A Beginners Guide to Scikit-Learn
Scikit-learn is one of the fundamental Python library for data analysis. It is an open source package that is relatively simple, efficient and accessible. This Python scientific library focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The Python library mostly focused on processing, analyzing and modelling data. Scikit Learn has minimal dependencies and is distributed under the simplified Berkeley Source Distribution (BSD) license, encouraging its use in both organizational and academic settings. Since it depend on the scientific Python environment, it can easily be incorporate into projects outside the traditional range of statistical data analysis .Installing scikit-learn
pip install -U scikit-learn
A simple machine learning model using Scikit Learn
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
X,y = load_wine(return_X_y=True)
lr = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model = lr.fit(X_train, y_train)
predictions = model.predict(X_test)
rf = RandomForestClassifier()
rf_model = rf.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
imputer = SimpleImputer(strategy='mean')
X_train_clean = imputer.fit(X_train)
print(classification_report(rf_predictions, y_test))
output 
Step by Step explanation...
Load wine data set
The wine dataset is a classic and very easy multi-class classification dataset .
from sklearn.datasets import load_wine
X,y = load_wine(return_X_y=True)
Logistic Regression classifier
Logistic regression is a fundamental classification technique. It belongs to the group of linear classifiers and is somewhat similar to polynomial and linear regression . The next step, with Scikit-learn, is to call the logistic regression estimator and save it as an object.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
train_test_split Function
The train_test_split() function is for splitting a single dataset for two different purposes: for training data and for testing data .
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
fit() method
It basically trains your model using the dataset you provided. Fitting your model to the training data is essentially the training part of the modelling process .
model = lr.fit(X_train, y_train)
predict() method
After it is trained, the model can be used to make predictions on previously unseen data, usually with a predict() method call.
predictions = model.predict(X_test)
Random forest classifier
A forest is comprised of trees. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf_model = rf.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
Simple Imputer
It is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_clean = imputer.fit(X_train)
Model Evaluation
Once a model has been trained you need to measure how good the model is at predicting on new data. This step is known as model evaluation and the metric that you choose will be determined by the task you are trying to solve.Classification Report
A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. The classification report visualizer displays the precision, recall, F1, and support scores for the model.
from sklearn.metrics import classification_report
print(classification_report(rf_predictions, y_test))
output 
The classification report is about key metrics in a classification problem .
Heading | Description |
---|---|
precision | how many are correctly classified among that class |
recall | how many of this class you find over the whole number of element of this class |
f1-score | harmonic mean between precision and recall |
support | number of occurence of the given class in your dataset |
Related Topics