A Beginners Guide to Scikit-Learn

Scikit-learn is one of the fundamental Python library for data analysis. It is an open source package that is relatively simple, efficient and accessible. This Python scientific library focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The Python library mostly focused on processing, analyzing and modelling data. Scikit Learn has minimal dependencies and is distributed under the simplified Berkeley Source Distribution (BSD) license, encouraging its use in both organizational and academic settings. Since it depend on the scientific Python environment, it can easily be incorporate into projects outside the traditional range of statistical data analysis .

Installing scikit-learn

pip install -U scikit-learn

A simple machine learning model using Scikit Learn

from sklearn.datasets import load_wine from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.metrics import classification_report
X,y = load_wine(return_X_y=True) lr = LogisticRegression() X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) model = lr.fit(X_train, y_train) predictions = model.predict(X_test)
rf = RandomForestClassifier() rf_model = rf.fit(X_train, y_train) rf_predictions = rf_model.predict(X_test)
imputer = SimpleImputer(strategy='mean') X_train_clean = imputer.fit(X_train)
print(classification_report(rf_predictions, y_test))
output machine learning model using Scikit Learn

Step by Step explanation...

Load wine data set

The wine dataset is a classic and very easy multi-class classification dataset .
from sklearn.datasets import load_wine
X,y = load_wine(return_X_y=True)

Logistic Regression classifier

Logistic regression is a fundamental classification technique. It belongs to the group of linear classifiers and is somewhat similar to polynomial and linear regression . The next step, with Scikit-learn, is to call the logistic regression estimator and save it as an object.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

train_test_split Function

The train_test_split() function is for splitting a single dataset for two different purposes: for training data and for testing data .
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

fit() method

It basically trains your model using the dataset you provided. Fitting your model to the training data is essentially the training part of the modelling process .
model = lr.fit(X_train, y_train)

predict() method

After it is trained, the model can be used to make predictions on previously unseen data, usually with a predict() method call.
predictions = model.predict(X_test)

Random forest classifier

A forest is comprised of trees. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier() rf_model = rf.fit(X_train, y_train) rf_predictions = rf_model.predict(X_test)

Simple Imputer

It is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') X_train_clean = imputer.fit(X_train)

Model Evaluation

Once a model has been trained you need to measure how good the model is at predicting on new data. This step is known as model evaluation and the metric that you choose will be determined by the task you are trying to solve.

Classification Report

A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. The classification report visualizer displays the precision, recall, F1, and support scores for the model.
from sklearn.metrics import classification_report
print(classification_report(rf_predictions, y_test))
output Classification report using Scikit Learn
The classification report is about key metrics in a classification problem .
Heading Description
precision how many are correctly classified among that class
recall how many of this class you find over the whole number of element of this class
f1-score harmonic mean between precision and recall
support number of occurence of the given class in your dataset