Random Forests Classifiers | Python

Random forest is a supervised learning algorithm made up of many decision trees. The decision trees are only able to predict to a certain degree of accuracy. But when combined together, they become a significantly more robust prediction tool . The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting .
How to Develop a Random Forest Ensemble in Python
The "forest" , generated by the Random Forest algorithm, is an ensemble of decision trees , usually trained with the "bagging" method.

Ensemble methods

The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve robustness over a single estimator .

Bagging methods

Bootstrap Aggregation or Bagging is an ensemble meta-algorithm. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output.

Hyper-parameters

The hyperparameters in Random Forest model are either used to increase the predictive power of the model or to make the model faster.

n_estimators

The n_estimators is the number of trees to be used in the Random Forest. Since Random Forest algorithm is an ensemble method comprising of creating multiple decision trees , this parameter is used to control the number of trees to be used in the process.

max_features

The max_features is the maximum number of features random forest considers to split a node.

n_jobs

The n_jobs tells the engine how many processors it is allowed to use.

random_state

The random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic.

Python implementation of the Random Forest algorithm

The Random Forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees . Increasing the number of trees increases the precision of the outcome.

About the Dataset

The "study_hours.csv" dataset has four features—namely, school_hrs , self_hrs , tution_hrs and passed. The school_hrs indicates how many hours per year the student studies at school, self_hrs indicates how many hours per year the student studies at home, and tution_hrs indicates how many hours per year the student is taking private tuition classes. Apart from these three independent variables, there is one dependent variable in the dataset named "passed " . This label has two values — either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail . The file is meant for testing purposes only, download it from here: study_hours.csv . Sample Data from study_hours.csv : Random Forest Algorithm with Python and Scikit-Learn

Import Python Packages

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import seaborn as sn import matplotlib.pyplot as plt

Create the DataFrame

df = pd.read_csv("study_hours.csv") print (df)

Data Pre-Processing

X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed']
The X variable contains the threee columns (school_hrs, self_hrs, tution_hrs) of the dataset while y contains the target column "passed".

Apply train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above code, set the test size to 0.25 , and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset. If you don't specify the random_state in your train_test_split, then every time you run your code a new random value is generated and the train and test datasets would have different values each time. However, if a fixed value is assigned like random_state = 0 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Building a Random Forest Model

clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train,y_train)
n_estimators is used to control the number of trees to be used in the process.

Making Predictions With Random Forest Model

Once your Random Forest Model training is complete, its time to predict the data using the created model. So, you can store the predicted values in the y_pred variable.
y_pred=clf.predict(X_test)

Accuracy and Confusion Matrix

Once the prediction is over, the next step is to print the accuracy and plot the confusion matrix .
print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred))
Accuracy: 90.0 %

Confusion Matrix

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) plt.show()

Random Forests Confusion Matrix
You can also derive the Accuracy from the above Confusion Matrix : Accuracy = (True Positives + True Negatives)/(Sum of all values on the matrix)
Accuracy = (6+3)/(3+1+0+6)
Accuracy = (9)/(10)
Accuracy = 0.9 * 100 = 90 % Full Source | Python
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import seaborn as sn import matplotlib.pyplot as plt df = pd.read_csv("study_hours.csv") X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred)) confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) plt.show()

Checking the Prediction

Recall that your original dataset ("study_hours.csv") had 40 observations . Since you set the test size to 0.25 , then the Confusion Matrix displayed the results for a total of 10 records (=40*0.25). These are the 10 test records:
print (X_test)
22 690 370 500 20 600 200 100 25 550 270 100 4 780 400 300 10 690 370 500 15 690 170 100 28 670 330 600 11 730 370 600 18 650 370 600 29 660 370 400
The prediction was also made for those 10 records on target field "passed" represents whether a student passed or not(either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail).
print(y_pred)
[1 0 0 1 1 0 1 1 1 1]
From the following table you can confirm that you got the correct results 9 out of 10 .
Random Forest Classifier Python Code Example
From the above table, you can confirm that the result matching with the accuracy level of 90% .