Random Forests Classifiers | Python
Random forest is a supervised learning algorithm made up of many decision trees. The decision trees are only able to predict to a certain degree of accuracy. But when combined together, they become a significantly more robust prediction tool . The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting .
The "forest" , generated by the Random Forest algorithm, is an ensemble of decision trees , usually trained with the "bagging" method.
Ensemble methods
The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve robustness over a single estimator .Bagging methods
Bootstrap Aggregation or Bagging is an ensemble meta-algorithm. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output.Hyper-parameters
The hyperparameters in Random Forest model are either used to increase the predictive power of the model or to make the model faster.n_estimators
The n_estimators is the number of trees to be used in the Random Forest. Since Random Forest algorithm is an ensemble method comprising of creating multiple decision trees , this parameter is used to control the number of trees to be used in the process.max_features
The max_features is the maximum number of features random forest considers to split a node.n_jobs
The n_jobs tells the engine how many processors it is allowed to use.random_state
The random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic.Python implementation of the Random Forest algorithm
The Random Forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees . Increasing the number of trees increases the precision of the outcome.About the Dataset
The "study_hours.csv" dataset has four features—namely, school_hrs , self_hrs , tution_hrs and passed. The school_hrs indicates how many hours per year the student studies at school, self_hrs indicates how many hours per year the student studies at home, and tution_hrs indicates how many hours per year the student is taking private tuition classes. Apart from these three independent variables, there is one dependent variable in the dataset named "passed " . This label has two values — either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail . The file is meant for testing purposes only, download it from here: study_hours.csv . Sample Data from study_hours.csv :
Import Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
Create the DataFrame
df = pd.read_csv("study_hours.csv")
print (df)
Data Pre-Processing
X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']
The X variable contains the threee columns (school_hrs, self_hrs, tution_hrs) of the dataset while y contains the target column "passed".
Apply train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above code, set the test size to 0.25 , and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset.
If you don't specify the random_state in your train_test_split, then every time you run your code a new random value is generated and the train and test datasets would have different values each time. However, if a fixed value is assigned like random_state = 0 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.
Building a Random Forest Model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
n_estimators is used to control the number of trees to be used in the process.
Making Predictions With Random Forest Model
Once your Random Forest Model training is complete, its time to predict the data using the created model. So, you can store the predicted values in the y_pred variable.
y_pred=clf.predict(X_test)
Accuracy and Confusion Matrix
Once the prediction is over, the next step is to print the accuracy and plot the confusion matrix .
print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred))
Accuracy: 90.0 %
Confusion Matrix
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
plt.show()

You can also derive the Accuracy from the above Confusion Matrix : Accuracy = (True Positives + True Negatives)/(Sum of all values on the matrix)
Accuracy = (6+3)/(3+1+0+6)
Accuracy = (9)/(10)
Accuracy = 0.9 * 100 = 90 % Full Source | Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
df = pd.read_csv("study_hours.csv")
X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred))
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
plt.show()
Checking the Prediction
Recall that your original dataset ("study_hours.csv") had 40 observations . Since you set the test size to 0.25 , then the Confusion Matrix displayed the results for a total of 10 records (=40*0.25). These are the 10 test records:
print (X_test)
22 690 370 500
20 600 200 100
25 550 270 100
4 780 400 300
10 690 370 500
15 690 170 100
28 670 330 600
11 730 370 600
18 650 370 600
29 660 370 400
The prediction was also made for those 10 records on target field "passed" represents whether a student passed or not(either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail).
print(y_pred)
[1 0 0 1 1 0 1 1 1 1]
From the following table you can confirm that you got the correct results 9 out of 10 .

From the above table, you can confirm that the result matching with the accuracy level of 90% .
Related Topics
- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- Logistic Regression | Python Machine Learning
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Decision Tree in Machine Learning | Python
- Support Vector Machine | Python Machine Learning