Random Forests Classifiers | PythonRandom forest is a supervised learning algorithm made up of many decision trees. The decision trees are only able to predict to a certain degree of accuracy. But when combined together, they become a significantly more robust prediction tool . The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting .
The "forest" , generated by the Random Forest algorithm, is an ensemble of decision trees , usually trained with the "bagging" method.
Ensemble methodsThe goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve robustness over a single estimator .
Bagging methodsBootstrap Aggregation or Bagging is an ensemble meta-algorithm. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output.
Hyper-parametersThe hyperparameters in Random Forest model are either used to increase the predictive power of the model or to make the model faster.
n_estimatorsThe n_estimators is the number of trees to be used in the Random Forest. Since Random Forest algorithm is an ensemble method comprising of creating multiple decision trees , this parameter is used to control the number of trees to be used in the process.
max_featuresThe max_features is the maximum number of features random forest considers to split a node.
n_jobsThe n_jobs tells the engine how many processors it is allowed to use.
random_stateThe random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic.
Python implementation of the Random Forest algorithmThe Random Forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees . Increasing the number of trees increases the precision of the outcome.
About the DatasetThe "study_hours.csv" dataset has four features—namely, school_hrs , self_hrs , tution_hrs and passed. The school_hrs indicates how many hours per year the student studies at school, self_hrs indicates how many hours per year the student studies at home, and tution_hrs indicates how many hours per year the student is taking private tuition classes. Apart from these three independent variables, there is one dependent variable in the dataset named "passed " . This label has two values — either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail . The file is meant for testing purposes only, download it from here: study_hours.csv . Sample Data from study_hours.csv :
Import Python Packages
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import seaborn as sn import matplotlib.pyplot as plt
Create the DataFrame
df = pd.read_csv("study_hours.csv") print (df)
X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed']The X variable contains the threee columns (school_hrs, self_hrs, tution_hrs) of the dataset while y contains the target column "passed".
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)Above code, set the test size to 0.25 , and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset. If you don't specify the random_state in your train_test_split, then every time you run your code a new random value is generated and the train and test datasets would have different values each time. However, if a fixed value is assigned like random_state = 0 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.
Building a Random Forest Model
clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train,y_train)n_estimators is used to control the number of trees to be used in the process.
Making Predictions With Random Forest ModelOnce your Random Forest Model training is complete, its time to predict the data using the created model. So, you can store the predicted values in the y_pred variable.
Accuracy and Confusion MatrixOnce the prediction is over, the next step is to print the accuracy and plot the confusion matrix .
print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred))
Accuracy: 90.0 %
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) plt.show()
You can also derive the Accuracy from the above Confusion Matrix : Accuracy = (True Positives + True Negatives)/(Sum of all values on the matrix)
Accuracy = (6+3)/(3+1+0+6)
Accuracy = (9)/(10)
Accuracy = 0.9 * 100 = 90 % Full Source | Python
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import seaborn as sn import matplotlib.pyplot as plt df = pd.read_csv("study_hours.csv") X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print('Accuracy: ', 100 * metrics.accuracy_score(y_test, y_pred)) confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) plt.show()
Checking the PredictionRecall that your original dataset ("study_hours.csv") had 40 observations . Since you set the test size to 0.25 , then the Confusion Matrix displayed the results for a total of 10 records (=40*0.25). These are the 10 test records:
22 690 370 500 20 600 200 100 25 550 270 100 4 780 400 300 10 690 370 500 15 690 170 100 28 670 330 600 11 730 370 600 18 650 370 600 29 660 370 400The prediction was also made for those 10 records on target field "passed" represents whether a student passed or not(either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail).
[1 0 0 1 1 0 1 1 1 1]From the following table you can confirm that you got the correct results 9 out of 10 .
From the above table, you can confirm that the result matching with the accuracy level of 90% .
- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- Logistic Regression | Python Machine Learning
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Decision Tree in Machine Learning | Python
- Support Vector Machine | Python Machine Learning