Decision Tree | Machine Learning

Decision tree is one of the predictive modelling approaches used in Machine Learning. It can be used for both a classification problem as well as for regression problem.

How does a decision tree work?

The logic behind the decision tree can be easily understood because it shows a tree-like structure . Decision trees classify instances by sorting them down the tree from the root to some leaf node , which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance. Each branch descending from a node corresponds to one of the possible values for the attribute. Each leaf node assigns a classification.
Decision Tree Introduction with example Python
A Decision Tree consists of three types of nodes:
  1. Root Node – the very top node is called as Root Node or just a Node.
  2. Decision Node – When a sub-node splits into further sub-nodes, then it is called a decision node.
  3. Leaf / Terminal Node – Nodes with no children is called Leaf or Terminal node or just leaves.

Decision Tree algorithm

There are many splitting criteria used in Decision trees . The 3 main splitting criteria used in Decision trees are:
  1. Gini Impurity
  2. Entropy
  3. Variance

Gini Impurity

Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.

Entropy

Entropy is a measure of the impurity in a collection of training examples. Entropy can be calculated as:
H(X) = – [(pi * log2 pi) + (qi * log2 qi)]

where,

  1. pi is the probability of Y = 1(probability of success of the event).
  2. qi is the probability of Y = 0 (probability of failure of the event).

Variance

Variance is a method for splitting the node used when the target variable is continuous, i.e., regression problems . It is so-called because it uses variance as a measure for deciding the feature on which node is split into child nodes.

Python Implementation of Decision Tree

About the Dataset - Kyphosis

Kyphosis is a medical condition that causes a forward curving of the back. It can occur at any age but is most common in older women. Here, the data frame has 81 rows and 4 columns . representing data on patients who have had corrective spinal surgery. This file is meant for testing purposes only, download it from here: kyphosis.csv . This data-frame contains the following columns:
  1. Kyphosis : A factor with levels absent present indicating if a kyphosis was present after the operation.
  2. Age : In months.
  3. Number : The number of vertebrae involved.
  4. Start : The number of the first vertebra operated on.

Importing Python Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report,confusion_matrix from IPython.display import Image from six import StringIO from sklearn.tree import export_graphviz import pydot

Import data from kyphosis Dataset

df = pd.read_csv('kyphosis.csv')


Python Decision Tree Classification with Scikit-Learn

Data Pre-Processing

X = df.drop('Kyphosis',axis=1) y = df['Kyphosis']
The X variable contains the last three columns (Age, Number, Start) of the dataset while y contains the first column (Kyphosis).

Train Test Split

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30)
The above code splits the dataset into 70% train data and 30% test data. Next step is to fit the model to the training set.
dtree = DecisionTreeClassifier() dtree.fit(X_train,y_train)

Make predictions and check accuracy

After fit the the training data to the Decision Tree Classifier , the next step is to make predictions on the test data to y_pred vector and find the Accuracy Score.
y_pred = dtree.predict(X_test) accuracy = metrics.accuracy_score(y_test,y_pred) print('Accuracy Score:',accuracy )
Accuracy Score: 0.76
The decision tree classifier gave an accuracy of 76%.

Confusion Matrix and Classification Report

The final step is to evaluate the model and see how well the model is performing. For that you can use metrics such as confusion matrix.
cfMatrix = confusion_matrix(y_test,y_pred) print(cfMatrix)
[[18 3] [ 3 1]]
Above Confusion Matrix shows 6 observations have been classified as false.
cReport = classification_report(y_test,y_pred) print(cReport)
precision recall f1-score support absent 0.86 0.86 0.86 21 present 0.25 0.25 0.25 4 accuracy 0.76 25 macro avg 0.55 0.55 0.55 25 weighted avg 0.76 0.76 0.76 25

Visualize the Tree

You can use Scikit-learn's export_graphviz function to display the tree. For plotting trees, you also need to install the following:
conda install python-graphviz pip install pydotplus
The export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png.
features = list(df.columns[1:]) dot_data = StringIO() export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True) graph = pydot.graph_from_dot_data(dot_data.getvalue()) Image(graph[0].create_png())
output:
Decision Tree in Machine Learning:  | Python
In the decision tree chart , each internal node has a decision rule that splits the data. Gini referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node . Full Source | Python
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report,confusion_matrix from IPython.display import Image from six import StringIO from sklearn.tree import export_graphviz import pydot df = pd.read_csv('kyphosis.csv') df.head() X = df.drop('Kyphosis',axis=1) y = df['Kyphosis'] X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30) dtree = DecisionTreeClassifier() dtree.fit(X_train,y_train) y_pred = dtree.predict(X_test) accuracy = metrics.accuracy_score(y_test,y_pred) print('Accuracy Score:',accuracy ) cfMatrix = confusion_matrix(y_test,y_pred) cReport = classification_report(y_test,y_pred) print(cfMatrix) print(cReport) features = list(df.columns[1:]) dot_data = StringIO() export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True) graph = pydot.graph_from_dot_data(dot_data.getvalue()) Image(graph[0].create_png())