Decision Tree | Machine Learning
Decision tree is one of the predictive modelling approaches used in Machine Learning. It can be used for both a classification problem as well as for regression problem.How does a decision tree work?
The logic behind the decision tree can be easily understood because it shows a tree-like structure . Decision trees classify instances by sorting them down the tree from the root to some leaf node , which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance. Each branch descending from a node corresponds to one of the possible values for the attribute. Each leaf node assigns a classification.
A Decision Tree consists of three types of nodes:
- Root Node – the very top node is called as Root Node or just a Node.
- Decision Node – When a sub-node splits into further sub-nodes, then it is called a decision node.
- Leaf / Terminal Node – Nodes with no children is called Leaf or Terminal node or just leaves.
Decision Tree algorithm
There are many splitting criteria used in Decision trees . The 3 main splitting criteria used in Decision trees are:- Gini Impurity
- Entropy
- Variance
Gini Impurity
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.Entropy
Entropy is a measure of the impurity in a collection of training examples. Entropy can be calculated as:
H(X) = – [(pi * log2 pi) + (qi * log2 qi)]
where,
- pi is the probability of Y = 1(probability of success of the event).
- qi is the probability of Y = 0 (probability of failure of the event).
Variance
Variance is a method for splitting the node used when the target variable is continuous, i.e., regression problems . It is so-called because it uses variance as a measure for deciding the feature on which node is split into child nodes.Python Implementation of Decision Tree
About the Dataset - Kyphosis
Kyphosis is a medical condition that causes a forward curving of the back. It can occur at any age but is most common in older women. Here, the data frame has 81 rows and 4 columns . representing data on patients who have had corrective spinal surgery. This file is meant for testing purposes only, download it from here: kyphosis.csv . This data-frame contains the following columns:- Kyphosis : A factor with levels absent present indicating if a kyphosis was present after the operation.
- Age : In months.
- Number : The number of vertebrae involved.
- Start : The number of the first vertebra operated on.
Importing Python Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from IPython.display import Image
from six import StringIO
from sklearn.tree import export_graphviz
import pydot
Import data from kyphosis Dataset
df = pd.read_csv('kyphosis.csv')

Data Pre-Processing
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
The X variable contains the last three columns (Age, Number, Start) of the dataset while y contains the first column (Kyphosis).
Train Test Split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30)
The above code splits the dataset into 70% train data and 30% test data.
Next step is to fit the model to the training set.
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
Make predictions and check accuracy
After fit the the training data to the Decision Tree Classifier , the next step is to make predictions on the test data to y_pred vector and find the Accuracy Score.
y_pred = dtree.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print('Accuracy Score:',accuracy )
Accuracy Score: 0.76
The decision tree classifier gave an accuracy of 76%.
Confusion Matrix and Classification Report
The final step is to evaluate the model and see how well the model is performing. For that you can use metrics such as confusion matrix.
cfMatrix = confusion_matrix(y_test,y_pred)
print(cfMatrix)
[[18 3]
[ 3 1]]
Above Confusion Matrix shows 6 observations have been classified as false.
cReport = classification_report(y_test,y_pred)
print(cReport)
precision recall f1-score support
absent 0.86 0.86 0.86 21
present 0.25 0.25 0.25 4
accuracy 0.76 25
macro avg 0.55 0.55 0.55 25
weighted avg 0.76 0.76 0.76 25
Visualize the Tree
You can use Scikit-learn's export_graphviz function to display the tree. For plotting trees, you also need to install the following:
conda install python-graphviz
pip install pydotplus
The export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png.
features = list(df.columns[1:])
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
output: 
In the decision tree chart , each internal node has a decision rule that splits the data. Gini referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node . Full Source | Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from IPython.display import Image
from six import StringIO
from sklearn.tree import export_graphviz
import pydot
df = pd.read_csv('kyphosis.csv')
df.head()
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30)
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
y_pred = dtree.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print('Accuracy Score:',accuracy )
cfMatrix = confusion_matrix(y_test,y_pred)
cReport = classification_report(y_test,y_pred)
print(cfMatrix)
print(cReport)
features = list(df.columns[1:])
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
Related Topics
- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- Logistic Regression | Python Machine Learning
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Random Forest | Python Machine Learning
- Support Vector Machine | Python Machine Learning