# Decision Tree | Machine Learning

Decision tree is one of the**predictive modelling**approaches used in Machine Learning. It can be used for both a

**classification problem**as well as for regression problem.

## How does a decision tree work?

The logic behind the decision tree can be easily understood because it shows a**tree-like structure**. Decision trees classify instances by sorting them down the tree from the root to some

**leaf node**, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance. Each branch descending from a node corresponds to one of the possible values for the attribute. Each leaf node assigns a classification.

A

**Decision Tree**consists of three types of nodes:

**Root Node**– the very top node is called as Root Node or just a Node.**Decision Node**– When a sub-node splits into further sub-nodes, then it is called a decision node.**Leaf / Terminal Node**– Nodes with no children is called Leaf or Terminal node or just leaves.

## Decision Tree algorithm

There are many splitting criteria used in**Decision trees**. The 3 main splitting criteria used in Decision trees are:

- Gini Impurity
- Entropy
- Variance

### Gini Impurity

Gini impurity is a measure of how often a**randomly chosen element**from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.

### Entropy

Entropy is a measure of the impurity in a collection of training examples.**Entropy**can be calculated as:

H(X) = – [(pi * log2 pi) + (qi * log2 qi)]

where,

**pi**is the probability of Y = 1(probability of success of the event).**qi**is the probability of Y = 0 (probability of failure of the event).

### Variance

Variance is a method for splitting the node used when the target variable is continuous, i.e.,**regression problems**. It is so-called because it uses variance as a measure for deciding the feature on which node is split into child nodes.

## Python Implementation of Decision Tree

### About the Dataset - Kyphosis

Kyphosis is a medical condition that causes a**forward curving**of the back. It can occur at any age but is most common in older women. Here, the data frame has

**81 rows and 4 columns**. representing data on patients who have had corrective spinal surgery. This file is meant for testing purposes only, download it from here: kyphosis.csv . This

**data-frame**contains the following columns:

**Kyphosis**: A factor with levels absent present indicating if a kyphosis was present after the operation.**Age**: In months.**Number**: The number of vertebrae involved.**Start**: The number of the first vertebra operated on.

### Importing Python Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from IPython.display import Image
from six import StringIO
from sklearn.tree import export_graphviz
import pydot

### Import data from kyphosis Dataset

df = pd.read_csv('kyphosis.csv')

### Data Pre-Processing

X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

The X variable contains the **last three columns**(Age, Number, Start) of the dataset while y contains the

**first column**(Kyphosis).

### Train Test Split

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30)

The above code splits the dataset into **70%**train data and

**30%**test data. Next step is to

**fit the model**to the training set.

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

### Make predictions and check accuracy

After fit the the training data to the**Decision Tree Classifier**, the next step is to make predictions on the test data to

**y_pred vector**and find the Accuracy Score.

y_pred = dtree.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print('Accuracy Score:',accuracy )

Accuracy Score: 0.76

The **decision tree classifier**gave an accuracy of 76%.

### Confusion Matrix and Classification Report

The final step is to**evaluate the model**and see how well the model is performing. For that you can use metrics such as confusion matrix.

cfMatrix = confusion_matrix(y_test,y_pred)
print(cfMatrix)

[[18 3]
[ 3 1]]

Above **Confusion Matrix**shows 6 observations have been classified as false.

cReport = classification_report(y_test,y_pred)
print(cReport)

precision recall f1-score support
absent 0.86 0.86 0.86 21
present 0.25 0.25 0.25 4
accuracy 0.76 25
macro avg 0.55 0.55 0.55 25
weighted avg 0.76 0.76 0.76 25

## Visualize the Tree

You can use**Scikit-learn's**export_graphviz function to display the tree. For plotting trees, you also need to install the following:

conda install python-graphviz
pip install pydotplus

The **export_graphviz**function converts decision tree classifier into dot file and

**pydotplus**convert this dot file to png.

features = list(df.columns[1:])
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())

**output:**

In the

**decision tree chart**, each internal node has a decision rule that splits the data.

**Gini**referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the

**leaf node**.

**Full Source | Python**

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from IPython.display import Image
from six import StringIO
from sklearn.tree import export_graphviz
import pydot
df = pd.read_csv('kyphosis.csv')
df.head()
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30)
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
y_pred = dtree.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print('Accuracy Score:',accuracy )
cfMatrix = confusion_matrix(y_test,y_pred)
cReport = classification_report(y_test,y_pred)
print(cfMatrix)
print(cReport)
features = list(df.columns[1:])
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())

**Related Topics**

- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- Logistic Regression | Python Machine Learning
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Random Forest | Python Machine Learning
- Support Vector Machine | Python Machine Learning