# Logistic Regression | Python

Classification is an area of**supervised machine learning algorithms**that tries to predict which class or category some entity belongs to, based on its features. Approximately 65% of regression problems in Data Science are classification problems.

**Logistic Regression**is a Machine Learning classification algorithm that is used to predict discrete values such as 0 or 1, Spam or Not spam, etc. The following article implemented a

**Logistic Regression model**using Python and scikit-learn. Using a

**"students_data.csv**" dataset and predicted whether a given student will pass or fail in an exam based on three relevant study features.

## About the Dataset

The students_data.csv dataset has three featuresâ€”namely, school_hrs, self_hrs and tution_hrs. The**school_hrs**indicates how many hours per year the student studies at school,

**self_hrs**indicates how many hours per year the student studies at home, and

**tution_hrs**indicates how many hours per year the student is taking private tuition classes. Apart from these three features, there is one label in the dataset named

**"passed**". This label has two valuesâ€”either

**1 or**0. The value of 1 indicates pass and a value of 0 indicates fail. The file is meant for testing purposes only, you can download it from here: students_data.csv .

## Logistic Regression example | Python

Here, you can build a**Logistic Regression**using:

- The dependent variable "passed" represents whether a student passed or not.
- The 3 independent variables are the school_hrs, self_hrs and tution_hrs.

## Import Python packages

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sn

## Importing the Data Set into Python Script

Here, you are going to do is to read in the dataset using the Pandas'**read_csv()**function.

df = pd.read_csv("students_data.csv")

Extracted the **dependent**(Y) and

**independent**variable(X) from the dataset.

X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']

## Building a Logistic Regression Model

Next step is to apply**train_test_split**. In this example, you can set the test size to 0.25, and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset.

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

At this stage, you ready to create your **logistic regression model**. You can do this using the LogisticRegression class you imported in the beginning.

logistic_regression= LogisticRegression()

## Training the Logistic Regression Model

Once the model is defined, you can work to**fit your data**. So, the next step is to fit method on the model to train the data.

logistic_regression.fit(X_train,y_train)

## Making Predictions With Logistic Regression Model

Once**model training**is complete, its time to predict the data using the model. So, you can store the predicted values in the

**y_pred**variable.

y_pred=logistic_regression.predict(X_test)

For the testing purpose later, print the **X_test**and

**y_pred**.

print (X_test)

school_hrs self_hrs tution_hrs
22 550 230 400
20 620 330 200
25 670 330 600
4 680 390 400
10 610 270 300
15 610 300 100
28 650 370 600
11 690 370 500
18 540 270 200
29 660 330 500

Print **y_pred**.

print (y_pred)

[0 0 1 1 0 0 1 1 0 1]

The "students_data.csv" has **40 observations**. Since you set the test size to 0.25, then the prediction displayed the results for

**10 records**(40*0.25=10, where 1 = passed, while 0 = failed). Then, use the code below to get the

**Confusion Matrix**:

## Confusion Matrix

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

Above

**Confusion Matrix**shows:

- True Positives (TP) = 4
- True Negatives (TN) = 4
- False Positives (FP) = 1
- False Negatives (FN) = 1

From the above data, you can calculate the Accuracy manually:

**Accuracy = (TP+TN)/Total = (4+4)/10 = 0.8**

**Accuracy Percentage = 100 * 0.8 = 80%**

## Finding Accuracy using Python

You can find the accuracy of your model in order to evaluate its performance. For this, you can use the accuracy_score method of the metrics class, as shown below:

accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy

print('Accuracy : ', accuracy)
print("Accuracy Percentage (%) : ", accuracy_percentage)

Accuracy : 0.8
Accuracy Percentage (%) : 80.0

The result shows that accuracy of the model is **80.0 %**. By accuracy, you mean the number of correct predictions divided by the total number of predictions.

**Full Source | Python**

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sn
df = pd.read_csv("students_data.csv")
X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
print (X_test)
print (y_pred)
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy
print('Accuracy : ', accuracy)
print("Accuracy Percentage (%) : ", accuracy_percentage)

Accuracy : 0.8
Accuracy Percentage (%) : 80.0

## Checking the Prediction

The**confusion matrix**displayed the results for

**10 records**(40*0.25). These are the 10 test records:

print (X_test)

school_hrs self_hrs tution_hrs
22 550 230 400
20 620 330 200
25 670 330 600
4 680 390 400
10 610 270 300
15 610 300 100
28 650 370 600
11 690 370 500
18 540 270 200
29 660 330 500

The prediction was also made for those **10 records**(where 1 = passed, while 0 = failed):

print (y_pred)

[0 0 1 1 0 0 1 1 0 1]

From the following table you can confirm that you got the correct results **8 out of 10**.

From the above table, you can confirm that the result matching with the

**accuracy level of 80%**.

**Related Topics**

- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Decision Tree in Machine Learning | Python
- Random Forest | Python Machine Learning
- Support Vector Machine | Python Machine Learning