Logistic Regression | Python
Classification is an area of supervised machine learning algorithms that tries to predict which class or category some entity belongs to, based on its features. Approximately 65% of regression problems in Data Science are classification problems. Logistic Regression is a Machine Learning classification algorithm that is used to predict discrete values such as 0 or 1, Spam or Not spam, etc. The following article implemented a Logistic Regression model using Python and scikit-learn. Using a "students_data.csv " dataset and predicted whether a given student will pass or fail in an exam based on three relevant study features.About the Dataset
The students_data.csv dataset has three features—namely, school_hrs, self_hrs and tution_hrs. The school_hrs indicates how many hours per year the student studies at school, self_hrs indicates how many hours per year the student studies at home, and tution_hrs indicates how many hours per year the student is taking private tuition classes. Apart from these three features, there is one label in the dataset named "passed ". This label has two values—either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail. The file is meant for testing purposes only, you can download it from here: students_data.csv .Logistic Regression example | Python
Here, you can build a Logistic Regression using:- The dependent variable "passed" represents whether a student passed or not.
- The 3 independent variables are the school_hrs, self_hrs and tution_hrs.

Import Python packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sn
Importing the Data Set into Python Script
Here, you are going to do is to read in the dataset using the Pandas' read_csv() function.
df = pd.read_csv("students_data.csv")
Extracted the dependent (Y) and independent variable(X) from the dataset.
X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']
Building a Logistic Regression Model
Next step is to apply train_test_split . In this example, you can set the test size to 0.25, and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
At this stage, you ready to create your logistic regression model . You can do this using the LogisticRegression class you imported in the beginning.
logistic_regression= LogisticRegression()
Training the Logistic Regression Model
Once the model is defined, you can work to fit your data . So, the next step is to fit method on the model to train the data.
logistic_regression.fit(X_train,y_train)
Making Predictions With Logistic Regression Model
Once model training is complete, its time to predict the data using the model. So, you can store the predicted values in the y_pred variable.
y_pred=logistic_regression.predict(X_test)
For the testing purpose later, print the X_test and y_pred .
print (X_test)
school_hrs self_hrs tution_hrs
22 550 230 400
20 620 330 200
25 670 330 600
4 680 390 400
10 610 270 300
15 610 300 100
28 650 370 600
11 690 370 500
18 540 270 200
29 660 330 500
Print y_pred .
print (y_pred)
[0 0 1 1 0 0 1 1 0 1]
The "students_data.csv" has 40 observations . Since you set the test size to 0.25, then the prediction displayed the results for 10 records (40*0.25=10, where 1 = passed, while 0 = failed).
Then, use the code below to get the Confusion Matrix :
Confusion Matrix
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

Above Confusion Matrix shows:
- True Positives (TP) = 4
- True Negatives (TN) = 4
- False Positives (FP) = 1
- False Negatives (FN) = 1
From the above data, you can calculate the Accuracy manually:
Accuracy = (TP+TN)/Total = (4+4)/10 = 0.8Accuracy Percentage = 100 * 0.8 = 80%
Finding Accuracy using Python
You can find the accuracy of your model in order to evaluate its performance. For this, you can use the accuracy_score method of the metrics class, as shown below:
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy
print('Accuracy : ', accuracy)
print("Accuracy Percentage (%) : ", accuracy_percentage)
Accuracy : 0.8
Accuracy Percentage (%) : 80.0
The result shows that accuracy of the model is 80.0 % . By accuracy, you mean the number of correct predictions divided by the total number of predictions.
Full Source | Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sn
df = pd.read_csv("students_data.csv")
X = df[['school_hrs', 'self_hrs','tution_hrs']]
y = df['passed']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
print (X_test)
print (y_pred)
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy
print('Accuracy : ', accuracy)
print("Accuracy Percentage (%) : ", accuracy_percentage)
Accuracy : 0.8
Accuracy Percentage (%) : 80.0
Checking the Prediction
The confusion matrix displayed the results for 10 records (40*0.25). These are the 10 test records:
print (X_test)
school_hrs self_hrs tution_hrs
22 550 230 400
20 620 330 200
25 670 330 600
4 680 390 400
10 610 270 300
15 610 300 100
28 650 370 600
11 690 370 500
18 540 270 200
29 660 330 500
The prediction was also made for those 10 records (where 1 = passed, while 0 = failed):
print (y_pred)
[0 0 1 1 0 0 1 1 0 1]
From the following table you can confirm that you got the correct results 8 out of 10 .

From the above table, you can confirm that the result matching with the accuracy level of 80% .
Related Topics
- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Ordinary Least Squares Regression | Python Data Science
- Polynomial Regression | Python
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Decision Tree in Machine Learning | Python
- Random Forest | Python Machine Learning
- Support Vector Machine | Python Machine Learning