Data Science

# Logistic Regression | Python

Classification is a vital area in supervised machine learning, where the goal is to predict the class or category of an entity based on its features. Logistic Regression is a widely used classification algorithm that specializes in predicting discrete values, such as binary outcomes like 0 or 1, Spam or Not spam, and so on. It is a fundamental tool in Data Science, with a significant portion of classification problems encountered in real-world applications.

The following article implemented a Logistic Regression model using Python and scikit-learn. Using a "students_data.csv " dataset and predicted whether a given student will pass or fail in an exam based on three relevant study features.

The students_data.csv dataset has three features—namely, school_hrs, self_hrs and tution_hrs. The school_hrs indicates how many hours per year the student studies at school, self_hrs indicates how many hours per year the student studies at home, and tution_hrs indicates how many hours per year the student is taking private tuition classes.

Apart from these three features, there is one label in the dataset named "passed ". This label has two values—either 1 or 0. The value of 1 indicates pass and a value of 0 indicates fail.

The file is meant for testing purposes only, you can download it from here: students_data.csv .

## Logistic Regression example | Python

Here, you can build a Logistic Regression using:

1. The dependent variable "passed" represents whether a student passed or not.
2. The 3 independent variables are the school_hrs, self_hrs and tution_hrs. ## Import Python packages

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import seaborn as sn

## Importing the Data Set into Python Script

Here, you are going to do is to read in the dataset using the Pandas' read_csv() function.

Extracted the dependent (Y) and independent variable(X) from the dataset.

X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed']

## Building a Logistic Regression Model

Next step is to apply train_test_split . In this example, you can set the test size to 0.25, and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset.

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

At this stage, you ready to create your logistic regression model . You can do this using the LogisticRegression class you imported in the beginning.

logistic_regression= LogisticRegression()

## Training the Logistic Regression Model

Once the model is defined, you can work to fit your data . So, the next step is to fit method on the model to train the data.

logistic_regression.fit(X_train,y_train)

## Making Predictions With Logistic Regression Model

Once model training is complete, its time to predict the data using the model. So, you can store the predicted values in the y_pred variable.

y_pred=logistic_regression.predict(X_test)

For the testing purpose later, print the X_test and y_pred .

print (X_test)
school_hrs self_hrs tution_hrs 22 550 230 400 20 620 330 200 25 670 330 600 4 680 390 400 10 610 270 300 15 610 300 100 28 650 370 600 11 690 370 500 18 540 270 200 29 660 330 500

Print y_pred .

print (y_pred)
[0 0 1 1 0 0 1 1 0 1]

The "students_data.csv" has 40 observations . Since you set the test size to 0.25, then the prediction displayed the results for 10 records (40*0.25=10, where 1 = passed, while 0 = failed).

Then, use the code below to get the Confusion Matrix :

## Confusion Matrix

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) Above Confusion Matrix shows:

1. True Positives (TP) = 4
2. True Negatives (TN) = 4
3. False Positives (FP) = 1
4. False Negatives (FN) = 1

From the above data, you can calculate the Accuracy manually:

Accuracy = (TP+TN)/Total = (4+4)/10 = 0.8

Accuracy Percentage = 100 * 0.8 = 80%

## Finding Accuracy using Python

You can find the accuracy of your model in order to evaluate its performance. For this, you can use the accuracy_score method of the metrics class, as shown below:

accuracy = metrics.accuracy_score(y_test, y_pred) accuracy_percentage = 100 * accuracy
print('Accuracy : ', accuracy) print("Accuracy Percentage (%) : ", accuracy_percentage)
Accuracy : 0.8 Accuracy Percentage (%) : 80.0

The result shows that accuracy of the model is 80.0 % . By accuracy, you mean the number of correct predictions divided by the total number of predictions.

Full Source | Python
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import seaborn as sn df = pd.read_csv("students_data.csv") X = df[['school_hrs', 'self_hrs','tution_hrs']] y = df['passed'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) logistic_regression= LogisticRegression() logistic_regression.fit(X_train,y_train) y_pred=logistic_regression.predict(X_test) print (X_test) print (y_pred) confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) accuracy = metrics.accuracy_score(y_test, y_pred) accuracy_percentage = 100 * accuracy print('Accuracy : ', accuracy) print("Accuracy Percentage (%) : ", accuracy_percentage)
Accuracy : 0.8 Accuracy Percentage (%) : 80.0

## Checking the Prediction

The confusion matrix displayed the results for 10 records (40*0.25). These are the 10 test records:

print (X_test)
school_hrs self_hrs tution_hrs 22 550 230 400 20 620 330 200 25 670 330 600 4 680 390 400 10 610 270 300 15 610 300 100 28 650 370 600 11 690 370 500 18 540 270 200 29 660 330 500

The prediction was also made for those 10 records (where 1 = passed, while 0 = failed):

print (y_pred)
[0 0 1 1 0 0 1 1 0 1]

From the following table you can confirm that you got the correct results 8 out of 10 . From the above table, you can confirm that the result matching with the accuracy level of 80% .

### Conclusion

Logistic Regression is a popular and powerful classification algorithm in supervised machine learning. It is used to predict binary outcomes and assign probabilities to discrete categories, making it well-suited for various applications such as spam detection, medical diagnosis, and customer churn prediction.