Ordinary Least Squares Regression | PythonMachine Learning (ML) develops algorithms (models) that can predict an output value with an acceptable error margin, based on a set of known input parameters. Ordinary Least Squares (OLS) is a form of regression, widely used in Machine Learning. The Ordinary Least Squares (OLS) regression technique falls under the Supervised Learning. It is a method for estimating the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one. This means that given a regression line through the data you calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
OLS method works for both univariate dataset (single independent variables and single dependent variables) and multi-variate dataset (single independent variable set and multiple dependent variables sets). An example of a scenario in which one may use OLS (Ordinary Least Squares) is in predicting Food Price from a data set that includes Food Quality and Service Quality.
Ordinary Least Squares Example:Consider the Restaurant data set: restaurants.csv . A restaurant guide collects several variables from a group of restaurants in a city. The description of the variables is given below:
|Food_Quality||Measure of Quality Food in points|
|Service_Quality||Measure of quality of Service in points|
|Price||Price of meal|
Restaurant data sample,
Loading required Python packages
import pandas import statsmodels.api as sm
Importing datasetThe Python Pandas module allows you to read csv files and return a DataFrame object . The file is meant for testing purposes only, you can download it here: restaurants.csv .
df = pandas.read_csv("restaurants.csv")
From restaurants.csv dataset, use the variable Price of meal ('Price') as your response Y and Measure of Quality Food ('Food_Quality') as our predictor X.
X = df['Food_Quality'] Y = df['Price']
Fit the ModelThe statsmodels object has a method called fit() that takes the independent(X ) and dependent(y) values as arguments. Add a constant term so that you fit the intercept of your linear model.
X = sm.add_constant(X) model = sm.OLS(Y, X).fit()
SummaryThe summary() method is used to obtain a table which gives an extensive description about the regression results. Full Source | Python
import pandas import statsmodels.api as sm df = pandas.read_csv("restaurants.csv") X = df['Food_Quality'] Y = df['Price'] X = sm.add_constant(X) model = sm.OLS(Y, X).fit() summary = model.summary() print(summary)
Description of some of the terms in the table :
- R-squared - statistical measure of how well the regression line approximates the real data points.
- Adj. R-squared - actually adjusts the statistics based on the number of independent variables present.
- F-statistic - the ratio of mean squared error of the model to the mean squared error of residuals.
- AIC - estimates the relative quality of statistical models for a given dataset.
- BIC - used as a criterion for model selection among a finite set of models.
- coef - the coefficients of the independent variables and the constant term in the equation.
- std err - the basic standard error of the estimate of the coefficient.
- t - a measure of how statistically significant the coefficient is.
- P > |t| - the null-hypothesis that the coefficient = 0 is true.
- Simple Linear Regression | Python Data Science
- Multiple Linear Regression | Python Data Science
- Polynomial Regression | Python
- Logistic Regression | Python Machine Learning
- K-Nearest Neighbor(KNN) | Python Machine Learning
- Decision Tree in Machine Learning | Python
- Random Forest | Python Machine Learning
- Support Vector Machine | Python Machine Learning