R-Squared in Regression AnalysisAfter building a Machine Learning model , you need to determine how well the model fits the data. R-squared is a statistical measure of how close the data are to the fitted regression line . It is the percentage of variation (from 0 to 1) explained by the relationship between two variables. R-squared is a statistical measure of the proportion of variance in the dependent variable explained by the independent variable(s). This mean that, it is a comparison of Residual sum of squares (SSres) with total sum of squares(SStot). Residual for a point in the data is the difference between the actual value and the value predicted by your linear regression model . Using the residual values, you can determine the sum of squares of the residuals also known as SSres (Residual sum of squares). The R² is calculated by dividing sum of squares of residuals from the regression model (SSres) by total sum of squares of errors from the average model (given by SStot ) and then subtract it from 1. R-squared is always between 0 and 100%:
- 0% indicates that a low level of correlation, meaning a regression model that is not valid, but not in all cases.
- 100% indicates that two variables are perfectly correlated, i.e., with no variance at all.
R-squared manual calculation
import numpy as np #manual calculation actual = np.array([56,45,68,49,26,40,52,38,30,48]) predicted = np.array([58,42,65,47,29,46,50,33,31,47])
ssres = sum((actual - predicted)**2) sstot = sum((actual-np.mean(actual))**2) r2_m = 1-(ssres/sstot) print("R-Squared:", r2_m)
R-squared using sklearn.metrics
import sklearn.metrics as metrics actual = np.array([56,45,68,49,26,40,52,38,30,48]) predicted = np.array([58,42,65,47,29,46,50,33,31,47])
r2_sk = metrics.r2_score(actual,predicted) print("R-Squared:", r2_sk)
Limitations of R-Squared :
R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data.