R - Logistic Regression

Logistic regression is a statistical method that is used to predict the probability of a binary outcome, such as whether a patient will recover from a disease or not. The outcome is called the dependent variable and the independent variables are the factors that are believed to affect the outcome.

Logistic regression is a type of regression analysis, but it is different from linear regression in that the dependent variable is categorical. In linear regression, the dependent variable is continuous, which means that it can take on any value. In logistic regression, the dependent variable can only take on two values, such as "yes" or "no".

Logistic function

Logistic regression works by first transforming the dependent variable into a probability. This is done using a function called the logistic function. The logistic function is a sigmoid function that takes a real number as input and outputs a number between 0 and 1.

Once the dependent variable has been transformed into a probability, logistic regression uses a linear regression model to predict the probability. The linear regression model is used to predict the log odds of the dependent variable. The log odds is the logarithm of the odds of the dependent variable.

The odds of the dependent variable is the probability of the dependent variable divided by the probability of the opposite outcome. For example, the odds of a patient recovering from a disease is the probability of the patient recovering divided by the probability of the patient not recovering.

Logistic Regression in R

Logistic regression can be performed in R using the glm() function. The glm() function takes three arguments:

formula: A formula that specifies the relationship between the dependent variable and the independent variables. The formula is in the following format:

y ~ x1 + x2 + x3 + ... + n

where y is the dependent variable and x1, x2, x3, ..., n are the independent variables.

  1. family: A family object that specifies the distribution of the dependent variable. For logistic regression, the family object is binomial.
  2. data: A data frame that contains the data for the dependent variable and the independent variables.

For example, the following code performs a logistic regression to predict the probability of a patient recovering from a disease, given the patient's age and the severity of the disease:

age <- c(50, 60, 70, 80, 90) severity <- c(1, 2, 3, 4, 5) disease <- c("Yes", "No", "Yes", "No", "Yes") data <- data.frame(age, severity, disease) model <- glm(disease ~ age + severity, family = binomial, data = data)

The output of the glm() function is an object that contains the results of the logistic regression. You can use the summary() function to view the results of the logistic regression:

summary(model)
#Output: Call: glm(formula = disease ~ age + severity, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.1113 -0.7629 -0.3339 0.5272 2.3885 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) -5.3248 1.9082 -2.803 0.00519 ** age 0.1802 0.0874 2.054 0.04041 * severity 0.4265 0.1922 2.226 0.02627 * Logistic R-squared: 0.326, Adjusted R-squared: 0.282 AIC: 12.71

The summary of the logistic regression shows that the coefficients of the independent variables are statistically significant. This means that the independent variables are affecting the dependent variable. The odds ratio for age is 1.18, which means that for every 1 unit increase in age, the odds of the patient recovering from the disease are 1.18 times higher. The odds ratio for severity is 1.43, which means that for every 1 unit increase in severity, the odds of the patient recovering from the disease are 1.43 times higher.

Conclusion

Logistic regression is a valuable tool for binary classification tasks, such as predicting outcomes with two categories. It is widely used in data analysis, machine learning, and research across various domains.