Machine Learning: Logistic Regression

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the assumptions of the Logistic Regression, and the assumptions of the Regular Regression, which are not applicable to the Logistic Regression. The discussion and the analysis also address the type of the variables in both the Logistic Regression and the Regular Regression.

Regular Linear Regression:

Regression analysis is used when a linear model is fit to the data and is used to predict values of an outcome variable or dependent variable from one or more predictor variable or independent variables (Field, 2013). The Linear Regression is also defined in (Field, 2013) as a method which is used to predict the values of the continuous variables, and to make inferences about how specific variables are related to a continuous variable. These two procedures of the prediction and inference rely on the information from the statistical model, which is represented by an equation or series of equations with some number of parameters (Tony Fischetti, 2015). Linear Regression is the most important prediction method for “continuous” variables (Giudici, 2005).

With one predictor or independent variable, the technique is sometimes referred to as “Simple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti, Mayor, & Forte, 2017; Giudici, 2005). However, when there are several predictors or independent variables in the model, it is referred to as “Multiple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti et al., 2017; Giudici, 2005). In Regression Analysis, the differences between what the model predicts and the observed data are called “Residuals” which are the same as “Deviations” when looking at the Mean (Field, 2013). These deviations are the vertical distances between what the model predicted and each data point that was observed. Sometimes, the predicted value of the outcome is less than the actual value, and sometimes it is greater, meaning that the residuals are sometimes positive and sometimes negative. To evaluate the error in a regression model, like when the fit of the mean using the variance is assessed, a sum of squared errors can be used using residual sum squares (SS_R) or the sum of squared residuals (Field, 2013). This SS_R is an indicator of how well a particular line fits the data: if the SS_R is large, the line is not representative of the data; if the SS_R is small, the line is a representative of the data (Field, 2013).

When using the Simple Linear Regression with the two variables; one independent or predictor and the other is outcome or dependent, the equation is as follows (Field, 2013).

In this Regression Model, the (b) is the correlation coefficient (more often denoted as ( r )), and it is a standardized measure (Field, 2013). However, an unstandardized measure of (b) can be used, but the equation will alter to be as follows (Field, 2013):

This model differs from that of a correlation only in that it uses an unstandardized measure of the relationship (b) and consequently a parameter (b₀) for the value of the outcome must be included, when the predictor is zero (Field, 2013). These parameters of (b₀) and (b₁) are known as the Regression Coefficients (Field, 2013).

When there are more than two variables which might be related to the outcome, Multiple Regression can be used. The Multiple Regression can be used with three, four or more predictors (Field, 2013). The equation for the Multiple Regression is as follows:

The (b₁) is the coefficient of the first predictor (X₁), (b₂) is the second coefficient of the second predictor (X₂), and so forth, as (b_n) is the coefficient of the n^th predictor (X_ni) (Field, 2013).

To assess the goodness of fit for the Regular Regression, the sum of squares, the R and R² can be used. When using the Mean as a model, the difference between the observed values and the values predicted by the mean can be calculated using the sum of squares (denoted SS_T) (Field, 2013). This value of the SS_T represents how good the Mean is as a model of the observed data (Field, 2013). When using the Regression Model, the SS_R can be used to represent the degree of inaccuracy when the best model is fitted to the data (Field, 2013).

Moreover, these two sums of squares of SS_T and SS_R can be used to calculate how much better the regression model is than using a baseline model such as the Mean model (Field, 2013). The improvement in prediction resulting from using the Regression Model rather than the Mean Model is measured by calculating the difference between SS_T and SS_R (Field, 2013). Such improvement is the Model Sum of Squares (SS_M) (Field, 2013). If the value of SS_M is large, then the regression model is very different from using the mean to predict the outcome variable, indicating that the Regression Model has made a big improvement to how well the outcome variable can be predicted (Field, 2013). However, if the SS_M is small then using the Regression Model is a little better than using the Mean model (Field, 2013). Calculating the R² by dividing SS_M by SS_T to measure the proportion of the improvement due to the model. The R² represents the amount of variance in the outcome explained by the mode (SS_M) relative to how much variation there was to explain in the first place (SS_T) (Field, 2013). Other methods to assess the goodness-of-fit of the Model include the F-test using Mean Squares (MS) (Field, 2013), and F-statistics to calculate the significance of R²(Field, 2013). To measure the individual contribution of a predictor in Regular Linear Regression, the estimated regression coefficient (b) and their standard errors to compute a t-statistic are used (Field, 2013).

The Regression Model must be generalized as a generalization is a critical additional step, because if the model cannot be generalized, then any conclusion must be restricted based on the model to the sample used (Field, 2013). For the regression model to generalize, cross-validation can be used (Field, 2013; Tony Fischetti, 2015) and the underlying assumptions must be met (Field, 2013).

Central Assumptions of Regular Linear Regression in Order of Importance

The assumptions of the Linear Model in order of importance as indicated in (Field, 2013) are as follows:

Additivity and Linearity: The outcome variable should be linearly related to any predictors, and with several predictors, their combined effect is best described by adding their effects together. Thus, the relationship between variables is linear. If this assumption is not met, the model is invalid. Sometimes, variables can be transformed to make their relationships linear (Field, 2013).
Independent Errors: The residual terms should be uncorrelated (i.e., independent) for any two observations, sometimes described as “lack of autocorrelation” (Field, 2013). If this assumption of independence is violated, the confidence intervals and significance tests will be invalid. However, regarding the model parameters, the estimates using the method of least square will still be valid but not optimal (Field, 2013). This assumption can be tested with the Durbin-Watson test, which tests for serial correlations between errors, specifically, it tests whether adjacent residuals are correlated (Field, 2013). The size of the Durbin-Watson statistic depends upon the number of predictors in the model and the number of observation (Field, 2013). As a very conservative rule of thumb, values less than one or greater than three are the cause of concern; however, values closer to 2 may still be problematic, depending on the sample and model (Field, 2013).
Homoscedasticity: At each level of the predictor variable(s), the variance of the residual terms should be constant, meaning that the residuals at each level of the predictor(s) should have the same variance (homoscedasticity) (Field, 2013). When the variances are very unequal there is said to be heteroscedasticity. Violating this assumption invalidates the confidence intervals and significance tests (Field, 2013). However, estimates of the model parameters (b) using the method of least squares are still valid but not optimal (Field, 2013). This problem can be overcome by using weighted least squares regression in which each case is weighted by a function of its variance (Field, 2013).
Normally Distributed Errors: It is assumed that the residuals in the model are random, normally distributed variables with a mean of 0. This assumption means that the differences between the model and the observed data are most frequently zero or very close to zero, and that differences much greater than zero happen only occasionally (Field, 2013). This assumption sometimes is confused with the idea that predictors have to be normally distributed (Field, 2013). Predictors do not need to be normally distributed (Field, 2013). In small samples a lack of normality will invalidate confidence intervals and significance tests; in large samples, it will not, because of the central limit theorem (Field, 2013). If the concern is only with estimating the model parameters and not with the significance tests and confidence intervals, then this assumption barely matters (Field, 2013). In other words, this assumption matters for significance tests and confidence intervals. This assumption can also be ignored if the bootstrap of confidence intervals is used (Field, 2013).

Additional Assumptions of Regular Linear Regression

There are additional assumptions when dealing with Regular Linear Regression. These additional assumptions are as follows as indicated in (Field, 2013).

Predictors are uncorrelated with “External Variable,” or “Third Variable” External variables are variables which have not been included in the regression model and influence the outcome variable. These variables can be described as “third variable.” This assumption indicates that there should be no external variables that correlate with any of the variables included int eh regression model (Field, 2013). If external variables do correlate with the predictors, the conclusion that is drawn from the model become “unreliable” because other variables exist that can predict the outcome just as well (Field, 2013).
Variable Types: All predictor (independent) variables must be “quantitative” or “categorical,” and the outcome (dependent) variables must be “quantitative,” “continuous” and “unbounded” (Field, 2013). The “quantitative” indicates that they should be measured at the interval level, and the “unbounded” indicates that there should be no constraints on the variability of the outcome (Field, 2013).
No Perfect Multicollinearity: If the model has more than one predictor then there should be no perfect linear relationship between two or more of the predictors. Thus, the predictors (independent) variables should not correlate too highly (Field, 2013).
Non-Zero Variance: The predictors should have some variations in value; meaning they do not have variances of zero (Field, 2013).

Logistic Regression

When the dataset has categorical variables as well as continuous predictors (independent), Logistic Regression is used (Field, 2013). Logistic Regression is multiple regression but with an outcome (dependent) variable that is categorical and predictor variables that are continuous or categorical. Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005).

Logistic Regression can have life-saving applications as in medical research it is used to generate models from which predictions can be made about the “likelihood” that, e.g. a tumor is cancerous or benign (Field, 2013). A database is used to develop which variables are influential in predicting the “likelihood” of malignancy of a tumor (Field, 2013). These variables can be measured for a new patient and their values placed in a Logistic Regression model, from which a “probability” of malignancy could be estimated (Field, 2013). Logistic Regression calculates the “probability” of the outcome occurring rather than making a prediction of the outcome corresponding to a given set of predictors (Ahlemeyer-Stubbe & Coleman, 2014). The expected values of the target variable from a Logistic Regression are between 0 and 1 and can be interpreted as a “likelihood” (Ahlemeyer-Stubbe & Coleman, 2014).

There are two types of Logistic Regression; Binary Logistic Regression, and Multinomial or Polychotomous Logistic Regression. The Binary Logistic Regression is used to predict membership of only two categorical outcomes or dependent variables, while the Multinomial or Polychotomous Logistic Regression is used to predict membership of more than two categorical outcomes or dependent variables (Field, 2013).

Concerning the assessment of the model, the R-statistics can be used to calculate a more literal version of the multiple correlations in the Logistic Regression model. The R-statistic is the partial correlation between the outcome variable and each of the predictor variables, and it can vary between -1 and +1. A positive value indicates that as the predictor variable increases, so does the likelihood of the event occurring, while the negative value indicates that as the predictor variable increases, the likelihood of the outcome occurring decreases (Field, 2013). If a variable has a small value of R then, it contributes a small amount to the model. Other measures for such assessment include Hosmer and Lemeshow, Cox and Snell’s and Nagelkerke’s (Field, 2013). All these measures differ in their computation, conceptually they are somewhat the same, and they can be seen as similar to the R² in linear regression regarding interpretation as they provide a gauge of the substantive significance of the model (Field, 2013).

In the Logistic Regression, there is an analogous statistics, the z-statistics, which follows the normal distribution to measure the individual contribution of predictors (Field, 2013). Like the t-tests in the Regular Linear Regression, the z-statistic indicates whether the (b) coefficient for that predictor is significantly different from zero (Field, 2013). If the coefficient is significantly different from zero, then the assumption can be that the predictor is making a significant contribution to the prediction of the outcome (Y) (Field, 2013). The z-statistic is known as the Wald statistic as it was developed by Abraham Wald (Field, 2013).

Principles of Logistic Regression

One of the assumptions mentioned above for the regular linear models is that the relationship between variables is linear for the linear regression to be valid. However, when the outcome variable is categorical, this assumption is violated as explained in the “Variable Types” assumption above, because and the outcome (dependent) variables must be “quantitative,” “continues” and “unbounded” (Field, 2013). To get around this problem, the data must be transformed using the logarithmic transformation). The purpose of this transformation is to express the non-linear relationship into a linear relationship (Field, 2013). However, Logistic Regression is based on this principle as it expresses the multiple linear regression equation in logarithmic terms called the “logit” and thus overcomes the problem of violating the assumption of linearity (Field, 2013). The transformation logit (p) is used in Logistic Regression with the letter (p) representing the probability of success (Ahlemeyer-Stubbe & Coleman, 2014). The logit (p) is a non-linear transformation, and Logistic Regression is a type of non-linear regression (Ahlemeyer-Stubbe & Coleman, 2014).

Assumptions of the Logistic Regression

In the Logistic Regression, the assumptions of the ordinary regression are still applicable. However, the following two assumptions are dealt with differently in the Logistic Regression (Field, 2013):

Linearity: While in the ordinary regression, the assumption is that the outcome has a linear relationship with the predictors, in the logistic regression, the outcome is categorical, and so this assumption is violated, and the log (or logit) of the data is used to overcome this violation (Field, 2013). Thus, the assumption of linearity in Logistic Regression is that there is a linear relationship between any continuous predictors and the logit of the outcome variable (Field, 2013). This assumption can be tested by checking if the interaction term between the predictor and its log transformation is significant (Field, 2013). In short, the linearity assumption is that each predictor has a linear relationship with the log of the outcome variable when using the Logistic Regression.
Independence of Errors: In the Logistic Regression, violating this assumption produces overdispersion, which can occur when the observed variance is bigger than expected from the Logistic Regression model. The overdispersion can occur for two reasons (Field, 2013). The first reason is the correlated observation when the assumption of independence is broken (Field, 2013). The second reason is due to variability in success probabilities (Field, 2013). The overdispersion tends to limit standard errors, which creates two problems. The first problem is the test statistics of regression parameters which are computed by dividing by the standard error, so if the standard error is too small, then the test statistic will be too big and falsely deemed significant. The second problem is the confidence intervals which are computed from standard errors, so if the standard error is too small, then the confidence interval will be too narrow and result in the overconfidence about the likely relationship between predictors and the outcome in the population. In short, the overdispersion occurs when the variance is larger than the expected variance from the model. This overdispersion can be caused by violating the assumption of independence. This problem makes the standard errors too small (Field, 2013), which can bias the conclusions about the significance of the model parameters (b-values) and population value (Field, 2013).

Business Analytics Methods Based on Data Types

In (Hodeghatta & Nayak, 2016), the following table summarizes the business analytics methods based on the data types. As shown in the table, when the response (dependent) variable is continuous, and the predictor variables is either continuous or categorical, the Linear Regression method is used. When the response (dependent) variable is categorical, and the predictor variables are either continuous or categorical, the Logistic Regression is used. Other methods are also listed as additional information.

Table-1. Business Analytics Methods Based on Data Types. Adapted from (Hodeghatta & Nayak, 2016).

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Fischetti, T. (2015). Data Analysis with R: Packt Publishing Ltd.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Share this:

Related

Published by Think and Knowledge Tank