The Assumptions of General Least Square Modeling for Regression and Correlations

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to compare the assumptions of General Least Square Model (GLM) modeling for regression and correlations. This discussion also covers the issues with transforming variables to make them linear. The procedure in R for linear regression is also addressed in this assignment. The discussion begins with some basics such as measurement scale, correlation, and regression, followed by the main topics for this discussion.

Measurement Scale

There are three types of measurement scale. There is nominal (categorical) such as race, color, job, sex or gender, job status and so forth (Kometa, 2016). There is ordinal (categorical) such as the effect of a drug could be none, mild and severe, job importance (1-5, where 1 is not important and 5 very important and so forth) (Kometa, 2016). There is the interval (continuous, covariates, scale metric) such as temperature (in Celsius), weight (in kg), heights (in inches or cm) and so forth (Kometa, 2016). The interval variables have all the properties of nominal and ordinal variables (Bernard, 2011). They are an exhaustive and mutually exclusive list of attributes, and the attributes have a rank-order structure (Bernard, 2011). They have one additional property which is related to the distance between attributes (Bernard, 2011). The distance between the attributes are meaningful (Bernard, 2011). Therefore, the interval variables involve true quantitative measurement (Bernard, 2011).

Correlations

Correlation analysis is used to measure the association between two variables. A correlation coefficient ( r ) is a statistic used for measuring the strength of a supposed linear association between two variables (Kometa, 2016). The correlation analysis can be conducted using interval data, ordinal data, or categorical data (crosstabs) (Kometa, 2016). The fundamental concept of the correlation requires the analysis of two variables simultaneously to find whether there is a relationship between the two sets of scores, and how strong or weak that relationship is, presuming that a relationship does, in fact, exist (Huck, Cormier, & Bounds, 2012). There are three possible scenarios within any bivariate data set. The first scenario is referred to as high-high, low-low when the high and low score on the first variable tend to be paired with the high and low score of the second variable respectively. The second scenario is referred to as high-low, low-high, when the relationship represents inverse, meaning when the high and low score of the first variable tend to be paired with a low and high score of the second variable. The third scenario is referred to as “little systematic tendency,” when some of the high and low scores on the first variable are paired with high scores on the second variable, whereas other high and low scores on the first variable are paired with low scores of the second variable (Huck et al., 2012).

The correlation coefficient varies from -1 and +1 (Huck et al., 2012; Kometa, 2016). Any ( r ) falls on the right side represents a positive correlation, indicating a direct relationship between the two measured variables, which can be categorized under the high-high, low-low scenario. However, any ( r ) falls on the left side represents a negative correlation, indicating indirect, or inverse, relationship, which can be categorized under high-low, low-high scenario. If ( r ) lands on either end of the correlation continuum, the term “perfect” may be used to describe the obtained correlation. The term high comes into play when ( r ) assumes a value close to either end, thus, implying a “strong relationship,” conversely, the term low is used when ( r ) lands close to the middle of the continuum, thus, implying a “weak relationship.” Any ( r ) ends up in the middle area of the left, or right side of the correlation continuum is called “moderate” (Huck et al., 2012). Figure 1 illustrates the correlation continuum of values -1 and +1.

Figure 1. Correlation Continuum (-1 and +1) (Huck et al., 2012).

The most common correlation coefficient is the Pearson correlation coefficient, used to measure the relationship between two interval variables (Huck et al., 2012; Kometa, 2016). Pearson correlation is designed for situations where each of the two variables is quantitative, and each variable is measured to produce raw scores (Huck et al., 2012). Spearman’s Rho is the second most popular bivariate correlational technique, where each of the two variables is measured to produce ranks with resulting correlation coefficient symbolized as r_s or p (Huck et al., 2012). Kendall’s Tau is similar to Spearman’s Rho (Huck et al., 2012).

Regression

When dealing with correlation and association between statistical variables, the variables are treated in a symmetric way. However, when dealing with the variables in a non-symmetric way, a predictive model for one or more response variables can be derived from one or more of the others (Giudici, 2005). Linear Regression is a predictive data mining method (Giudici, 2005; Perugachi-Diaz & Knapik, 2017).

Linear Regression is described to be the most important prediction method for continuous variables, while Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005). Cluster analysis is different from Logistic Regression and Tree Models, as in the cluster analysis the clustering is unsupervised in the cluster analysis and is measured with no reference variables, while in Logistic Regress and Tree Models, the clustering is supervised and is measured against a reference variables such as response whose levels are known (Giudici, 2005).

The Linear Regression is to examine and predict data by modeling the relationship between the dependent variable also called “response” variable, and the independent variable also known as “explanatory” variable. The purpose of the Linear Regression is to find the best statistical relationship between these variables to predict the response variable or to examine the relationship between the variables (Perugachi-Diaz & Knapik, 2017).

Bivariate Linear Regression can be used to evaluate whether one variable called dependent variable or the response can be caused, explained and therefore predicted as a function of another variable called independent, the explanatory variable, the covariate or the feature (Giudici, 2005). The Y is used for the dependent or response variable, and X is used for the independent or explanatory variable (Giudici, 2005). Linear Regression is the simplest statistical model which can describe Y as a function of an X (Giudici, 2005). The Linear Regression model specifies a “noisy” linear relationship between variables Y and X, and for each paired observation (x_i, y_i), the following Regression Function is used (Giudici, 2005; Schumacker, 2015).

Where:

i = 1, 2, …n
a = The intercept of the regression function.
b = The slope coefficient of the regression function also called the regression coefficient.
e_i = the random error of the regression function, relative to the ith observation.

The Regression Function has two main elements; the Regression Line and the Error Term. The Regression Line can be developed empirically, starting from the matrix of available data. The Error Term describes how well the regression line approximates the observed response variable. The determination of the Regression Line can be described as a problem of fitting a straight line to the observed dispersion diagram, where the Regression Line is the Linear Function using the following formula (Giudici, 2005).

Where:

= indicates the fitted ith value of the dependent variable, calculated on the basis of the ith value of the explanatory variable of x_i.

The Regression Line simple formula, as indicated in (Bernard, 2011; Schumacker, 2015) is as follows:

Where:

y = variable value of dependent variable.
a and b are some constants.
x = the variable value of the independent variable.

The Error Term of e_i in the expression of the Regression Function represents, for each observation y_i, the residual, namely the difference between the observed response values y_i, and the corresponding values fitted with the Regression Line using the following formula (Giudici, 2005):

Each residual can be interpreted as the part of the corresponding value that is not explained by the linear relationship with the explanatory variable. To obtain the analytic expression of the regression line, it is sufficient to calculate the parameters a and b on the basis of the available data. The method of least square is often used for this. It chooses the straight line which minimizes the sum of squares of the errors of the fit (SSE), defined by the following formula (Giudici, 2005).

Figure 2 illustrates the representation of the regression line.

Figure 2. Representation of the Regression Line (Giudici, 2005).

General Least Square Model (GLM) for Regression and Correlations

The Linear Regression is based on the Gauss-Markov theorem, which states that if the errors of prediction are independently distributed, sum to zero and have constant variance, then the least squares estimation of the regression weight is the best linear unbiased estimator of the population (Schumacker, 2015). The Gauss-Markov theorem provides the rule that justifies the selection of a regression weight based on minimizing the error of prediction, which gives the best prediction of Y, which is referred to as the least squares criterion, that is, selecting regression weights based on minimizing the sum of squared errors of prediction (Schumacker, 2015). The least squares criterion is sometimes referred to as BLUE, or Best Linear Unbiased Estimator (Schumacker, 2015).

Several assumptions are made when using Linear Regression, among which is one crucial assumption known as “independence assumption,” which is satisfied when the observations are taken on subjects which are not related in any sense (Perugachi-Diaz & Knapik, 2017). Using this assumption, the error of the data can be assumed to be independent (Perugachi-Diaz & Knapik, 2017). If this assumption is violated, the errors exist to be dependent, and the quality of statistical inference may not follow from the classical theory (Perugachi-Diaz & Knapik, 2017).

Regression works by trying to fit a straight line between these data points so that the overall distance between points and the line is minimized using the statistical method called least square. Figure 3 illustrates an example of a Scatter Plot of two variables, e.g., English and Maths Scores (Muijs, 2010).

Figure 3. Example of a Scatter Plot of two Variables, e.g. English and Maths Scores (Muijs, 2010).

In Pearson’s correlation, ( r ) measures how much changes in one variable correspond with equivalent changes in the other variables (Bernard, 2011). It can also be used as a measure of association between an interval and an ordinal variable or between an interval and a dummy variable which are nominal variable coded as 1 or 0, present or absent (Bernard, 2011). The square of Pearson’s r or r-squared is a PRE (proportionate reduction of error) measure of association for linear relations between interval variables (Bernard, 2011). It indicates how much better the scores of a dependent variable can be predicted if the scores of some independent variables are known (Bernard, 2011). The dots illustrated in Figure 4 is physically distant from the dotted mean line by a certain amount. The sum of the squared distances to the mean is the smallest sum possible which is the smallest cumulative prediction error giving the mean of the dependent is only known (Bernard, 2011). The distance from the dots above the line to the mean is positive; the distances from the dots below the line to the mean are negative (Bernard, 2011). The sum of the actual distances is zero. Squaring the distances gets rid of the negative numbers (Bernard, 2011). The solid line that runs diagonally through the graph in Figure 4 minimizes the prediction error for these data. This line is called the best fitting line, or the least square line, or the regression line (Bernard, 2011).

Figure 4. Example of a Plot of Data of TFR and “INFMORT” for Ten countries (Bernard, 2011).

Transformation of Variables for Linear Regression

The transformation of the data can involve the data transformation of the data matrix in univariate and multivariate frequency distributions (Giudici, 2005). It can also involve a process to simplify the statistical analysis and the interpretation of the results (Giudici, 2005). For instance, when the p variables of the data matrix are expressed in different measurement units, it is a good idea to put all the variables into the same measurement unit so that the different measurement scales do not affect the results (Giudici, 2005). This transformation can be implemented using the linear transformation to standardize the variables, taking away the average of each one and dividing it by the square root of its variance (Giudici, 2005). There is other data transformation such as the non-linear Box-Cox transformation (Giudici, 2005).

The transformation of the data is also a method of solving problems with data quality, perhaps because items are missing or because there are anomalous values, known as outliers (Giudici, 2005). There are two primary approaches to deal with missing data; remove it, or substitute it using the remaining data (Giudici, 2005). The identification of anomalous values requires a formal statistical analysis; an anomalous value can seldom be eliminated as its existence often provides valuable information about the descriptive or predictive model connected to the data under examination (Giudici, 2005).

The underlying concept behind the transformation of the variables is to correct for distributional problems, outliers, lack of linearity or unequal variances (Field, 2013). The transformation of the variables changes the form of the relationships between variables, but the relative differences between people for a given variable stay the same. Thus, those relationships can still be quantified (Field, 2013). However, it does change the differences between different variables because it changes the units of measurement (Field, 2013). Thus, in the case of a relationship between variables, e.g., regression, the transformation is implemented at the problematic variable. However, in case of differences between variables such as a change in a variable over time, then the transformation is implemented for all of those variables (Field, 2013).

There are various transformation techniques to correct various problems. Log Transformation (log(X_i)) method can be used to correct for positive skew, positive kurtosis, unequal variances, lack of linearity (Field, 2013). Square root transformation (ÖX_i ) can be used to correct for positive skew, positive kurtosis, unequal variances, and lack of linearity (Field, 2013). Reciprocal Transformation (1/X_i) can be used to correct for positive skew, positive kurtosis, unequal variances (Field, 2013). The Reverse Score Transformation can be used to correct for negative skew (Field, 2013). Table 1 summarizes these types of transformation and their correction use.

Table 1. Transformation of Data Methods and their Use. Adapted from (Field, 2013).

Procedures in R for Linear Regressions

In R, there is a package called “stats” package which contains two different functions which can be used to estimate the intercept and slope in the linear regression equation (Schumacker, 2015). These two functions in R are lm() and lsfit() (Schumacker, 2015). The lm() function uses a data frame, while the lsfit() uses a matrix or data vector. The lm() function outputs an intercept term, which has meaning when interpreting results in linear regression. The lm() function can also specify an equation with no intercept of the form (Schumacker, 2015).

Example of lm() function with intercept on y as dependent variable and x as independent variable:

LReg = lm(y ~ x, data = dataframe).

Example of lm() function with no intercept on y as dependent variable and x as independent variable:

LReg = lm(y ~ 0 + x, data=dataframe) or
LReg = lm(y ~ x – 1, data = dataframe)

The expectation when using the lm() function is that the response variable data is distributed normally (Hodeghatta & Nayak, 2016). However, the independent variables are not required to be normally distributed (Hodeghatta & Nayak, 2016). Predictors can be factors (Hodeghatta & Nayak, 2016).

#cor() function to find the correlation between variables

cor(x,y)

#To build linear regression model with R

model <-lm(y ~ x, data=dataset)

References

Bernard, H. R. (2011). Research methods in anthropology: Qualitative and quantitative approaches: Rowman Altamira.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Huck, S. W., Cormier, W. H., & Bounds, W. G. (2012). Reading statistics and research (6th ed.): Harper & Row New York.

Kometa, S. T. (2016). Getting Started With IBM SPSS Statistics for Windows: A Training Manual for Beginners (8th ed.): Pearson.

Muijs, D. (2010). Doing quantitative research in education with SPSS: Sage.

Perugachi-Diaz, Y., & Knapik, B. (2017). Correlation in Linear Regression.

Schumacker, R. E. (2015). Learning statistics using R: Sage Publications.

Share this:

Related

Published by Think and Knowledge Tank