Overfitting and Parsimony in Large Dataset Analysis

Dr. O. Aly
Computer Science

Introduction:  The purpose of this discussion is to discuss the issues of overfitting versus using parsimony and their importance in Big Data analysis.  The discussion also addresses if the overfitting approach is a problem in the General Least Square Model (GLM) approach.  Some hierarchical methods which do not require parsimony of GLM are also discussed in this discussion. This discussion does not include the GLM as it was discussed earlier. It begins with Parsimony in Statistics model.

Parsimony Principle in Statistical Model

The medieval (14th century) English philosopher, William of Ockham (1285 – 1347/49) (Forster, 1998) popularized a critical principle stated by Aristotle “Entities must not be multiplied beyond what is necessary” (Bordens & Abbott, 2008; Epstein, 1984; Forster, 1998).  The refinement of this principle by Ockham is now called “Occam’s Razor” stating that a problem should be stated in the simplest possible terms and explained with the fewest postulates possible (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  This method is now known as Law or Principle of Parsimony (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  Thus, based on this law, a theory should account for phenomena within its domain in the simplest terms possible and with the fewest assumptions (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  As indicated by (Bordens & Abbott, 2008), if there are two competing theories concerning a behavior, the one which explains the behavior in the simplest terms is preferred under the law of parsimony. 

Modern theories of the attribution process, development, memory, and motivation adhere to this law of parsimony (Bordens & Abbott, 2008).  However, the history of science witnessed some theories which got crushed under their weight of complexity (Bordens & Abbott, 2008). For instance, the collapse of interest in the Hull-Spence model of learning occurred primarily because the theory had been modified so many times to account for anomalous data that was no longer parsimonious (Bordens & Abbott, 2008).  The model of Hull-Spence became too complicated with too many assumptions and too many variables whose values had to be extracted from the very data that the theory was meant to explain (Bordens & Abbott, 2008).  As a result of such complexity, the interest in the theory collapsed and got lost.  The Ptolemaic Theory of planetary motion also lost its parsimony because it lost much of its true predictive power (Bordens & Abbott, 2008).

Parsimonious is one of the characteristics of a good theory (Bordens & Abbott, 2008).  Parsimonious explanation or a theory explains a relationship using relatively few assumptions (Bordens & Abbott, 2008). When more than one explanation is offered for observed behavior, scientists and researchers prefer the parsimonious explanation which explains behavior with the fewest number of assumptions (Bordens & Abbott, 2008). Scientific explanations are regularly evaluated and examined for consistency with the evidence and with known principles for parsimony and generality (Bordens & Abbott, 2008).  Accepted explanations can be overthrown in favor of views which are more general, more parsimonious, and more consistent with observation (Bordens & Abbott, 2008).

How to Develop Fit Model Using Parsimony Principle

When building a model, the researcher should strive for parsimony  (Bordens & Abbott, 2008; Field, 2013). The statistical implication of using a parsimony heuristic is that models be kept as simple as possible, meaning predictors should not be included unless they have the explanatory benefit (Field, 2013).  This strategy can be implemented by fitting the model that include all potential predictors, and then systematically removing any that do not seem to contribute to the model (Field, 2013).  Moreover, if the model includes interaction terms, then, the interaction terms to be valid, the main effects involved in the interaction term should be retained (Field, 2013).  Example of the implementation of Parsimony in developing a model include three variables in a patient dataset: (1) outcome variable (as cured or not cured), which is dependent variable (DV), (2) intervention variable, which is a predictor independent variable (IV), and (3) duration, which is another predictor independent variable (Field, 2013).  Thus, the three potential predictors can be Intervention, Duration and the interaction of the “Intervention x Duration” (Field, 2013).  The most complex model includes all of these three predictors.  As the model is being developed, any terms that are added but did not improve the model should be removed and adopt the model which did not include those terms that did not make a difference.  Thus, the first model (model-1) which the researchers can fit would be to have only Intervention as a predictor (Field, 2013).  Then, the model is built up by adding in another main effect of the Duration in this example as model-2.  The interaction of the Intervention x Duration can be added in model-3. Figure 1 illustrates these three models of development. The goal is to determine which of these models best fits the data while adhering to the general idea of parsimony (Field, 2013).   If the interaction term model-3 did not improve the model (model-2), then model-2 should be used as the final model.  If the Duration in model-2 did not make any difference and did not improve model-1, then model-1 should be used as the final model (Field, 2013).  The aim is to build the model systematically and choose the most parsimonious model as the final model.  The parsimonious representations are essential because simpler models tend to give more insight into a problem (Ledolter, 2013). 

Figure 1.  Building Models based on the Principle of Parsimony (Field, 2013).

Overfitting in Statistical Models

Overfitting is a term used when using models or procedures which violate Parsimony Principle, it means that the model includes more terms than are necessary or uses more complicated approaches than necessary (Hawkins, 2004).   There are two types of “Overfitting” methods.  The first “Overfitting” method is to use a model which is more flexible than it needs to be (Hawkins, 2004).  For instance, a neural net can accommodate some curvilinear relationships and so is more flexible than a simple linear regression (Hawkins, 2004). However, if it is used on a dataset that conforms to the linear model, it will add a level of complexity without any corresponding benefit in performance, or even worse, with poorer performance than the simpler model (Hawkins, 2004).  The second “Overfitting” method is to use a model that includes irrelevant components such as a polynomial of excessive degree or a multiple linear regression that has irrelevant as well as the needed predictors (Hawkins, 2004). 

The “Overfitting” technique is not preferred for four essential reasons (Hawkins, 2004).  The first reason involves wasting resources and expanding the possibilities for undetected errors in databases which can lead to prediction mistakes, as the values of these unuseful predictors must be substituted in the future use of the mode (Hawkins, 2004).  The second reason is that the model with unneeded predictors can lead to worse decisions (Hawkins, 2004).   The third reason is that irrelevant predictor can make predictions worse because the coefficients fitted to them add random variation to the subsequent predictions (Hawkins, 2004). The last reason is that the choice of model has an impact on its portability (Hawkins, 2004). The one-predictor linear regression that captures a relationship with the model is highly portable (Hawkins, 2004).  The more portable model is preferred over, the less portable model, as the fundamental requirement of science is that one researcher’s results can be duplicated by another researcher (Hawkins, 2004). 

Moreover, large models overfitted on training dataset turn out to be extremely poor predictors in new situations as needed predictor variables increase the prediction error variance (Ledolter, 2013).  The overparameterized models are of little use if it is difficult to collect data on predictor variables in the future.  The partitioning of the data into training and evaluation (test) datasets is central to most data mining methods (Ledolter, 2013). Researchers must check whether the relationships found in the training dataset will hold up in the future (Ledolter, 2013).

How to recognize and avoid Overfit Models

A model overfits if it is more complicated than another model that fits equally well (Hawkins, 2004).  The recognition of overfitting model involves not only the comparison of the simpler model and the more complex model but also the issue of how the fit of a model is measured (Hawkins, 2004).  Cross-Validation can detect overfit models by determining how well the model generalizes to other datasets by partitioning the data (minitab.com, 2015).  This process of cross-validation helps assess how well the model fits new observations which were not used in the model estimation process (minitab.com, 2015). 

Hierarchical Methods

The regression analysis types include simple, hierarchical, and stepwise analysis (Bordens & Abbott, 2008).  The main difference between these types is how predictor variables are entered into the regression equation which may affect the regression solution (Bordens & Abbott, 2008).  In the simple regression analysis, all predictors variables are entered together, while in the hierarchical regression, the order in which variables are entered into the regression equation is specified (Bordens & Abbott, 2008; Field, 2013).  Thus, the hierarchical regression is used for a well-developed theory or model suggesting a specific causal order (Bordens & Abbott, 2008). As a general rule, known predictors should be entered into the model first in order of their importance in predicting the outcome (Field, 2013).  After the known predictors have been entered, any new predictors can be added into the model (Field, 2013).  In the stepwise regression, the order in which variables are entered is based on a statistical decision, not on a theory (Bordens & Abbott, 2008).  

The choice of the regression analysis should be based on the research questions or the underlying theory (Bordens & Abbott, 2008).  If the theoretical model is suggesting a particular order of entry, the hierarchical regression should be used (Bordens & Abbott, 2008). Stepwise regression is infrequently used because sampling and measurement error tends to make unstable correlations among variables in stepwise regression (Bordens & Abbott, 2008).  The main problem with the stepwise methods is that they assess the fit of a variable based on the other variables in the model (Field, 2013).

Goodness-of-fit Measure for the Fit Model

Comparison between hierarchical and stepwise methods:  The hierarchical and stepwise methods involve adding predictors to the model in stages, and it is useful to know these additions improve the model (Field, 2013).  Since the larger values of R2 indicates better fit, thus, a simple way to see whether a model has improved as a result of adding predictors to it would be to see whether R2 for the new model is bigger than for the old model.  However, it will always get bigger if predictors are added, so the issue is more whether it gets significantly bigger (Field, 2013).  The significance of the change in R2 can be assessed using the equation below as the F-statistics is also be used to calculate the significance of R2 (Field, 2013) 

However, because the focus is on the change in the models, thus the change in R2  (R2 change)  and R2 of the newer model (R2 new) are used using the following equation (Field, 2013). Thus, models can be compared using this F-ratio (Field, 2013). 

The Akaike’s Information Criterion (AIC) method is a goodness-of-fit measure which penalizes the model for having more variables. If the AIC is bigger, the fit is worse; if the AIC is smaller, the fit is better (Field, 2013).   If the Automated Linear Model function in SPSS is used, then AIC is used to select models rather than the change in R2. AIC is used to compare it with other models with the same outcome variables; if it is getting smaller, then the fit of the model is improving (Field, 2013).  In addition to the AIC method, there is Hurvich and Tsai’s criterion (AICC), which is a version of AIC designed for small samples (Field, 2013).  Bozdogan’s criterion (CAIC), which is a version of AIC which is used for model complexity and sample size. Bayesian Information Criterion (BIC) of Schwarz, which is comparable to the AIC (Field, 2013; Forster, 1998). However, it is slightly more conservative as it corrects more harshly for the number of parameters being estimated (Field, 2013).  It should be used when sample sizes are large, and the number of parameters is small (Field, 2013).

The AIC and BIC are the most commonly used measures for the fit of the model.  The values of these measures are all useful as a way of comparing models (Field, 2013).  The value of AIC, AICC, CAIC, and BIC can all be compared to their equivalent values in other models.  In all cases, smaller values mean better-fitting models (Field, 2013).

There is also Minimum Description Length (MDL) measure of Rissanen, which is based on the idea that statistical inference centers around capturing regularity in data; regularity, in turn, can be exploited to compress the data (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2015).  Thus, the goal is to find the model which compresses the data the most (Vandekerckhove et al., 2015).   There are three versions of MDL: crude two-part code, where the penalty for complex models is that they take many bits to describe, increasing the summed code length. In this version, it can be difficult to define the number of bits required to describe the model.  The second version of MDL is the Fisher Information approximation (FIA), which is similar to AIC and BIC in that it includes a first term that represents goodness-fo-fit, and additional terms that represent a penalty for complexity (Vandekerckhove et al., 2015).  The second term resembles that of BIC, and the third term reflects a more sophisticated penalty which represents the number of distinguishable probability distribution that a model can generate (Vandekerckhove et al., 2015).  The FIA differs from AIC and BIC in that it also accounts for functional form complexity, not just complexity due to the number of free parameters (Vandekerckhove et al., 2015).  The third version of MDL is normalized maximum likelihood (NML) which is simple to state but can be difficult to compute, for instance, the denominator may be infinite, and this requires further measures to be taken (Vandekerckhove et al., 2015). Moreover, NML requires integration over the entire set of possible datasets, which may be difficult to define as it depends on unknown decision process in the researchers (Vandekerckhove et al., 2015).

AIC and BIC in R: If there are (p) potential predictors, then there are 2p possible models (r-project.org, 2002).  AIC and BIC can be used in R as selection criteria for linear regression models as well as for other types of models.  As indicated in (r-project.org, 2002): the equations for AIC and BIC are as follows.

For the linear regression models, the -2log-likelihood (known as the deviance is nlog(RSS/n)) (r-project.org, 2002).  AIC and BIC need to get minimized (r-project.org, 2002).  The Larger models will fit better and so have smaller RSS but use more parameters (r-project.org, 2002).  Thus, the best choice of model will balance fit with model size (r-project.org, 2002).  The BIC penalizes larger models more heavily and so will tend to prefer smaller models in comparison to AIC (r-project.org, 2002). 

Example of the code in R using the state.x77 dataset is below. The function does not evaluate the AIC for all possible models but uses a search method that compares models sequentially as shown in the result of the R commands.

  • g <- lm(Life.Exp ~ ., data=state.x77.df)
  • step(g)

References

Bordens, K. S., & Abbott, B. B. (2008). Research Design and Methods: A Process Approach: McGraw-Hill.

Epstein, R. (1984). The principle of parsimony and some applications in psychology. The Journal of Mind and Behavior, 119-130.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Forster, M. R. (1998). Parsimony and Simplicity. Retrieved from http://philosophy.wisc.edu/forster/220/simplicity.html, University of Wisconsin-Madison.

Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1), 1-12.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

minitab.com. (2015). The Danger of Overfitting Regression Models. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models.

r-project.org. (2002). Practical Regression and ANOVA Using R Retrieved from https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf.

Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model Comparison and the Principle The Oxford handbook of computational and mathematical psychology (Vol. 300): Oxford Library of Psychology.

Quantitative Analysis of the “Faithful” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the faithful dataset. The project is divided into two main Parts.  Part-I evaluates and examines the DataSet for understanding the DataSet using the RStudio.  Part-I involves five significant tasks.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving nine significant tasks to analyze the Data Frames.   The result shows that there is a relationship between Waiting Time until Next Eruption and the Eruption Time.   The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is minimal almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternative hypotheses that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable Waiting Time and the independent variable of Eruptions. The project also analyzed the comparison among three regressions: standard regression, polynomial regression, and lowess regression.  The result shows a similar relationship and lines between the three models.

Keywords: R-Dataset; Faithful Dataset; Regression Analysis Using R.

Introduction

This project examines and analyzes the dataset of faithful.csv (oldfaithful.csv).  The dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/.  The dataset has 272 observations on two variables; eruptions and waiting. The dataset is to describe the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  A closer look at the failthful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequency than expected under non-human measurement.  The eruption variable is numeric for the eruption time in minutes.  The waiting variable is numeric for the waiting time to next eruption in minutes. The geyser is in package MASS.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: Density Histograms and Smoothed Density Histograms.
  • Task-3: Standard Linear Regression
  • Task-4: Polynomial Regression.
  • Task-5: Lowess Regression.
  • Task-6: The Summary of the Model.
  • Task-7: Comparison of Linear Regressions, Polynomial Regression, and Lowess Regression.
  • Task-8: The Summary of all Models.
  • Task-9:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “faithful.”

Task-1:  Install MASS Package

The purpose of this task is to install the MASS package which is required for this project. The faithful.cxv requires this package.

  • Command: >install.packages(“MASS”)

Task-2:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is a “faithful” dataset. It describes the Waiting Time between Eruptions and the duration of the Eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  The dataset has 272 observations on two main variables:

  • Eruptions: numeric Eruption time in minutes.
  • Waiting: numeric Waiting time to next eruption in minutes.

Task-3:  Examine the Variables of the Data Sets

The main dataset is called “faithful.csv” dataset, which includes two main variables eruptions and waiting.

  • ##Examine the dataset
  • data()
  • ?faithful
  • install.packages(“MASS”)
  • install.packages(“lattice”)
  • library(lattice)
  • faithful <- read.csv(“C:/CS871/Data/faithful.csv”)
  • data(faithful)
  • summary(faithful)

Figure 1.  Eruptions and Waiting for Eruption Plots for Faithful dataset.

Task-4: Create a Data Frame to repreent the dataset of faithful.

  • ##Create DataFrame
  • faithful.df <- data.frame(faithful)
  • faithful.df
  • summary(faithful.df)

Task-5: Examine the Content of the Data Frame using head(), names(), colnames(), and dim() functions.

  • names(faithful.df)
  • head(faithful.df)
  • dim(faithful.df)

Part-II: Discussion and Analysis

Task-1:  The first Ten lines of Waiting and Eruptions.

  • ##The first ten lines of Waiting and Eruptions
  • faithful$waiting[1:10]
  • faithful$eruption[1:10]
  • ##The descriptive analysis of waiting and eruptions
  • summary(faithful$waiting)
  • summary(faithful$eruptions)

Task-2:  Density Histograms, and Smoothed Density Histograms.

  • ##Density histogram for Waiting Time
  • hist(faithful.df$waiting, col=”blue”, freq=FALSE, main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##smoothed density histogram for Waiting Time
  • smoothedDensity_waiting <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_waiting, col=”blue”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##Density histogram for Eruptions
  • hist(faithful.df$eruptions, col=”red”, freq=FALSE, main=”Histogram of Eruption Time”, xlab=”Eruption Time In Minutes”)
  • ##smoothed density histogram for Eruptions
  • smoothedDensity_eruptions <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_eruptions, col=”red”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)

Figure 2.  Density Histogram and Smoothed Density Histogram of Waiting Time.

Figure 3.  Density Histogram and Smoothed Density Histogram of Eruption.

Task-3: Standard Linear Regression

            The purpose of this task is to examine the Standard Linear Regression for the two main factors of the faithful dataset Waiting Time until Next Eruption and Eruption Time.  This task also addresses the diagnostic plots of the standard linear regression such as residuals vs. fitted as examined below.  The R codes are as follows:

  • ##Standard Regression of waiting time on eruption time.
  • lin.reg.model=lm(waiting~eruptions, data=faithful.df)
  • plot(waiting~eruptions, data=faithful, col=”blue”, main=”Regression of Two Factors of Waiting and Eruption Time”)
  • abline(lin.reg.model, col=”red”)

Figure 4.  Regression of Two Factors of Waiting and Eruption Time.

            The following graphs represent the diagnostic plots for the standard linear regressions.  The first plot represents the residual vs. fitted. The second plot represents the Normal Q-Q. The third plot represents Scale-Location. The fourth plot represents Residuals vs. Leverage.  The discussion and analysis of these graphs under the discussion analysis section of this project.

Figure 5.  Diagnostic Plots for Standard Linear Regression.

Task-4: Polynomial Regression

            The purpose of this task is to examine the polynomial regression for the Waiting Time Until Next Eruption and the Eruption Time variables. The R codes are as follows:

Figure 6.  Polynomial Regreession.

Task-5: Lowess Regression

            The purpose of this task is to examine the Lowess Regression.  The R codes are as follows:

Figure 7.  Lowess Regression.

Task-6: Summary of the Model.

            The purpose of this task is to examine the descriptive analysis summary such as residuals, intercept, R-squared.  The R code is as follows:

  • summary(model1)

Task-7:  Comparison of Linear Regression, Polynomial Regression and Lowess Regression.

  • ##Comparing local polynomial regression to the standard regression.
  • lowessReg=lowess(faithful$waiting~faithful$eruptions, f=2/3)
  • local.poly.reg <-locfit(waiting~lp(eruptions, nn=0.5), data=faithful)
  • standard.reg=lm(waiting~eruptions, data=faithful)
  • plot(faithful$waiting~faithful$eruptions, main=”Eruptions Time”, xlab=”Eruption Time in Minutes”, ylab=”Waiting Time to Next Eruption Time”, col=”blue”)
  • lines(lowessReg, col=”red”)
  • abline(standard.reg, col=”green”)
  • lines(local.poly.reg, col=”yellow”)

Figure 8.  Regression Comparison for the Eruptions and Waiting Time Variables.

Task-8:  Summary of these Models

            The purpose of this task is to examine the summary of each mode. The R codes are as follows:

  • ##Summary of the regressions
  • summary(lowessReg)
  • summary(local.poly.reg)
  • summary(standard.reg)
  • cor(faithful.df$eruptions, faithful.df$waiting)

Task-9: Discussion and Analysis

            The result shows that the descriptive analysis of the average for the eruptions is 3.49, which is lower than the median value of 4.0 minutes, indicating a negatively skewed distribution.  The average for the waiting time until the next eruptions is 70.9 which is less than the median of 76.0 indicating a negatively skewed distribution.  Figure 2 illustrated the density histogram and smoothed density histogram of the waiting time. The result shows that the peak waiting time is ~80 minutes with the highest density point of 0.04.   Figure 3 illustrated the density histogram and smoothed density histogram of the eruption time in minutes.  The result shows that the peak eruption time in minutes is ~4.4 with the highest frequency density point of 0.6.  Figure 4 illustrates the linear regression of the two factors of waiting until next eruption and the eruption time in minutes.  The result shows that when the waiting time increases, the eruption time in minutes increases.  The residuals depict the difference between the actual value of the response variable and the value of the response variable predicted using the regression.  The maximum residual is shown as 15.97.  The spread of residuals is provided by specifying the values of min, max, median, Q1, and Q3 of the residuals. In this case, the spread is from -12.08 to 15.97.  Since the principle behind the regression line and the regression equation is to reduce the error or this difference, the expectation is that the median value should be very near to 0.  However, the median shows .21 which is higher than 0.  The prediction error can go up to the maximum value of the residual.  As this value is 15.97 which is not small, this residual cannot be accepted.   The result also shows that the value next to the coefficient estimate is the standard error of the estimate. Ths specifies the uncertainty of the estimate, then comes the “t” value of the standard error.  This value specifies as to how large the coefficient estimate is concerning the uncertainty.  The next value is the probability that the absolute(t) value is greater than the one specified which is due to a chance error.  The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is very small almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypothesis that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable waiting time and the independent variable of eruptions.

The diagnostic plots of the standard regression are also discussed in this project. Figure 5 illustrates four different diagnostic plots of the standard regression.  This analysis also covers the residuals and fitted lines.  Figure 5 illustrated the Residuals vs. Fitted in Linear Regression Model for Waiting Time until Next Eruption as a function of the Eruptions Time in minutes.  The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).  The result of the Linear Regression for the identified variables of Eruptions and Waiting Time shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 5 also illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). The residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 5 also illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 5 also illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).   The result of the Polynomial regression is also addressed in this project.  The polynomial regression shows a relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 6 is similar to the Standard Linear Regression.  Lowess Regression shows the same pattern on the relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 7 is similar to the Standards Linear Regression.  These three lines of the Standard Linear Regression, Polynomial Regression and Lowess Regression are illustrated together in a comparison fashion in Figure 8.   The coefficient correlation result shows that there is a positive correlation indicating the positive effect of the Eruptions Time on the Waiting Time until Next Eruption.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

 

Quantitative Analysis of “Ethanol” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss locally weighted scatterplot smoothing known as LOWESS method for multiple regression models in a k-nearest-neighbor-based model.  The discussion also addresses whether the LOWESS is a parametric or non-parametric method.  The advantages and disadvantages of LOWESS from the computational standpoint are also addressed in this discussion.  Moreover, another purpose of this discussion is to select a dataset from http://vincentarelbundock.github.io/Rdatasets/ and perform a multiple regression analysis using R programming.  The dataset selected for this discussion is “ethanol” dataset.  The discussion begins with Multiple Regression, Lowess method, Lowess/Loess in R, and K-Nearest-Neighbor (k-NN), followed by the analysis of the “ethanol” dataset.

Multiple Regression

When there is more than one predictor variable, simple Linear Regression becomes Multiple Linear Regression, and the analysis becomes more involved (Kabacoff, 2011).  The Polynomial Regression typically is a particular case of the Multiple Regression (Kabacoff, 2011).  The Quadratic Regression has two predictors (X and X2), and Cubic Regression has three predictors (X, X2, and X3) (Kabacoff, 2011).  Where there is more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant (Kabacoff, 2011).   

Locally Weighted Scatterplot Smoothing (Lowess) Method

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).

Moreover, the Lowess and least square are non-parametric strategies for fitting a smooth curve to data points (statisticshowto.com, 2013).  The “parametric” indicates there is an assumption in advance that the data fits some distribution, i.e., normal distribution (statisticshowto.com, 2013). Parametric fitting can lead to fitting a smooth curve which misrepresents the data because some distribution is assumed in advance (statisticshowto.com, 2013).  Thus, in those cases, non-parametric smoothers may be a better choice (statisticshowto.com, 2013).  The non-parametric smoothers like Loess try to find a curve of best fit without assuming the data must fit some distribution shape (statisticshowto.com, 2013).  In general, both types of smoothers are used for the same set of data to offset the advantages and disadvantages of each type of smoother (statisticshowto.com, 2013). The benefits of non-parametric smoothing include providing a flexible approach to representing data, ease of use, easy computations (statisticshowto.com, 2013).  The disadvantages of the non-parametric smoothing include the following: (1) it cannot be used to obtain a simple equation for a set of data, (2) less well understood than parametric smoothers, and (3) it requires a little guesswork to obtain a result (statisticshowto.com, 2013). 

Lowess/Loess in R

Rhere are two versions of the lowess or loess scatter-diagram smoothing approach implemented in R (Dias, n.d.).  The former (lowess) was implemented first, while loess is more flexible and powerful (Dias, n.d.).  Example of lowess:

  • lowess(x,y f=2/3, iter=3, delta=.01*diff(range(x)))

where the following model is assumed: y = b(x)+e. 

  • The “f” is the smoother span which gives the proportion of points in the plot which influence the smooth at each value.  The larger values give more smoothness. 
    • The “iter” is the number of “robustifying” iterations which should be performed; using smaller values of “iter” will make “lowess” run faster. 
    • The “delta” represents the value of “x” which lies within “delta” of each other replace by a single value in the output from “lowess” (Dias, n.d.).

The loess() function uses a formula to specify the response (and in its application as a scatter-diagram smoother) a single predictor variable (Dias, n.d.).  The loess() function creates an object which contains the results, and the predict() function retrieves the fitted values.  These can be plotted along with the response variable (Dias, n.d.).  However, the points must be plotted in increasing order of the predictor variable in order for the lines() function to draw the line in an appropriate fashion, which is done using order() function applied to the predictor variable values and the explicit sub-scripting (in square brackets[]) to arrange the observations in ascending order (Dias, n.d.).

K-Nearest-Neighbor (K-NN)

The K-NN classifier is based on learning numeric attributes in an n-dimensional space. All of the training samples are stored in n-dimensional space with a unique pattern (Hodeghatta & Nayak, 2016).  When a new sample is given, the K-NN classifier searches for the pattern spaces which are closest to the sample and accordingly labels the class in the k-pattern space (called k-nearest-neighbor) (Hodeghatta & Nayak, 2016).  The “closeness” is defined regarding Euclidean distance, where the Euclidean distance between two points, X = (x1, x2, x3,.., xn)  and Y = (y1, y2, y3,.., yn) is defined as follows:

The unknown sample is assigned the nearest class among the K-NN pattern.  The aim is to look for the records which are similar to, or “near” the record to be classified in the training records which have values close to X = (x1, x2, x3,.., xn) (Hodeghatta & Nayak, 2016).  These records are grouped into classes based on the “closeness,” and the unknown sample will look for the class (defined by k) and identifies itself to that class which is nearest in the k-space (Hodeghatta & Nayak, 2016).   If a new record has to be classified, it finds the nearest match to the record and tags to that class (Hodeghatta & Nayak, 2016).  

The K-NN does not assume any relationship among the predictors(X) and class (Y) (Hodeghatta & Nayak, 2016).  However, it draws the conclusion of the class based on the similarity measures between predictors and records in the dataset (Hodeghatta & Nayak, 2016).  There are many potential measures, K-NN uses the Euclidean distance between the records to find the similarities to label the class (Hodeghatta & Nayak, 2016).  The predictor variable should be standardized to a common scale before computing the Euclidean distances and classifying (Hodeghatta & Nayak, 2016).  After computing the distances between records, a rule to put these records into different classes(k) is required (Hodeghatta & Nayak, 2016).  A higher value of (k) reduces the risk of overfitting due to noise in the training set (Hodeghatta & Nayak, 2016).  The value of (k) ideally can be between 2 and 10, for each time, to find the misclassification error and find the value of (k) which gives the minimum error (Hodeghatta & Nayak, 2016).

The advantages of the K-NN as a classification method include its simplicity and lack of parametric assumptions (Hodeghatta & Nayak, 2016).  It performs well for large training datasets (Hodeghatta & Nayak, 2016).  However, the disadvantages of the K-NN as a classification method include the time to find the nearest neighbors, reduced performance for a large number of predictors (Hodeghatta & Nayak, 2016). 

Multiple Regression Analysis for “ethanol” dataset Using R

This section is divided into five major Tasks.  The first task is to understand and examine the dataset.  Task-2, Task-3, and Task-4 are to understand density histogram, linear regression, and multiple linear regression.  Task-5 covers the discussion and analysis of the results.

Task-1:  Understand and Examine the Dataset.

The purpose of this task is to understand and examine the dataset. The description of the dataset is found in (r-project.org, 2018).  A data frame with 88 observations on the following three variables.

  • NOx Concentration of nitrogen oxides (NO and NO2) in micrograms/J.
  • C Compression ratio of the engine.
  • E Equivalence ratio–a measure of the richness of the air and ethanol fuel mixture.

#R-Commands and Results using summary(), names(), head(), dim(), and plot() functions.

  • ethanol <- read.csv(“C:/CS871/Data/ethanol.csv”)
  • data(ethanol)
  • summary(ethanol)
  • names(ethanol)
  • head(ethanol)
  • dim(ethanol)
  • plot(ethanol, col=”red”)

Figure 1.  Plot Summary of NOx, C, and E in Ethanol Dataset.

  • ethanol[1:3,]   ##First three lines
  • ethanol$NOx[1:10]  ##First 10 lines for concentration of nitrogen oxides (NOx)
  • ethanol$C[1:10]   ##First 10 lines for Compression Ratio of the Engine ( C )
  • ethanol$E[1:10]  ##First 10 lines for Equivalence Ratio ( E )
  • ##Descriptive Analysis using summary() function to analyze the central tendency.
  • summary(ethanol$NOx)
  • summary(ethanol$C)
  • summary(ethanol$E)

Task-2:  Density Histogram and Smoothed Density Histogram

  • ##Density histogram for NOx
  • hist(ethanol$NOx, freq=FALSE, col=”orange”)
  • install.packages(“locfit”)  ##locfit library is required for smoothed histogram
  • library(locfit)
  • smoothedDensity_NOx <- locfit(~lp(NOx), data=ethanol)
  • plot(smoothedDensity_NOx, col=”orange”, main=”Smoothed Density Histogram for NOx”)

Figure 2.  Density Histogram and Smoothed Density Histogram of NOx of Ethanol.

  • ##Density histogram for Equivalence Ration ( E )
  • hist(ethanol$E, freq=FALSE, col=”blue”)
  • smoothedDensity_E <- locfit(~lp(E), data=ethanol)
  • plot(smoothedDensity_E, col=”blue”, main=”Smoothed Density Histogram for Equivalence Ratio”)

Figure 3.  Density Histogram and Smoothed Density Histogram of E of Ethanol.

  • ##Density histogram for Compression Ratio ( C )
  • hist(ethanol$C, freq=FALSE, col=”blue”)
  • smoothedDensity_C <- locfit(~lp(C), data=ethanol)
  • plot(smoothedDensity_C, col=”blue”, main=”Smoothed Density Histogram for Compression Ratio”)

Figure 4.  Density Histogram and Smoothed Density Histogram of C of Ethanol.

Task-3:  Linear Regression Model

  • ## Linear Regression
  • lin.reg.model1=lm(NOx~E, data=ethanol)
  • lin.reg.model1
  • plot(NOx~E, data=ethanol, col=”blue”, main=”Linear Regression of NOx and Equivalence Ratio in Ethanol”)
  • abline(lin.reg.model1, col=”red”)
  • mean.NOx=mean(ethanol$NOx, na.rm=T)
  • abline(h=mean.NOx, col=”green”)

Figure 5:  Linear Regression of the NOx and E in Ethanol.

  • ##local polynomial regression of NOx on the equivalent ratio
  • ##fit with a 50% nearest neighbor bandwidth.
  • local.poly.reg <-locfit(NOx~lp(E, nn=0.5), data=ethanol)
  • plot(local.poly.reg, col=”blue”)

Figure 6:  Smoothed Polynomial Regression of the NOx and E in Ethanol.

Figure 7.  Residuals vs. Fitted Plots.

Figure 8.  Normal Q-Q Plot.

Figure 9. Scale-Location Plot.

Figure 10.  Residuals vs. Leverage.

  • ##To better understand the linearity of the relationship represented by the model.
  • summary(lin.reg.model1)
  • plot(lin.reg.model1)
  • crPlots(lin.reg.model1)
  • termplot(lin.reg.model1)

Figure 11.  crPlots() Plots for the Linearity of the Relationship between NOx and Equivalence Ratio of the Model.

##Examine the Correlation between NOx and E.

Task-4: Multiple Regressions

  • ##Produce Plots of some explanatory variables.
  • plot(NOx~E, ethanol, col=”blue”)
  • plot(NOx~C, ethanol, col=”red”)
  • ##Use vertical bar to find the relationship of E on NOx conditioned with C
  • coplot(NOx~E|C, panel=panel.smooth,ethanol, col=”blue”)
  • model2=lm(NOx~E*C, ethanol)
  • plot(model2, col=”blue”)

Figure 12. Multiple Regression – Relationship of E on NOx conditioned with C.

Figure 13. Multiple Regression Diagnostic Plot: Residual vs. Fitted.

Figure 14. Multiple Regression Diagnostic Plot: Normal Q-Q.

Figure 15. Multiple Regression Diagnostic Plot: Scale-Location.

Figure 16. Multiple Regression Diagnostic Plot: Residual vs. Leverage.

  • summary(model2)

Task-5:  Discussion and Analysis:  The result shows the average of NOx is 1.96, which is higher than the median of 1.75 indicating positive skewed distribution.  The average of the compression ratio of the engine ( C ) is 12.034 which is a little higher than the median of 12.00 indicating almost normal distribution.  The average ethanol equivalence ratio of measure ( E ) is 0.926, which is a little lower than the median of 0.932 indicating a little negative skewed distribution but close to normal distribution.  In summary, the average for NOx is 1.96, for C is 12.034 and for E is 0.926. 

The NOx exhaust emissions depend on two predictor variables: the fuel-air equivalence ratio ( E ), and the compression ratio ( C ) of the engine. The density of the NOx emissions and its smoothed version using the local polynomial regression are illustrated in Figure 2.  The result shows that the NOx starts to increase when the density starts at 0.15 and continues to increase.  However, after the density reaches 0.35, the NOx continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between NOx and density between 0.15 and .35 density, after which the relationship seems to go into the reverse and negative direction. 

The density of the Equivalence Ratio measure of the richness of air and ethanol fuel mixture and its smoothed version using the local polynomial regression are illustrated in Figure 3.  The result shows that the E varies with the density.  For instance, the density gets increased with the increased value of E until the density reaches ~1.5, and then it drops while the E continues to increase.  However, the density continues to drop until it reaches ~1.2, while the E continues to increase.   The density gets increased from ~1.2 until it reached ~1.6, and the E continues to increase. After density of ~1.6, the density gets dropped again while the E value continues to increase.  Thus, in summary, the density varies while the E value keeps increasing.

The density of the Compression Ratio of the Engine and its smoothed version using the local polynomial regression are illustrated in Figure 4.  The result shows that the C starts to increase when the density starts with ~0.09 and continues to increase.  However, after the density reaches ~0.11, the C continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between C and density between ~0.09 and ~.11 density, after which the relationship seems to go into the reverse and negative direction. 

Figure 5 illustrates the Linear Regression between the NOx and Equivalence Ratio in Ethanol. Figure 6 illustrates the Smoothed Polynomial Regression of the NOx and E in Ethanol. The result of the Linear Regression of the Equivalence Ratio ( E ) as a function of the NOx shows that while the E value increases, the NOx varies indicating an increase and then decrease.  Figure 6 shows the smoothed a polynomial regression of the NOx and E in ethanol, indicating the same result that there a positive association between E and NOx, meaning when E increases until it reaches ~0.9, NOx also increases until it reaches ~3.5. After that point, the relationship shows negative, meaning that the NOx gets increased with the increase of E. 

This analysis also covers the residuals and fitted lines.  Figure 7 illustrated the Residuals vs. Fitted in Linear Regression Model for NOx as a function of the E.  The residuals depicts the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).

The result of the Linear Regression for the identified variables of E and NOx shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 8 illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  Figure 8 shows that the residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 9 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 10 illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.   Figure 11 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016).  The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016).  The result of Figure 12 shows that the model created is not linear, which requires to re-explore the model. Moreover, the correlation between NOx and E result shows there is a negative correlation between NOx and E with a value of -0.11.   Figure 12 illustrates the Multiple Regression and the Relationship of Equivalence Ratio on NOx conditioned with Compression Ratio.  The Multiple Linear Regression is useful for modeling the relationship between a numeric outcome or dependent variable (Y), and multiple explanatory or independent variables (X).  The result shows that the interaction of C and E affects the NOx.  While the E and C increase, the NOx decreases.   Approximately 0.013 of variation in NOx can be explained by this model (E*C).   The interaction of E and C has a negative value of -0.063 on NOx. 

References

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Kabacoff, R. I. (2011). IN ACTION.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

statisticshowto.com. (2013). Lowess Smoothing: Overview. Retrieved from http://www.statisticshowto.com/lowess-smoothing/.

Quantitative Analysis of “State.x77” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to continue working with R using state.x77 dataset for this assignment.  In this task, the dataset will get converted to a data frame.  Moreover, regression will be performed on the dataset.  The commands used in this discussion are derived from (r-project.org, 2018).  There are four major tasks.  The discussion begins with Task-1 to understand and examine the dataset.  Task-2 covers the data frame creation. Task-3 is to examine the data frame.  Task-4 investigates the data frame using the Linear Regression analysis.  Task-4 is comprehensive as it covers the R commands, the results of the commands and the analysis of the result. 

Task-1:  Understand and Examine the dataset:

The purpose of this task is to understand and examine the dataset.  The following is a summary of the variables from the information provided in the help site as a result of ?state.x77 command:

  • Command: > ?state.x77
  • Command: > summary(state.x77)
  • Command: >head(state.x77)
  • Command: >dim(state.x77)
  • Command:  >list(state.x77)

The dataset of state.x77has 50 rows and 8 columns giving the following statistics in the respective columns.

##The first 10 lines of Income, Illiteracy, and Murder.

  • state.x77.df$Income[1:10]
  • state.x77.df$Illiteracy[1:10]
  • state.x77.df$Murder[1:10]

The descriptive statistical analysis (Central Tendency) (mean, median, min, max, 3th quantile) of the Income, Illiteracy, and Population variables.

  • Command:>summary(state.x77.df$Income)
  • Command:>summary(state.x77.df$Illiteracy)
  • Command:>summary(state.x77.df$Population)

Task2:  Create a Data Frame

  • Command: >state.x77.df <- data.frame(state.x77)
  • Command:>state.selected.variables <- as.data.frame(state.x77[,c(“Murder”, “Population”, “Illiteracy”, “Income”, “Frost”)])

Task-3: Examine the Data Frame

  • Command: > list(state.x77.df)
  • Command: >names(state.x77.df)

Task-4: Linear Regression Model – Commands, Results and Analysis:

  • plot(Income~Illiteracy, data=state.x77.df)
  • mean.Income=mean(state.x77.df$Income, na.rm=T)
  • abline(h=mean.Income, col=”red”)
  • model1=lm(Income~Illiteracy, data=state.x77.df)
  • model1

Figure 1.  Linear Regression Model for Income and Illiteracy.

Analysis: Figure 1 illustrates the Linear Regression between Income and Illiteracy.  The result of the Linear Regression of the Income as a function of the Illiteracy shows that the income increases when the illiteracy percent decreases, and vice versa, indicating there is a reverse relationship between the illiteracy and income. More analysis on the residuals and the fitted lines are discussed below using plot() function in R. 

  • Command: > plot(model1)

Figure 2.  Residuals vs. Fitted in Linear Regression Model for Income and Illiteracy.

Analysis:  Figure 2 illustrated the Residuals vs. Fitted in the Linear Regression Model for Income as a function of the Illiteracy. The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016). For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The Plot of the fitted values against the residuals with a line shows the relationship between the two.  The horizontal and straight line indicates that the “average residual” for all “fitted values” is more or less the same (Navarro, 2015).   The result of the Linear Regression for the identified variables of Illiteracy and Income (Figure 2) shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line. 

Figure 3.  Normal Q-Q Plot of the Linear Regression Model for Illiteracy and Income.

Analysis: Figure 3 illustrates the Normal Q-Q plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  The result shows that the residuals are almost on the straight line in the preceding Normal Q-Q plot, indicating that the residuals are normally distributed.  Hence, the normality test of the residuals is passed.

Figure 4. Scale-Location Plot Generated in R to Validate Homoscedasticity for Illiteracy and Income.

Analysis: Figure 4 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above.  The points are spread in a random fashion around the near horizontal line, as such ensures that the assumption of constant variance of the errors (or homoscedasticity) is fulfilled (Hodeghatta & Nayak, 2016).

Figure 5. Residuals vs. Leverage Plot Generated in R for the LR Model.

Analysis: Figure 5 illustrates the Residuals vs. Leverage Plot generated for the LR Model.  In this plot of Residuals vs. Leverage, the patterns are not relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015). 

##Better understand the linearity of the relationship represented by the model.

  • Command: >crPlots(model1)

Figure 6.  crPlots() Plots for the Linearity of the Relationship between Income and Illiteracy of the Model.

Analysis:  Figure 6 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016).  The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016).  The result of Figure 6 shows that the model created is linear and the reverse relationship between income and the illiteracy as analyzed above in Figure 1.

##Examine the Correlation between Income and Illiteracy.

Analysis: The correlation result shows a negative association between income and illiteracy as anticipated in the linear regression model.

References: 

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

The Assumptions of General Least Square Modeling for Regression and Correlations

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to compare the assumptions of General Least Square Model (GLM) modeling for regression and correlations.  This discussion also covers the issues with transforming variables to make them linear.  The procedure in R for linear regression is also addressed in this assignment.  The discussion begins with some basics such as measurement scale, correlation, and regression, followed by the main topics for this discussion.

Measurement Scale

There are three types of measurement scale.  There is nominal (categorical) such as race, color, job, sex or gender, job status and so forth (Kometa, 2016).  There is ordinal (categorical) such as the effect of a drug could be none, mild and severe, job importance (1-5, where 1 is not important and 5 very important and so forth) (Kometa, 2016).  There is the interval (continuous, covariates, scale metric) such as temperature (in Celsius), weight (in kg), heights (in inches or cm) and so forth (Kometa, 2016). The interval variables have all the properties of nominal and ordinal variables (Bernard, 2011).  They are an exhaustive and mutually exclusive list of attributes, and the attributes have a rank-order structure (Bernard, 2011).  They have one additional property which is related to the distance between attributes (Bernard, 2011). The distance between the attributes are meaningful (Bernard, 2011). Therefore, the interval variables involve true quantitative measurement (Bernard, 2011).

Correlations

Correlation analysis is used to measure the association between two variables.  A correlation coefficient ( r ) is a statistic used for measuring the strength of a supposed linear association between two variables (Kometa, 2016).  The correlation analysis can be conducted using interval data, ordinal data, or categorical data (crosstabs) (Kometa, 2016).  The fundamental concept of the correlation requires the analysis of two variables simultaneously to find whether there is a relationship between the two sets of scores, and how strong or weak that relationship is, presuming that a relationship does, in fact, exist (Huck, Cormier, & Bounds, 2012).  There are three possible scenarios within any bivariate data set.  The first scenario is referred to as high-high, low-low when the high and low score on the first variable tend to be paired with the high and low score of the second variable respectively.  The second scenario is referred to as high-low, low-high, when the relationship represents inverse, meaning when the high and low score of the first variable tend to be paired with a low and high score of the second variable.  The third scenario is referred to as “little systematic tendency,” when some of the high and low scores on the first variable are paired with high scores on the second variable, whereas other high and low scores on the first variable are paired with low scores of the second variable (Huck et al., 2012).

The correlation coefficient varies from -1 and +1 (Huck et al., 2012; Kometa, 2016).  Any ( r ) falls on the right side represents a positive correlation, indicating a direct relationship between the two measured variables, which can be categorized under the high-high, low-low scenario.  However, any ( r ) falls on the left side represents a negative correlation, indicating indirect, or inverse, relationship, which can be categorized under high-low, low-high scenario.   If ( r ) lands on either end of the correlation continuum, the term “perfect” may be used to describe the obtained correlation. The term high comes into play when ( r ) assumes a value close to either end, thus, implying a “strong relationship,” conversely, the term low is used when ( r ) lands close to the middle of the continuum, thus, implying a “weak relationship.”   Any ( r ) ends up in the middle area of the left, or right side of the correlation continuum is called “moderate” (Huck et al., 2012).  Figure 1 illustrates the correlation continuum of values -1 and +1.

Figure 1. Correlation Continuum (-1 and +1) (Huck et al., 2012).

The most common correlation coefficient is the Pearson correlation coefficient, used to measure the relationship between two interval variables (Huck et al., 2012; Kometa, 2016).  Pearson correlation is designed for situations where each of the two variables is quantitative, and each variable is measured to produce raw scores (Huck et al., 2012).  Spearman’s Rho is the second most popular bivariate correlational technique, where each of the two variables is measured to produce ranks with resulting correlation coefficient symbolized as rs or p (Huck et al., 2012).  Kendall’s Tau is similar to Spearman’s Rho (Huck et al., 2012). 

Regression

When dealing with correlation and association between statistical variables, the variables are treated in a symmetric way. However, when dealing with the variables in a non-symmetric way, a predictive model for one or more response variables can be derived from one or more of the others (Giudici, 2005).  Linear Regression is a predictive data mining method (Giudici, 2005; Perugachi-Diaz & Knapik, 2017).

Linear Regression is described to be the most important prediction method for continuous variables, while Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005).  Cluster analysis is different from Logistic Regression and Tree Models, as in the cluster analysis the clustering is unsupervised in the cluster analysis and is measured with no reference variables, while in Logistic Regress and Tree Models, the clustering is supervised and is measured against a reference variables such as response whose levels are known (Giudici, 2005).

The Linear Regression is to examine and predict data by modeling the relationship between the dependent variable also called “response” variable, and the independent variable also known as “explanatory” variable.  The purpose of the Linear Regression is to find the best statistical relationship between these variables to predict the response variable or to examine the relationship between the variables (Perugachi-Diaz & Knapik, 2017). 

Bivariate Linear Regression can be used to evaluate whether one variable called dependent variable or the response can be caused, explained and therefore predicted as a function of another variable called independent, the explanatory variable, the covariate or the feature (Giudici, 2005).  The Y is used for the dependent or response variable, and X is used for the independent or explanatory variable (Giudici, 2005). Linear Regression is the simplest statistical model which can describe Y as a function of an X (Giudici, 2005).  The Linear Regression model specifies a “noisy” linear relationship between variables Y and X, and for each paired observation (xi, yi), the following Regression Function is used (Giudici, 2005; Schumacker, 2015).

Where: 

  • i = 1, 2, …n
  • a = The intercept of the regression function.
  • b = The slope coefficient of the regression function also called the regression coefficient.
  • ei = the random error of the regression function, relative to the ith observation.

The Regression Function has two main elements; the Regression Line and the Error Term.  The Regression Line can be developed empirically, starting from the matrix of available data. The Error Term describes how well the regression line approximates the observed response variable.  The determination of the Regression Line can be described as a problem of fitting a straight line to the observed dispersion diagram, where the Regression Line is the Linear Function using the following formula (Giudici, 2005).

Where: 

 = indicates the fitted ith value of the dependent variable, calculated on the basis of the ith value of the explanatory variable of xi.

The Regression Line simple formula, as indicated in (Bernard, 2011; Schumacker, 2015) is as follows:

Where: 

  • y = variable value of dependent variable.
  • a and b are some constants.
  • x = the variable value of the independent variable.

The Error Term of ei in the expression of the Regression Function represents, for each observation yi, the residual, namely the difference between the observed response values yi, and the corresponding values fitted with the Regression Line using the following formula (Giudici, 2005):

Each residual can be interpreted as the part of the corresponding value that is not explained by the linear relationship with the explanatory variable.  To obtain the analytic expression of the regression line, it is sufficient to calculate the parameters a and b on the basis of the available data. The method of least square is often used for this. It chooses the straight line which minimizes the sum of squares of the errors of the fit (SSE), defined by the following formula (Giudici, 2005). 

Figure 2 illustrates the representation of the regression line.

Figure 2.  Representation of the Regression Line (Giudici, 2005).

General Least Square Model (GLM) for Regression and Correlations

The Linear Regression is based on the Gauss-Markov theorem, which states that if the errors of prediction are independently distributed, sum to zero and have constant variance, then the least squares estimation of the regression weight is the best linear unbiased estimator of the population (Schumacker, 2015).   The Gauss-Markov theorem provides the rule that justifies the selection of a regression weight based on minimizing the error of prediction, which gives the best prediction of Y, which is referred to as the least squares criterion, that is, selecting regression weights based on minimizing the sum of squared errors of prediction (Schumacker, 2015). The least squares criterion is sometimes referred to as BLUE, or Best Linear Unbiased Estimator (Schumacker, 2015).

Several assumptions are made when using Linear Regression, among which is one crucial assumption known as “independence assumption,” which is satisfied when the observations are taken on subjects which are not related in any sense (Perugachi-Diaz & Knapik, 2017).  Using this assumption, the error of the data can be assumed to be independent (Perugachi-Diaz & Knapik, 2017).  If this assumption is violated, the errors exist to be dependent, and the quality of statistical inference may not follow from the classical theory (Perugachi-Diaz & Knapik, 2017). 

Regression works by trying to fit a straight line between these data points so that the overall distance between points and the line is minimized using the statistical method called least square.  Figure 3 illustrates an example of a Scatter Plot of two variables, e.g., English and Maths Scores (Muijs, 2010).  

Figure 3. Example of a Scatter Plot of two Variables, e.g. English and Maths Scores (Muijs, 2010).

In Pearson’s correlation, ( r ) measures how much changes in one variable correspond with equivalent changes in the other variables (Bernard, 2011). It can also be used as a measure of association between an interval and an ordinal variable or between an interval and a dummy variable which are nominal variable coded as 1 or 0, present or absent (Bernard, 2011).  The square of Pearson’s r or r-squared is a PRE (proportionate reduction of error) measure of association for linear relations between interval variables (Bernard, 2011).  It indicates how much better the scores of a dependent variable can be predicted if the scores of some independent variables are known (Bernard, 2011).  The dots illustrated in Figure 4 is physically distant from the dotted mean line by a certain amount. The sum of the squared distances to the mean is the smallest sum possible which is the smallest cumulative prediction error giving the mean of the dependent is only known (Bernard, 2011).  The distance from the dots above the line to the mean is positive; the distances from the dots below the line to the mean are negative (Bernard, 2011).  The sum of the actual distances is zero.  Squaring the distances gets rid of the negative numbers (Bernard, 2011).  The solid line that runs diagonally through the graph in Figure 4 minimizes the prediction error for these data.  This line is called the best fitting line, or the least square line, or the regression line (Bernard, 2011).   

Figure 4.  Example of a Plot of Data of TFR and “INFMORT” for Ten countries (Bernard, 2011).

Transformation of Variables for Linear Regression

The transformation of the data can involve the data transformation of the data matrix in univariate and multivariate frequency distributions (Giudici, 2005).  It can also involve a process to simplify the statistical analysis and the interpretation of the results (Giudici, 2005).  For instance, when the p variables of the data matrix are expressed in different measurement units, it is a good idea to put all the variables into the same measurement unit so that the different measurement scales do not affect the results (Giudici, 2005).  This transformation can be implemented using the linear transformation to standardize the variables, taking away the average of each one and dividing it by the square root of its variance (Giudici, 2005).  There is other data transformation such as the non-linear Box-Cox transformation (Giudici, 2005).

The transformation of the data is also a method of solving problems with data quality, perhaps because items are missing or because there are anomalous values, known as outliers (Giudici, 2005).  There are two primary approaches to deal with missing data; remove it, or substitute it using the remaining data (Giudici, 2005).  The identification of anomalous values requires a formal statistical analysis; an anomalous value can seldom be eliminated as its existence often provides valuable information about the descriptive or predictive model connected to the data under examination (Giudici, 2005). 

The underlying concept behind the transformation of the variables is to correct for distributional problems, outliers, lack of linearity or unequal variances (Field, 2013).  The transformation of the variables changes the form of the relationships between variables, but the relative differences between people for a given variable stay the same. Thus, those relationships can still be quantified (Field, 2013).  However, it does change the differences between different variables because it changes the units of measurement (Field, 2013).  Thus, in the case of a relationship between variables, e.g., regression, the transformation is implemented at the problematic variable. However, in case of differences between variables such as a change in a variable over time, then the transformation is implemented for all of those variables (Field, 2013). 

There are various transformation techniques to correct various problems.  Log Transformation (log(Xi)) method can be used to correct for positive skew, positive kurtosis, unequal variances, lack of linearity (Field, 2013).  Square root transformation (ÖXi ) can be used to correct for positive skew, positive kurtosis, unequal variances, and lack of linearity (Field, 2013).  Reciprocal Transformation (1/Xi) can be used to correct for positive skew, positive kurtosis, unequal variances (Field, 2013).  The Reverse Score Transformation can be used to correct for negative skew (Field, 2013).  Table 1 summarizes these types of transformation and their correction use.

Table 1.  Transformation of Data Methods and their Use. Adapted from (Field, 2013).

Procedures in R for Linear Regressions

In R, there is a package called “stats” package which contains two different functions which can be used to estimate the intercept and slope in the linear regression equation (Schumacker, 2015). These two functions in R are lm() and lsfit() (Schumacker, 2015).  The lm() function uses a data frame, while the lsfit() uses a matrix or data vector.  The lm() function outputs an intercept term, which has meaning when interpreting results in linear regression.   The lm() function can also specify an equation with no intercept of the form (Schumacker, 2015). 

Example of lm() function with intercept on y as dependent variable and x as independent variable: 

  • LReg = lm(y ~ x, data = dataframe).

Example of lm() function with no intercept on y as dependent variable and x as independent variable: 

  • LReg = lm(y ~ 0 + x, data=dataframe) or
  • LReg  = lm(y ~ x – 1, data = dataframe)

 The expectation when using the lm() function is that the response variable data is distributed normally (Hodeghatta & Nayak, 2016).  However, the independent variables are not required to be normally distributed (Hodeghatta & Nayak, 2016).  Predictors can be factors (Hodeghatta & Nayak, 2016).

#cor() function to find the correlation between variables

cor(x,y)

#To build linear regression model with R

model <-lm(y ~ x, data=dataset)

References

Bernard, H. R. (2011). Research methods in anthropology: Qualitative and quantitative approaches: Rowman Altamira.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Huck, S. W., Cormier, W. H., & Bounds, W. G. (2012). Reading statistics and research (6th ed.): Harper & Row New York.

Kometa, S. T. (2016). Getting Started With IBM SPSS Statistics for Windows: A Training Manual for Beginners (8th ed.): Pearson.

Muijs, D. (2010). Doing quantitative research in education with SPSS: Sage.

Perugachi-Diaz, Y., & Knapik, B. (2017). Correlation in Linear Regression.

Schumacker, R. E. (2015). Learning statistics using R: Sage Publications.

Quantitative Analysis of “Births2006.smpl” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to analyze the selected dataset of the births2006.smpl.  The dataset is part of the R library “nutshell.” The project is divided into two main Parts.  Part-I evaluates and examines the dataset for understanding the Dataset using the R.  Part-I involves five significant tasks for the examination of the dataset. Part-II is about the Data Analysis of the dataset.  The Data Analysis involves nine significant tasks.  The first eight tasks involve the codes and the results with Plot Graphs, and Bar Charts for analysis.   Task-9 is the last task of Part-II for discussion and analysis. The most observed results include the higher number of the birth during the working days of Tuesday through Thursday than the weekend, and the domination of the vaginal method over the C-section.  The result also shows that the average birth weight gets increased among the male babies for quintuplet while the trend continues to decline among the female babies. The researcher recommends further statistical significance, and the effect size tests to verify these results and examine the interaction among specific variables such as birth weight and Apgar score.

Keywords: Births2006.smpl; Box Plot and Graphs Analysis Using R.

Introduction

This project examines and analyzes the dataset of births2006.smpl which is part of the Nutshell package of RStudio.  This dataset contains information on babies born in the United in the year 2006.  The source of this dataset is (https://www.cdc.gov/NCHS/data_access/VitalStatsOnline.htm).  There is only one record per birth.  The dataset is a random ten percent sample of the original data (RDocumentation, n.d.).  The package which is required for this dataset is called “nutshell” in R.  The dataset contains 427,323 as shown below.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: The Number of Birth in 2006 per day of the week in the U.S.
  • Task-3: The Number of Birth per Delivery Method and Day of Week in 2006 in the U.S.
  • Task-4: The Number of Birth based on Birth Weight and Single or Multiple Birth Using Histogram.
  • Task-5: The Number of Birth based on Birth Weight and Delivery Method Using Histogram.
  • Task-6: Box Plot of Birth Weight Per Apgar Score.
  • Task-7: Box Plot of Birth Weight Per Day of Week.
  • Task-8: The Average of Birth Weight Per Multiple Births by Gender.
  • Task-9:  Discussion and Analysis.

Part-I:  Understanding and Examining the Dataset “births2006.smpl”

Task-1: Install Nutshell Package

The purpose of this task is to install the nutshell package which is required for this project. The births2006.smpl dataset is part of Nutshell package in R. 

  • Command: >install.packages(“nutshell”)
  • Command: >library (nutshell)

Task-2: Understand the Variables of the Dataset

The purpose of this task is to understand the variables of the dataset.  This dataset is part of RStudio dataset (RDocumentation, n.d.).  The main dataset is called “births2006.smpl” dataset, which includes thirteen variables as shown in Table 1. 

Table 1. The Variables of the Dataset of births2006.smpl.

This dataset contains information on babies born in the United in the year 2006.  The source of this dataset is (https://www.cdc.gov/NCHS/data_access/VitalStatsOnline.htm).  There is only one record per birth.  The dataset is a random ten percent sample of the original data (RDocumentation, n.d.).  The package which is required for this dataset is called “Nutshell” in R.  The dataset contains 427,323 as shown below. 

  • Command:> nrow(births.dataframe)

Task-3: Examine the Datasets Using R

The purpose of this task is to examine each dataset using RConsole. The commands which will primarily use in this section are a summary() to understand each dataset better.

  • Command: >summary(births2006.smpl),

Task-4: Create a Data frame to represent the dataset of births2006.smpl.

            The purpose of this task is to create a data frame for the dataset.

  • Command:  >births.dataframe <- data.frame(births2006.smpl)

Task-5: Examine the Content of the Data frame using head(), names(), colnames(), and dim() functions.

            The purpose of this task is to examine the content of the data frame using the functions of the head(), names(), colnames(), and dim().

  • Command:  head(births.dataframe)
    • Command: names(births.dataframe)Command: colnames(births.dataframe)Command:  >dim(births.dataframe)

Part-II: Birth Dataset Tasks and Analysis

Task-1: The First Five records of the dataset.

            The purpose of this task is to display the first five records using head() function.   

Task-2: The Number of Birth in 2006 per day of the week in the U.S.

The purpose of this task is to display a bar chart of the “frequency” of births according to the day of the week of the birth.  

Figure 1.  Frequency of Birth in 2006 per day of the week in United States.

Task-3: The Number of Births Per Delivery Method and Day of Week in 2006 in the U.S.

The purpose of this task is to show a bar chart of the “frequency” for two-way classification of birth according to the day of the week and the method of the delivery (C-section or Vaginal).

Figure 2.  The Number of Births Per Delivery Method and Day of Week in 2006 in the US.

Task-4: The Number of Birth based on Birth Weight and Single or Multiple Birth Using Histogram.

The purpose of this task is to use “lattice” (trellis) graphs using lattice R package, to condition density histograms on the value of a third variable. The variables for multiple births and the method of delivery are conditioning variables. Separate the histogram of birth weight according to these variables.  

Figure 3.  The Number of the Birth based on Weight and Single or Multiple Birth.

Task-5: The Number of Birth based on Birth Weight and Delivery Method Using Histogram.

The purpose of this task is to use “lattice” (trellis) graphs using lattice R package, to condition density histograms on the value of a third variable. The variables for multiple births and the method of delivery are conditioning variables. Separate the histogram of birth weight according to these variables.  

Figure 4.  The Number of the Birth based on Birth Weight and Delivery Method.

Task-6: Box Plot of Birth Weight Per Apgar Score

The purpose of this task is to use Box plot of birth weight against Apgar score and box plots of birth weight by day of the week of delivery. 

Figure 5.  Box Plot of Birth

Task-7: Box Plot of Birth Weight Per Day of the Week

The purpose of this task is to use Box plot of birth weight per day of the week.

Figure 6.  Box Plot of Birth Weight Per day of the Week.

Task-8: The Average of Birth Weight Per Multiple Births by Gender.

The purpose of this task is to calculate the average birth weight as a function of multiple births for males and females separately.  In this task, the tapply function is used, and the option na.rm=TRUE is used for missing values.

Figure 7.  Bar Plot of Average Birth Weight Per Multiple Births by Gender.

 Task-9: Discussion and Data Analysis

For the number of the births in 2006 per day of the week in United States, giving for Sunday (1) through the week until Saturday is (7), the result (Figure 1) shows that the highest number of births, which seems to be very close, happens in the working days of 3, 4, and 5, Tuesday, Wednesday, and Thursday respectively. The least number of birth is observed on day 1 (Sunday), followed by day 7 (Saturday), day 2 (Monday) and day 6 (Friday).  

For the number of births per delivery method for (C-section vs. vaginal), and the day of the week in 2006 in the United States, the result (Figure 2)  shows that the vaginal method is dominating the delivery methods and has the highest ranks in all weekdays in comparison with C-section.  The same high number of the birth per day in the vaginal method are the working days of Tuesday, Wednesday, and Thursday.  The least number of birth per day in the vaginal method is on Sunday, followed by Saturday, Monday, and Friday.   The highest number of birth in C-section is observed on Friday, followed by Tuesday through Thursday.  The least number of birth per day in C-section is still on Sunday, followed by Saturday and Monday. 

For the number of births based on birth weight and single or multiple births (twin, triplet, quadruplet, and quintuplet or higher), the result (Figure 3) shows that the single birth frequency has almost a normal distribution.  However, the more birth such as twin, triplet, quadruplet, and quintuplet or higher, the more distribution moves toward the left indicating less weight.  Thus, this result can suggest that the more birth (twin, triplet, quadruplet, and quintuplet or more) have lower birth rates on average. 

For the number of births based on the birth weight and delivery method, the result (Figure 4) shows that the vaginal and C-section have almost the same distribution.  However, the vaginal shows a higher percent total than the C-section.  The unknown delivery method is an almost the same pattern of distribution of vaginal and C-section.  More analysis is required to determine the effect of the weight on the delivery method and the rate of the birth. 

The Apgar score is a scoring system used by doctors and nursed to evaluate newborns one minute and five minutes after the baby is born (Gill, 2018).  The Apgar scoring system is divided into five categories: activity/muscle tone, pulse/heart rate, grimace, appearance, and respiration/breathing. Each category receives a score of 0 to 2 points.  At most, a child will receive an overall score of 10 (Gill, 2018). However, a baby rarely scores a 10 in the first few moments of life, because most babies have blue hands or feet immediately after the birth (Gill, 2018).  For the birth weight per Apgar score, the result (Figure 5) shows that the median is almost the same or close among the birth weight for Apgar score of 3-10.  The median for birth weight of Apgar score of 0 and 2 is close, while the least median is the Apgar score 1 within the same range of the birth weight of 0-2000 gram.  However, the birth weight from 2000-4000 gram, the median of the birth weight is close to each other for the Apgar score from 3-10, almost ~3000 gram.  The birth weight distribution varies, as it is more distributed between ~1500 to 2300 grams, the closer to Apgar score 10, the birth weight moves between ~2500 to ~3000 grams.  There are outliers in distribution for Apgar score 8 and 9.   These outliers show heavyweight babies above 6000 grams with Apgar score of 8-9.  As the Apgar score increases, the more outliers than the distribution of lower Apgar scores.  Thus, more analysis using statistical significance tests and effect size can be performed for further investigation of these two variables interaction. 

For the birth weight per day of the week, the result (Figure 6) shows that there is a normal distribution for the seven days of the week. The median of the birth weight for all days is almost the same.  The minimum, the maximum, and the range of the birth weight have also a normal distribution among the days of the week.  However, there are outliers in the birth weight for the working days of Tuesday, Wednesday, and Thursday.  There are additional outliers in the birth weight on Monday, as well as on Saturday but fewer outliers than the working days of Tues-Thurs.  This result indicates that there is no relationship between the birth weight and the days of the week, as the heavyweight babies above 6000 grams reflecting the outliers tend to occur with no regard to the days of the week. 

For the average of the birth weight per multiple births by gender, the result (Figure 7) shows that the single birth has the highest birth weight for the male and female of ~3500 grams.  The birth weight tends to decrease for “twin,” “triplet” for male and female.  However, the birth weight shows a decrease in the female and more decrease in male than female in “quadruplet.”  The more observed result is shown in the male gender babies as the birth weight gets increased for the “quintuplet or higher,” while the birth weight for female continues to decline for the same category of “quintuplet or higher.”  This result confirms the result of the impact of the multiple births on the birth weight as discussed earlier and illustrated in Figure 3.

In summary, the analysis of the dataset of births2006.smpl using R indicates that frequency of birth tends to focus more on the working days than the weekends, and the vaginal tends to dominate the delivery methods.  Moreover, the frequency of the birth based on birth weight and single or multiple births shows that the single birth has more normal distribution than the other multiple births.  The vaginal and C-section have shown almost similar distribution.  The birth weight per Apgar score is between ~2500-3000 grams and close among the Apgar score of 8-10.  The days of the week does not show any difference in the birth weight. Moreover, the birth weight per gender shows that the birth weight tends to decrease by multiple births among females and males, except only for the quintuplet, where it tends to decrease in a female while it increases in males.  This result of the increasing birth weight among male birth for quintuplet or higher requires more investigation to evaluate the reasons and causes for such an increase in the birth weight.  The researcher recommends further statistical significance, and the effect size tests verify these results.  

Conclusion

The project analyzed the selected dataset of the births2006.smpl.  The dataset is part of the R library “nutshell.” The project is divided into two main Parts.  Part-I evaluated and examined the dataset for understanding the Dataset using the R.  Part-I involved five major tasks for the examination of the dataset. Part-II addressed the Data Analysis of the dataset.  The Data Analysis involved nine major tasks.  The first eight tasks involved the codes and the results with Plot Graphs, and Bar Charts for analysis.   The discussion and the analysis were addressed in Task-9.  The most observed results showed that the number of the birth increases during the working days of Tuesday through Thursday over the weekend and the vaginal method is dominating over the C-section.  The result also showed that the average birth weight gets increased among the male babies for quintuplet while the trend continues to decline among the female babies. The researcher recommends further statistical significance, and the effect size tests to verify these results and examine the interaction among certain variables such as birth weight and Apgar score.

References

Gill, K. (2018). Apgar Score: What You Should Know. Retrieved from https://www.healthline.com/health/apgar-score#apgar-rubric.

RDocumentation. (n.d.). Births in the United States, 2006: births2006.smpl dataset. Retrieved from https://www.rdocumentation.org/packages/nutshell/versions/2.0/topics/births2006.smpl.

 

Machine Learning: Supervised Learning

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss the supervised learning and how it can be used in large datasets to overcome the problem where everything is significant with statistical analysis. The discussion also addresses the importance of a clear purpose of supervised learning and the use of random sampling.

Supervised Learning (SL) Algorithm

In accordance with the (Hall, Dean, Kabul, & Silva, 2014), SL “refers to techniques that use labeled data to train a model.”  It is comprised of “Prediction” (“Regression”) algorithm, and “Classification” algorithm.  The “Regression” or “Prediction” algorithm is used for “interval labels,” while the “Classification” algorithm is used for “class labels” (Hall et al., 2014).  In the SL algorithm, the training data represented in observations, measurements, and so forth are associated by labels reflecting the class of the observations (Han, Pei, & Kamber, 2011).  The new data is classified based on the “training set” (Han et al., 2011).

The “Predictive Modeling” (PM) operation of the “Data Mining” utilizes the same concept of the human learning by using the observation to formulate a model of specific characteristics and phenomenon (Coronel & Morris, 2016).  The analysis of an existing database to determine the essential characteristics “model” about the data set can implement using the PM operation (Coronel & Morris, 2016).  The (SL) algorithm develops these key characteristics represented in a “model” (Coronel & Morris, 2016).  The SL approach has two phases: (1) Training Phase, and (2) Testing Phase.  In the “Training Phase,” a model utilizing a large sample of historical data called “Training Set” is developed.  In the “Testing Phase,” the model is tested on new, previously unseen data, to determine the accuracy and the performance characteristics.  The PM operation involves two approaches: (1) Classification Technique, and (2) Value Prediction Technique (Connolly & Begg, 2015).  The nature of the predicted variables distinguish both techniques of the classification and value prediction (Connolly & Begg, 2015). 

The “Classification Technique” involves two specializations of classifications: (1) “Tree Induction,” and (2) “Neural Induction” which are used to develop a predetermined class for each record in the database from a set of possible class values (Connolly & Begg, 2015).  The application of this approach can answer questions like “What is the probability for those customers who are renting to be interested in purchasing home?” 

The “Value Prediction,” on the other hand, implements the traditional statistical methods of (1) “Linear Regression,” and (2) “Non-Linear Regression” which are used to estimate a continuous numeric value that is associated with a database record (Connolly & Begg, 2015).  The application of this approach can be used for “Credit Card Fraud Detection,” and “Target Mailing List Identification” (Connolly & Begg, 2015).  The limitation of this approach is that the “Linear Regression” works well only with “Linear Data” (Connolly & Begg, 2015). The application of the PM operation includes the (1) “Customer Retention Management,” (2) “Credit Approval,” (3) “Cross-Selling,” and (4) “Direct Marketing” (Connolly & Begg, 2015).  Furthermore, the Supervised methods such as Linear Regression or Multiple Linear Regression can be used if there exists a strong relationship between a response variable and various predictors (Hodeghatta & Nayak, 2016).   

Clear Purpose of Supervised Learning

The purpose of the supervised learning must be clear before the implementation of the data mining process.   Data mining process involves six steps in accordance to (Dhawan, 2014).  They are as follows. 

  • The first step includes the exploration of the data domain.  To achieve the expected result, understanding and grasping the domain of the application assist in accumulating better data sets that would determine the data mining technique to be applied. 
  • The second phase includes the data collection.  In the data collection stage, all data mining algorithms are implemented on some data sets.  
  • The third phase involves the refinement and the transformation of the data.  In this stage, the datasets will get more refined to remove any noise, outliner, missing values, and other inconsistencies.  The refinement of the data is followed by the transformation of the data for further processing for analysis and pattern extraction.  
  • The fourth step involves the feature selection.  In this stage, relevant features are selected to apply further processing. 
  • The fifth stage involves the application of the relevant algorithm.  After the data is acquired, cleaned and features are selected, in this step, the algorithm is selected to process the data and produce results.  Some of the commonly used algorithms include (1) clustering algorithm, (2) association rule mining algorithm, (3) decision tree algorithm, and (4) sequence mining algorithm. 
  • The last phase involves the observation, the analysis, and the evaluation of the data.  In this step, the purpose is to find a pattern in the result produced by the algorithm.  The conclusion is typically based on the observation and evaluation of the data.

Classification is one of the data mining techniques.  Classification based data mining exists as the cornerstone of the machine learning in artificial intelligence (Dhawan, 2014).  The process in the Supervised Classification begins with given sample data, also known as a training set, consists of multiple entries, each with multiple features.  The purpose of this supervised classification is to analyze the sample data and to develop an accurate understanding or model for each class using the attributes present in the data.  This supervised classification is used to classify and label test data.  Thus, the precise purpose of the supervised classification is very critical to analyze the sample data and develop an accurate model for each class using the attributes present in the data.  Figure 1 illustrates the supervised classification technique in data mining as depicted in (Dhawan, 2014).  

Figure 1:  Linear Overview of steps involved in Supervised Classification (Dhawan, 2014)

The conventional techniques employed in the Supervised Classification involves the known algorithms of (1) Bayesian Classification, (2) Naïve Bayesian Classification, (3) Robust Bayesian Classifier, and (4) Decision Tree Learning. 

Various Types of Sampling

A sample of records can be taken for any analysis unless the dataset is driven from a big data infrastructure (Hodeghatta & Nayak, 2016).  A randomization technique should be used, and steps must be taken to ensure that all the members of a population have an equal chance of being selected (Hodeghatta & Nayak, 2016). This method is called probability sampling.  There are various variations on this sampling type:  Random Sampling, Stratified Sampling, and Systematic Sampling (Hodeghatta & Nayak, 2016), cluster, and multi-stage (Saunders, 2011).  In Random Sampling, a sample is picked randomly, and every member has an equal opportunity to be selected. In Stratified Sampling, the population is divided into groups, and data is selected randomly from a group or strata.  In Systematic Sampling, members are selected systematically, for instance, every tenth member of that particular time or event (Hodeghatta & Nayak, 2016).  The most appropriate sampling technique to obtain a representative sample should be implemented based on the research question(s) and the objectives of the research study (Saunders, 2011). 

In summary, supervised learning is comprised of Prediction or Regression, and Classification. In both approaches, a clear understanding of the SL is critical to analyze the sample data and develop an accurate understanding or model for each class using the attributes present in the data.  There are various types of sampling:  random, stratified and systematic.  The most appropriate sampling technique to obtain a representative sample should be implemented based on the research question(s) and the objectives of the research study. 

References

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Coronel, C., & Morris, S. (2016). Database systems: design, implementation, & management: Cengage Learning.

Dhawan, S. (2014). An Overview of Efficient Data Mining Techniques. Paper presented at the International Journal of Engineering Research and Technology.

Hall, P., Dean, J., Kabul, I. K., & Silva, J. (2014). An Overview of Machine Learning with SAS® Enterprise Miner™. SAS Institute Inc.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Saunders, M. N. (2011). Research methods for business students, 5/e: Pearson Education India.