Quantitative Analysis of “State.x77” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to continue working with R using state.x77 dataset for this assignment. In this task, the dataset will get converted to a data frame. Moreover, regression will be performed on the dataset. The commands used in this discussion are derived from (r-project.org, 2018). There are four major tasks. The discussion begins with Task-1 to understand and examine the dataset. Task-2 covers the data frame creation. Task-3 is to examine the data frame. Task-4 investigates the data frame using the Linear Regression analysis. Task-4 is comprehensive as it covers the R commands, the results of the commands and the analysis of the result.

Task-1: Understand and Examine the dataset:

The purpose of this task is to understand and examine the dataset. The following is a summary of the variables from the information provided in the help site as a result of ?state.x77 command:

Command: > ?state.x77
Command: > summary(state.x77)
Command: >head(state.x77)
Command: >dim(state.x77)
Command: >list(state.x77)

The dataset of state.x77has 50 rows and 8 columns giving the following statistics in the respective columns.

##The first 10 lines of Income, Illiteracy, and Murder.

state.x77.df$Income[1:10]
state.x77.df$Illiteracy[1:10]
state.x77.df$Murder[1:10]

The descriptive statistical analysis (Central Tendency) (mean, median, min, max, 3th quantile) of the Income, Illiteracy, and Population variables.

Command:>summary(state.x77.df$Income)
Command:>summary(state.x77.df$Illiteracy)
Command:>summary(state.x77.df$Population)

Task2: Create a Data Frame

Command: >state.x77.df <- data.frame(state.x77)
Command:>state.selected.variables <- as.data.frame(state.x77[,c(“Murder”, “Population”, “Illiteracy”, “Income”, “Frost”)])

Task-3: Examine the Data Frame

Command: > list(state.x77.df)

Command: >names(state.x77.df)

Task-4: Linear Regression Model – Commands, Results and Analysis:

plot(Income~Illiteracy, data=state.x77.df)
mean.Income=mean(state.x77.df$Income, na.rm=T)
abline(h=mean.Income, col=”red”)
model1=lm(Income~Illiteracy, data=state.x77.df)
model1

Figure 1. Linear Regression Model for Income and Illiteracy.

Analysis: Figure 1 illustrates the Linear Regression between Income and Illiteracy. The result of the Linear Regression of the Income as a function of the Illiteracy shows that the income increases when the illiteracy percent decreases, and vice versa, indicating there is a reverse relationship between the illiteracy and income. More analysis on the residuals and the fitted lines are discussed below using plot() function in R.

Command: > plot(model1)

Figure 2. Residuals vs. Fitted in Linear Regression Model for Income and Illiteracy.

Analysis: Figure 2 illustrated the Residuals vs. Fitted in the Linear Regression Model for Income as a function of the Illiteracy. The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016). The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016). The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016). For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016). When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016). The Plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” is more or less the same (Navarro, 2015). The result of the Linear Regression for the identified variables of Illiteracy and Income (Figure 2) shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.

Figure 3. Normal Q-Q Plot of the Linear Regression Model for Illiteracy and Income.

Analysis: Figure 3 illustrates the Normal Q-Q plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). The result shows that the residuals are almost on the straight line in the preceding Normal Q-Q plot, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.

Figure 4. Scale-Location Plot Generated in R to Validate Homoscedasticity for Illiteracy and Income.

Analysis: Figure 4 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the near horizontal line, as such ensures that the assumption of constant variance of the errors (or homoscedasticity) is fulfilled (Hodeghatta & Nayak, 2016).

Figure 5. Residuals vs. Leverage Plot Generated in R for the LR Model.

Analysis: Figure 5 illustrates the Residuals vs. Leverage Plot generated for the LR Model. In this plot of Residuals vs. Leverage, the patterns are not relevant as the case with the diagnostics plot of the linear regression. In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015). Those spots are the places where a case can be influential against a regression line (Bommae, 2015). When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).

##Better understand the linearity of the relationship represented by the model.

Command: >crPlots(model1)

Figure 6. crPlots() Plots for the Linearity of the Relationship between Income and Illiteracy of the Model.

Analysis: Figure 6 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016). The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016). The result of Figure 6 shows that the model created is linear and the reverse relationship between income and the illiteracy as analyzed above in Figure 1.

##Examine the Correlation between Income and Illiteracy.

Analysis: The correlation result shows a negative association between income and illiteracy as anticipated in the linear regression model.

References:

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Share this:

Related

Published by Think and Knowledge Tank