Quantitative Analysis of the “Faithful” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the faithful dataset. The project is divided into two main Parts.  Part-I evaluates and examines the DataSet for understanding the DataSet using the RStudio.  Part-I involves five significant tasks.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving nine significant tasks to analyze the Data Frames.   The result shows that there is a relationship between Waiting Time until Next Eruption and the Eruption Time.   The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is minimal almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternative hypotheses that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable Waiting Time and the independent variable of Eruptions. The project also analyzed the comparison among three regressions: standard regression, polynomial regression, and lowess regression.  The result shows a similar relationship and lines between the three models.

Keywords: R-Dataset; Faithful Dataset; Regression Analysis Using R.

Introduction

This project examines and analyzes the dataset of faithful.csv (oldfaithful.csv).  The dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/.  The dataset has 272 observations on two variables; eruptions and waiting. The dataset is to describe the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  A closer look at the failthful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequency than expected under non-human measurement.  The eruption variable is numeric for the eruption time in minutes.  The waiting variable is numeric for the waiting time to next eruption in minutes. The geyser is in package MASS.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: Density Histograms and Smoothed Density Histograms.
  • Task-3: Standard Linear Regression
  • Task-4: Polynomial Regression.
  • Task-5: Lowess Regression.
  • Task-6: The Summary of the Model.
  • Task-7: Comparison of Linear Regressions, Polynomial Regression, and Lowess Regression.
  • Task-8: The Summary of all Models.
  • Task-9:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “faithful.”

Task-1:  Install MASS Package

The purpose of this task is to install the MASS package which is required for this project. The faithful.cxv requires this package.

  • Command: >install.packages(“MASS”)

Task-2:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is a “faithful” dataset. It describes the Waiting Time between Eruptions and the duration of the Eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  The dataset has 272 observations on two main variables:

  • Eruptions: numeric Eruption time in minutes.
  • Waiting: numeric Waiting time to next eruption in minutes.

Task-3:  Examine the Variables of the Data Sets

The main dataset is called “faithful.csv” dataset, which includes two main variables eruptions and waiting.

  • ##Examine the dataset
  • data()
  • ?faithful
  • install.packages(“MASS”)
  • install.packages(“lattice”)
  • library(lattice)
  • faithful <- read.csv(“C:/CS871/Data/faithful.csv”)
  • data(faithful)
  • summary(faithful)

Figure 1.  Eruptions and Waiting for Eruption Plots for Faithful dataset.

Task-4: Create a Data Frame to repreent the dataset of faithful.

  • ##Create DataFrame
  • faithful.df <- data.frame(faithful)
  • faithful.df
  • summary(faithful.df)

Task-5: Examine the Content of the Data Frame using head(), names(), colnames(), and dim() functions.

  • names(faithful.df)
  • head(faithful.df)
  • dim(faithful.df)

Part-II: Discussion and Analysis

Task-1:  The first Ten lines of Waiting and Eruptions.

  • ##The first ten lines of Waiting and Eruptions
  • faithful$waiting[1:10]
  • faithful$eruption[1:10]
  • ##The descriptive analysis of waiting and eruptions
  • summary(faithful$waiting)
  • summary(faithful$eruptions)

Task-2:  Density Histograms, and Smoothed Density Histograms.

  • ##Density histogram for Waiting Time
  • hist(faithful.df$waiting, col=”blue”, freq=FALSE, main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##smoothed density histogram for Waiting Time
  • smoothedDensity_waiting <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_waiting, col=”blue”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##Density histogram for Eruptions
  • hist(faithful.df$eruptions, col=”red”, freq=FALSE, main=”Histogram of Eruption Time”, xlab=”Eruption Time In Minutes”)
  • ##smoothed density histogram for Eruptions
  • smoothedDensity_eruptions <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_eruptions, col=”red”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)

Figure 2.  Density Histogram and Smoothed Density Histogram of Waiting Time.

Figure 3.  Density Histogram and Smoothed Density Histogram of Eruption.

Task-3: Standard Linear Regression

            The purpose of this task is to examine the Standard Linear Regression for the two main factors of the faithful dataset Waiting Time until Next Eruption and Eruption Time.  This task also addresses the diagnostic plots of the standard linear regression such as residuals vs. fitted as examined below.  The R codes are as follows:

  • ##Standard Regression of waiting time on eruption time.
  • lin.reg.model=lm(waiting~eruptions, data=faithful.df)
  • plot(waiting~eruptions, data=faithful, col=”blue”, main=”Regression of Two Factors of Waiting and Eruption Time”)
  • abline(lin.reg.model, col=”red”)

Figure 4.  Regression of Two Factors of Waiting and Eruption Time.

            The following graphs represent the diagnostic plots for the standard linear regressions.  The first plot represents the residual vs. fitted. The second plot represents the Normal Q-Q. The third plot represents Scale-Location. The fourth plot represents Residuals vs. Leverage.  The discussion and analysis of these graphs under the discussion analysis section of this project.

Figure 5.  Diagnostic Plots for Standard Linear Regression.

Task-4: Polynomial Regression

            The purpose of this task is to examine the polynomial regression for the Waiting Time Until Next Eruption and the Eruption Time variables. The R codes are as follows:

Figure 6.  Polynomial Regreession.

Task-5: Lowess Regression

            The purpose of this task is to examine the Lowess Regression.  The R codes are as follows:

Figure 7.  Lowess Regression.

Task-6: Summary of the Model.

            The purpose of this task is to examine the descriptive analysis summary such as residuals, intercept, R-squared.  The R code is as follows:

  • summary(model1)

Task-7:  Comparison of Linear Regression, Polynomial Regression and Lowess Regression.

  • ##Comparing local polynomial regression to the standard regression.
  • lowessReg=lowess(faithful$waiting~faithful$eruptions, f=2/3)
  • local.poly.reg <-locfit(waiting~lp(eruptions, nn=0.5), data=faithful)
  • standard.reg=lm(waiting~eruptions, data=faithful)
  • plot(faithful$waiting~faithful$eruptions, main=”Eruptions Time”, xlab=”Eruption Time in Minutes”, ylab=”Waiting Time to Next Eruption Time”, col=”blue”)
  • lines(lowessReg, col=”red”)
  • abline(standard.reg, col=”green”)
  • lines(local.poly.reg, col=”yellow”)

Figure 8.  Regression Comparison for the Eruptions and Waiting Time Variables.

Task-8:  Summary of these Models

            The purpose of this task is to examine the summary of each mode. The R codes are as follows:

  • ##Summary of the regressions
  • summary(lowessReg)
  • summary(local.poly.reg)
  • summary(standard.reg)
  • cor(faithful.df$eruptions, faithful.df$waiting)

Task-9: Discussion and Analysis

            The result shows that the descriptive analysis of the average for the eruptions is 3.49, which is lower than the median value of 4.0 minutes, indicating a negatively skewed distribution.  The average for the waiting time until the next eruptions is 70.9 which is less than the median of 76.0 indicating a negatively skewed distribution.  Figure 2 illustrated the density histogram and smoothed density histogram of the waiting time. The result shows that the peak waiting time is ~80 minutes with the highest density point of 0.04.   Figure 3 illustrated the density histogram and smoothed density histogram of the eruption time in minutes.  The result shows that the peak eruption time in minutes is ~4.4 with the highest frequency density point of 0.6.  Figure 4 illustrates the linear regression of the two factors of waiting until next eruption and the eruption time in minutes.  The result shows that when the waiting time increases, the eruption time in minutes increases.  The residuals depict the difference between the actual value of the response variable and the value of the response variable predicted using the regression.  The maximum residual is shown as 15.97.  The spread of residuals is provided by specifying the values of min, max, median, Q1, and Q3 of the residuals. In this case, the spread is from -12.08 to 15.97.  Since the principle behind the regression line and the regression equation is to reduce the error or this difference, the expectation is that the median value should be very near to 0.  However, the median shows .21 which is higher than 0.  The prediction error can go up to the maximum value of the residual.  As this value is 15.97 which is not small, this residual cannot be accepted.   The result also shows that the value next to the coefficient estimate is the standard error of the estimate. Ths specifies the uncertainty of the estimate, then comes the “t” value of the standard error.  This value specifies as to how large the coefficient estimate is concerning the uncertainty.  The next value is the probability that the absolute(t) value is greater than the one specified which is due to a chance error.  The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is very small almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypothesis that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable waiting time and the independent variable of eruptions.

The diagnostic plots of the standard regression are also discussed in this project. Figure 5 illustrates four different diagnostic plots of the standard regression.  This analysis also covers the residuals and fitted lines.  Figure 5 illustrated the Residuals vs. Fitted in Linear Regression Model for Waiting Time until Next Eruption as a function of the Eruptions Time in minutes.  The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).  The result of the Linear Regression for the identified variables of Eruptions and Waiting Time shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 5 also illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). The residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 5 also illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 5 also illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).   The result of the Polynomial regression is also addressed in this project.  The polynomial regression shows a relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 6 is similar to the Standard Linear Regression.  Lowess Regression shows the same pattern on the relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 7 is similar to the Standards Linear Regression.  These three lines of the Standard Linear Regression, Polynomial Regression and Lowess Regression are illustrated together in a comparison fashion in Figure 8.   The coefficient correlation result shows that there is a positive correlation indicating the positive effect of the Eruptions Time on the Waiting Time until Next Eruption.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.