Quantitative Analysis of “Ethanol” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss locally weighted scatterplot smoothing known as LOWESS method for multiple regression models in a k-nearest-neighbor-based model. The discussion also addresses whether the LOWESS is a parametric or non-parametric method. The advantages and disadvantages of LOWESS from the computational standpoint are also addressed in this discussion. Moreover, another purpose of this discussion is to select a dataset from http://vincentarelbundock.github.io/Rdatasets/ and perform a multiple regression analysis using R programming. The dataset selected for this discussion is “ethanol” dataset. The discussion begins with Multiple Regression, Lowess method, Lowess/Loess in R, and K-Nearest-Neighbor (k-NN), followed by the analysis of the “ethanol” dataset.

Multiple Regression

When there is more than one predictor variable, simple Linear Regression becomes Multiple Linear Regression, and the analysis becomes more involved (Kabacoff, 2011). The Polynomial Regression typically is a particular case of the Multiple Regression (Kabacoff, 2011). The Quadratic Regression has two predictors (X and X²), and Cubic Regression has three predictors (X, X^2,and X³) (Kabacoff, 2011). Where there is more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant (Kabacoff, 2011).

Locally Weighted Scatterplot Smoothing (Lowess) Method

Regression assumes that the relationship between predictors and outcomes is linear. However, non-linear relationships between variables can exist in some cases (Navarro, 2015). There are some tools in statistics which can be employed to do non-linear regression. The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015). As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).

Moreover, the Lowess and least square are non-parametric strategies for fitting a smooth curve to data points (statisticshowto.com, 2013). The “parametric” indicates there is an assumption in advance that the data fits some distribution, i.e., normal distribution (statisticshowto.com, 2013). Parametric fitting can lead to fitting a smooth curve which misrepresents the data because some distribution is assumed in advance (statisticshowto.com, 2013). Thus, in those cases, non-parametric smoothers may be a better choice (statisticshowto.com, 2013). The non-parametric smoothers like Loess try to find a curve of best fit without assuming the data must fit some distribution shape (statisticshowto.com, 2013). In general, both types of smoothers are used for the same set of data to offset the advantages and disadvantages of each type of smoother (statisticshowto.com, 2013). The benefits of non-parametric smoothing include providing a flexible approach to representing data, ease of use, easy computations (statisticshowto.com, 2013). The disadvantages of the non-parametric smoothing include the following: (1) it cannot be used to obtain a simple equation for a set of data, (2) less well understood than parametric smoothers, and (3) it requires a little guesswork to obtain a result (statisticshowto.com, 2013).

Lowess/Loess in R

Rhere are two versions of the lowess or loess scatter-diagram smoothing approach implemented in R (Dias, n.d.). The former (lowess) was implemented first, while loess is more flexible and powerful (Dias, n.d.). Example of lowess:

lowess(x,y f=2/3, iter=3, delta=.01*diff(range(x)))

where the following model is assumed: y = b(x)+e.

The “f” is the smoother span which gives the proportion of points in the plot which influence the smooth at each value. The larger values give more smoothness.
- The “iter” is the number of “robustifying” iterations which should be performed; using smaller values of “iter” will make “lowess” run faster.
- The “delta” represents the value of “x” which lies within “delta” of each other replace by a single value in the output from “lowess” (Dias, n.d.).

The loess() function uses a formula to specify the response (and in its application as a scatter-diagram smoother) a single predictor variable (Dias, n.d.). The loess() function creates an object which contains the results, and the predict() function retrieves the fitted values. These can be plotted along with the response variable (Dias, n.d.). However, the points must be plotted in increasing order of the predictor variable in order for the lines() function to draw the line in an appropriate fashion, which is done using order() function applied to the predictor variable values and the explicit sub-scripting (in square brackets[]) to arrange the observations in ascending order (Dias, n.d.).

K-Nearest-Neighbor (K-NN)

The K-NN classifier is based on learning numeric attributes in an n-dimensional space. All of the training samples are stored in n-dimensional space with a unique pattern (Hodeghatta & Nayak, 2016). When a new sample is given, the K-NN classifier searches for the pattern spaces which are closest to the sample and accordingly labels the class in the k-pattern space (called k-nearest-neighbor) (Hodeghatta & Nayak, 2016). The “closeness” is defined regarding Euclidean distance, where the Euclidean distance between two points, X = (x₁, x_2,x_3,..,x_n) and Y = (y₁, y_2,y_3,..,y_n) is defined as follows:

The unknown sample is assigned the nearest class among the K-NN pattern. The aim is to look for the records which are similar to, or “near” the record to be classified in the training records which have values close to X = (x₁, x_2,x_3,..,x_n₎ (Hodeghatta & Nayak, 2016). These records are grouped into classes based on the “closeness,” and the unknown sample will look for the class (defined by k) and identifies itself to that class which is nearest in the k-space (Hodeghatta & Nayak, 2016). If a new record has to be classified, it finds the nearest match to the record and tags to that class (Hodeghatta & Nayak, 2016).

The K-NN does not assume any relationship among the predictors(X) and class (Y) (Hodeghatta & Nayak, 2016). However, it draws the conclusion of the class based on the similarity measures between predictors and records in the dataset (Hodeghatta & Nayak, 2016). There are many potential measures, K-NN uses the Euclidean distance between the records to find the similarities to label the class (Hodeghatta & Nayak, 2016). The predictor variable should be standardized to a common scale before computing the Euclidean distances and classifying (Hodeghatta & Nayak, 2016). After computing the distances between records, a rule to put these records into different classes(k) is required (Hodeghatta & Nayak, 2016). A higher value of (k) reduces the risk of overfitting due to noise in the training set (Hodeghatta & Nayak, 2016). The value of (k) ideally can be between 2 and 10, for each time, to find the misclassification error and find the value of (k) which gives the minimum error (Hodeghatta & Nayak, 2016).

The advantages of the K-NN as a classification method include its simplicity and lack of parametric assumptions (Hodeghatta & Nayak, 2016). It performs well for large training datasets (Hodeghatta & Nayak, 2016). However, the disadvantages of the K-NN as a classification method include the time to find the nearest neighbors, reduced performance for a large number of predictors (Hodeghatta & Nayak, 2016).

Multiple Regression Analysis for “ethanol” dataset Using R

This section is divided into five major Tasks. The first task is to understand and examine the dataset. Task-2, Task-3, and Task-4 are to understand density histogram, linear regression, and multiple linear regression. Task-5 covers the discussion and analysis of the results.

Task-1: Understand and Examine the Dataset.

The purpose of this task is to understand and examine the dataset. The description of the dataset is found in (r-project.org, 2018). A data frame with 88 observations on the following three variables.

NOx Concentration of nitrogen oxides (NO and NO2) in micrograms/J.
C Compression ratio of the engine.
E Equivalence ratio–a measure of the richness of the air and ethanol fuel mixture.

#R-Commands and Results using summary(), names(), head(), dim(), and plot() functions.

ethanol <- read.csv(“C:/CS871/Data/ethanol.csv”)
data(ethanol)
summary(ethanol)
names(ethanol)
head(ethanol)
dim(ethanol)
plot(ethanol, col=”red”)

Figure 1. Plot Summary of NOx, C, and E in Ethanol Dataset.

ethanol[1:3,] ##First three lines
ethanol$NOx[1:10] ##First 10 lines for concentration of nitrogen oxides (NOx)
ethanol$C[1:10] ##First 10 lines for Compression Ratio of the Engine ( C )
ethanol$E[1:10] ##First 10 lines for Equivalence Ratio ( E )

##Descriptive Analysis using summary() function to analyze the central tendency.
summary(ethanol$NOx)
summary(ethanol$C)
summary(ethanol$E)

Task-2: Density Histogram and Smoothed Density Histogram

##Density histogram for NOx
hist(ethanol$NOx, freq=FALSE, col=”orange”)
install.packages(“locfit”) ##locfit library is required for smoothed histogram
library(locfit)
smoothedDensity_NOx <- locfit(~lp(NOx), data=ethanol)
plot(smoothedDensity_NOx, col=”orange”, main=”Smoothed Density Histogram for NOx”)

Figure 2. Density Histogram and Smoothed Density Histogram of NOx of Ethanol.

##Density histogram for Equivalence Ration ( E )
hist(ethanol$E, freq=FALSE, col=”blue”)
smoothedDensity_E <- locfit(~lp(E), data=ethanol)
plot(smoothedDensity_E, col=”blue”, main=”Smoothed Density Histogram for Equivalence Ratio”)

Figure 3. Density Histogram and Smoothed Density Histogram of E of Ethanol.

##Density histogram for Compression Ratio ( C )
hist(ethanol$C, freq=FALSE, col=”blue”)
smoothedDensity_C <- locfit(~lp(C), data=ethanol)
plot(smoothedDensity_C, col=”blue”, main=”Smoothed Density Histogram for Compression Ratio”)

Figure 4. Density Histogram and Smoothed Density Histogram of C of Ethanol.

Task-3: Linear Regression Model

## Linear Regression
lin.reg.model1=lm(NOx~E, data=ethanol)
lin.reg.model1
plot(NOx~E, data=ethanol, col=”blue”, main=”Linear Regression of NOx and Equivalence Ratio in Ethanol”)
abline(lin.reg.model1, col=”red”)
mean.NOx=mean(ethanol$NOx, na.rm=T)
abline(h=mean.NOx, col=”green”)

Figure 5: Linear Regression of the NOx and E in Ethanol.

##local polynomial regression of NOx on the equivalent ratio
##fit with a 50% nearest neighbor bandwidth.
local.poly.reg <-locfit(NOx~lp(E, nn=0.5), data=ethanol)
plot(local.poly.reg, col=”blue”)

Figure 6: Smoothed Polynomial Regression of the NOx and E in Ethanol.

Figure 7. Residuals vs. Fitted Plots.

Figure 8. Normal Q-Q Plot.

Figure 9. Scale-Location Plot.

Figure 10. Residuals vs. Leverage.

##To better understand the linearity of the relationship represented by the model.
summary(lin.reg.model1)
plot(lin.reg.model1)
crPlots(lin.reg.model1)
termplot(lin.reg.model1)

Figure 11. crPlots() Plots for the Linearity of the Relationship between NOx and Equivalence Ratio of the Model.

##Examine the Correlation between NOx and E.

Task-4: Multiple Regressions

##Produce Plots of some explanatory variables.
plot(NOx~E, ethanol, col=”blue”)
plot(NOx~C, ethanol, col=”red”)
##Use vertical bar to find the relationship of E on NOx conditioned with C
coplot(NOx~E|C, panel=panel.smooth,ethanol, col=”blue”)
model2=lm(NOx~E*C, ethanol)
plot(model2, col=”blue”)

Figure 12. Multiple Regression – Relationship of E on NOx conditioned with C.

Figure 13. Multiple Regression Diagnostic Plot: Residual vs. Fitted.

Figure 14. Multiple Regression Diagnostic Plot: Normal Q-Q.

Figure 15. Multiple Regression Diagnostic Plot: Scale-Location.

Figure 16. Multiple Regression Diagnostic Plot: Residual vs. Leverage.

summary(model2)

Task-5: Discussion and Analysis: The result shows the average of NOx is 1.96, which is higher than the median of 1.75 indicating positive skewed distribution. The average of the compression ratio of the engine ( C ) is 12.034 which is a little higher than the median of 12.00 indicating almost normal distribution. The average ethanol equivalence ratio of measure ( E ) is 0.926, which is a little lower than the median of 0.932 indicating a little negative skewed distribution but close to normal distribution. In summary, the average for NOx is 1.96, for C is 12.034 and for E is 0.926.

The NOx exhaust emissions depend on two predictor variables: the fuel-air equivalence ratio ( E ), and the compression ratio ( C ) of the engine. The density of the NOx emissions and its smoothed version using the local polynomial regression are illustrated in Figure 2. The result shows that the NOx starts to increase when the density starts at 0.15 and continues to increase. However, after the density reaches 0.35, the NOx continues to increase while the density starts to drop. Thus, there seems to be a positive relationship between NOx and density between 0.15 and .35 density, after which the relationship seems to go into the reverse and negative direction.

The density of the Equivalence Ratio measure of the richness of air and ethanol fuel mixture and its smoothed version using the local polynomial regression are illustrated in Figure 3. The result shows that the E varies with the density. For instance, the density gets increased with the increased value of E until the density reaches ~1.5, and then it drops while the E continues to increase. However, the density continues to drop until it reaches ~1.2, while the E continues to increase. The density gets increased from ~1.2 until it reached ~1.6, and the E continues to increase. After density of ~1.6, the density gets dropped again while the E value continues to increase. Thus, in summary, the density varies while the E value keeps increasing.

The density of the Compression Ratio of the Engine and its smoothed version using the local polynomial regression are illustrated in Figure 4. The result shows that the C starts to increase when the density starts with ~0.09 and continues to increase. However, after the density reaches ~0.11, the C continues to increase while the density starts to drop. Thus, there seems to be a positive relationship between C and density between ~0.09 and ~.11 density, after which the relationship seems to go into the reverse and negative direction.

Figure 5 illustrates the Linear Regression between the NOx and Equivalence Ratio in Ethanol. Figure 6 illustrates the Smoothed Polynomial Regression of the NOx and E in Ethanol. The result of the Linear Regression of the Equivalence Ratio ( E ) as a function of the NOx shows that while the E value increases, the NOx varies indicating an increase and then decrease. Figure 6 shows the smoothed a polynomial regression of the NOx and E in ethanol, indicating the same result that there a positive association between E and NOx, meaning when E increases until it reaches ~0.9, NOx also increases until it reaches ~3.5. After that point, the relationship shows negative, meaning that the NOx gets increased with the increase of E.

This analysis also covers the residuals and fitted lines. Figure 7 illustrated the Residuals vs. Fitted in Linear Regression Model for NOx as a function of the E. The residuals depicts the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016). The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016). The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016). For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016). When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016). The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).

The result of the Linear Regression for the identified variables of E and NOx shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line. Figure 8 illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). Figure 8 shows that the residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed. Figure 9 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016). Thus, it is not fulfilled in this case. Figure 10 illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression. In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015). Those spots are the places where a case can be influential against a regression line (Bommae, 2015). When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015). The Cook’s distance lines are (red dashed line) are far indicating there is no influential case. Figure 11 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016). The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016). The result of Figure 12 shows that the model created is not linear, which requires to re-explore the model. Moreover, the correlation between NOx and E result shows there is a negative correlation between NOx and E with a value of -0.11. Figure 12 illustrates the Multiple Regression and the Relationship of Equivalence Ratio on NOx conditioned with Compression Ratio. The Multiple Linear Regression is useful for modeling the relationship between a numeric outcome or dependent variable (Y), and multiple explanatory or independent variables (X). The result shows that the interaction of C and E affects the NOx. While the E and C increase, the NOx decreases. Approximately 0.013 of variation in NOx can be explained by this model (E*C). The interaction of E and C has a negative value of -0.063 on NOx.

References

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Kabacoff, R. I. (2011). IN ACTION.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

statisticshowto.com. (2013). Lowess Smoothing: Overview. Retrieved from http://www.statisticshowto.com/lowess-smoothing/.

Share this:

Related

Published by Think and Knowledge Tank