Quantitative Analysis of the “Flight-Delays” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the flight delays Dataset. The project is divided into two main Parts.  Part-I evaluates and examines the Dataset for understanding the Dataset using the RStudio.  Part-I involves five major tasks to review and understand the Dataset variables.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame using logistic regression first, followed by the naïve Bayesian method.   The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period.  The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights.  Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

Keywords: Flight-Delays Dataset; Naïve Bays Prediction Analysis Using R.

Introduction

This project examines and analyzes the Dataset of (flight.delays.csv).  The Dataset is downloaded from CTU course materials.  There have been a couple of attempts to download the Dataset from the following link https://www.transtats.bts.gov/.  However, the attempts failed to continue with the Dataset analysis due to the size of the downloaded Datasets from that link and the limited resources of the student’s machine.  Thus, this project utilized the version of flight.delays.csv which is provided by the course in the course material.  The Dataset of (flight.delays.cvs) has 2201 observations on 14 variables.  The focus of this analysis is Naïve Bayes.  However, for a better understanding of the prediction and a comparison using two different models, the researcher has also implemented the Logistic Regression first, followed by the Naïve Bayesian Approach on the same Dataset of flight.delays.csv.  This project addresses two major Parts.  Part-I covers the following key Tasks to understand and examine the Dataset of “flight.delays.csv.” 

  • Task-1:  Review the Variables of the Dataset.
  • Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.
  • Task-3:  Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.
  • Task-4:  Create Data Frame and Histogram of the Delay (Response)
  • Task-5:  Visualization of the Desired Variables Using Plot() Function.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

  • Task-1:  Logistic Regression Model for Flight Delays Prediction
  • Task-2:  Naïve Bayesian Model for Flight Delays Prediction.
  • Task-3:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “flight.delays.csv”

Task-1:  Review the Variables of the Dataset

The purpose of this task is to understand the variables of the Dataset.  The Dataset is “flight. Delays” Dataset.  The Dataset describes the clients who can default on a loan.  There are 14 variables.  Table 1 summarizes the selected variables for this project.  

Table 1:  Flight Delays Variables

Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.

            The purpose of this task is to load and understand the Dataset using names(), head(), dim() function.  The task also displays the first three observations.

  • ## reading the data
  • fd <-read.csv(“C:/CS871/Data/flight.delays.csv”)
  • names(fd[1:5,])
  • head(fd)
  • dim(fd)
  • fd[1:3,]

Task-3:  Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.

            The purpose of this task is to examine the dawta set, install the requried package (car).  This task also displays the descriptive statistics for analysis.

  • ### set seed
  • set.seed(1)
  • ##Required Library(car) to recode a variable
  • install.packages(“car”)
  • library(car)
  • summary(fd)
  • plot(fd, col=”blue”)

Figure 1. The plot of the Identified Variables for the Flight Delays Dataset.

Task-5:  Visualization of the Desired Variables Using Plot() Function.

            The purpose of this task is to visualize the selected variables using the Plot() Function for a good understanding of these variables and the current trend for each variable.

  • plot(fd$schedf, col=”blue”, main=”Histogram of the Scheduled Time”)
  • plot(fd$carrier, col=”blue”, main=”Histogram of the Carrier”)
  • plot(fd$dest, col=”blue”, main=”Histogram of the Destination”)
  • plot(fd$origin, col=”blue”, main=”Histogram of the  Origin”)
  • plot(fd$weather, col=”blue”, main=”Histogram of the Weather”)
  • plot(fd$dayweek, col=”blue”, main=”Histogram of the Day of Week”)

Figure 2.  Histogram of the Schedule Time and Carrier.

Figure 3.  Histogram of the Destination and Origin.

Figure 4.  Histogram of the Weather and Day of Week.

Part-II:  Plot, Discuss and Analyze

 Task-1: Logistic Regression Model for Flight Delays

The purpose of this task is to first use the logistic regression model for predicting the on-time and delayed flights more than 15 minutes. The Dataset consists of 2201 flights for the year of 2004 from Washington DC into the NYC.  The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes and coded as 0=no delay, and 1=delay by more than 15 minutes.  The explanatory variables include:

  • Three arrival airports (Kennedy, Newark, and LaGuardia).
  • Three different departure airports (Reagan, Dulles, and Baltimore.
  • Eight carriers a categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
  • Weather conditions (0=good, 1=bad).
  • Day of week (1 for Sunday and Monday; and 0 for all other days).

The code of R is shown below for the logistic regression model.

  • ## Create a Data Frame and Understand the Dataset.
  • fd <-data.frame(fd)
  • names(fd)
  • head(fd)
  • fd[1:5,]
  • dim(fd)
  • summary(fd)
  • plot(fd, co=”blue”)
  • ## library car is needed to recode variables
  • library(car)
  • ##Define hours of Departure
  • fd$sched=factor(floor(fd$schedtime/100))
  • table(fd$sched)
  • table(fd$carrier)
  • table(fd$dest)
  • table(fd$origin)
  • table(fd$weather)
  • table(fd$dayweek)
  • table(fd$daymonth)
  • table(fd$delay)
  • fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
  • fd$delay=as.numeric(levels(fd$delay)[fd$delay])
  • table(fd$delay)
  • ## Summary of the Major Variables
  • summary(fd$sched)
  • summary(fd$carrier)
  • summary(fd$dest)
  • summary(fd$origin)
  • summary(fd$weather)
  • summary(fd$dayweek)
  • summary(fd$daymonth)
  • summary(fd$delay)
  • ## Plots and Histograms of the Major Variables
  • plot(fd$sched, col=”blue”, main=”Schedule Departure Time”)
  • plot(fd$carrier, col=”darkblue”, main=”Flight Carriers”)
  • plot(fd$dest, col=”darkred”, main=”Destination of Flights”)
  • plot(fd$origin, col=”green”, main=”Origin of Flights”)
  • plot(fd$weather, col=”darkgreen”, main=”Weather During Flight Days”)
  • hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
  • hist(fd$daymonth, col=”yellow”, main=”Flights Day of the Month”)
  • plot(fd$delay, col=”red”, main=”Plot of the Delay”)
  • hist(fd$delay, col=”red”, main=”Histogram of the Delay”)
  • ## Delay: 1=Monday and 7=Sunday coded as 1, else 0.
  • fd$dayweek=recode(fd$dayweek,”c(1,7)=1;else=0″)
  • table(fd$dayweek)
  • summary(fd$dayweek)
  • hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
  • ## Omit unused variables
  • fd=fd[,c(-1,-3,-5,-6,-7,-11,-12)]
  • fd[1:5,]
  • ## Create Sample Dataset
  • delay.length=length(fd$delay)
  • delay.length
  • delay.length1=floor(delay.length*(0.6))
  • delay.length1
  • delay.length2=delay.length-delay.length1
  • delay.length2
  • train=sample(1:delay.length, delay.length1)
  • train
  • plot(train, col=”red”)
  • ## Estimation of Logistic Regression Model
  • ##Explanatory Variables: carrier, destination, origin, weather, day of week,
  • ##(weekday/weekend), scheduled hour of departure.
  • ## Create design matrix; indicators for categorical variables (factors)
  • Xfd <- model.matrix(delay~., data=fd) [,-1]
  • Xfd[1:5,]
  • xtrain <- Xfd[train,]
  • xtrain[1:2,]
  • xtest <- Xfd[-train,]
  • xtest[1:2,]
  • ytrain <- fd$delay[train]
  • ytrain[1:5]
  • ytest <- fd$delay[-train]
  • ytest[1:5]
  • model1 = glm(delay~., family=binomial, data=data.frame(delay=ytrain,xtrain))
  • summary(model1)
  • ## Prediction: predicted default probabilities for cases in test set
  • probability.test <- predict(model1, newdata=data.frame(xtest), type=”response”)
  • data.frame(ytest,probability.test)[1:10,]
  • ## The first column in list represents the case number of the test element
  • plot(ytest~probability.test, col=”blue”)
  • ## Coding as 1 if probability 0.5 larger
  • ### using floor function
  • probability.fifty = floor(probability.test+0.5)
  • table.ytest = table(ytest,probability.fifty)
  • table.ytest
  • error = ((table.ytest[1,2]+table.ytest[2,1])/delay.length2)
  • error

Figure 5.  The probability of the Delay Using Logistic Regression.

Task-2:  Naïve Bayesian Model for Predicting Delays and Ontime Flights

The purpose of this task is to use the Naïve Bayesian model for predicting a categorical response from most categorical predictor variables.  The Dataset consists of 2201 flights in 2004 from Washington, DC into NYC.  The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes (0=no delay, 1=delay).  The explanatory variables include the following:

  • Three arrival airports (Kennedy, Newark, and LaGuardia).
  • Three different departure airports (Reagan, Dulles, and Baltimore.
  • Eight carriers.
  • A categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
  • Weather conditions (0=good, 1=bad).
  • Day of week (7 days with Monday=1, …, Sunday=7).

The code of R is shown below for the logistic regression model, followed by the result of each code.

  • fd=data.frame(fd)
  • fd$schedf=factor(floor(fd$schedtime/100))
  • fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
  • response=as.numeric(levels(fd$delay)[fd$delay])
  • hist(response, col=”orange”)
  • fd.mean.response=mean(response)
  • fd.mean.response
  • ## Create Train Dataset 60/40
  • n=length(fd$dayweek)
  • n
  • n1=floor(n*(0.6))
  • n1
  • n2=n-n1
  • n2
  • train=sample(1:n,n1)
  • train
  • plot(train, col=”blue”, main=”Train Data Plot”)
  • ## Determine Marginal Probabilities
  • td=cbind(fd$schedf[train],fd$carrier[train],fd$dest[train],fd$origin[train],fd$weather[train],fd$dayweek[train],response[train])
  • td
  • tdtrain0=td[td[,7]<0.5,]
  • tdtrain1=td[td[,7]>0.5,]
  • tdtrain0[1:3]
  • tdtrain1[1:3]
  • plot(td, col=”blue”, main=”Train Data”)
  • plot(tdtrain0, col=”blue”, main=”Marginal Probability <0.5″ )
  • plot(tdtrain1, col=”blue”, main=”Marginal Probability > 0.5″)
  • ## Prior Probabilities for Delay for P( y = 0 ) and P(y = 1)
  • tdel=table(response[train])
  • tdel=tdel/sum(tdel)
  • tdel
  • ## P( y = 0)       P( y = 1)         1
  • ##0.8022727    0.1977273
  • ## Probabilities for Scheduled Time
  • ### Probabilities for (y=0) for Scheduled Time
  • ts0=table(tttrain0[,1])
  • ts0=ts0/sum(ts0)
  • ts0
  • ### Probabilities for (y = 1) for Scheduled Time
  • ts1=table(tttrain1[,1])
  • ts1=ts1/sum(ts1)
  • ts1
  • ## Probabilities for Carrier
  • ## Probabilities for (y = 0) for Carrier
  • tc0=table(tttrain0[,2])
  • tc0=tc0/sum(tc0)
  • tc0
  • tc1=table(tttrain1[,2])
  • tc1=tc1/sum(tc1)
  • tc1
  • ## Probabilities for Destination
  • ##Probabilities for (y=0) for Destination
  • td0=table(tttrain0[,3])
  • td0=td0/sum(td0)
  • td0
  • ##Probabilities for (y=1) for Destination
  • td1=table(tttrain1[,3])
  • td1=td1/sum(td1)
  • td1
  • ## Probabilities for Origin
  • ##Probabilities for (y=0) for Origin
  • to0=table(tttrain0[,4])
  • to0=to0/sum(to0)
  • to0
  • ##Probabilities for (y=1) for Origin
  • to1=table(tttrain1[,4])
  • to1=to1/sum(to1)
  • to1
  • ## Probabilities for Weather
  • ##Probabilities for (y=0) for Origin
  • tw0=table(tttrain0[,5])
  • tw0=tw0/sum(tw0)
  • tw0
  • ## bandaid as no observation in a cell
  • tw0=tw1
  • tw0[1]=1
  • tw0[2]=0
  • ##Probabilities for (y=1) for Weather
  • tw1=table(tttrain1[,5])
  • tw1=tw1/sum(tw1)
  • tw1
  • ## Probabilities for Day of Week
  • #### Probabilities for (y-0) for Day of Week
  • tdw0=table(tttrain0[,6])
  • tdw0=tdw0/sum(tdw0)
  • tdw0
  • #### Probabilities for (y-1) for Day of Week
  • tdw1=table(tttrain1[,6])
  • tdw1=tdw1/sum(tdw1)
  • tdw1
  • ### Create Test Data
  • testdata=cbind(fd$schedf[-train],fd$carrier[-train],fd$dest[-train],fd$origin[-train],fd$weather[-train],fd$dayweek[-train],response[-train])
  • testdata[1:3]
  • ## With these estimates, the following probabilities can be determined.
  • ##P(y = 1|Carrier = 7,DOW = 7,DepTime = 9 AM−10 AM,Dest = LGA,Origin = DCA,Weather = 0)
  • =[(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
  • [(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
  • +[(0.015)(0.099)(0.059)(0.545)(0.653)(1)](0.802)
  • = 0.09.
  • ## Creating Predictions, stored in prediction
  • p0=ts0[tt[,1]]*tc0[tt[,2]]*td0[tt[,3]]*to0[tt[,4]]*tw0[tt[,5]+1]*tdw0[tt[,6]]
  • p1=ts1[tt[,1]]*tc1[tt[,2]]*td1[tt[,3]]*to1[tt[,4]]*tw1[tt[,5]+1]*tdw1[tt[,6]]
  • prediction=(p1*tdel[2])/(p1*tdel[2]+p0*tdel[1])
  • hist(prediction, col=”blue”, main=”Histogram of Predictions”)
  • plot(response[-train], prediction,col=”blue”)
  • ###Coding as 1 if probability >=0.5 
  • ## Calculate the Probability for at least 0.5 or more
  • prob1=floor(prediction+0.5)
  • tr=table(response[-train],prob1)
  • tr
  • error=(tr[1,2]+tr[2,1])/n2
  • error
  • ## Calculate the Probability for at least 0.3 or more
  • prob2=floor(prediction+0.3)
  • tr2=table(response[-train],prob2)
  • tr2
  • error=(tr[1,2]+tr[2,1])/n2
  • error
  • ## calculating the lift, cumulative, sorted by predicted values and average success.
  • ## cumulative 1’s sorted by predicted values
  • ## cumulative 1’s using the average success prob from training set
  • axis=dim(n2)
  • ax=dim(n2)
  • ay=dim(n2)
  • axis[1]=1
  • ax[1]=xbar
  • ay[1]=bb1[1,2]
  • for (i in 2:n2) {
  • axis[i]=i
  • ax[i]=xbar*i
  • ay[i]=ay[i-1]+bb1[i,2]
  • }
  • aaa=cbind(bb1[,1],bb1[,2],ay,ax)
  • aaa[1:100,]
  • plot(axis,ay,xlab=”Number of Cases”,ylab=”Number of Successes”,main=”Lift: Cum Successes Sorted Predicted Values Using Average Success Probabilitis”, col=”red”)
  • points(axis,ax,type=”l”)

Figure 6. Pre and Post Factor and Level of History Categorical Variable.

Figure 7: Train Dataset Plot.

Figure 8.  Train Data, Marginal Probability of <0.5 and >0.5.

Figure 9.  Prior Probability for Delay (y=0) and (y-1).

Figure 10.  Prior Probability for Scheduled Time: Left (y=0) and Right (y-1).

Figure 11.  Prior Probability for Carrier: Left (y=0) and Right (y-1).

Figure 12.  Prior Probability for Destination: Left (y=0) and Right (y-1).

Figure 13.  Prior Probability for Origin: Left (y=0) and Right (y-1).

Figure 14.  Prior Probability for Weather: Left (y=0) and Right (y-1).

Figure 15.  Prior Probability for Day of Week: Left (y=0) and Right (y-1).

Figure 16.  Test Data Plot.

Figure 17.  Histogram of the Prediction Using Bayesian Method.

Figure 18.  Plot of Prediction to the Response Using the Test Data.

Figure 19.  Probability Calculation for at least 0.5 or larger (left), and at least 0.3 or larger (right).

Figure 20.  Lift: Cum Success Sorted by Predicted Values Using Average Success Probabilities.

Task-3: Discussion and Analysis

            The descriptive analysis shows that the average schedule time is 13:72 which is less than the median of 14:55 indicating a negatively skewed distribution, while the average for the departure time is 13:69 which is less than the median of 14:50 confirming the negatively skewed distribution.  The result of the carrier shows that the DH has the highest rank of 551, followed by RU of 408. The result of the destination shows that the LGA has the highest rank of 1150, followed by EWR of 665 and JFK of 386.   The result of the origin shows that DCA has the highest rank of 1370, followed by IAD of 686 and 145 for BWI.   The result shows that the weather is not the primary reason for the delays.  Few instances of weather instances are related to the delays.  The descriptive analysis shows the ontime has the highest frequency of 1773, followed by the delays of 428 frequency.  The average delay or response is 0.195.

The result shows the success probability which is the proportion of delayed planes in the training set if 0.198 as analyzed in Task 2 of Part-II; the failure probability which is the proportion of on-time flights is 0.802 as discussed and analyzed in Task-2 of Part-II.   The naïve rule which does not incorporate any covariate information classified every flight as being on-time as the estimated unconditional probability of a flight being on-time, 0.802, is larger than the cutoff of 0.5. Thus, this rule does not make an error predicting a flight which is on-time, but it makes a 100% error when the flight is delayed.   The naïve rule fails to identify the 167 delayed flights among the 881 flights of the evaluation Dataset as shown in Task-1 of Part-II; its misclassification error rate in the holdout sample is 167/881=0.189.  The logistic regression reduces the overall misclassification error in the holdout (evaluation/test) Dataset to 0.176, which is a modest improvement over the naïve rule of (0.189) as illustrated in Task-1 of Part-II.   The logistic regression identifies, among 167 delayed flights, correctly 14 delayed flights 8.4%, but it misses 153/167 delayed flights (92.6%).  Moreover, the logistic regression model predicts 2 of the 714 on-time flights as being delayed as illustrated in Task-1 of Part-II.

The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period.  The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights.  Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

The lift charts (Figure 20) is constructed with the number of cases on the x-axis and the cumulative true-positive cases on the y-axis.  True positives are those observations which are classified correctly.  It measures the effectiveness of a classification model by comparing the true positives without a model (Hodeghatta & Nayak, 2016).  It also provides an indication of how well the model performs if the samples are selected randomly from a population (Hodeghatta & Nayak, 2016).  With the lift chart, a comparison of different models’ performance for a set of random cases (Hodeghatta & Nayak, 2016).  In Figure 20, the lift varies with the number of cases, and the black line is a reference line, meaning if a prediction of a positive case is made in case there was no model, then, this line provides a benchmark.  The lift curve graph in Figure 20, graphs the expected number of delayed flights, assuming that the probability of delay is estimated by the proportion of delayed flights in the evaluation sample, against the number of cases.  The reference line expresses the performance of the naïve model. With ten flights, for instance, the expected number of delayed flights is 10 p, where p is the proportion of delayed flights in the evaluation sample which is 0.189 in this case.  At the very end, the lift curve and the reference line meet. However, in the beginning, the logistic regression leads to a “lift.” For instance, when picking 10 cases with the largest estimated success probabilities, all the 10 case turn out to be delayed.  If the lift is close to the reference line, then there is not much point in using the estimated model for classification. The overall misclassification rate of the logistic regression is not that different from that of the naïve strategy which considers all flights as being on-time. However, as the lift curve shows, flights with the largest probabilities of being delayed are classified correctly. The logistic regression is quite successful in identifying those flight as being delayed.  The lift curve in Figure 10 shows that the model gives an advantage in detecting the most apparent flights which are going to be delayed or on-time.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Quantitative Analysis of “German.Credit” Dataset Using K-NN Classification and Cross-Validation

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the german.credit.csv dataset to address the issues of lending which result in default.  Two outcomes are success (defaulting on the loan), and failure (not defaulting on the loan).  The explanatory variables in the Logistic Regression are both the type of loan and borrowing amount.  For the K-NN Classification, three continuous variables are used:  duration, amount, and installment.  The cross-validation with k=5 for the nearest neighbor will be used as well in this analysis.

The dataset is downloaded from the following archive site for machine learning repository : https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).  The dataset has 1000 observation on 222 variables. There are two datasets for german.credit.csv.  The original dataset, in the form provided by Professor Hofmann, contains categorical/symbolic attributes and is in the current german.credit.csv file which is used in this disussion.  The other dataset “german.data-numeric” is not used in this discussion which was developed by Strathclyde University for algorithms that need numerical attributes.  This discussion utilized the original version of german.credit.csv which has categorical variables and the continuous variables. 

This analysis discusses and addresses fourteen Tasks as shown below:

  • Task-1: Understand the Variables of the Dataset
  • Task-2: Load and Review the Dataset using names(), head(), dim() functions.
  • Task-3: Pre and Post Factor and Level of Categorical Variables of the Dataset.
  • Task-4: Summary and Plot the Continuous Variables: Duration, Amount, and Installment
  • Task-5: Classify Amount into Groups.
  • Task-6: Summary of all selected variables.
  • Task-7: Select and Plot Specific Variables for this analysis.
  • Task-8:  Create Design Matrix
  • Task-9:   Create Training and Prediction Dataset.
  • Task-10:  Implement K-Nearest Neighbor Method.
  • Task-11: Calculate the Proportion of Correct Classification.
  • Task-12: Plot for 3 Nearest Neighbor.
  • Task-13: Cross-Validation with k=5 for the Nearest Neighbor.
  • Task-14: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is “german.credit” dataset.  The dataset describes the clients who can default on a loan.  There are selected variables out of the 22 variables which are target for this analysis.  Table 1 and Table 2 summarize these selected variables for this discussion.  Table 1 focuses on the variables with binary and numerical values, while Table 2 focuses on the variables with categorical values.

Table 1:  Binary and Numerical Variables

Table 2: Categorical Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • gc <- read.csv(“C:/CS871/german.credit.csv”)
  • names(gc)
  • head(gc)
  • dim(gc)
  • gc[1:3,]

Task-3:  Pre and Post Factor and Level of Categorical Variables of the Data Sets

  • ## history categorical variable pre and post factor and level.
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Pre Factor and Level”)
  • gc$history = factor(gc$history, levels=c(“A30”, “A31”, “A32”, “A33”, “A34”))
  • levels(gc$history)=c(“good-others”, “good-thisBank”, “current-paid-duly”, “bad-delayed”, “critical”)
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Post Factor and Level”)
  • ##### purpose pre and post factor and level
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)
  • ###tranform purpose
  • gc$purpose <- factor(gc$purpose, levels=c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))
  • levels(gc$purpose) <- c(“newcar”,”usedcar”,”furniture/equipment”,”radio/television”,”domestic appliances”,”repairs”, “edu”,”vacation-doesNotExist”, “retraining”, “business”, “others”)
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)

Figure 1.  Example of Pre and Post Factor of Purpose as Cateogircal Variable Illustration.

Task-4:  Summary & Plot the Numerical Variables: Duration, Amount, Installment

  • ##summary and plot of those numerical variables
  • summary(gc$duration)
  • summary(gc$amount)
  • plot(gc$amount, col=”blue”, main=”Amount Numerical Variable”)
  • summary(gc$installment)
  • plot(gc$installment, col=”blue”, main=”Installment Numerical Variable”)

Figure 2:  Duration, Amount, Installment Continuous IV.

Task-5:  Classify the Amount into Groups

  • #### To classify the amount into groups
  • gc$amount <-as.factor(ifelse(gc$amount <=2500, ‘0-2500′, ifelse(gc.df$amount<=5000,’2600-5000’, ‘5000+’)))
  • summary(gc$amount)

Task-6:  Summary of all variables

  • summary(gc$duration)
  • summary(gc$amount)
  • summary(gc$installment)
  • summary(gc$age)
  • summary(gc$history)
  • summary(gc$purpose)
  • summary(gc$housing)
  • summary(gc$rent)

Task-7:  Select and Plot specific variables for this discussion

  • ##cut the dataset to the selected variables
  • ##(duration, amount, installment, and age) which are numeric and
  • ##(history, purpose and housing) which are categorical and
  • ## Default (representing the risk) which is binary.
  • gc.sv <- gc[,c(“Default”, “duration”, “amount”, “installment”, “age”, “history”, “purpose”, “foreign”, “housing”)]
  • gc.sv[1:3,]
  • summary(gc.sv)
  • ### Setting the Rent
  • gc$rent <- factor(gc$housing==”A151″)
  • summary(gc$rent)
  • plot(gc, col=”blue”)

Figure 3:  Plot of The Selected Variables.

Task-8:  Create Design Matrix

  • ###Create a Design Matrix
  • ##Factor variables are turned into indicator variables
  • ##The first column of ones is ommitted
  • Xgc <- model.matrix(Default~.,data=gc)[,-1]
  • Xgc[1:3,]

Task-9: Create Training and Prediction Datasets

  • ## creating training and prediction datasets
  • ## select 900 rows for estimation and 100 for testing
  • set.seed(1)
  • train <- sample(1:1000,900)
  • xtrain <- Xgc[train,]
  • xnew <- Xgc[-train,]
  • ytrain <- gc$Default[train]
  • ynew <- gc$Default[-train]

Task-10:  K-Nearest Neighbor Method

  • ## k-nearest neighbor method
  • library(class)
  • nearest1 <- knn(train=xtrain, test=xnew, cl=ytrain, k=1)
  • nearest3 <- knn(train=xtrain, test=xnew, cl=ytrain, k=3)
  • data.frame(ynew,nearest1,nearest3)[1:10,]

Task-11: Calculate the Proportion of Correct Classification

  • ## calculate the proportion of correct classifications
  • proportion.correct.class1=100*sum(ynew==nearest1)/100
  • proportion.correct.class3=100*sum(ynew==nearest3)/100
  • proportion.correct.class1
  • proportion.correct.class3

Task-12: Plot for 3 Nearest Neighbors

  • ## plot for 3NN
  • plot(xtrain[,c(“amount”,”duration”)],
  • col=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(1,2)[as.numeric(ytrain)],
  • main=”Predicted Default, by 3 Nearest Neighbors”, xlab=”Amount”, ylab=”Duration”,cex.main=.95)
  • points(xnew[,c(“amount”,”duration”)],
  • bg=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(21,24)[as.numeric(nearest3)],cex=1.2,col=grey(.7))
  • legend(“bottomright”,pch=c(1,16,2,17),bg=c(1,1,1,1),
  • legend=c(“data 0″,”pred 0″,”data 1″,”pred 1”),
  • title=”default”,bty=”n”,cex=.8)
  • legend(“topleft”,fill=c(4,3,6,2),legend=c(1,2,3,4),
  • title=”installment %”, horiz=TRUE,bty=”n”,col=grey(.7),cex=.8)

Figure 4:  Predicted Default by 5 Nearest Neighbors.

Task-13: Cross-Validation with k=5 for the nearest neighbor

  • ## The above was for just one training set
  • ## The cross-validation (leave one out)
  • proportion.corr=dim(10)
  • for (k in 1:10) {
  • prediction=knn.cv(x,cl=gc$Default,k)
  • proportion.corr[k]=100*sum(gc$Default==prediction)/1000
  • }
  • proportion.corr

Task-14: Discussion and Analysis

The descriptive analysis shows that the average duration (Mean=20.9) is higher than the Median (Median=18) indicating a positively skewed distribution.  The average amount (Mean=3271) is higher than the Median (Median=2320) indicating a positively skewed distribution.  The average installment rate is (Mean=2.97) is a little less than the Median (Median=3.00) indicating a small negative skewed distribution.  The result also shows that the radio/TV ranks number one for the loan, followed by a new car.  A training dataset is created to select 900 rows for estimation and 100 for testing.  K-NN method is used to estimate the nearest using k=1 and k=3, and the proportion of the correct classification is calculated to result in 60% and 61% for k=1 and k=3 respectively (Figure 4).  The result of the cross-validation with k=5 for the nearest neighbor is about 65% of the outcome.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Examples of Bayesian Analysis in the Context of Social Media

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to provide examples of how Bayesian analysis can be used in the context of social media.  The discussion also summarizes the study which used Bayesian analysis in the context of social media. The discussion begins with Naïve Bayes Classifiers for Mining, followed by the Importance of Social Media, the Social Media Mining, Social Media Mining Techniques, Social Media Mining Process, Data Modelling Step, Twitter Mining Using Naïve Bayes with R.  The discussion ends with additional examples of the Bayesian Analysis methods in Social Media.

Naïve Bayes Classifiers for Mining

Naïve Bayes classifiers are probabilistic classifiers, built using the Bayes Theorem (Kumar & Paul, 2016). Naïve Bayes is also known as prior probability and a class conditional probability classifier, since it uses the prior probability of each feature and generates a posterior probability distribution over all the features (Kumar & Paul, 2016). 

The Naïve Bayes classifier makes the following assumptions about the data (Kumar & Paul, 2016):

  • All the features in the dataset are independent of each other.
  • All the features are important.

Though these assumptions may not be accurate in a real-world scenario, Naïve Bayes is still in many applications for text classification such as (Kumar & Paul, 2016):

  • Spam filtering for email applications
  • Social media mining, such as finding the sentiments in a given text.
  • Computer network security applications.

As indicated by (Kumar & Paul, 2016), the classifier has various strength such as:

  • Naïve Bayes classifiers are highly scalable and need less computational cycles when compared with other advanced and sophisticated classifiers.
  • A vast number of features can be taken into consideration.
  • The Naïve Bayes classifiers work well when there is missing data, and the dimensionality of the inputs is high.
  • The Naïve Bayes need only small amounts of training sets.

The Importance of Social Media

The traditional media such as radio, newspaper, or television facilitates one-way communication with a limited scope of reach and usability.  Although the audience can interact with channels such as radio, the quality, and frequency of such communications are limited (Ravindran & Garg, 2015).  On the other hand, the Internet-based social media offers multi-way communication with features such as immediacy and permanence (Ravindran & Garg, 2015).

Social media is an approach to communication using online tools such as Twitter, Facebook, LinkedIn, and so forth (Ravindran & Garg, 2015).  Social Media is defined by Andreas Kaplan and Michael Haenlein as cited in (Ravindran & Garg, 2015) as follows:  “A group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content.”

Social media spans various Internet-based platforms which facilitate human emotions such as:

  • Networking, such as Facebook, LinkedIn.
  • Microblogging, such as Twitter, Tumblr.  
  • Photo sharing, such as Instagram, Flickr.
  • Video sharing, such as YouTube, Vimeo.
  • Stack exchanging, such as Stack Overflow, Github.
  • Instant messaging, such as Whatsapp, Hike.

The marketing industry is maturing in understanding the promise or the impact of social media (Ravindran & Garg, 2015).  While social media is regarded to be a great tool for banner advertisement regarding cost and reach, it can turn out to be more influential in the long term (Ravindran & Garg, 2015).  Organizations need to find out about the opinions of consumers by mining social networks (Ravindran & Garg, 2015).  They can understand the current and potential outlook of consumers by collecting information on their opinions, and such informative information can guide a business decision, in the long run, influencing the fate of any business (Ravindran & Garg, 2015).

Social Media Mining

The social media mining is a systematic analysis of information generated from social media.  The set of tools and techniques which are used to mine such information are collectively called Data Mining techniques and in the context of social media; Social Media Mining (SMM) (Ravindran & Garg, 2015).  There has been much research in multiple disciplines of social media such as modeling behavior, predictive analysis, and recommending content (Ravindran & Garg, 2015).

Social Media Mining (SMM)Techniques

Graph mining is one technique of the SMM.  Graph mining is described as “the process of extracting useful knowledge (patterns, outliers and so on), from a social relationship between the community members can be represented as a graph” (Ravindran & Garg, 2015).  The most influential example of Graph Mining is Facebook Graph Search (Ravindran & Garg, 2015). The Text Mining is another SMM technique, which includes extraction of meaning from unstructured text data presented in social media.  The primary targets of this type of mining are blogs and microblogs such as Twitter (Ravindran & Garg, 2015). 

Social Media Mining Process

The process of the social media mining include the following five steps (Ravindran & Garg, 2015):

  1. Getting authentication from the social website.
  2. Data Visualization.
  3. Cleaning and pre-processing.
  4. Data modeling using standard algorithms such as opinion mining, clustering, anomaly/spam detection, correlations and segmentation, recommendations.
  5. Result Visualization.

Data Modelling Step

This step is number four in the process of social media mining which includes the application of mining algorithms (Ravindran & Garg, 2015).  Standard mining algorithms include the opinion mining or sentiment mining where the opinion/sentiment present in the given phrase is assessed (Ravindran & Garg, 2015).  Although the classification of sentiments is not a simple technique, various classification algorithms have been employed to aid opinion mining. This algorithm varies from simple probabilistic classifiers such as Naïve Bayes which assumes that all features are independent and does not use any prior information, to the more advanced classifiers such as Maximum Entropy, which uses the prior information to a certain extent (Ravindran & Garg, 2015).  Other classifiers include Support Vector Machine (SVM), and Neural Networks NN) which have been used to correctly classify the sentiments (Ravindran & Garg, 2015).  Additional methods include Anomaly/spam detection or social spammer detection (Ravindran & Garg, 2015).  Fake profiles created with malicious intentions are known as spam or anomalies profiles (Ravindran & Garg, 2015).

Twitter Mining Using Naïve Bayes with R

In this application, after getting the cleaned Twitter data, R packages are used to assess the sentiments in the tweets.  The first step is to obtain the authorization using the following two packages (r-project.org, 2018; Ravindran & Garg, 2015):

  • getTwitterOAuth(consumer_key, consumer_secret)
  • registerTwitterOAuth(OAuth)
  • source(“authenticate.R”)

Collect tweets as a corpus, using searchTwitter() function in R.  After obtaining the cleaned Twitter data; few R packages are used to assess the sentiments in the tweets.  The first step is to use a Naïve algorithm which gives a score based on the number of times a positive or negative word occurred in the given sentence in this example (Ravindran & Garg, 2015).  To estimate the sentiment further, Naïve Bayes is used in deciding on the emotion present in any tweet.  The Naïve Bayes method requires R packages called Rstem and sentiment to assist with this assessment.  These packages are now removed from R repository. However, they can still be downloaded from the archive at https://cran.r-project.org/src/contrib/Archive/.  Additional R packages include classify_emotion().  Example of the result is illustrated in Figure 1, when applying the Naïve Bayes method using R (Ravindran & Garg, 2015).

Figure 1. Example of the Result When Applying Naïve Bayes Method in R (Ravindran & Garg, 2015).

In summary, the Naïve Bayes method can be used in social media such as Twitter example.  Certain R packages must be installed to be able to complete the Naïve Bayes analysis on twitter dataset.

Additional Examples of Bayesian Analysis methods in Social Media

In another application of Naïve Bayes method on social media reported by (Singh, Singh, & Singh, 2017).  Naïve Bayes classifier is a popular supervised classifier, provides a way to express positive, negative and neutral feelings in the web text.  It utilizes conditional probability to classify words into their respective categories (Singh et al., 2017).  The benefit of using Naïve Bayes on text classification is that it needs a small dataset for training (Singh et al., 2017).  The raw data from web undergoes pre-processing, removal of numeric, foreign words, HTML tags, and special symbols yielding the set of words (Singh et al., 2017).  The tagging of words with labels of positive, negative and neutral tags is manually performed by human experts (Singh et al., 2017).  This pre-processing produces word-category pairs for training set (Singh et al., 2017). 

The work of (Singh et al., 2017) focused on four Text Classifiers utilized for sentiment analysis: Naïve Bayes, J48, BFTree, and OneR algorithms.  Naïve Bayes was found to be quite fast in learning whereas OneR method was found more promising in generating the accuracy of 91.3% in precision, 97% in F-measure and 92.34% in correctly classified instances (Singh et al., 2017). 

Another example of the application of Bayesian Analysis is also reported in the research paper of (Volkova & Van Durme, 2015).  The work of (Volkova & Van Durme, 2015) proposed two approaches in mining streaming social media.  They studied iterative, incremental retraining in batch and current settings with and without iterative annotations. They treated each new message as independent evidence which is combined into an incremental user-prediction model applying Bayes Rule and explored model training in parallel with its application, rather than assuming a previously existing labeled dataset.  The applied Bayesian rule updates to dynamically revise posterior probability estimates of the attribute value in question (Volkova & Van Durme, 2015). 

References

Kumar, A., & Paul, A. (2016). Mastering Text Mining with R: Packt Publishing Ltd.

r-project.org. (2018). Pakcage ‘twitterR’. Retrieved from https://cran.r-project.org/web/packages/twitteR/twitteR.pdf.

Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R: Packt Publishing Ltd.

Singh, J., Singh, G., & Singh, R. (2017). Optimization of sentiment analysis using machine learning classifiers. Human-centric Computing and Information Sciences, 7(1), 32.

Volkova, S., & Van Durme, B. (2015). Online Bayesian Models for Personal Analytics in Social Media.

Bayesian Analysis

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Bayesian analysis and the reasons for using it when faced with uncertainty in making decisions.  The discussion also addresses any assumptions of Bayesian analysis, cases which would call for Bayesian analysis, and any problems with the Bayesian analysis.

Probability Theory and Probability Calculus

Various words and terms are used to describe uncertainty and related concepts such as probability, chance, randomness, luck, hazard, and fate (Hand, Mannila, & Smyth, 2001).  Modeling “uncertainty” is a required component of almost all data analysis (Hand et al., 2001).   There are various reasons for such uncertainty.  One reason is that the data may be only a sample from the population which is used for the research study so that the uncertainty is about the extent to which different samples differ from each other and the overall population (Hand et al., 2001).  Another reason for such uncertainty is a prediction about tomorrow based on today’s data so that the conclusions are subject to uncertainty about what the future will bring (Hand et al., 2001). Another reason includes the ignorance, and some value cannot be observed, and the ideas must be based on the “best guess” about it (Hand et al., 2001).

There are two different probabilities: probability theory and probability calculus (Hand et al., 2001). The probability theory is concerned with the interpretation of probability while the probability calculus is concerned with the manipulation of the mathematical representation of probability (Hand et al., 2001).  

The probability measures the likeliness that a particular event will occur.  Mathematicians refer to a set of potential outcomes of an experiment or trial to which a probability of occurrences can be assigned (Fischetti, Mayor, & Forte, 2017).  Probabilities are expressed as a number between 0 and 1 or as a percentage out of 100 (Fischetti et al., 2017).  An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur (Fischetti et al., 2017).  An example of probability is a coin flip, where there are two outcomes: heads or tails.  Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive, and mutually exclusive, meaning that they can never co-occur together (Fischetti et al., 2017). Thus, the probability of obtaining either heads or tails is P(heads)=0.50; P(tails)=0.50 respectively.  Moreover, when the probability of either outcome does not affect the probability of the other, these events are described as “conditionally independent” (Fischetti et al., 2017).   For instance, the probability of event A and event B is the product of the probability of A and the probability of B (Fischetti et al., 2017). 

Subjective vs. Objective Probability Views

In the first half of the 20th century the dominant statistics was the frequentist approach (Hand et al., 2001; O’Hagan, 2004).  The frequentist view of probability takes the perspective that probability is an “objective” concept, where the probability of an event is defined as the limiting proportion of times that the event would occur in repetitions of the substantially identical situation (Hand et al., 2001). Example of this frequentist probability is the coin example mentioned above. 

In the second half of the 20th century, the dominance of the frequentist view started to fade out (Hand et al., 2001; O’Hagan, 2004).  Although the vast majority of statistical analysis in practice is still frequentist, a competing view of “subjective probability” has acquired increasing importance (Hand et al., 2001; O’Hagan, 2004).  The principles and methodologies for data analysis driven from the “subjective” view are often referred to as “Bayesian” statistics (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).   The calculus is the same for the two viewpoints, even though the underlying interpretation is entirely different (Hand et al., 2001).

Bayesian Theorem

It is named after the 18th-century minister Thomas Bayes, whose paper presented to the Royal Society in 1763 first used the Bayesian argument (O’Hagan, 2004).  Although the Bayesian approach can be traced back to Thomas Bayes, its modern incarnation began only in 1950 and 1960 (O’Hagan, 2004). 

The central tent of Bayesian statistics is the explicit characterization of all forms of uncertainty in a data analysis problem including uncertainty about any parameters which are estimated from the data, uncertainty as to which among a set of model structures are best or closest to “truth,” uncertainty in any forecast that can be made, and so forth (Hand et al., 2001).  Because Bayesian interpretation is subjective, when evidence is scarce, there are sometimes wildly different degrees of belief among different people (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

From the perspective of the “subjective” probability, the Bayesian interpretation of probability views probability as the degree of belief in a claim or hypothesis (Fischetti et al., 2017; Hand et al., 2001).  The Bayesian inference provides a method to update that belief or hypotheses in the light of new evidence (Fischetti et al., 2017).  The equation of the Bayes’ Theorem is defined in equation (1) (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

Where:

  • H is the hypothesis.
  • E is the evidence.

Bayesian Continuous Posterior Distribution

When working with Bayesian analysis, the hypothesis concerns a continuous parameter or many parameters (Fischetti et al., 2017).  Bayesian analysis usually yields a continuous posterior called a “posterior distribution” (Fischetti et al., 2017).  The Bayes methods utilize the Bayes’ rule which expresses a powerful framework for combining sample information with a prior expert opinion to produce an updated or posterior expert opinion (Giudici, 2005).  In the Bayesian analysis, a parameter is treated as a random variable whose uncertainty is modeled by a probability distribution (Giudici, 2005). This distribution is the expert’s prior distribution p(q), stated in the absence of the sampled data (Giudici, 2005).  The likelihood is the distribution of the sample, conditional on the values of the random variable q: p(x|q) (Giudici, 2005).  The Bayes’ rule provides an algorithm to update the expert’s opinion in the light of the data, producing the so-called posterior distribution p(x|q), as shown in equation (2) below (Giudici, 2005).  With c = p(x), a constant that does not depend on the unknown parameter (q).

The posterior distribution represents the main Bayesian inferential tool (Giudici, 2005).  Once it is obtained, it is easy to obtain any inference of interest (Giudici, 2005).  For instance, to obtain a point estimate, a summary of the posterior distribution is taken, such as the Mean or the Mode (Giudici, 2005).  Similarly, confidence intervals can be easily derived by taking any two values of q such the probability of q belonging to the interval described by those two values corresponds to the given confidence level (Giudici, 2005).  As q is a random variable, it is now correct to interpret the confidence level as a probabilistic statement: (1 – α ) is the coverage probability of the interval, namely, the probability that q assumes values in the interval (Giudici, 2005). Thus, the Bayesian approach is thus described to be a coherent and flexible procedure (Giudici, 2005).

Special Case of Bayesian Estimates; Maximum Likelihood Estimator (MLE)

The MLE is a special case of Bayesian estimates, when the assumption as a prior distribution for q a constant distribution expressing a vague state of prior knowledge, the posterior mode is equal to the MLE (Giudici, 2005).  More generally, when a large sample is considered, the Bayesian posterior distribution approaches an asymptotic normal distribution, with the MLE as expected value (Giudici, 2005).

Bayesian Statistics

In the Bayesian Statistics, there is a four-step process (O’Hagan, 2004).  The first step is to create a statistical model to link data to parameters.  The second step is to formulate prior information about parameters.  The third step is to combine the two sources of information using Bayes’ theorem. The last step is to use the result posterior distribution to derive inferences about parameters (O’Hagan, 2004). Figure 1 illustrates the synthesis of the information by Bayes’ theorem. 

Figure 1.  Synthesis of the Information by Bayes’ Theorem (O’Hagan, 2004).

Figure 2 illustrates Bayes’ Theorem using a “triplot,” in which the prior distribution, likelihood and posterior distribution are all plotted on the same graph.  The prior information is represented by the dashed line lying, in this example, lying between -4 and +4.  The data with the dotted line represents the likelihood which favors the values of the parameters between 0 and 3, and strongly argue against any value below -2 or above +4 (O’Hagan, 2004).  The posterior is represented as solid line putting these two sources of information together (O’Hagan, 2004).  Thus, for values below -2, the posterior density is minimal because the data are saying that these values are highly implausible, while values above +4 are ruled out by the prior (O’Hagan, 2004).  While the data favor values around 1.5, the prior prefers values around 0, the posterior listens to both and the synthesis is a compromise, and the parameter is most likely to be around 1 (O’Hagan, 2004).

Figure 2.  Triplot. Prior Density (dashed), Likelihood (dotted), and Posterior Density (solid) (O’Hagan, 2004).

Bayes Assumption, Naïve Bayes, and Bayes Classifier

Any joint distribution can be simplified by making appropriate independence assumptions, essentially approximating a full table of probabilities by-products of much smaller tables (Hand et al., 2001).  At an extreme, an assumption can be made that all the variables are conditionally independent.  Such assumption is sometimes referred to as the Naïve Bayes or first-order Bayes assumption (Alexander & Wang, 2017; Baştanlar & Özuysal, 2014; Hand et al., 2001; Suguna, Sakthi Sakunthala, Sanjana, & Sanjhana, 2017).  The conditional independence model is linear in the number of variables (p) rather than being exponential (Hand et al., 2001).  To use the model for classification, the product form is simply used for the class-conditional distributions, yielding the Naïve Bayes classifier (Hand et al., 2001). The reduction in the number of parameters by using the Naïve Bayes model comes at a cost as an extreme independence assumption is made (Hand et al., 2001).  In some cases, the conditional independence assumption can be made quite reasonable (Hand et al., 2001).  In many practical cases, the conditional independence assumption may not be realistic (Hand et al., 2001).  Although the independence assumption may not be a realistic model of the probabilities involved, it may still permit relatively accurate classification performance (Hand et al., 2001). 

The Naïve Bayes model can easily be generalized in many different directions (Hand et al., 2001).  The simplicity, parsimony, and interpretability of the Naïve Bayes model have led to its widespread popularity, particularly in the machine learning literature (Hand et al., 2001).  The model can be generalized equally well by including some but not all dependencies beyond first-order (Hand et al., 2001).  However, the conventional wisdom in practice is that such additions to the model often provide only limited improvements in classification performance on many data sets, underscoring the difference between building accurate density estimators and building good classifiers (Hand et al., 2001).

Markov Chain Monte Carlo (MCMC) vs. Bayesian Analysis

Computing tools were explicitly developed for Bayesian analysis which is more powerful than anything available for frequentist methods, in the sense that Bayesians can now tackle enormously intricate problems that frequentists methods cannot begin to address (O’Hagan, 2004).  The advent of MCMC methods in the early 1990s served to emancipate the implementation of the Bayesian analysis (Allenby, Bradlow, George, Liechty, & McCulloch, 2014). 

The transformation is continuing, and computational developments are shifting the balance consistently in favor of Bayesian methods (O’Hagan, 2004).  MCMC is a simulation technique, whose concept is to by-pass the mathematical operations rather than to implement them (O’Hagan, 2004).  The Bayesian inference is solved by randomly drawing a sizeable simulated sample from the posterior distribution (O’Hagan, 2004).  The underlying concept of the Bayesian inference is that a sufficiently large sample from any distribution can represent the whole distribution effectively (O’Hagan, 2004).

Bayesian Analysis Application

The Bayesian principle methods are not the tools of choice in many application areas (O’Hagan, 2004).  Bayes’ Theorem has been applied to and proven useful in various disciplines and contexts such as German Enigma code during World War II, saving millions of lives (Fischetti et al., 2017).  Furthermore, An essential application of Bayes’ rule arises in the predictive classification problems (Giudici, 2005).  Bayesian analyses applied to market science problems have become increasingly popular due to their ability to capture individual-level customer heterogeneity (Allenby et al., 2014).  The Big Data is promoting the collection and archiving of an unprecedented amount of data (Allenby et al., 2014).  The marketing industry is convinced that there is gold in such Big Data (Allenby et al., 2014).  Big Data is described as discrete, and as huge because of its breadth and not because of its depth (Allenby et al., 2014).  It provides significant amounts of shallow data which does not reveal the state of the respondent, and the state of the market (Allenby et al., 2014).  The Bayesian methods are found to be useful in marketing analysis because of their ability to deal with large, shallow datasets and their ability to produce exact, finite sample inference (Allenby et al., 2014).

The pharmaceutical industry is another example of the use of Bayesian methods.  The pharmaceutical companies are regularly forced to abandon drugs that have just failed to demonstrate beneficial effects in frequentist terms, while the Bayesian analysis suggests that it would be worth persevering (O’Hagan, 2004).  The DNA is another example of the use of Bayesian analysis (O’Hagan, 2004).  Other applications of Bayesian methods include fraud detection (Bolton & Hand, 2002), social networks (Rodriguez, 2012).

Bayesian Analysis Software

There is a growing range of software available to assist with Bayesian analysis (O’Hagan, 2004).  Two particular software packages which are in general use, freely available: First Bayes and WinBUGS (O’Hagan, 2004).

The First Bayes is an elementary program which is aimed at helping the beginner to learn and understand how Bayesian methods work (O’Hagan, 2004).  The WinBUGS is a robust program for carrying out MCMC computations and is in widespread use for serious Bayesian analysis (O’Hagan, 2004).

Advantages and Disadvantages of Bayesian Analysis

Bayesian methods and classical methods both have advantages and disadvantages, and there are some similarities (sas.com, n.d.)  When the sample size is large, Bayesian inference often provides results for parametric models which are very similar to the results produced by frequentist methods (sas.com, n.d.).  Bayesian Analysis has the following five advantages:

  • It provides a natural and principled technique of combining prior information with data, within a solid decision theoretical framework.  Thus, past information about a parameter can be incorporated to form a prior distribution for future analysis.  When new observations are available, the previous posterior distribution can be used as a prior.  All inferences logically follow from the Bayes’ Theorem (sas.com, n.d.).
  • It provides inferences which are conditional on the data and are precise, without reliance on the asymptotic approximation.  Small sample inference proceeds in the same manner as if one had a large sample.  The Bayesian analysis can also estimate any functions of parameters directly, without using the “plug-in” method, which is a way to estimate functionals by plugging the estimated parameters in the functionals (sas.com, n.d.).
  • It adheres to the likelihood principle.  If two distinct sampling designs yield proportional likelihood functions for q, then all inferences about q should be identical from these two designs.  The classical inference does not adhere to the likelihood principle in general (sas.com, n.d.).
  • It provides interpretable answers, such as “the true parameter q  has a probability of 0.95 of falling in a 95% credible interval” (sas.com, n.d.).
  • It provides a convenient setting for a wide range of models, such as hierarchical models and missing data problems (sas.com, n.d.).    

Bayesian Analysis also has the following disadvantages:

  • It does not tell how to select a prior.  There is no correct method to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior, which can generate misleading results if done with no caution (sas.com, n.d.).
  • It can produce posterior distributions which are heavily influenced by the priors.  From a practical point of view, subject matter experts might disagree with the validity of the chosen prior (sas.com, n.d.).
  • If often comes with a high computational cost, especially in models with a large number of parameters.  Also, simulations provide slightly different answers unless the same random seed is used.   However, the slight variations in simulation results do not contradict the claim that the Bayesian inferences are precise or exact, because the posterior distribution of a parameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantities can vary due to the random number generator used in the procedures (sas.com, n.d.).

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Allenby, G. M., Bradlow, E. T., George, E. I., Liechty, J., & McCulloch, R. E. (2014). Perspectives on Bayesian Methods and Big Data. Customer Needs and Solutions, 1(3), 169-175.

Baştanlar, Y., & Özuysal, M. (2014). Introduction to machine learning miRNomics: MicroRNA Biology and Computational Analysis (pp. 105-128): Springer.

Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 235-249.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

O’Hagan, A. (2004). Bayesian statistics: principles and benefits. Frontis, 31-45.

Rodriguez, A. (2012). Modeling the dynamics of social networks using Bayesian hierarchical blockmodels. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(3), 218-234.

sas.com. (n.d.). Bayesian Analysis: Advantages and Disadvantages. Retrieved from https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect006.htm.

Suguna, M., Sakthi Sakunthala, N., Sanjana, S., & Sanjhana, S. (2017). A Survey on Prediction of Heart Diseases Using Big Data Algorithms.

Quantitative Analysis of the “German.Credit” Dataset Using Logistic Regression

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the German Credit dataset. The project is divided into two main Parts.  Part-I evaluates and examines the DataSet for understanding the Dataset using the RStudio.  Part-I involves six significant tasks.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving four significant tasks to analyze the Data Frames.   Although the project analyzes three significant models including Standards Linear Regression, Multiple Linear Regression, the  Logistic Regression is the emphasis of this project.  The result shows that the duration, amount, installment and rent show positive coefficient values indicating that they have a positive impact on the probability of the dependent binary outcome (Default).  As the p-value is much less than 0.05 for duration, amount, installment, history, purpose for used car, goods, repair, and business, and rent, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypotheses that there is significance of the parameter to the model.  The p-value for the age is (p=0.05) indicating that we accept the null hypothesis that there is no significance of the parameter to the model.  The p-value for (purpose) of education is > 0.05 indicating that we accept the null hypothesis that there is no significance of the parameter to the model.  The performance of the Logistic Regression model for the test dataset shows that the Logistic Regression recognizes 23 of the 28 Defaults (82%) and predicts the defaults 42 of the 72 good loans (58%). 

Keywords: German-Credit Dataset; Regression; Logistic Regression Analysis Using R.

Introduction

This project examines and analyze the dataset of (german.credit.csv).  The dataset is downloaded from the following archive site for machine learning repository:

https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).

The dataset has 1000 obeservation on 22 variables. There are two datasets for german.credit.csv.  The original dataset, in the form provided by Professor Hofmann, contains categorical/symbolic attributes and is in the current german.credit.csv file which is used in this project.  The other dataset “german.data-numeric” is not used in this project which was developed by Strathclyde University for algorithms that need numerical attributes.  This project utilized the original version of german.credit.csv which has categorical variables, because these categorical variables will be transformed during the analysis process to generate various models including linear models, and logistic model.

There are two Parts for this project.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

  • Task-1: Review the Variables of the Dataset
  • Task-2: Understand the Dataset using names(), head(), and dim() functions
  • Task-3: Pre and Post Factor and Level of Categorical Variables of the Dataset.
  • Task-4: Summary and Plot the Continuous Variables: Duration, Amount, and Installment
  • Task-5: Classify Amount into Groups.
  • Task-6: Summary of all selected variables.

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: Select and Plot Specific Variables for this Project.
  • Task-2:  Model-1:  Linear Regression – single variable, and Diagnostic Analysis.
  • Task-3:  Model-3:  Multiple Regression Analysis
  • Task-4:  Discussion and Analysis.

There is an assumption that, on average, lending into default is five times as costly as not lending to a good debtor (default=success=0, no default=failure=1).  The default is defined as “success,” while no defaulting on a loan is defined as “failure.”   If a certain (p) for the probability of default is estimated, the expected costs are 5p, assuming the bank makes the loan, and 1(1-p) if the bank refuses to make the loan. Thus, if 5p < 1-p, the bank can expect to lose less by making the loan than by turning away business.  The following decision rule can be implied: make the loan if the probability of default p < 1/6.  The prediction of Default (Success) is to implemented whenever p > 1/6.

This project analyuze the German Credit dataset.  The two expected outcomes are success (defaulting on the loan), and failure (not defaulting on the loan).  This project implements three models, Linear Regression, Multiple Linear Regression.  However, the emphasis of this project is on the Logistic Regression (Model-3) to estimate the probability of default, using continuous variables (duration, amount, installment, age), and categorical variables (loan history, purpose, rent) as explanatory variables (independent variables).

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “german.credit.csv”

Task-1:  Review the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is “german.credit” dataset.  The dataset describes the clients who can default on a loan.  There are selected variables out of the 22 variables which are target for this project.  Table 1 and Table 2 summarize these selected variables for this project.  Table 1 focuses on the variables with binary and numerical values, while Table 2 focuses on the variables with categorical values.

Table 1:  Binary (Default) and Continuous (Numerical) Variables

Table 2: Categorical Variables.

Task-2:  Load and Understand the Data Sets Using names(), head(), dim() functions.

            The purpose of this task is to load and understand the data set using names(), head(), and dim() functions as shown in the code below.

  • gc <- read.csv(“C:/CS871/german.credit.csv”)
  • names(gc)
  • head(gc)
  • dim(gc)
  • gc[1:3,]

Task-3:  Pre and Post Factor and Level of Categorical Variables of the Data Sets

            The purpose of this task is to factor and level all categorical variables and show the result pre- and post this process as shown in the code and snapshots below.

  • ### history categorical variable pre and post factor and level.
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Pre Factor and Level”)
  • gc$history = factor(gc$history, levels=c(“A30”, “A31”, “A32”, “A33”, “A34”))
  • levels(gc$history)=c(“good-others”, “good-thisBank”, “current-paid-duly”, “bad-delayed”, “critical”)
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Post Factor and Level”)

Figure 1. Pre and Post Factor and Level of History Categorical Variable.

  • ##### purpose pre and post factor and level
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)
  • ###tranform purpose
  • gc$purpose <- factor(gc$purpose, levels=c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))
  • levels(gc$purpose) <- c(“newcar”,”usedcar”,”furniture/equipment”,”radio/television”,”domestic appliances”,”repairs”, “edu”,”vacation-doesNotExist”, “retraining”, “business”, “others”)
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)

Figure 2. Pre and Post Factor and Level of Purpose Categorical Variable.

  • ### housing categorical variable pre and post factor and level.
  • summary(gc$housing)
  • gc$rent <- factor(gc$housing==”A151″)
  • summary(gc$rent)
  • ##### housing categorical variable pre and post factor and level
  • summary(gc$housing)
  • plot(gc$housing, col=”orange”)
  • gc$housing <- factor(gc$housing, levels=c(“A151”, “A152”, “A153”))
  • levels(gc$housing) <- c(“rent”,”own”,”for free”)
  • summary(gc$housing)

Figure 3. Pre and Post Factor and Level of Housing Categorical Variable.

 Task-4:  Summary & Plot the Numerical Variables: Duration, Amount, Installment, & Age.

            The purpose of this task is to obtain the summary of the numerical variables: duration, amount, installment and age and plot the result.

  • ##summary and plot of those numerical variables
  • summary(gc$duration)

Figure 4. Duration Plot as a Numerical and Continous Variable.

  • summary(gc$amount)
  • plot(gc$amount, col=”blue”, main=”Amount Numerical Variable”)

Figure 5. Amount Plot as a Numerical and Continous Variable.

  • summary(gc$installment)
  • plot(gc$installment, col=”blue”, main=”Installment Numerical Variable”)

Figure 6. Installment Plot as a Numerical and Continous Variable.

  • summary(gc$age)
  • plot(gc$age, col=”blue”, main=”Age Numerical Variable”)

Figure 7.  Age Plot as a Numerical and Continous Variable.

Task-5:  Classify the Amount into Groups

            The purpose of this task is to classify the amount into groups as shown below.

  • #### To classify the amount into groups
  • gc$amount <-as.factor(ifelse(gc$amount <=2500, ‘0-2500′, ifelse(gc.df$amount<=5000,’2600-5000’, ‘5000+’)))
  • summary(gc$amount)

Task-6:  Summary of all variables

The purpose of this task is to obtain the result of all required variables.

  • summary(gc$duration)
  • summary(gc$amount)
  • summary(gc$installment)
  • summary(gc$age)
  • summary(gc$history)
  • summary(gc$purpose)
  • summary(gc$housing)
  • summary(gc$rent)

Part-II: Plots, Discussion and Analysis

Task-1:  Select and Plot specific variables for this project

            The purpose of this task is to select the specific required variable and plot them into one single diagram.

  • ##for demonstration, cut the dataset to these selected variables
  • ##(duration, amount, installment, and age) which are numeric and
  • ##(history, purpose and housing) which are categorical and
  • ## Default (representing the risk) which is binary.
  • gc.sv <- gc[,c(“Default”, “duration”, “amount”, “installment”, “age”, “history”, “purpose”, “foreign”, “housing”)]
  • gc.sv[1:3,]
  • summary(gc.sv)
  • plot(gc.sv, col=”red”)
  • ### Setting the Rent
  • gc$rent <- factor(gc$housing==”A151″)
  • summary(gc$rent)

Figure 8.  Plot of all Selected Variables.

Task-2: Model-1: Linear Regression – single variable, and Diagnostic Analysis

                The purpose of this task is to implement Model-1 for Linear Regression using one variable (History) as a factor for the Default.

  • # to get a plot of the Default as dependent and history as independent
  • plot(Default~history, data=gc, col=”darkred”)
  • ##calculate the mean first – but no mean because of NA
  • mean(gc$Default)
  • # to check the NA values
  • gc$Default
  • ##we have to remove null values
  • mean(gc$Default, na.rm=T)
  • ##get the mean without NA values
  • mean.Default=mean(gc$Default,na.rm=T)
  • ## plot again after removing the NA value
  • plot(Default~history, data=gc, col=”darkred”, main=”Default and History Effect”)
  • #put horizontal line at the mean
  • abline(h=mean.Default)
  • #use lm to fit a regression line through these data:
  • model1.Default.history = lm(Default ~ history, data=gc)
  • model1.Default.history
  • abline(model1.Default.history, col=”yellow”)
  • plot(model1.Default.history)
  • # to see the effect of solar
  • termplot(model1.Default.history)
  • #summary
  • summary(model1.Default.history)

Figure 9.  Model-1: Linear Regression Using the Mean Line (black).

Figure 10.  Model-1: Linear Regression Using Regression Line (yellow).

Figure 11.  Model-1: Four Diagnostic Plots For Standard Linear Regression.

Task-3: Model-2: Multiple Regressions Analysis

            The purpose of this task is to implement Model-2 for Multiple Regressions including the Purpose, History, Installment, Housing, Amount, Duration, Age which can influence the Default as dependent variable.

  • #Find relationship between multiple variables
  • ##purpose on Default conditioned with history.
  • coplot(Default~purpose|history, panel=panel.smooth, gc)  
  • ##rent on Default conditioned with history.
  • coplot(Default~rent|history, panel=panel.smooth, gc)  
  • ##housing on Default conditioned with history.
  • coplot(Default~housing|history, panel=panel.smooth, gc)  

Figure 12.  Model-2: Multiple Linear Regression Using Purpose, Housing, Rent, conditioned with History as factors influencing Default.

  • ##amount on Default conditioned with duration.
  • coplot(Default~amount|duration, panel=panel.smooth, gc)  
  • ##installment on Default conditioned with duration.
  • coplot(Default~installment|duration, panel=panel.smooth, gc)
  • ##age on Default conditioned with duration.
  • coplot(Default~age|duration, panel=panel.smooth, gc)  

Figure 13.  Model-2: Multiple Linear Regression Using Amount, Duration as Factors influencing Default.

Task-3: Model-3:  Logistic Regression Analysis

            The purpose of this task is to implement the Logistic Regression Analysis.

  • summary(gc)
  • ###Create a Design Matrix
  • ##Factor variables are turned into indicator variables
  • ##The first column of ones is ommitted
  • Xgc <- model.matrix(Default~.,data=gc)[,-1]
  • Xgc[1:3,]
  • ## Create training and prediction datasets
  • ## Select 900 rows for estimation and 100 for testing
  • set.seed(1)
  • train <- sample(1:1000,900)
  • xtrain <- Xgc[train,]
  • xnew <- Xgc[-train,]
  • ytrain <- gc$Default[train]
  • ynew <- gc$Default[-train]
  • gcglm=glm(Default~.,family=binomial,data=data.frame(Default=ytrain,xtrain))
  • summary(gcglm)
  • plot(gcglm, col=”blue”)
  • coef(gcglm)

Figure 14.  Model-3: Logistic Regression: Diagnostic Analysis: Residuals v. Fitted.

Figure 15.  Model-3: Logistic Regression: Diagnostic Analysis: Normal Q-Q.

Figure 16.  Model-3: Logistic Regression: Diagnostic Analysis: Scale-Location.

Figure 17.  Model-3: Logistic Regression: Diagnostic Analysis: Residuals vs. Leverage.

  • ## Now to prediction: what are the underlying default probabilities
  • ## for cases in the test set
  • ptest <- predict(gcglm, newdata=data.frame(xnew),type=”response”)
  • data.frame(ynew,ptest)
  • ## Using probability cutoff 1/6
  • ## coding as 1 (predicting default) if probability 1/6 or larger
  • probability.one.sixth=floor(ptest+(5/6))
  • probability1=table(ynew,probability.one.sixth)
  • probability1
  • error=(probability1[1,2]+probability1[2,1])/100
  • error
  • ### Binary Classification, Using Probabiillties to make decision, sensitivity and specificity,
  • ### Cut 1/6
  • cut=1/6
  • true.positive <- ynew==1 & ptest >= cut
  • true.positive
  • true.negative <- ynew==0 & ptest < cut
  • true.negative
  • # Sensitivity (predict default when it does happen)
  • sum(true.positive)/sum(ynew==1)
  • # Specificity (predict no default when it does not happen)
  • sum(true.negative)/sum(ynew==0)
  • ## Using probability cutoff 1/2
  • ## coding as 1 if probability 1/2 or larger
  • cut=1/2
  • probability.half=floor(ptest+(1/2))
  • probability2=table(ynew,probability.half)
  • probability2
  • error=(probability2[1,2]+probability2[2,1])/100
  • error
  • true.positive <- ynew==1 & ptest >=cut
  • true.negative <- ynew==0 & ptest < cut
  • # Sensitivity (predict default when it does happen)
  • sum(true.positive)/sum(ynew==1)
  • # Specificity (predict no default when it does not happen)
  • sum(true.negative)/sum(ynew==0)
  • ## R macro for plotting the ROC curve
  • ## plot the ROC curve for classification of y with p
  • roc <- function(p,y){
  • y <- factor(y)
  • n <- length(p)
  • p <- as.vector(p)
  • Q <- p > matrix(rep(seq(0,1,length=500),n),ncol=500,byrow=TRUE)
  • fp <- colSums((y==levels(y)[1])*Q)/sum(y==levels(y)[1])
  • tp <- colSums((y==levels(y)[2])*Q)/sum(y==levels(y)[2])
  • plot(fp, tp, xlab=”1-Specificity”, ylab=”Sensitivity”)
  • abline(a=0,b=1,lty=2,col=8)
  • }
  • ###### ROC for hold-out period
  • roc(p=ptest,y=ynew)
  • ## ROC for all cases (in-sample)
  • gcglmall <- glm(gc$Default ~ Xgc,family=binomial)
  • roc(p=gcglmall$fitted, y=gcglmall$y)
  • ## using the ROCR package to graph the ROC curves
  • library(ROCR)

Figure 18.  Model-3: Logistic Regression: Sensitivity and Specificity for a sample of the data.

Figure 19.  Model-3: Logistic Regression: Sensitivity and Specificity for All the data.

  • ## input is a data frame consisting of two columns
  • ## predictions in first column and actual outcomes in the second
  • ## ROC for hold-out period
  • predictions=ptest
  • labels=ynew
  • data=data.frame(predictions,labels)
  • data
  • ## prediction1: function to create prediction objects
  • prediction1 <- prediction(data$predictions,data$labels)
  • prediction1
  • ## perf: creates the input to be plotted
  • ## sensitivity and one minus specificity (the false positive rate)
  • performance <- performance(prediction1, “sens”, “fpr”)
  • performance
  • plot(performance)

Figure 20.  Model-3: Logistic Regression: ROC for specifc cases.

  • ## ROC for all cases (in-sample)
  • gcglmall <- glm(gc$Default ~ Xgc,family=binomial)
  • predictions=gcglmall$fitted
  • labels=gcglmall$y
  • data=data.frame(predictions,labels)
  • prediction2 <- prediction(data$predictions,data$labels)
  • performance2 <- performance(prediction2, “sens”, “fpr”)
  • plot(performance2)

Figure 21.  Model-3: Logistic Regression: ROC for all cases.

Task-4:  Discussion and Analysis

The descriptive analysis of the history (Figure 1) shows that the existing credits paid back duly till now are ranked number one of 530, followed by the critical account of 293, followed by the category of delay in paying off in the past of 88.  The category of all credits at this bank paid back duly include only 49 and no credits taken or all credits paid back duly has only 40.  The descriptive analysis of the purpose (Figure 2) shows that the radio/TV of 280 ranks number one in the loan, followed by the new car category of 234, followed by the furniture/equipment of 181, and used car of103.  The business and education have 97 and 50 respectively, while the repairs, appliances, retaining and others have the least rank for loans. The descriptive analysis of the housing (Figure 3) shows that the own (Own=713) ranks number one for those who receive loans, followed by the rent (Rent=179).  The category of “free” housing (Free=108) ranks the last for receiving loans.  The descriptive analysis of the duration (Figure 4) shows that the duration average (Mean=21) is higher than the Median duration in a month (Median=18), indicating a positively skewed distribution.  The maximum duration has the value of 72 months, while the minimum duration for the loan is four months.  Most of the duration periods are up to 60 months.  There is an outlier of the duration period of ~70 months as illustrated in Figure 4.  The descriptive analysis of the amount (Figure 5) shows that the average amount of the loan (Mean=3271), is higher than the Median amount of the loan (Median=2320) indicating a positively skewed distribution.  The maximum amount for a loan is 18424, while the minimum amount is 250.  Most loans have the amount of ~5000, followed by the amount of 10,000-15,000.  There is an outlier of a loan amount above 15,000 as illustrated in Figure 5.  The descriptive analysis of the installment (Figure 6) shows that average installment (Mean=2.98) is higher than the Median installment (Median=1.00) indicating a positively skewed distribution.  The maximum installment rate is 4.00, while the minimum installment has a rate of 1.00.  The installment rates as illustrated in (Figure 6) are categorized between 1.00, 2.00, 3.00 and 4.00. The descriptive analysis of the age (Figure 7) shows that the average age (Mean=36) which is higher than the Median age (Median=33) indicating a positively skewed distribution.  The maximum age is 75, while the minimum age is 19.  As illustrated in Figure 7, most of the age is from ~20 to 50, with few densities from 50 to 60 and the fewest density is above 60 and 70.  

The Linear Regression is the first model (Model-1) is implemented on the (history) explanatory variable over the default.  The mean (Mean=3) is plotted in Figure 9 of the Linear Regression of history over the default.  Another line is plotted in Figure 10 is illustrated the Linear Regression using the Regression Line.  The result shows that there is a negative relationship between the Default (success) and the history.  The bad history shows the success in defaulting on a loan. The diagnostic plots of the standard regression are also discussed in this project. Figure 11 illustrates four different diagnostic plots of the standard regression. This analysis also covers the residuals and fitted lines.  Figure 11 illustrated the Residuals vs. Fitted in Linear Regression Model for the History explanatory variable as a function of the Default.  The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016). The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be near zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  Where there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016). The plot of the fitted values against the residuals with the line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).  The result of the Linear Regression for the identified variables of History and Default shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 11 also illustrates the Normal QQ plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  The residuals are not on the straight line, indicating that the residuals are not normally distributed. Hence, the normality test of the residuals did not pass, as it is supposed to be a straight line for the residuals to pass.  Figure 11 also illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot. The points are spread in a random fashion around the horizontal line but not equally the line.  If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 11 also illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the Linear Regression. In this plot, the outlying values at the upper right corner or the lower right are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.

The Multiple Linear Regression is the second Model.  Multiple variables such as purpose conditioned with history, rent conditioned with history, housing conditioned with history are used to implement the Multiple Regression models.  There are two most noticeable results (Figure 12).  The first noticeable result shows that the category of “good to other banks” that the default as success goes down when the loans are made for (purpose) of new car, furniture, radio/TV, domestic appliances, and repair, while the default as success goes up – meaning failure to default) with education, retraining, business and others.  The second noticeable result is illustrated in the category of “currently paid duly,” that the default as success goes down meaning the loan receiver default on the loan when the loan is made to the same categories as the first noticeable result, but until radio/TV, after which the default on loan will fail for (purpose) of domestic appliances, repair, education, retraining, business, and others.   The other Multiple Regression used the rent and free, the history shows a failure to default for the “good to others” category, while the own category shows success to default for the “current paid duly” category.   For the rent specific, the result shows that these specific categories, there no default on loan with the rent on both categories.  For the amount, installment and age conditioned with duration Multiple Regression models, the result shows that the loan 5000+, when the duration increases, the default failure increases, and when the duration decreases, the default failure decreases meaning there is a significant chance for defaulting on the loan. The same pattern shows when the installment increases, the default on a loan decreases.  The same pattern is also shown for the age, when the age increases, the default on the loan decreases.

The Logistic Regression Model (Model-3) is the emphasis of this project. The Logistic Regression is used to predict the probability of Default on loan based on other independent variables. In this case, the probability of the Default based on the identified required independent and explanatory variables of duration, amount, installment, age (continuous variables), history, purpose, and rent (categorical variables). 

A random selection of 900 of the 1000 cases is made for the Training set, and the remaining of 100 cases is made into the Test set.  The result shows the coefficient estimates, starting with a negative intercept (slope) of -2.705e-01 with a standard error of 4.833e-01, and the coefficient for each of these identified explanatory variables are also estimated.  The standard error of the coefficient estimates represents the accuracy of the coefficient. The larger the standard error, the less confident about the estimates.  The (z) value represents the z-statistic, which is the coefficient estimate divided by the standard error of the estimates.  The Pr(>|z|) is the last column in the result of the logistic analysis, where the p-value corresponding to the z-statistic.  The smaller the p-value, the more significant the estimates are.  One unit increase in duration (an increase in the number of duration by one) is associated with an increase of logarithm of the odds of the observation being Default.  As indicated in (Fischetti et al., 2017), if the coefficient is positive, it has a positive impact on the probability of the dependent variable, and if the coefficient is negative, it has a negative impact on the probability of the binary outcome (Default).  The age, history with poor and terrible, the purpose for used care, goods or repairs, education, and business, have negative coefficient values indicating that there is a negative impact on the probability of the binary outcome of the Default.  The duration, amount, installment and rent show positive coefficient values indicating that they have a positive impact on the probability of the dependent binary outcome (Default).  As the p-value is much less than 0.05 for duration, amount, installment, history, purpose for used car, goods, repair, and business, and rent, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypotheses that there is significance of the parameter to the model.  The p-value for the age is (p=0.05) indicating that we accept the null hypothesis that there is no significance of the parameter to the model.  The p-value for (purpose) of education is > 0.05 indicating that we accept the null hypothesis that there is no significance of the parameter to the model.  The performance of the Logistic Regression model for the Test dataset shows that the Logistic Regression recognizes 23 of the 28 Defaults (82%) and predicts the defaults 42 of the 72 good loans (58%).  The coordinates of the sensitivity and specificity (sensitivity = 0.82, 1-specifictiy=0.58) define one point on the ROC curve (Figure 18).  The sensitivity = 8/28=0.29, and specificity =68/72=0.94 for another point on ROC curve (sensitivity = 0.29, 1-specificity=0.06) (Figure 19).  The ROC is calculated using various cutoff on the probability.  The ROC curves that assess the predictive quality of the classification rule on the holdout sample of 100 observations as shown in (Figure 18), and on the complete data set of all 1000 cases as shown in (Figure 19).  Specified values on sensitivity and specificity imply a certain value for the probability cutoff.  However, for certain data and models, no cutoff may achieve the given desires properties on sensitivity and specificity, implying that the desired sensitivity and specificity cannot be attained (Ledolter, 2013).

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Machine Learning: How Logistic Regression Is Used to Predict Categorical Outcome

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss how the Logistic Regression used to predict the categorical outcome.  The discussion addresses the predictive power of categorical predictors of a binary outcome and whether the Logistic Regression should be used.  The discussion begins with an overall overview of variable types, business analytics methods based on data types and by market sector. The discussion then addresses how Logistic Regression is used when working with categorical outcome variable, and it ends with an example of Logistic Regression using R.  

Variables Types

Variables can be classified in various ways.  Variables can be categorical or continuous (Ary, Jacobs, Sorensen, & Walker, 2013).  When researchers classify subjects by sorting them into mutually exclusive groups, the attribute on which they base the classification is termed as a “categorical variables” (Ary et al., 2013).   Examples of categorical variables are home language, county of residence, father’s principal occupation, and school in which enrolled (Ary et al., 2013).  The simplest type of categorical variable has only two mutually exclusive classes and is called a “dichotomous variable” (Ary et al., 2013).  Male-Female, Citizen-Alien, and Pass-Fail are examples of the dichotomous variables (Ary et al., 2013).  Some categorical variables have more than two classes such as educational level, religious affiliation, and state of birth (Ary et al., 2013).  When the attribute has an “infinite” number of values with a range, it is a continuous variable (Ary et al., 2013).  Examples of continuous variables include height, weight, age, and achievement test scores (Ary et al., 2013). 

The most important classification of variables is by their use in the research under consideration when they are classified as independent variables or dependent variables (Ary et al., 2013).  The independent variables are antecedent to dependent variables and are known or are hypothesized to influence the dependent variable which is the outcome (Ary et al., 2013).  In experimental studies, the treatment is the independent variable, and the outcome is the dependent variable (Ary et al., 2013).  In a non-experimental study, it is often more challenging to label variables as independent or dependent (Ary et al., 2013).  The variable that inevitably precedes another one in time is called an independent variable (Ary et al., 2013).  For instance, in a research study of the relationship between teacher experience and students’ achievement scores, teacher experience would be considered as the independent variable (Ary et al., 2013).

Business Analytics Methods Based on Variable Types

The data types play a significant role in the employment of the analytical method.  As indicated in (Hodeghatta & Nayak, 2016), when the response (dependent) variable is continuous, and the predictor variables are either continuous or categorical, the Linear Regression, Neural Network, K-Nearest Neighbor (K-NN) methods can be used as detailed in Table 1. When the response (dependent) variable is categorical, and the predictor variables are either continuous or categorical, the Logistic Regression, K-NN, Neural Network, Decision/Classification Trees, Naïve Bayes can be used as detailed in Table 1.

Table-1: Business Analytics Methods Based on Data Types. Adapted from (Hodeghatta & Nayak, 2016).

Analytics Techniques/Methods Used By Market Sectors

In (EMC, 2015), the Analytic Techniques and Methods used are summarized in Table 2 by some of the Market Sectors.  These are examples of the application of these analytic techniques and method used.  As shown in Table 2, Logistic Regression can be used in Retail Business and Wireless Telecom industries. Additional methods are also used for different Market Sector as shown in Table 2.

Table 2.  Analytic Techniques/Methods Used by Market Sector (EMC, 2015).

Besides the above Market Sectors, Logistic Regression can also be used in Medical, Finance, Marketing and Engineering (EMC, 2015), while the Linear Regression can be used in Real Estate, Demand Forecasting, and Medical (EMC, 2015).

Predicting Categorical Outcomes Using Logistic Regression

The Logistic Regression model was first introduced by Berkson (Colesca, 2009; Wilson & Lorenz, 2015), who showed how the model could be fitted using iteratively reweighted least squares (Colesca, 2009).  Logistic Regression is widely used (Ahlemeyer-Stubbe & Coleman, 2014; Colesca, 2009) in social science research because many studies involve binary response variable (Colesca, 2009). Thus, in Logistic Regression, the target outcome is “binary,” such as YES or NO or the target outcome is categorical with just a few categories (Ahlemeyer-Stubbe & Coleman, 2014), while the Regular Linear Regression is used to model continuous target variables (Ahlemeyer-Stubbe & Coleman, 2014).  Logistic Regression calculates the probability of the outcome occurring, rather than predicting the outcome corresponding to a given set of predictors (Ahlemeyer-Stubbe & Coleman, 2014).  The Logistic Regression can answer questions such as: “What is the probability that an applicant will default on a loan?” while the Linear Regression can answer questions such as “What is a person’s expected income?” (EMC, 2015).  The Logistic Regression is based on the logistic function f(y), as shown in equation (1) (EMC, 2015).  

The expected value of the target variable from a Logistic Regression is between 0 and 1 and can be interpreted as a “likelihood” (Ahlemeyer-Stubbe & Coleman, 2014).  When y à¥, f(y) à1, and when y à¥, f(y) à0.  Figure 1 illustrates an example of the value of the logistic function f(y) varies from 0 to 1 as y increases using the Logistic Regression method (EMC, 2015).

Figure 1.  Logistic Function (EMC, 2015).

Because the range of f(y) is (0,1), the logistic function appears to be an appropriate function to model the probability of a particular outcome occurring (EMC, 2015).  As the value of the (y) increases, the probability of the outcome occurring increases (EMC, 2015).  In any proposed model, (y) needs to be a function of the input variables in any proposed model to predict the likelihood of an outcome (EMC, 2015).  In the Logistic Regression, the (y) is expressed as a linear function of the input variables (EMC, 2015).  The formula of the Logistic Regression is shown in equation (2) below, which is similar to the Linear Regression equation (EMC, 2015).  However, one difference is that the values of (y) are not directly observed, only the value of f(y) regarding success or failure, typically expressed as 1 or 0 respectively is observed (EMC, 2015).

Based on the input variables of x1, x2, …, xp-1, the probability of an event is shown in equation (3) below (EMC, 2015).

Using the (p) to denote f(y), the equation can be re-written as shown in equation (4) (EMC, 2015).  The quantity ln(p/p-1), in the equation (4) is known as the log odds ratio, or the logit of (p) (EMC, 2015).       

The probability is a continuous measurement, but because it is a constrained measurement, and it is bounded by 0 and 1, it cannot be measured using the Regular Linear Regression (Fischetti, 2015), because one of the assumptions in Regular Linear Regression is that all predictor variables must be “quantitative” or “categorical,” and the outcome  variables must be “quantitative,” “continuous” and “unbounded” (Field, 2013). The “quantitative” indicates that they should be measured at the interval level, and the “unbounded” indicates that there should be no constraints on the variability of the outcome (Field, 2013).  In the Regular Linear Regression, the outcome is below 0 and above 1 (Fischetti, 2015). 

The logistic function can be applied to the outcome of a Linear Regression to constrain it to be between 0 and 1, and it can be interpreted as a proper probability (Fischetti, 2015).  As shown in Figure 1, the outcome of the logistic function is always between 0 and 1. Thus, the Linear Regression can be adapted to output probabilities (Fischetti, 2015).  However, the function which can be applied to the linear combination of predictors is called “inverse link function,” while the function that transforms the dependent variable into a value that can be modeled using linear regression is just called “link function” (Fischetti, 2015).  In the Logistic Regression, the “link function” is called “logit function” (Fischetti, 2015).   The transformation logit (p) is used in Logistic Regression with the letter (p) to represent the probability of success (Ahlemeyer-Stubbe & Coleman, 2014).  The logit (p) is a non-linear transformation, and Logistic Regression is a type of non-linear regression (Ahlemeyer-Stubbe & Coleman, 2014).

There are two problems that must be considered when dealing with Logistic Regressions. The first problem is that the ordinary least squares of the Regular Linear Regression to solve for the coefficients cannot be used because the link function is non-linear (Fischetti, 2015).  Most statistical software solves this problem by using a technique called Maximum Likelihood Estimation (MLE) instead (Fischetti, 2015). Techniques such as MLE are used to estimate the model parameters (EMC, 2015). The MLE determines the values of the model parameters which maximize the chances of observing the given dataset (EMC, 2015).  

The second problem is that Linear Regression assumes that the error distribution is normally distributed (Fischetti, 2015).  Logistic Regression models the error distribution as a “Bernoulli” distribution or a “binomial distribution” (Fischetti, 2015).   In the Logistic Regression, the link function and error distributions are the logits and binomial respectively.  In the Regular Linear Regression, the link function is the identity function, which returns its argument unchanged, and the error distribution is the normal distribution (Fischetti, 2015).

Logistic Regression in R

The function glm() is used in R to perform Logistic Regression.  The error distribution and link function will be specified in the “family” argument.  The family argument can be family=”binomial” or family=binomial(). Example of the glm() using the births.df dataset.  In this example, we are building Logistic Regression using all available predictor variables on SEX gender (male, female).

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Ary, D., Jacobs, L. C., Sorensen, C. K., & Walker, D. (2013). Introduction to research in education: Cengage Learning.

Colesca, S. E. (2009). Increasing e-trust: A solution to minimize risk in e-government adoption. Journal of applied quantitative methods, 4(1), 31-44.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Fischetti, T. (2015). Data Analysis with R: Packt Publishing Ltd.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Wilson, J. R., & Lorenz, K. A. (2015). Short History of the Logistic Regression Model Modeling Binary Correlated Responses using SAS, SPSS and R (pp. 17-23): Springer.

Machine Learning: Logistic Regression

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the assumptions of the Logistic Regression, and the assumptions of the Regular Regression, which are not applicable to the Logistic Regression.  The discussion and the analysis also address the type of the variables in both the Logistic Regression and the Regular Regression. 

Regular Linear Regression:

Regression analysis is used when a linear model is fit to the data and is used to predict values of an outcome variable or dependent variable from one or more predictor variable or independent variables (Field, 2013).  The Linear Regression is also defined in (Field, 2013) as a method which is used to predict the values of the continuous variables, and to make inferences about how specific variables are related to a continuous variable.  These two procedures of the prediction and inference rely on the information from the statistical model, which is represented by an equation or series of equations with some number of parameters (Tony Fischetti, 2015).  Linear Regression is the most important prediction method for “continuous” variables (Giudici, 2005). 

With one predictor or independent variable, the technique is sometimes referred to as “Simple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti, Mayor, & Forte, 2017; Giudici, 2005).  However, when there are several predictors or independent variables in the model, it is referred to as “Multiple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti et al., 2017; Giudici, 2005).  In Regression Analysis, the differences between what the model predicts and the observed data are called “Residuals” which are the same as “Deviations” when looking at the Mean (Field, 2013).  These deviations are the vertical distances between what the model predicted and each data point that was observed.  Sometimes, the predicted value of the outcome is less than the actual value, and sometimes it is greater, meaning that the residuals are sometimes positive and sometimes negative.  To evaluate the error in a regression model, like when the fit of the mean using the variance is assessed, a sum of squared errors can be used using residual sum squares (SSR) or the sum of squared residuals (Field, 2013).  This SSR is an indicator of how well a particular line fits the data: if the SSR is large, the line is not representative of the data; if the SSR is small, the line is a representative of the data (Field, 2013).

When using the Simple Linear Regression with the two variables; one independent or predictor and the other is outcome or dependent, the equation is as follows (Field, 2013).

In this Regression Model, the (b) is the correlation coefficient (more often denoted as ( r )), and it is a standardized measure (Field, 2013). However, an unstandardized measure of (b) can be used, but the equation will alter to be as follows (Field, 2013):

This model differs from that of a correlation only in that it uses an unstandardized measure of the relationship (b) and consequently a parameter (b0) for the value of the outcome must be included, when the predictor is zero (Field, 2013).  These parameters of (b0) and (b1) are known as the Regression Coefficients (Field, 2013).

When there are more than two variables which might be related to the outcome, Multiple Regression can be used. The Multiple Regression can be used with three, four or more predictors (Field, 2013).  The equation for the Multiple Regression is as follows:

The (b1) is the coefficient of the first predictor (X1), (b2) is the second coefficient of the second predictor (X2), and so forth, as (bn) is the coefficient of the nth predictor (Xni) (Field, 2013). 

To assess the goodness of fit for the Regular Regression, the sum of squares, the R and R2 can be used.  When using the Mean as a model, the difference between the observed values and the values predicted by the mean can be calculated using the sum of squares (denoted SST) (Field, 2013).  This value of the SST represents how good the Mean is as a model of the observed data (Field, 2013).  When using the Regression Model, the SSR can be used to represent the degree of inaccuracy when the best model is fitted to the data (Field, 2013). 

Moreover, these two sums of squares of SST and SSR can be used to calculate how much better the regression model is than using a baseline model such as the Mean model (Field, 2013).  The improvement in prediction resulting from using the Regression Model rather than the Mean Model is measured by calculating the difference between SST and SSR (Field, 2013).  Such improvement is the Model Sum of Squares (SSM) (Field, 2013).  If the value of SSM is large, then the regression model is very different from using the mean to predict the outcome variable, indicating that the Regression Model has made a big improvement to how well the outcome variable can be predicted (Field, 2013). However, if the SSM is small then using the Regression Model is a little better than using the Mean model (Field, 2013).  Calculating the R2 by dividing SSM by SST to measure the proportion of the improvement due to the model.  The R2 represents the amount of variance in the outcome explained by the mode (SSM) relative to how much variation there was to explain in the first place (SST) (Field, 2013).  Other methods to assess the goodness-of-fit of the Model include the F-test using Mean Squares (MS) (Field, 2013), and F-statistics to calculate the significance of R2(Field, 2013). To measure the individual contribution of a predictor in Regular Linear Regression, the estimated regression coefficient (b) and their standard errors to compute a t-statistic are used (Field, 2013).

The Regression Model must be generalized as a generalization is a critical additional step, because if the model cannot be generalized, then any conclusion must be restricted based on the model to the sample used (Field, 2013).  For the regression model to generalize, cross-validation can be used (Field, 2013; Tony Fischetti, 2015) and the underlying assumptions must be met (Field, 2013).

Central Assumptions of Regular Linear Regression in Order of Importance

The assumptions of the Linear Model in order of importance as indicated in (Field, 2013) are as follows:

  1. Additivity and Linearity:  The outcome variable should be linearly related to any predictors, and with several predictors, their combined effect is best described by adding their effects together. Thus, the relationship between variables is linear.  If this assumption is not met, the model is invalid.  Sometimes, variables can be transformed to make their relationships linear (Field, 2013).
  2. Independent Errors:  The residual terms should be uncorrelated (i.e., independent) for any two observations, sometimes described as “lack of autocorrelation” (Field, 2013).  If this assumption of independence is violated, the confidence intervals and significance tests will be invalid.  However, regarding the model parameters, the estimates using the method of least square will still be valid but not optimal (Field, 2013).  This assumption can be tested with the Durbin-Watson test, which tests for serial correlations between errors, specifically, it tests whether adjacent residuals are correlated (Field, 2013).  The size of the Durbin-Watson statistic depends upon the number of predictors in the model and the number of observation (Field, 2013).  As a very conservative rule of thumb, values less than one or greater than three are the cause of concern; however, values closer to 2 may still be problematic, depending on the sample and model (Field, 2013).
  3. Homoscedasticity:  At each level of the predictor variable(s), the variance of the residual terms should be constant, meaning that the residuals at each level of the predictor(s) should have the same variance (homoscedasticity) (Field, 2013).  When the variances are very unequal there is said to be heteroscedasticity.  Violating this assumption invalidates the confidence intervals and significance tests (Field, 2013). However, estimates of the model parameters (b) using the method of least squares are still valid but not optimal (Field, 2013).  This problem can be overcome by using weighted least squares regression in which each case is weighted by a function of its variance (Field, 2013).
  4. Normally Distributed Errors:   It is assumed that the residuals in the model are random, normally distributed variables with a mean of 0.  This assumption means that the differences between the model and the observed data are most frequently zero or very close to zero, and that differences much greater than zero happen only occasionally (Field, 2013).  This assumption sometimes is confused with the idea that predictors have to be normally distributed (Field, 2013).  Predictors do not need to be normally distributed (Field, 2013). In small samples a lack of normality will invalidate confidence intervals and significance tests; in large samples, it will not, because of the central limit theorem (Field, 2013).  If the concern is only with estimating the model parameters and not with the significance tests and confidence intervals, then this assumption barely matters (Field, 2013).  In other words, this assumption matters for significance tests and confidence intervals.  This assumption can also be ignored if the bootstrap of confidence intervals is used (Field, 2013).

Additional Assumptions of Regular Linear Regression

There are additional assumptions when dealing with Regular Linear Regression.  These additional assumptions are as follows as indicated in (Field, 2013).

  • Predictors are uncorrelated with “External Variable,” or “Third Variable”  External variables are variables which have not been included in the regression model and influence the outcome variable. These variables can be described as “third variable.” This assumption indicates that there should be no external variables that correlate with any of the variables included int eh regression model (Field, 2013).  If external variables do correlate with the predictors, the conclusion that is drawn from the model become “unreliable” because other variables exist that can predict the outcome just as well (Field, 2013).
  • Variable Types: All predictor (independent) variables must be “quantitative” or “categorical,” and the outcome (dependent) variables must be “quantitative,” “continuous” and “unbounded” (Field, 2013). The “quantitative” indicates that they should be measured at the interval level, and the “unbounded” indicates that there should be no constraints on the variability of the outcome (Field, 2013).  
  • No Perfect Multicollinearity: If the model has more than one predictor then there should be no perfect linear relationship between two or more of the predictors.  Thus, the predictors (independent) variables should not correlate too highly (Field, 2013).
  • Non-Zero Variance: The predictors should have some variations in value; meaning they do not have variances of zero (Field, 2013).  

Logistic Regression

When the dataset has categorical variables as well as continuous predictors (independent), Logistic Regression is used (Field, 2013). Logistic Regression is multiple regression but with an outcome (dependent) variable that is categorical and predictor variables that are continuous or categorical.  Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005).

Logistic Regression can have life-saving applications as in medical research it is used to generate models from which predictions can be made about the “likelihood” that, e.g. a tumor is cancerous or benign (Field, 2013). A database is used to develop which variables are influential in predicting the “likelihood” of malignancy of a tumor (Field, 2013). These variables can be measured for a new patient and their values placed in a Logistic Regression model, from which a “probability” of malignancy could be estimated (Field, 2013).  Logistic Regression calculates the “probability” of the outcome occurring rather than making a prediction of the outcome corresponding to a given set of predictors (Ahlemeyer-Stubbe & Coleman, 2014). The expected values of the target variable from a Logistic Regression are between 0 and 1 and can be interpreted as a “likelihood” (Ahlemeyer-Stubbe & Coleman, 2014).

There are two types of Logistic Regression; Binary Logistic Regression, and Multinomial or Polychotomous Logistic Regression.  The Binary Logistic Regression is used to predict membership of only two categorical outcomes or dependent variables, while the Multinomial or Polychotomous Logistic Regression is used to predict membership of more than two categorical outcomes or dependent variables (Field, 2013).

Concerning the assessment of the model, the R-statistics can be used to calculate a more literal version of the multiple correlations in the Logistic Regression model.  The R-statistic is the partial correlation between the outcome variable and each of the predictor variables, and it can vary between -1 and +1. A positive value indicates that as the predictor variable increases, so does the likelihood of the event occurring, while the negative value indicates that as the predictor variable increases, the likelihood of the outcome occurring decreases (Field, 2013).  If a variable has a small value of R then, it contributes a small amount to the model.  Other measures for such assessment include Hosmer and Lemeshow, Cox and Snell’s and Nagelkerke’s (Field, 2013). All these measures differ in their computation, conceptually they are somewhat the same, and they can be seen as similar to the R2 in linear regression regarding interpretation as they provide a gauge of the substantive significance of the model (Field, 2013).

In the Logistic Regression, there is an analogous statistics, the z-statistics, which follows the normal distribution to measure the individual contribution of predictors (Field, 2013). Like the t-tests in the Regular Linear Regression, the z-statistic indicates whether the (b) coefficient for that predictor is significantly different from zero (Field, 2013). If the coefficient is significantly different from zero, then the assumption can be that the predictor is making a significant contribution to the prediction of the outcome (Y) (Field, 2013). The z-statistic is known as the Wald statistic as it was developed by Abraham Wald (Field, 2013).  

Principles of Logistic Regression

One of the assumptions mentioned above for the regular linear models is that the relationship between variables is linear for the linear regression to be valid.  However, when the outcome variable is categorical, this assumption is violated as explained in the “Variable Types” assumption above, because and the outcome (dependent) variables must be “quantitative,” “continues” and “unbounded” (Field, 2013).  To get around this problem, the data must be transformed using the logarithmic transformation).  The purpose of this transformation is to express the non-linear relationship into a linear relationship (Field, 2013). However, Logistic Regression is based on this principle as it expresses the multiple linear regression equation in logarithmic terms called the “logit” and thus overcomes the problem of violating the assumption of linearity (Field, 2013).  The transformation logit (p) is used in Logistic Regression with the letter (p) representing the probability of success (Ahlemeyer-Stubbe & Coleman, 2014).  The logit (p) is a non-linear transformation, and Logistic Regression is a type of non-linear regression (Ahlemeyer-Stubbe & Coleman, 2014).

Assumptions of the Logistic Regression

In the Logistic Regression, the assumptions of the ordinary regression are still applicable.  However, the following two assumptions are dealt with differently in the Logistic Regression (Field, 2013):

  • Linearity:  While in the ordinary regression, the assumption is that the outcome has a linear relationship with the predictors, in the logistic regression, the outcome is categorical, and so this assumption is violated, and the log (or logit) of the data is used to overcome this violation (Field, 2013). Thus, the assumption of linearity in Logistic Regression is that there is a linear relationship between any continuous predictors and the logit of the outcome variable (Field, 2013). This assumption can be tested by checking if the interaction term between the predictor and its log transformation is significant (Field, 2013).  In short, the linearity assumption is that each predictor has a linear relationship with the log of the outcome variable when using the Logistic Regression.
  • Independence of Errors:  In the Logistic Regression, violating this assumption produces overdispersion, which can occur when the observed variance is bigger than expected from the Logistic Regression model.   The overdispersion can occur for two reasons (Field, 2013). The first reason is the correlated observation when the assumption of independence is broken (Field, 2013). The second reason is due to variability in success probabilities (Field, 2013). The overdispersion tends to limit standard errors, which creates two problems. The first problem is the test statistics of regression parameters which are computed by dividing by the standard error, so if the standard error is too small, then the test statistic will be too big and falsely deemed significant.  The second problem is the confidence intervals which are computed from standard errors, so if the standard error is too small, then the confidence interval will be too narrow and result in the overconfidence about the likely relationship between predictors and the outcome in the population. In short, the overdispersion occurs when the variance is larger than the expected variance from the model.  This overdispersion can be caused by violating the assumption of independence.  This problem makes the standard errors too small (Field, 2013), which can bias the conclusions about the significance of the model parameters (b-values) and population value (Field, 2013).

Business Analytics Methods Based on Data Types

In (Hodeghatta & Nayak, 2016), the following table summarizes the business analytics methods based on the data types.  As shown in the table, when the response (dependent) variable is continuous, and the predictor variables is either continuous or categorical, the Linear Regression method is used.  When the response (dependent) variable is categorical, and the predictor variables are either continuous or categorical, the Logistic Regression is used.  Other methods are also listed as additional information.

Table-1. Business Analytics Methods Based on Data Types. Adapted from (Hodeghatta & Nayak, 2016).

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Fischetti, T. (2015). Data Analysis with R: Packt Publishing Ltd.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Overfitting and Parsimony in Large Dataset Analysis

Dr. O. Aly
Computer Science

Introduction:  The purpose of this discussion is to discuss the issues of overfitting versus using parsimony and their importance in Big Data analysis.  The discussion also addresses if the overfitting approach is a problem in the General Least Square Model (GLM) approach.  Some hierarchical methods which do not require parsimony of GLM are also discussed in this discussion. This discussion does not include the GLM as it was discussed earlier. It begins with Parsimony in Statistics model.

Parsimony Principle in Statistical Model

The medieval (14th century) English philosopher, William of Ockham (1285 – 1347/49) (Forster, 1998) popularized a critical principle stated by Aristotle “Entities must not be multiplied beyond what is necessary” (Bordens & Abbott, 2008; Epstein, 1984; Forster, 1998).  The refinement of this principle by Ockham is now called “Occam’s Razor” stating that a problem should be stated in the simplest possible terms and explained with the fewest postulates possible (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  This method is now known as Law or Principle of Parsimony (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  Thus, based on this law, a theory should account for phenomena within its domain in the simplest terms possible and with the fewest assumptions (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  As indicated by (Bordens & Abbott, 2008), if there are two competing theories concerning a behavior, the one which explains the behavior in the simplest terms is preferred under the law of parsimony. 

Modern theories of the attribution process, development, memory, and motivation adhere to this law of parsimony (Bordens & Abbott, 2008).  However, the history of science witnessed some theories which got crushed under their weight of complexity (Bordens & Abbott, 2008). For instance, the collapse of interest in the Hull-Spence model of learning occurred primarily because the theory had been modified so many times to account for anomalous data that was no longer parsimonious (Bordens & Abbott, 2008).  The model of Hull-Spence became too complicated with too many assumptions and too many variables whose values had to be extracted from the very data that the theory was meant to explain (Bordens & Abbott, 2008).  As a result of such complexity, the interest in the theory collapsed and got lost.  The Ptolemaic Theory of planetary motion also lost its parsimony because it lost much of its true predictive power (Bordens & Abbott, 2008).

Parsimonious is one of the characteristics of a good theory (Bordens & Abbott, 2008).  Parsimonious explanation or a theory explains a relationship using relatively few assumptions (Bordens & Abbott, 2008). When more than one explanation is offered for observed behavior, scientists and researchers prefer the parsimonious explanation which explains behavior with the fewest number of assumptions (Bordens & Abbott, 2008). Scientific explanations are regularly evaluated and examined for consistency with the evidence and with known principles for parsimony and generality (Bordens & Abbott, 2008).  Accepted explanations can be overthrown in favor of views which are more general, more parsimonious, and more consistent with observation (Bordens & Abbott, 2008).

How to Develop Fit Model Using Parsimony Principle

When building a model, the researcher should strive for parsimony  (Bordens & Abbott, 2008; Field, 2013). The statistical implication of using a parsimony heuristic is that models be kept as simple as possible, meaning predictors should not be included unless they have the explanatory benefit (Field, 2013).  This strategy can be implemented by fitting the model that include all potential predictors, and then systematically removing any that do not seem to contribute to the model (Field, 2013).  Moreover, if the model includes interaction terms, then, the interaction terms to be valid, the main effects involved in the interaction term should be retained (Field, 2013).  Example of the implementation of Parsimony in developing a model include three variables in a patient dataset: (1) outcome variable (as cured or not cured), which is dependent variable (DV), (2) intervention variable, which is a predictor independent variable (IV), and (3) duration, which is another predictor independent variable (Field, 2013).  Thus, the three potential predictors can be Intervention, Duration and the interaction of the “Intervention x Duration” (Field, 2013).  The most complex model includes all of these three predictors.  As the model is being developed, any terms that are added but did not improve the model should be removed and adopt the model which did not include those terms that did not make a difference.  Thus, the first model (model-1) which the researchers can fit would be to have only Intervention as a predictor (Field, 2013).  Then, the model is built up by adding in another main effect of the Duration in this example as model-2.  The interaction of the Intervention x Duration can be added in model-3. Figure 1 illustrates these three models of development. The goal is to determine which of these models best fits the data while adhering to the general idea of parsimony (Field, 2013).   If the interaction term model-3 did not improve the model (model-2), then model-2 should be used as the final model.  If the Duration in model-2 did not make any difference and did not improve model-1, then model-1 should be used as the final model (Field, 2013).  The aim is to build the model systematically and choose the most parsimonious model as the final model.  The parsimonious representations are essential because simpler models tend to give more insight into a problem (Ledolter, 2013). 

Figure 1.  Building Models based on the Principle of Parsimony (Field, 2013).

Overfitting in Statistical Models

Overfitting is a term used when using models or procedures which violate Parsimony Principle, it means that the model includes more terms than are necessary or uses more complicated approaches than necessary (Hawkins, 2004).   There are two types of “Overfitting” methods.  The first “Overfitting” method is to use a model which is more flexible than it needs to be (Hawkins, 2004).  For instance, a neural net can accommodate some curvilinear relationships and so is more flexible than a simple linear regression (Hawkins, 2004). However, if it is used on a dataset that conforms to the linear model, it will add a level of complexity without any corresponding benefit in performance, or even worse, with poorer performance than the simpler model (Hawkins, 2004).  The second “Overfitting” method is to use a model that includes irrelevant components such as a polynomial of excessive degree or a multiple linear regression that has irrelevant as well as the needed predictors (Hawkins, 2004). 

The “Overfitting” technique is not preferred for four essential reasons (Hawkins, 2004).  The first reason involves wasting resources and expanding the possibilities for undetected errors in databases which can lead to prediction mistakes, as the values of these unuseful predictors must be substituted in the future use of the mode (Hawkins, 2004).  The second reason is that the model with unneeded predictors can lead to worse decisions (Hawkins, 2004).   The third reason is that irrelevant predictor can make predictions worse because the coefficients fitted to them add random variation to the subsequent predictions (Hawkins, 2004). The last reason is that the choice of model has an impact on its portability (Hawkins, 2004). The one-predictor linear regression that captures a relationship with the model is highly portable (Hawkins, 2004).  The more portable model is preferred over, the less portable model, as the fundamental requirement of science is that one researcher’s results can be duplicated by another researcher (Hawkins, 2004). 

Moreover, large models overfitted on training dataset turn out to be extremely poor predictors in new situations as needed predictor variables increase the prediction error variance (Ledolter, 2013).  The overparameterized models are of little use if it is difficult to collect data on predictor variables in the future.  The partitioning of the data into training and evaluation (test) datasets is central to most data mining methods (Ledolter, 2013). Researchers must check whether the relationships found in the training dataset will hold up in the future (Ledolter, 2013).

How to recognize and avoid Overfit Models

A model overfits if it is more complicated than another model that fits equally well (Hawkins, 2004).  The recognition of overfitting model involves not only the comparison of the simpler model and the more complex model but also the issue of how the fit of a model is measured (Hawkins, 2004).  Cross-Validation can detect overfit models by determining how well the model generalizes to other datasets by partitioning the data (minitab.com, 2015).  This process of cross-validation helps assess how well the model fits new observations which were not used in the model estimation process (minitab.com, 2015). 

Hierarchical Methods

The regression analysis types include simple, hierarchical, and stepwise analysis (Bordens & Abbott, 2008).  The main difference between these types is how predictor variables are entered into the regression equation which may affect the regression solution (Bordens & Abbott, 2008).  In the simple regression analysis, all predictors variables are entered together, while in the hierarchical regression, the order in which variables are entered into the regression equation is specified (Bordens & Abbott, 2008; Field, 2013).  Thus, the hierarchical regression is used for a well-developed theory or model suggesting a specific causal order (Bordens & Abbott, 2008). As a general rule, known predictors should be entered into the model first in order of their importance in predicting the outcome (Field, 2013).  After the known predictors have been entered, any new predictors can be added into the model (Field, 2013).  In the stepwise regression, the order in which variables are entered is based on a statistical decision, not on a theory (Bordens & Abbott, 2008).  

The choice of the regression analysis should be based on the research questions or the underlying theory (Bordens & Abbott, 2008).  If the theoretical model is suggesting a particular order of entry, the hierarchical regression should be used (Bordens & Abbott, 2008). Stepwise regression is infrequently used because sampling and measurement error tends to make unstable correlations among variables in stepwise regression (Bordens & Abbott, 2008).  The main problem with the stepwise methods is that they assess the fit of a variable based on the other variables in the model (Field, 2013).

Goodness-of-fit Measure for the Fit Model

Comparison between hierarchical and stepwise methods:  The hierarchical and stepwise methods involve adding predictors to the model in stages, and it is useful to know these additions improve the model (Field, 2013).  Since the larger values of R2 indicates better fit, thus, a simple way to see whether a model has improved as a result of adding predictors to it would be to see whether R2 for the new model is bigger than for the old model.  However, it will always get bigger if predictors are added, so the issue is more whether it gets significantly bigger (Field, 2013).  The significance of the change in R2 can be assessed using the equation below as the F-statistics is also be used to calculate the significance of R2 (Field, 2013) 

However, because the focus is on the change in the models, thus the change in R2  (R2 change)  and R2 of the newer model (R2 new) are used using the following equation (Field, 2013). Thus, models can be compared using this F-ratio (Field, 2013). 

The Akaike’s Information Criterion (AIC) method is a goodness-of-fit measure which penalizes the model for having more variables. If the AIC is bigger, the fit is worse; if the AIC is smaller, the fit is better (Field, 2013).   If the Automated Linear Model function in SPSS is used, then AIC is used to select models rather than the change in R2. AIC is used to compare it with other models with the same outcome variables; if it is getting smaller, then the fit of the model is improving (Field, 2013).  In addition to the AIC method, there is Hurvich and Tsai’s criterion (AICC), which is a version of AIC designed for small samples (Field, 2013).  Bozdogan’s criterion (CAIC), which is a version of AIC which is used for model complexity and sample size. Bayesian Information Criterion (BIC) of Schwarz, which is comparable to the AIC (Field, 2013; Forster, 1998). However, it is slightly more conservative as it corrects more harshly for the number of parameters being estimated (Field, 2013).  It should be used when sample sizes are large, and the number of parameters is small (Field, 2013).

The AIC and BIC are the most commonly used measures for the fit of the model.  The values of these measures are all useful as a way of comparing models (Field, 2013).  The value of AIC, AICC, CAIC, and BIC can all be compared to their equivalent values in other models.  In all cases, smaller values mean better-fitting models (Field, 2013).

There is also Minimum Description Length (MDL) measure of Rissanen, which is based on the idea that statistical inference centers around capturing regularity in data; regularity, in turn, can be exploited to compress the data (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2015).  Thus, the goal is to find the model which compresses the data the most (Vandekerckhove et al., 2015).   There are three versions of MDL: crude two-part code, where the penalty for complex models is that they take many bits to describe, increasing the summed code length. In this version, it can be difficult to define the number of bits required to describe the model.  The second version of MDL is the Fisher Information approximation (FIA), which is similar to AIC and BIC in that it includes a first term that represents goodness-fo-fit, and additional terms that represent a penalty for complexity (Vandekerckhove et al., 2015).  The second term resembles that of BIC, and the third term reflects a more sophisticated penalty which represents the number of distinguishable probability distribution that a model can generate (Vandekerckhove et al., 2015).  The FIA differs from AIC and BIC in that it also accounts for functional form complexity, not just complexity due to the number of free parameters (Vandekerckhove et al., 2015).  The third version of MDL is normalized maximum likelihood (NML) which is simple to state but can be difficult to compute, for instance, the denominator may be infinite, and this requires further measures to be taken (Vandekerckhove et al., 2015). Moreover, NML requires integration over the entire set of possible datasets, which may be difficult to define as it depends on unknown decision process in the researchers (Vandekerckhove et al., 2015).

AIC and BIC in R: If there are (p) potential predictors, then there are 2p possible models (r-project.org, 2002).  AIC and BIC can be used in R as selection criteria for linear regression models as well as for other types of models.  As indicated in (r-project.org, 2002): the equations for AIC and BIC are as follows.

For the linear regression models, the -2log-likelihood (known as the deviance is nlog(RSS/n)) (r-project.org, 2002).  AIC and BIC need to get minimized (r-project.org, 2002).  The Larger models will fit better and so have smaller RSS but use more parameters (r-project.org, 2002).  Thus, the best choice of model will balance fit with model size (r-project.org, 2002).  The BIC penalizes larger models more heavily and so will tend to prefer smaller models in comparison to AIC (r-project.org, 2002). 

Example of the code in R using the state.x77 dataset is below. The function does not evaluate the AIC for all possible models but uses a search method that compares models sequentially as shown in the result of the R commands.

  • g <- lm(Life.Exp ~ ., data=state.x77.df)
  • step(g)

References

Bordens, K. S., & Abbott, B. B. (2008). Research Design and Methods: A Process Approach: McGraw-Hill.

Epstein, R. (1984). The principle of parsimony and some applications in psychology. The Journal of Mind and Behavior, 119-130.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Forster, M. R. (1998). Parsimony and Simplicity. Retrieved from http://philosophy.wisc.edu/forster/220/simplicity.html, University of Wisconsin-Madison.

Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1), 1-12.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

minitab.com. (2015). The Danger of Overfitting Regression Models. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models.

r-project.org. (2002). Practical Regression and ANOVA Using R Retrieved from https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf.

Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model Comparison and the Principle The Oxford handbook of computational and mathematical psychology (Vol. 300): Oxford Library of Psychology.

Quantitative Analysis of the “Faithful” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the faithful dataset. The project is divided into two main Parts.  Part-I evaluates and examines the DataSet for understanding the DataSet using the RStudio.  Part-I involves five significant tasks.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving nine significant tasks to analyze the Data Frames.   The result shows that there is a relationship between Waiting Time until Next Eruption and the Eruption Time.   The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is minimal almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternative hypotheses that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable Waiting Time and the independent variable of Eruptions. The project also analyzed the comparison among three regressions: standard regression, polynomial regression, and lowess regression.  The result shows a similar relationship and lines between the three models.

Keywords: R-Dataset; Faithful Dataset; Regression Analysis Using R.

Introduction

This project examines and analyzes the dataset of faithful.csv (oldfaithful.csv).  The dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/.  The dataset has 272 observations on two variables; eruptions and waiting. The dataset is to describe the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  A closer look at the failthful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequency than expected under non-human measurement.  The eruption variable is numeric for the eruption time in minutes.  The waiting variable is numeric for the waiting time to next eruption in minutes. The geyser is in package MASS.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: Density Histograms and Smoothed Density Histograms.
  • Task-3: Standard Linear Regression
  • Task-4: Polynomial Regression.
  • Task-5: Lowess Regression.
  • Task-6: The Summary of the Model.
  • Task-7: Comparison of Linear Regressions, Polynomial Regression, and Lowess Regression.
  • Task-8: The Summary of all Models.
  • Task-9:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “faithful.”

Task-1:  Install MASS Package

The purpose of this task is to install the MASS package which is required for this project. The faithful.cxv requires this package.

  • Command: >install.packages(“MASS”)

Task-2:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is a “faithful” dataset. It describes the Waiting Time between Eruptions and the duration of the Eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  The dataset has 272 observations on two main variables:

  • Eruptions: numeric Eruption time in minutes.
  • Waiting: numeric Waiting time to next eruption in minutes.

Task-3:  Examine the Variables of the Data Sets

The main dataset is called “faithful.csv” dataset, which includes two main variables eruptions and waiting.

  • ##Examine the dataset
  • data()
  • ?faithful
  • install.packages(“MASS”)
  • install.packages(“lattice”)
  • library(lattice)
  • faithful <- read.csv(“C:/CS871/Data/faithful.csv”)
  • data(faithful)
  • summary(faithful)

Figure 1.  Eruptions and Waiting for Eruption Plots for Faithful dataset.

Task-4: Create a Data Frame to repreent the dataset of faithful.

  • ##Create DataFrame
  • faithful.df <- data.frame(faithful)
  • faithful.df
  • summary(faithful.df)

Task-5: Examine the Content of the Data Frame using head(), names(), colnames(), and dim() functions.

  • names(faithful.df)
  • head(faithful.df)
  • dim(faithful.df)

Part-II: Discussion and Analysis

Task-1:  The first Ten lines of Waiting and Eruptions.

  • ##The first ten lines of Waiting and Eruptions
  • faithful$waiting[1:10]
  • faithful$eruption[1:10]
  • ##The descriptive analysis of waiting and eruptions
  • summary(faithful$waiting)
  • summary(faithful$eruptions)

Task-2:  Density Histograms, and Smoothed Density Histograms.

  • ##Density histogram for Waiting Time
  • hist(faithful.df$waiting, col=”blue”, freq=FALSE, main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##smoothed density histogram for Waiting Time
  • smoothedDensity_waiting <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_waiting, col=”blue”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##Density histogram for Eruptions
  • hist(faithful.df$eruptions, col=”red”, freq=FALSE, main=”Histogram of Eruption Time”, xlab=”Eruption Time In Minutes”)
  • ##smoothed density histogram for Eruptions
  • smoothedDensity_eruptions <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_eruptions, col=”red”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)

Figure 2.  Density Histogram and Smoothed Density Histogram of Waiting Time.

Figure 3.  Density Histogram and Smoothed Density Histogram of Eruption.

Task-3: Standard Linear Regression

            The purpose of this task is to examine the Standard Linear Regression for the two main factors of the faithful dataset Waiting Time until Next Eruption and Eruption Time.  This task also addresses the diagnostic plots of the standard linear regression such as residuals vs. fitted as examined below.  The R codes are as follows:

  • ##Standard Regression of waiting time on eruption time.
  • lin.reg.model=lm(waiting~eruptions, data=faithful.df)
  • plot(waiting~eruptions, data=faithful, col=”blue”, main=”Regression of Two Factors of Waiting and Eruption Time”)
  • abline(lin.reg.model, col=”red”)

Figure 4.  Regression of Two Factors of Waiting and Eruption Time.

            The following graphs represent the diagnostic plots for the standard linear regressions.  The first plot represents the residual vs. fitted. The second plot represents the Normal Q-Q. The third plot represents Scale-Location. The fourth plot represents Residuals vs. Leverage.  The discussion and analysis of these graphs under the discussion analysis section of this project.

Figure 5.  Diagnostic Plots for Standard Linear Regression.

Task-4: Polynomial Regression

            The purpose of this task is to examine the polynomial regression for the Waiting Time Until Next Eruption and the Eruption Time variables. The R codes are as follows:

Figure 6.  Polynomial Regreession.

Task-5: Lowess Regression

            The purpose of this task is to examine the Lowess Regression.  The R codes are as follows:

Figure 7.  Lowess Regression.

Task-6: Summary of the Model.

            The purpose of this task is to examine the descriptive analysis summary such as residuals, intercept, R-squared.  The R code is as follows:

  • summary(model1)

Task-7:  Comparison of Linear Regression, Polynomial Regression and Lowess Regression.

  • ##Comparing local polynomial regression to the standard regression.
  • lowessReg=lowess(faithful$waiting~faithful$eruptions, f=2/3)
  • local.poly.reg <-locfit(waiting~lp(eruptions, nn=0.5), data=faithful)
  • standard.reg=lm(waiting~eruptions, data=faithful)
  • plot(faithful$waiting~faithful$eruptions, main=”Eruptions Time”, xlab=”Eruption Time in Minutes”, ylab=”Waiting Time to Next Eruption Time”, col=”blue”)
  • lines(lowessReg, col=”red”)
  • abline(standard.reg, col=”green”)
  • lines(local.poly.reg, col=”yellow”)

Figure 8.  Regression Comparison for the Eruptions and Waiting Time Variables.

Task-8:  Summary of these Models

            The purpose of this task is to examine the summary of each mode. The R codes are as follows:

  • ##Summary of the regressions
  • summary(lowessReg)
  • summary(local.poly.reg)
  • summary(standard.reg)
  • cor(faithful.df$eruptions, faithful.df$waiting)

Task-9: Discussion and Analysis

            The result shows that the descriptive analysis of the average for the eruptions is 3.49, which is lower than the median value of 4.0 minutes, indicating a negatively skewed distribution.  The average for the waiting time until the next eruptions is 70.9 which is less than the median of 76.0 indicating a negatively skewed distribution.  Figure 2 illustrated the density histogram and smoothed density histogram of the waiting time. The result shows that the peak waiting time is ~80 minutes with the highest density point of 0.04.   Figure 3 illustrated the density histogram and smoothed density histogram of the eruption time in minutes.  The result shows that the peak eruption time in minutes is ~4.4 with the highest frequency density point of 0.6.  Figure 4 illustrates the linear regression of the two factors of waiting until next eruption and the eruption time in minutes.  The result shows that when the waiting time increases, the eruption time in minutes increases.  The residuals depict the difference between the actual value of the response variable and the value of the response variable predicted using the regression.  The maximum residual is shown as 15.97.  The spread of residuals is provided by specifying the values of min, max, median, Q1, and Q3 of the residuals. In this case, the spread is from -12.08 to 15.97.  Since the principle behind the regression line and the regression equation is to reduce the error or this difference, the expectation is that the median value should be very near to 0.  However, the median shows .21 which is higher than 0.  The prediction error can go up to the maximum value of the residual.  As this value is 15.97 which is not small, this residual cannot be accepted.   The result also shows that the value next to the coefficient estimate is the standard error of the estimate. Ths specifies the uncertainty of the estimate, then comes the “t” value of the standard error.  This value specifies as to how large the coefficient estimate is concerning the uncertainty.  The next value is the probability that the absolute(t) value is greater than the one specified which is due to a chance error.  The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is very small almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypothesis that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable waiting time and the independent variable of eruptions.

The diagnostic plots of the standard regression are also discussed in this project. Figure 5 illustrates four different diagnostic plots of the standard regression.  This analysis also covers the residuals and fitted lines.  Figure 5 illustrated the Residuals vs. Fitted in Linear Regression Model for Waiting Time until Next Eruption as a function of the Eruptions Time in minutes.  The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).  The result of the Linear Regression for the identified variables of Eruptions and Waiting Time shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 5 also illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). The residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 5 also illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 5 also illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).   The result of the Polynomial regression is also addressed in this project.  The polynomial regression shows a relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 6 is similar to the Standard Linear Regression.  Lowess Regression shows the same pattern on the relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 7 is similar to the Standards Linear Regression.  These three lines of the Standard Linear Regression, Polynomial Regression and Lowess Regression are illustrated together in a comparison fashion in Figure 8.   The coefficient correlation result shows that there is a positive correlation indicating the positive effect of the Eruptions Time on the Waiting Time until Next Eruption.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

 

Quantitative Analysis of “Ethanol” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss locally weighted scatterplot smoothing known as LOWESS method for multiple regression models in a k-nearest-neighbor-based model.  The discussion also addresses whether the LOWESS is a parametric or non-parametric method.  The advantages and disadvantages of LOWESS from the computational standpoint are also addressed in this discussion.  Moreover, another purpose of this discussion is to select a dataset from http://vincentarelbundock.github.io/Rdatasets/ and perform a multiple regression analysis using R programming.  The dataset selected for this discussion is “ethanol” dataset.  The discussion begins with Multiple Regression, Lowess method, Lowess/Loess in R, and K-Nearest-Neighbor (k-NN), followed by the analysis of the “ethanol” dataset.

Multiple Regression

When there is more than one predictor variable, simple Linear Regression becomes Multiple Linear Regression, and the analysis becomes more involved (Kabacoff, 2011).  The Polynomial Regression typically is a particular case of the Multiple Regression (Kabacoff, 2011).  The Quadratic Regression has two predictors (X and X2), and Cubic Regression has three predictors (X, X2, and X3) (Kabacoff, 2011).  Where there is more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant (Kabacoff, 2011).   

Locally Weighted Scatterplot Smoothing (Lowess) Method

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).

Moreover, the Lowess and least square are non-parametric strategies for fitting a smooth curve to data points (statisticshowto.com, 2013).  The “parametric” indicates there is an assumption in advance that the data fits some distribution, i.e., normal distribution (statisticshowto.com, 2013). Parametric fitting can lead to fitting a smooth curve which misrepresents the data because some distribution is assumed in advance (statisticshowto.com, 2013).  Thus, in those cases, non-parametric smoothers may be a better choice (statisticshowto.com, 2013).  The non-parametric smoothers like Loess try to find a curve of best fit without assuming the data must fit some distribution shape (statisticshowto.com, 2013).  In general, both types of smoothers are used for the same set of data to offset the advantages and disadvantages of each type of smoother (statisticshowto.com, 2013). The benefits of non-parametric smoothing include providing a flexible approach to representing data, ease of use, easy computations (statisticshowto.com, 2013).  The disadvantages of the non-parametric smoothing include the following: (1) it cannot be used to obtain a simple equation for a set of data, (2) less well understood than parametric smoothers, and (3) it requires a little guesswork to obtain a result (statisticshowto.com, 2013). 

Lowess/Loess in R

Rhere are two versions of the lowess or loess scatter-diagram smoothing approach implemented in R (Dias, n.d.).  The former (lowess) was implemented first, while loess is more flexible and powerful (Dias, n.d.).  Example of lowess:

  • lowess(x,y f=2/3, iter=3, delta=.01*diff(range(x)))

where the following model is assumed: y = b(x)+e. 

  • The “f” is the smoother span which gives the proportion of points in the plot which influence the smooth at each value.  The larger values give more smoothness. 
    • The “iter” is the number of “robustifying” iterations which should be performed; using smaller values of “iter” will make “lowess” run faster. 
    • The “delta” represents the value of “x” which lies within “delta” of each other replace by a single value in the output from “lowess” (Dias, n.d.).

The loess() function uses a formula to specify the response (and in its application as a scatter-diagram smoother) a single predictor variable (Dias, n.d.).  The loess() function creates an object which contains the results, and the predict() function retrieves the fitted values.  These can be plotted along with the response variable (Dias, n.d.).  However, the points must be plotted in increasing order of the predictor variable in order for the lines() function to draw the line in an appropriate fashion, which is done using order() function applied to the predictor variable values and the explicit sub-scripting (in square brackets[]) to arrange the observations in ascending order (Dias, n.d.).

K-Nearest-Neighbor (K-NN)

The K-NN classifier is based on learning numeric attributes in an n-dimensional space. All of the training samples are stored in n-dimensional space with a unique pattern (Hodeghatta & Nayak, 2016).  When a new sample is given, the K-NN classifier searches for the pattern spaces which are closest to the sample and accordingly labels the class in the k-pattern space (called k-nearest-neighbor) (Hodeghatta & Nayak, 2016).  The “closeness” is defined regarding Euclidean distance, where the Euclidean distance between two points, X = (x1, x2, x3,.., xn)  and Y = (y1, y2, y3,.., yn) is defined as follows:

The unknown sample is assigned the nearest class among the K-NN pattern.  The aim is to look for the records which are similar to, or “near” the record to be classified in the training records which have values close to X = (x1, x2, x3,.., xn) (Hodeghatta & Nayak, 2016).  These records are grouped into classes based on the “closeness,” and the unknown sample will look for the class (defined by k) and identifies itself to that class which is nearest in the k-space (Hodeghatta & Nayak, 2016).   If a new record has to be classified, it finds the nearest match to the record and tags to that class (Hodeghatta & Nayak, 2016).  

The K-NN does not assume any relationship among the predictors(X) and class (Y) (Hodeghatta & Nayak, 2016).  However, it draws the conclusion of the class based on the similarity measures between predictors and records in the dataset (Hodeghatta & Nayak, 2016).  There are many potential measures, K-NN uses the Euclidean distance between the records to find the similarities to label the class (Hodeghatta & Nayak, 2016).  The predictor variable should be standardized to a common scale before computing the Euclidean distances and classifying (Hodeghatta & Nayak, 2016).  After computing the distances between records, a rule to put these records into different classes(k) is required (Hodeghatta & Nayak, 2016).  A higher value of (k) reduces the risk of overfitting due to noise in the training set (Hodeghatta & Nayak, 2016).  The value of (k) ideally can be between 2 and 10, for each time, to find the misclassification error and find the value of (k) which gives the minimum error (Hodeghatta & Nayak, 2016).

The advantages of the K-NN as a classification method include its simplicity and lack of parametric assumptions (Hodeghatta & Nayak, 2016).  It performs well for large training datasets (Hodeghatta & Nayak, 2016).  However, the disadvantages of the K-NN as a classification method include the time to find the nearest neighbors, reduced performance for a large number of predictors (Hodeghatta & Nayak, 2016). 

Multiple Regression Analysis for “ethanol” dataset Using R

This section is divided into five major Tasks.  The first task is to understand and examine the dataset.  Task-2, Task-3, and Task-4 are to understand density histogram, linear regression, and multiple linear regression.  Task-5 covers the discussion and analysis of the results.

Task-1:  Understand and Examine the Dataset.

The purpose of this task is to understand and examine the dataset. The description of the dataset is found in (r-project.org, 2018).  A data frame with 88 observations on the following three variables.

  • NOx Concentration of nitrogen oxides (NO and NO2) in micrograms/J.
  • C Compression ratio of the engine.
  • E Equivalence ratio–a measure of the richness of the air and ethanol fuel mixture.

#R-Commands and Results using summary(), names(), head(), dim(), and plot() functions.

  • ethanol <- read.csv(“C:/CS871/Data/ethanol.csv”)
  • data(ethanol)
  • summary(ethanol)
  • names(ethanol)
  • head(ethanol)
  • dim(ethanol)
  • plot(ethanol, col=”red”)

Figure 1.  Plot Summary of NOx, C, and E in Ethanol Dataset.

  • ethanol[1:3,]   ##First three lines
  • ethanol$NOx[1:10]  ##First 10 lines for concentration of nitrogen oxides (NOx)
  • ethanol$C[1:10]   ##First 10 lines for Compression Ratio of the Engine ( C )
  • ethanol$E[1:10]  ##First 10 lines for Equivalence Ratio ( E )
  • ##Descriptive Analysis using summary() function to analyze the central tendency.
  • summary(ethanol$NOx)
  • summary(ethanol$C)
  • summary(ethanol$E)

Task-2:  Density Histogram and Smoothed Density Histogram

  • ##Density histogram for NOx
  • hist(ethanol$NOx, freq=FALSE, col=”orange”)
  • install.packages(“locfit”)  ##locfit library is required for smoothed histogram
  • library(locfit)
  • smoothedDensity_NOx <- locfit(~lp(NOx), data=ethanol)
  • plot(smoothedDensity_NOx, col=”orange”, main=”Smoothed Density Histogram for NOx”)

Figure 2.  Density Histogram and Smoothed Density Histogram of NOx of Ethanol.

  • ##Density histogram for Equivalence Ration ( E )
  • hist(ethanol$E, freq=FALSE, col=”blue”)
  • smoothedDensity_E <- locfit(~lp(E), data=ethanol)
  • plot(smoothedDensity_E, col=”blue”, main=”Smoothed Density Histogram for Equivalence Ratio”)

Figure 3.  Density Histogram and Smoothed Density Histogram of E of Ethanol.

  • ##Density histogram for Compression Ratio ( C )
  • hist(ethanol$C, freq=FALSE, col=”blue”)
  • smoothedDensity_C <- locfit(~lp(C), data=ethanol)
  • plot(smoothedDensity_C, col=”blue”, main=”Smoothed Density Histogram for Compression Ratio”)

Figure 4.  Density Histogram and Smoothed Density Histogram of C of Ethanol.

Task-3:  Linear Regression Model

  • ## Linear Regression
  • lin.reg.model1=lm(NOx~E, data=ethanol)
  • lin.reg.model1
  • plot(NOx~E, data=ethanol, col=”blue”, main=”Linear Regression of NOx and Equivalence Ratio in Ethanol”)
  • abline(lin.reg.model1, col=”red”)
  • mean.NOx=mean(ethanol$NOx, na.rm=T)
  • abline(h=mean.NOx, col=”green”)

Figure 5:  Linear Regression of the NOx and E in Ethanol.

  • ##local polynomial regression of NOx on the equivalent ratio
  • ##fit with a 50% nearest neighbor bandwidth.
  • local.poly.reg <-locfit(NOx~lp(E, nn=0.5), data=ethanol)
  • plot(local.poly.reg, col=”blue”)

Figure 6:  Smoothed Polynomial Regression of the NOx and E in Ethanol.

Figure 7.  Residuals vs. Fitted Plots.

Figure 8.  Normal Q-Q Plot.

Figure 9. Scale-Location Plot.

Figure 10.  Residuals vs. Leverage.

  • ##To better understand the linearity of the relationship represented by the model.
  • summary(lin.reg.model1)
  • plot(lin.reg.model1)
  • crPlots(lin.reg.model1)
  • termplot(lin.reg.model1)

Figure 11.  crPlots() Plots for the Linearity of the Relationship between NOx and Equivalence Ratio of the Model.

##Examine the Correlation between NOx and E.

Task-4: Multiple Regressions

  • ##Produce Plots of some explanatory variables.
  • plot(NOx~E, ethanol, col=”blue”)
  • plot(NOx~C, ethanol, col=”red”)
  • ##Use vertical bar to find the relationship of E on NOx conditioned with C
  • coplot(NOx~E|C, panel=panel.smooth,ethanol, col=”blue”)
  • model2=lm(NOx~E*C, ethanol)
  • plot(model2, col=”blue”)

Figure 12. Multiple Regression – Relationship of E on NOx conditioned with C.

Figure 13. Multiple Regression Diagnostic Plot: Residual vs. Fitted.

Figure 14. Multiple Regression Diagnostic Plot: Normal Q-Q.

Figure 15. Multiple Regression Diagnostic Plot: Scale-Location.

Figure 16. Multiple Regression Diagnostic Plot: Residual vs. Leverage.

  • summary(model2)

Task-5:  Discussion and Analysis:  The result shows the average of NOx is 1.96, which is higher than the median of 1.75 indicating positive skewed distribution.  The average of the compression ratio of the engine ( C ) is 12.034 which is a little higher than the median of 12.00 indicating almost normal distribution.  The average ethanol equivalence ratio of measure ( E ) is 0.926, which is a little lower than the median of 0.932 indicating a little negative skewed distribution but close to normal distribution.  In summary, the average for NOx is 1.96, for C is 12.034 and for E is 0.926. 

The NOx exhaust emissions depend on two predictor variables: the fuel-air equivalence ratio ( E ), and the compression ratio ( C ) of the engine. The density of the NOx emissions and its smoothed version using the local polynomial regression are illustrated in Figure 2.  The result shows that the NOx starts to increase when the density starts at 0.15 and continues to increase.  However, after the density reaches 0.35, the NOx continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between NOx and density between 0.15 and .35 density, after which the relationship seems to go into the reverse and negative direction. 

The density of the Equivalence Ratio measure of the richness of air and ethanol fuel mixture and its smoothed version using the local polynomial regression are illustrated in Figure 3.  The result shows that the E varies with the density.  For instance, the density gets increased with the increased value of E until the density reaches ~1.5, and then it drops while the E continues to increase.  However, the density continues to drop until it reaches ~1.2, while the E continues to increase.   The density gets increased from ~1.2 until it reached ~1.6, and the E continues to increase. After density of ~1.6, the density gets dropped again while the E value continues to increase.  Thus, in summary, the density varies while the E value keeps increasing.

The density of the Compression Ratio of the Engine and its smoothed version using the local polynomial regression are illustrated in Figure 4.  The result shows that the C starts to increase when the density starts with ~0.09 and continues to increase.  However, after the density reaches ~0.11, the C continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between C and density between ~0.09 and ~.11 density, after which the relationship seems to go into the reverse and negative direction. 

Figure 5 illustrates the Linear Regression between the NOx and Equivalence Ratio in Ethanol. Figure 6 illustrates the Smoothed Polynomial Regression of the NOx and E in Ethanol. The result of the Linear Regression of the Equivalence Ratio ( E ) as a function of the NOx shows that while the E value increases, the NOx varies indicating an increase and then decrease.  Figure 6 shows the smoothed a polynomial regression of the NOx and E in ethanol, indicating the same result that there a positive association between E and NOx, meaning when E increases until it reaches ~0.9, NOx also increases until it reaches ~3.5. After that point, the relationship shows negative, meaning that the NOx gets increased with the increase of E. 

This analysis also covers the residuals and fitted lines.  Figure 7 illustrated the Residuals vs. Fitted in Linear Regression Model for NOx as a function of the E.  The residuals depicts the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).

The result of the Linear Regression for the identified variables of E and NOx shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 8 illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  Figure 8 shows that the residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 9 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 10 illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.   Figure 11 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016).  The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016).  The result of Figure 12 shows that the model created is not linear, which requires to re-explore the model. Moreover, the correlation between NOx and E result shows there is a negative correlation between NOx and E with a value of -0.11.   Figure 12 illustrates the Multiple Regression and the Relationship of Equivalence Ratio on NOx conditioned with Compression Ratio.  The Multiple Linear Regression is useful for modeling the relationship between a numeric outcome or dependent variable (Y), and multiple explanatory or independent variables (X).  The result shows that the interaction of C and E affects the NOx.  While the E and C increase, the NOx decreases.   Approximately 0.013 of variation in NOx can be explained by this model (E*C).   The interaction of E and C has a negative value of -0.063 on NOx. 

References

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Kabacoff, R. I. (2011). IN ACTION.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

statisticshowto.com. (2013). Lowess Smoothing: Overview. Retrieved from http://www.statisticshowto.com/lowess-smoothing/.