Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the flight delays Dataset. The project is divided into two main Parts. Part-I evaluates and examines the Dataset for understanding the Dataset using the RStudio. Part-I involves five major tasks to review and understand the Dataset variables. Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame using logistic regression first, followed by the naïve Bayesian method. The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period. The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights. Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

Keywords: Flight-Delays Dataset; Naïve Bays Prediction Analysis Using R.

Introduction

This project examines and analyzes the Dataset of (flight.delays.csv). The Dataset is downloaded from CTU course materials. There have been a couple of attempts to download the Dataset from the following link https://www.transtats.bts.gov/. However, the attempts failed to continue with the Dataset analysis due to the size of the downloaded Datasets from that link and the limited resources of the student’s machine. Thus, this project utilized the version of flight.delays.csv which is provided by the course in the course material. The Dataset of (flight.delays.cvs) has 2201 observations on 14 variables. The focus of this analysis is Naïve Bayes. However, for a better understanding of the prediction and a comparison using two different models, the researcher has also implemented the Logistic Regression first, followed by the Naïve Bayesian Approach on the same Dataset of flight.delays.csv. This project addresses two major Parts. Part-I covers the following key Tasks to understand and examine the Dataset of “flight.delays.csv.”

Task-1: Review the Variables of the Dataset.
Task-2: Load and Understand the Dataset Using names(), head(), dim() Functions.
Task-3: Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.
Task-4: Create Data Frame and Histogram of the Delay (Response)
Task-5: Visualization of the Desired Variables Using Plot() Function.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

Task-1: Logistic Regression Model for Flight Delays Prediction
Task-2: Naïve Bayesian Model for Flight Delays Prediction.
Task-3: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I: Understand and Examine the Dataset “flight.delays.csv”

Task-1: Review the Variables of the Dataset

The purpose of this task is to understand the variables of the Dataset. The Dataset is “flight. Delays” Dataset. The Dataset describes the clients who can default on a loan. There are 14 variables. Table 1 summarizes the selected variables for this project.

Table 1: Flight Delays Variables

Task-2: Load and Understand the Dataset Using names(), head(), dim() Functions.

The purpose of this task is to load and understand the Dataset using names(), head(), dim() function. The task also displays the first three observations.

## reading the data
fd <-read.csv(“C:/CS871/Data/flight.delays.csv”)
names(fd[1:5,])
head(fd)
dim(fd)
fd[1:3,]

Task-3: Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.

The purpose of this task is to examine the dawta set, install the requried package (car). This task also displays the descriptive statistics for analysis.

### set seed
set.seed(1)
##Required Library(car) to recode a variable
install.packages(“car”)
library(car)
summary(fd)
plot(fd, col=”blue”)

Figure 1. The plot of the Identified Variables for the Flight Delays Dataset.

Task-5: Visualization of the Desired Variables Using Plot() Function.

The purpose of this task is to visualize the selected variables using the Plot() Function for a good understanding of these variables and the current trend for each variable.

plot(fd$schedf, col=”blue”, main=”Histogram of the Scheduled Time”)
plot(fd$carrier, col=”blue”, main=”Histogram of the Carrier”)
plot(fd$dest, col=”blue”, main=”Histogram of the Destination”)
plot(fd$origin, col=”blue”, main=”Histogram of the Origin”)
plot(fd$weather, col=”blue”, main=”Histogram of the Weather”)
plot(fd$dayweek, col=”blue”, main=”Histogram of the Day of Week”)

Figure 2. Histogram of the Schedule Time and Carrier.

Figure 3. Histogram of the Destination and Origin.

Figure 4. Histogram of the Weather and Day of Week.

Part-II: Plot, Discuss and Analyze

Task-1: Logistic Regression Model for Flight Delays

The purpose of this task is to first use the logistic regression model for predicting the on-time and delayed flights more than 15 minutes. The Dataset consists of 2201 flights for the year of 2004 from Washington DC into the NYC. The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes and coded as 0=no delay, and 1=delay by more than 15 minutes. The explanatory variables include:

Three arrival airports (Kennedy, Newark, and LaGuardia).
Three different departure airports (Reagan, Dulles, and Baltimore.
Eight carriers a categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
Weather conditions (0=good, 1=bad).
Day of week (1 for Sunday and Monday; and 0 for all other days).

The code of R is shown below for the logistic regression model.

## Create a Data Frame and Understand the Dataset.
fd <-data.frame(fd)
names(fd)
head(fd)
fd[1:5,]
dim(fd)
summary(fd)
plot(fd, co=”blue”)
## library car is needed to recode variables
library(car)
##Define hours of Departure
fd$sched=factor(floor(fd$schedtime/100))
table(fd$sched)
table(fd$carrier)
table(fd$dest)
table(fd$origin)
table(fd$weather)
table(fd$dayweek)
table(fd$daymonth)
table(fd$delay)
fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
fd$delay=as.numeric(levels(fd$delay)[fd$delay])
table(fd$delay)
## Summary of the Major Variables
summary(fd$sched)
summary(fd$carrier)
summary(fd$dest)
summary(fd$origin)
summary(fd$weather)
summary(fd$dayweek)
summary(fd$daymonth)
summary(fd$delay)
## Plots and Histograms of the Major Variables
plot(fd$sched, col=”blue”, main=”Schedule Departure Time”)
plot(fd$carrier, col=”darkblue”, main=”Flight Carriers”)
plot(fd$dest, col=”darkred”, main=”Destination of Flights”)
plot(fd$origin, col=”green”, main=”Origin of Flights”)
plot(fd$weather, col=”darkgreen”, main=”Weather During Flight Days”)
hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
hist(fd$daymonth, col=”yellow”, main=”Flights Day of the Month”)
plot(fd$delay, col=”red”, main=”Plot of the Delay”)
hist(fd$delay, col=”red”, main=”Histogram of the Delay”)
## Delay: 1=Monday and 7=Sunday coded as 1, else 0.
fd$dayweek=recode(fd$dayweek,”c(1,7)=1;else=0″)
table(fd$dayweek)
summary(fd$dayweek)
hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
## Omit unused variables
fd=fd[,c(-1,-3,-5,-6,-7,-11,-12)]
fd[1:5,]
## Create Sample Dataset
delay.length=length(fd$delay)
delay.length
delay.length1=floor(delay.length*(0.6))
delay.length1
delay.length2=delay.length-delay.length1
delay.length2
train=sample(1:delay.length, delay.length1)
train
plot(train, col=”red”)
## Estimation of Logistic Regression Model
##Explanatory Variables: carrier, destination, origin, weather, day of week,
##(weekday/weekend), scheduled hour of departure.
## Create design matrix; indicators for categorical variables (factors)
Xfd <- model.matrix(delay~., data=fd) [,-1]
Xfd[1:5,]
xtrain <- Xfd[train,]
xtrain[1:2,]
xtest <- Xfd[-train,]
xtest[1:2,]
ytrain <- fd$delay[train]
ytrain[1:5]
ytest <- fd$delay[-train]
ytest[1:5]
model1 = glm(delay~., family=binomial, data=data.frame(delay=ytrain,xtrain))
summary(model1)
## Prediction: predicted default probabilities for cases in test set
probability.test <- predict(model1, newdata=data.frame(xtest), type=”response”)
data.frame(ytest,probability.test)[1:10,]
## The first column in list represents the case number of the test element
plot(ytest~probability.test, col=”blue”)
## Coding as 1 if probability 0.5 larger
### using floor function
probability.fifty = floor(probability.test+0.5)
table.ytest = table(ytest,probability.fifty)
table.ytest
error = ((table.ytest[1,2]+table.ytest[2,1])/delay.length2)
error

Figure 5. The probability of the Delay Using Logistic Regression.

Task-2: Naïve Bayesian Model for Predicting Delays and Ontime Flights

The purpose of this task is to use the Naïve Bayesian model for predicting a categorical response from most categorical predictor variables. The Dataset consists of 2201 flights in 2004 from Washington, DC into NYC. The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes (0=no delay, 1=delay). The explanatory variables include the following:

Three arrival airports (Kennedy, Newark, and LaGuardia).
Three different departure airports (Reagan, Dulles, and Baltimore.
Eight carriers.
A categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
Weather conditions (0=good, 1=bad).
Day of week (7 days with Monday=1, …, Sunday=7).

The code of R is shown below for the logistic regression model, followed by the result of each code.

fd=data.frame(fd)
fd$schedf=factor(floor(fd$schedtime/100))
fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
response=as.numeric(levels(fd$delay)[fd$delay])
hist(response, col=”orange”)
fd.mean.response=mean(response)
fd.mean.response
## Create Train Dataset 60/40
n=length(fd$dayweek)
n
n1=floor(n*(0.6))
n1
n2=n-n1
n2
train=sample(1:n,n1)
train
plot(train, col=”blue”, main=”Train Data Plot”)
## Determine Marginal Probabilities
td=cbind(fd$schedf[train],fd$carrier[train],fd$dest[train],fd$origin[train],fd$weather[train],fd$dayweek[train],response[train])
td
tdtrain0=td[td[,7]<0.5,]
tdtrain1=td[td[,7]>0.5,]
tdtrain0[1:3]
tdtrain1[1:3]
plot(td, col=”blue”, main=”Train Data”)
plot(tdtrain0, col=”blue”, main=”Marginal Probability <0.5″ )
plot(tdtrain1, col=”blue”, main=”Marginal Probability > 0.5″)
## Prior Probabilities for Delay for P( y = 0 ) and P(y = 1)
tdel=table(response[train])
tdel=tdel/sum(tdel)
tdel
## P( y = 0) P( y = 1) 1
##0.8022727 0.1977273
## Probabilities for Scheduled Time
### Probabilities for (y=0) for Scheduled Time
ts0=table(tttrain0[,1])
ts0=ts0/sum(ts0)
ts0
### Probabilities for (y = 1) for Scheduled Time
ts1=table(tttrain1[,1])
ts1=ts1/sum(ts1)
ts1
## Probabilities for Carrier
## Probabilities for (y = 0) for Carrier
tc0=table(tttrain0[,2])
tc0=tc0/sum(tc0)
tc0
tc1=table(tttrain1[,2])
tc1=tc1/sum(tc1)
tc1
## Probabilities for Destination
##Probabilities for (y=0) for Destination
td0=table(tttrain0[,3])
td0=td0/sum(td0)
td0
##Probabilities for (y=1) for Destination
td1=table(tttrain1[,3])
td1=td1/sum(td1)
td1
## Probabilities for Origin
##Probabilities for (y=0) for Origin
to0=table(tttrain0[,4])
to0=to0/sum(to0)
to0
##Probabilities for (y=1) for Origin
to1=table(tttrain1[,4])
to1=to1/sum(to1)
to1
## Probabilities for Weather
##Probabilities for (y=0) for Origin
tw0=table(tttrain0[,5])
tw0=tw0/sum(tw0)
tw0
## bandaid as no observation in a cell
tw0=tw1
tw0[1]=1
tw0[2]=0
##Probabilities for (y=1) for Weather
tw1=table(tttrain1[,5])
tw1=tw1/sum(tw1)
tw1
## Probabilities for Day of Week
#### Probabilities for (y-0) for Day of Week
tdw0=table(tttrain0[,6])
tdw0=tdw0/sum(tdw0)
tdw0
#### Probabilities for (y-1) for Day of Week
tdw1=table(tttrain1[,6])
tdw1=tdw1/sum(tdw1)
tdw1
### Create Test Data
testdata=cbind(fd$schedf[-train],fd$carrier[-train],fd$dest[-train],fd$origin[-train],fd$weather[-train],fd$dayweek[-train],response[-train])
testdata[1:3]
## With these estimates, the following probabilities can be determined.
##P(y = 1|Carrier = 7,DOW = 7,DepTime = 9 AM−10 AM,Dest = LGA,Origin = DCA,Weather = 0)
=[(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
[(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
+[(0.015)(0.099)(0.059)(0.545)(0.653)(1)](0.802)
= 0.09.
## Creating Predictions, stored in prediction
p0=ts0[tt[,1]]*tc0[tt[,2]]*td0[tt[,3]]*to0[tt[,4]]*tw0[tt[,5]+1]*tdw0[tt[,6]]
p1=ts1[tt[,1]]*tc1[tt[,2]]*td1[tt[,3]]*to1[tt[,4]]*tw1[tt[,5]+1]*tdw1[tt[,6]]
prediction=(p1*tdel[2])/(p1*tdel[2]+p0*tdel[1])
hist(prediction, col=”blue”, main=”Histogram of Predictions”)
plot(response[-train], prediction,col=”blue”)
###Coding as 1 if probability >=0.5
## Calculate the Probability for at least 0.5 or more
prob1=floor(prediction+0.5)
tr=table(response[-train],prob1)
tr
error=(tr[1,2]+tr[2,1])/n2
error
## Calculate the Probability for at least 0.3 or more
prob2=floor(prediction+0.3)
tr2=table(response[-train],prob2)
tr2
error=(tr[1,2]+tr[2,1])/n2
error
## calculating the lift, cumulative, sorted by predicted values and average success.
## cumulative 1’s sorted by predicted values
## cumulative 1’s using the average success prob from training set
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
axis[i]=i
ax[i]=xbar*i
ay[i]=ay[i-1]+bb1[i,2]
}
aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:100,]
plot(axis,ay,xlab=”Number of Cases”,ylab=”Number of Successes”,main=”Lift: Cum Successes Sorted Predicted Values Using Average Success Probabilitis”, col=”red”)
points(axis,ax,type=”l”)

Figure 6. Pre and Post Factor and Level of History Categorical Variable.

Figure 7: Train Dataset Plot.

Figure 8. Train Data, Marginal Probability of <0.5 and >0.5.

Figure 9. Prior Probability for Delay (y=0) and (y-1).

Figure 10. Prior Probability for Scheduled Time: Left (y=0) and Right (y-1).

Figure 11. Prior Probability for Carrier: Left (y=0) and Right (y-1).

Figure 12. Prior Probability for Destination: Left (y=0) and Right (y-1).

Figure 13. Prior Probability for Origin: Left (y=0) and Right (y-1).

Figure 14. Prior Probability for Weather: Left (y=0) and Right (y-1).

Figure 15. Prior Probability for Day of Week: Left (y=0) and Right (y-1).

Figure 16. Test Data Plot.

Figure 17. Histogram of the Prediction Using Bayesian Method.

Figure 18. Plot of Prediction to the Response Using the Test Data.

Figure 19. Probability Calculation for at least 0.5 or larger (left), and at least 0.3 or larger (right).

Figure 20. Lift: Cum Success Sorted by Predicted Values Using Average Success Probabilities.

Task-3: Discussion and Analysis

The descriptive analysis shows that the average schedule time is 13:72 which is less than the median of 14:55 indicating a negatively skewed distribution, while the average for the departure time is 13:69 which is less than the median of 14:50 confirming the negatively skewed distribution. The result of the carrier shows that the DH has the highest rank of 551, followed by RU of 408. The result of the destination shows that the LGA has the highest rank of 1150, followed by EWR of 665 and JFK of 386. The result of the origin shows that DCA has the highest rank of 1370, followed by IAD of 686 and 145 for BWI. The result shows that the weather is not the primary reason for the delays. Few instances of weather instances are related to the delays. The descriptive analysis shows the ontime has the highest frequency of 1773, followed by the delays of 428 frequency. The average delay or response is 0.195.

The result shows the success probability which is the proportion of delayed planes in the training set if 0.198 as analyzed in Task 2 of Part-II; the failure probability which is the proportion of on-time flights is 0.802 as discussed and analyzed in Task-2 of Part-II. The naïve rule which does not incorporate any covariate information classified every flight as being on-time as the estimated unconditional probability of a flight being on-time, 0.802, is larger than the cutoff of 0.5. Thus, this rule does not make an error predicting a flight which is on-time, but it makes a 100% error when the flight is delayed. The naïve rule fails to identify the 167 delayed flights among the 881 flights of the evaluation Dataset as shown in Task-1 of Part-II; its misclassification error rate in the holdout sample is 167/881=0.189. The logistic regression reduces the overall misclassification error in the holdout (evaluation/test) Dataset to 0.176, which is a modest improvement over the naïve rule of (0.189) as illustrated in Task-1 of Part-II. The logistic regression identifies, among 167 delayed flights, correctly 14 delayed flights 8.4%, but it misses 153/167 delayed flights (92.6%). Moreover, the logistic regression model predicts 2 of the 714 on-time flights as being delayed as illustrated in Task-1 of Part-II.

The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period. The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights. Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

The lift charts (Figure 20) is constructed with the number of cases on the x-axis and the cumulative true-positive cases on the y-axis. True positives are those observations which are classified correctly. It measures the effectiveness of a classification model by comparing the true positives without a model (Hodeghatta & Nayak, 2016). It also provides an indication of how well the model performs if the samples are selected randomly from a population (Hodeghatta & Nayak, 2016). With the lift chart, a comparison of different models’ performance for a set of random cases (Hodeghatta & Nayak, 2016). In Figure 20, the lift varies with the number of cases, and the black line is a reference line, meaning if a prediction of a positive case is made in case there was no model, then, this line provides a benchmark. The lift curve graph in Figure 20, graphs the expected number of delayed flights, assuming that the probability of delay is estimated by the proportion of delayed flights in the evaluation sample, against the number of cases. The reference line expresses the performance of the naïve model. With ten flights, for instance, the expected number of delayed flights is 10 p, where p is the proportion of delayed flights in the evaluation sample which is 0.189 in this case. At the very end, the lift curve and the reference line meet. However, in the beginning, the logistic regression leads to a “lift.” For instance, when picking 10 cases with the largest estimated success probabilities, all the 10 case turn out to be delayed. If the lift is close to the reference line, then there is not much point in using the estimated model for classification. The overall misclassification rate of the logistic regression is not that different from that of the naïve strategy which considers all flights as being on-time. However, as the lift curve shows, flights with the largest probabilities of being delayed are classified correctly. The logistic regression is quite successful in identifying those flight as being delayed. The lift curve in Figure 10 shows that the model gives an advantage in detecting the most apparent flights which are going to be delayed or on-time.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Quantitative Analysis of the “Flight-Delays” Dataset Using R-Programming