Quantitative Analysis of “German.Credit” Dataset Using K-NN Classification and Cross-Validation

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the german.credit.csv dataset to address the issues of lending which result in default.  Two outcomes are success (defaulting on the loan), and failure (not defaulting on the loan).  The explanatory variables in the Logistic Regression are both the type of loan and borrowing amount.  For the K-NN Classification, three continuous variables are used:  duration, amount, and installment.  The cross-validation with k=5 for the nearest neighbor will be used as well in this analysis.

The dataset is downloaded from the following archive site for machine learning repository : https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).  The dataset has 1000 observation on 222 variables. There are two datasets for german.credit.csv.  The original dataset, in the form provided by Professor Hofmann, contains categorical/symbolic attributes and is in the current german.credit.csv file which is used in this disussion.  The other dataset “german.data-numeric” is not used in this discussion which was developed by Strathclyde University for algorithms that need numerical attributes.  This discussion utilized the original version of german.credit.csv which has categorical variables and the continuous variables. 

This analysis discusses and addresses fourteen Tasks as shown below:

  • Task-1: Understand the Variables of the Dataset
  • Task-2: Load and Review the Dataset using names(), head(), dim() functions.
  • Task-3: Pre and Post Factor and Level of Categorical Variables of the Dataset.
  • Task-4: Summary and Plot the Continuous Variables: Duration, Amount, and Installment
  • Task-5: Classify Amount into Groups.
  • Task-6: Summary of all selected variables.
  • Task-7: Select and Plot Specific Variables for this analysis.
  • Task-8:  Create Design Matrix
  • Task-9:   Create Training and Prediction Dataset.
  • Task-10:  Implement K-Nearest Neighbor Method.
  • Task-11: Calculate the Proportion of Correct Classification.
  • Task-12: Plot for 3 Nearest Neighbor.
  • Task-13: Cross-Validation with k=5 for the Nearest Neighbor.
  • Task-14: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is “german.credit” dataset.  The dataset describes the clients who can default on a loan.  There are selected variables out of the 22 variables which are target for this analysis.  Table 1 and Table 2 summarize these selected variables for this discussion.  Table 1 focuses on the variables with binary and numerical values, while Table 2 focuses on the variables with categorical values.

Table 1:  Binary and Numerical Variables

Table 2: Categorical Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • gc <- read.csv(“C:/CS871/german.credit.csv”)
  • names(gc)
  • head(gc)
  • dim(gc)
  • gc[1:3,]

Task-3:  Pre and Post Factor and Level of Categorical Variables of the Data Sets

  • ## history categorical variable pre and post factor and level.
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Pre Factor and Level”)
  • gc$history = factor(gc$history, levels=c(“A30”, “A31”, “A32”, “A33”, “A34”))
  • levels(gc$history)=c(“good-others”, “good-thisBank”, “current-paid-duly”, “bad-delayed”, “critical”)
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Post Factor and Level”)
  • ##### purpose pre and post factor and level
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)
  • ###tranform purpose
  • gc$purpose <- factor(gc$purpose, levels=c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))
  • levels(gc$purpose) <- c(“newcar”,”usedcar”,”furniture/equipment”,”radio/television”,”domestic appliances”,”repairs”, “edu”,”vacation-doesNotExist”, “retraining”, “business”, “others”)
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)

Figure 1.  Example of Pre and Post Factor of Purpose as Cateogircal Variable Illustration.

Task-4:  Summary & Plot the Numerical Variables: Duration, Amount, Installment

  • ##summary and plot of those numerical variables
  • summary(gc$duration)
  • summary(gc$amount)
  • plot(gc$amount, col=”blue”, main=”Amount Numerical Variable”)
  • summary(gc$installment)
  • plot(gc$installment, col=”blue”, main=”Installment Numerical Variable”)

Figure 2:  Duration, Amount, Installment Continuous IV.

Task-5:  Classify the Amount into Groups

  • #### To classify the amount into groups
  • gc$amount <-as.factor(ifelse(gc$amount <=2500, ‘0-2500′, ifelse(gc.df$amount<=5000,’2600-5000’, ‘5000+’)))
  • summary(gc$amount)

Task-6:  Summary of all variables

  • summary(gc$duration)
  • summary(gc$amount)
  • summary(gc$installment)
  • summary(gc$age)
  • summary(gc$history)
  • summary(gc$purpose)
  • summary(gc$housing)
  • summary(gc$rent)

Task-7:  Select and Plot specific variables for this discussion

  • ##cut the dataset to the selected variables
  • ##(duration, amount, installment, and age) which are numeric and
  • ##(history, purpose and housing) which are categorical and
  • ## Default (representing the risk) which is binary.
  • gc.sv <- gc[,c(“Default”, “duration”, “amount”, “installment”, “age”, “history”, “purpose”, “foreign”, “housing”)]
  • gc.sv[1:3,]
  • summary(gc.sv)
  • ### Setting the Rent
  • gc$rent <- factor(gc$housing==”A151″)
  • summary(gc$rent)
  • plot(gc, col=”blue”)

Figure 3:  Plot of The Selected Variables.

Task-8:  Create Design Matrix

  • ###Create a Design Matrix
  • ##Factor variables are turned into indicator variables
  • ##The first column of ones is ommitted
  • Xgc <- model.matrix(Default~.,data=gc)[,-1]
  • Xgc[1:3,]

Task-9: Create Training and Prediction Datasets

  • ## creating training and prediction datasets
  • ## select 900 rows for estimation and 100 for testing
  • set.seed(1)
  • train <- sample(1:1000,900)
  • xtrain <- Xgc[train,]
  • xnew <- Xgc[-train,]
  • ytrain <- gc$Default[train]
  • ynew <- gc$Default[-train]

Task-10:  K-Nearest Neighbor Method

  • ## k-nearest neighbor method
  • library(class)
  • nearest1 <- knn(train=xtrain, test=xnew, cl=ytrain, k=1)
  • nearest3 <- knn(train=xtrain, test=xnew, cl=ytrain, k=3)
  • data.frame(ynew,nearest1,nearest3)[1:10,]

Task-11: Calculate the Proportion of Correct Classification

  • ## calculate the proportion of correct classifications
  • proportion.correct.class1=100*sum(ynew==nearest1)/100
  • proportion.correct.class3=100*sum(ynew==nearest3)/100
  • proportion.correct.class1
  • proportion.correct.class3

Task-12: Plot for 3 Nearest Neighbors

  • ## plot for 3NN
  • plot(xtrain[,c(“amount”,”duration”)],
  • col=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(1,2)[as.numeric(ytrain)],
  • main=”Predicted Default, by 3 Nearest Neighbors”, xlab=”Amount”, ylab=”Duration”,cex.main=.95)
  • points(xnew[,c(“amount”,”duration”)],
  • bg=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(21,24)[as.numeric(nearest3)],cex=1.2,col=grey(.7))
  • legend(“bottomright”,pch=c(1,16,2,17),bg=c(1,1,1,1),
  • legend=c(“data 0″,”pred 0″,”data 1″,”pred 1”),
  • title=”default”,bty=”n”,cex=.8)
  • legend(“topleft”,fill=c(4,3,6,2),legend=c(1,2,3,4),
  • title=”installment %”, horiz=TRUE,bty=”n”,col=grey(.7),cex=.8)

Figure 4:  Predicted Default by 5 Nearest Neighbors.

Task-13: Cross-Validation with k=5 for the nearest neighbor

  • ## The above was for just one training set
  • ## The cross-validation (leave one out)
  • proportion.corr=dim(10)
  • for (k in 1:10) {
  • prediction=knn.cv(x,cl=gc$Default,k)
  • proportion.corr[k]=100*sum(gc$Default==prediction)/1000
  • }
  • proportion.corr

Task-14: Discussion and Analysis

The descriptive analysis shows that the average duration (Mean=20.9) is higher than the Median (Median=18) indicating a positively skewed distribution.  The average amount (Mean=3271) is higher than the Median (Median=2320) indicating a positively skewed distribution.  The average installment rate is (Mean=2.97) is a little less than the Median (Median=3.00) indicating a small negative skewed distribution.  The result also shows that the radio/TV ranks number one for the loan, followed by a new car.  A training dataset is created to select 900 rows for estimation and 100 for testing.  K-NN method is used to estimate the nearest using k=1 and k=3, and the proportion of the correct classification is calculated to result in 60% and 61% for k=1 and k=3 respectively (Figure 4).  The result of the cross-validation with k=5 for the nearest neighbor is about 65% of the outcome.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

2 thoughts on “Quantitative Analysis of “German.Credit” Dataset Using K-NN Classification and Cross-Validation

Comments are closed.