Quantitative Analysis of “Prostate Cancer” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.  This goal is to predict tumor spread, which is the log volume in this dataset of 97 men who had undergone a biopsy. The measures which are used for prediction are BPH, PSA, Gleason Score, CP, and size of the prostate.  The predicted tumor size affects the treatment options for the patients, which can include chemotherapy, radiation treatment, and surgical removal of the prostate.  

The dataset “prostate.cancer.csv” is downloaded from the CTU course learning materials.  The dataset has 97 observations or patients on six variables. The response variable is the log volume (lcavol).  This assignment is to predict this variable (lcavol) from five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) using the decision tree.  The response variable is a continuous measurement variable. The sum of squared residuals as the impurity (fitting) criterion is used in this analysis.

This assignment discusses and addresses fourteen Tasks as shown below:

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset. The dataset has 97 observations or patients with six variables. The response variable for prediction is (lcavol), and the five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) will be used for this prediction using the decision tree.  The response variable is a continuous measurement variable.  Table 1 summarizes these variables including the response variable of (lcavol).

Table 1:  Prostate Cancer Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • pc <- read.csv(“C:/CS871/prostate.cancer.csv”)
  • pc
  • dim(pc)
  • names(pc)
  • head(pc)
  • pc <- data.frame(pc)
  • head(pc)
  • str(pc)
  • pc <-data.frame(pc)
  • summary(pc)
  • plot(pc, col=”blue”, main=”Plot of Prostate Cancer”)

Figure 1. Plot of Prostate Variables.

Task-3:  Distribution of Prostate Cancer Variables.

  • #### Distribution of Prostate Cancel Variables
  • ### These are the variables names
  • colnames(pc)
  • ##Setup grid, margins.
  • par(mfrow=c(3,3), mar=c(4,4,2,0.5))
  • for (j in 1:ncol(pc))
  • {
  • hist(pc[,j], xlab=colnames(pc)[j],
  • main=paste(“Histogram of”, colnames(pc)[j]),
  • col=”blue”, breaks=20)
  • }
  • hist(pc$lcavol,col=”orange”)
  • hist(pc$age,col=”orange”)
  • hist(pc$lbph,col=”orange”)
  • hist(pc$lcp,col=”orange”)
  • hist(pc$gleason,col=”orange”)
  • hist(pc$lpsa,col=”orange”)

Figure 2.  Distribution of Prostate Cancer Variables.

Task-4:  Correlation Among Prostate Variables

  • ##Correlations between prostate cancer variables
  • pc.cor = cor(pc)
  • pc.cor <- round(pc.cor, 3)
  • summary(pc)
  • pc.cor[lower.tri(pc.cor,diag=TRUE)] = 0
  • pc.cor.sorted = sort(abs(pc.cor),decreasing=T)
  • pc.cor.sorted[1]
  • ##[1] 0.871
  • # Use arrayInd()
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[1]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ##[1] “lcp”     “gleason”
  • pc.cor.sorted[2]
  • ## [[1] 0.812
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[2]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ## [1] “lbph” “lcp”

Task-5:  Visualization of the Relationship and Correlation between the Cancer Spread (lcavol) and other Variables.

  • ## Visualizing relationships among varibles
  • plot(lcavol~age, data=pc, col=”red”, main=”Relationship of Age on the Cancer Volume (lcvol)”)
  • plot(lcavol~lbph, data=pc, col=”red”, main=”Relationship of Amount of Benign Prostatic Hyperplasia (lbph) on the Cancer Volume (lcavol)”)
  • plot(lcavol~lcp, data=pc, col=”red”, main=”Relationship of Capsular Penetration (lcp) on the Cancer Volume (lcvol)”)
  • plot(lcavol~gleason, data=pc, col=”red”, main=”Relationship of Gleason System (gleason) on the Cancer Volume (lcvol)”)
  • plot(lcavol~lpsa, data=pc, col=”red”,main=”Relationship of Prostate Specific Anitgen (lpsa) on the Cancer Volume (lcvol)”)

Figure 3.  Plot of Correlation Among Prostate Variables Using plot() Function.

  • ## Correlation among the variables using corrplot() function
  • install.packages(“ElemStatLearn”)
  • library(ElemStatLearn)           ## it contains the data
  • install.packages(“car”)
  • library(car)                   ## package to calculate variance inflation factor
  • install.packages(“corrplot”)
  • library(corrplot)                       ## correlation plots
  • install.packages(“leaps”)
  • library(leaps)               ## best subsets regression
  • install.packages(“glmnet”)
  • library(glmnet)                        ##allows ridge regression, LASSO and elastic net.
  • install.packages(“caret”)
  • library(caret)                ##parameter tuning
  • pc.cor=cor(pc)
  • corrplot.mixed(pc.cor)

Figure 4.  Plot of Correlation Among Prostate Variables Using corplot() Function.

Task-6:  Build a Decision Tree for Prediction

  • ##Building a Decision Tree for Prediction
  • install.packages(“tree”)
  • library(tree)
  • ##Construct the tree
  • pctree <- tree(lcavol ~., data=pc, mindev=0.1, mincut=1)
  • pctree <- tree(lcavol ~., data=pc,mincut=1)
  • pctree
  • plot(pctree, col=4)
  • text(pctree, digits=2)
  • pccut <- prune.tree(pctree, k=1.7)
  • plot(pccut, col=”red”, main=”Pruning Using k=1.7″)
  • pccut
  • text(pccut, digits=2)
  • pccut <- prune.tree(pccut, k=2.05)
  • plot(pccut, col=”darkgreen”, main=”Pruning Using k=2.05″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree,k=3)
  • plot(pccut, col=”orange”, main=”Pruning Using k=3″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree)
  • pccut
  • plot(pccut, col=”blue”, main=”Decision Tree Pruning”)
  • pccut <- prune.tree(pctree,best=3)
  • pccut
  • plot(pccut, col=”orange”, main=”Pruning Using best=3″)
  • text(pccut, digits=2)

Figure 5:  Initial Tree Development.

Figure 6:  First Pruning with α=1.7.

Figure 7:  Second Pruning with α=2.05.

Figure 8:  Third Pruning with α=3.

Figure 9:  Plot of the Decision Tree Pruning.

Figure 10.  Plot of the Final Tree.

Task-7:  Cross-Validation

  • ## Use cross-validation to prune the tree
  • set.seed(2)
  • cvpc <- cv.tree(pctree, k=10)
  • cvpc$size
  • cvpc$dev
  • plot(cvpc, pch=21, bg=8, type=”p”, cex=1.5, ylim=c(65,100), col=”blue”)
  • pccut <- prune.tree(pctree, best=3)
  • pccut
  • plot(pccut, col=”red”)
  • text(pccut)
  • ## Final Plot
  • plot(pc[,c(“lcp”,”lpsa”)],col=”red”,cex=0.2*exp(pc$lcavol))
  • abline(v=.261624, col=4, lwd=2)
  • lines(x=c(-2,.261624), y=c(2.30257,2.30257), col=4, lwd=2)

Figure 13.  Plot of the Cross-Validation Deviance.

Figure 14.  Plot of the Final Classification Tree.

Figure 15.  Plots of Cross-Validation.

Task-8:  Discussion and Analysis

The classification and regression tree (CART) represents a nonparametric technique which generalizes parametric regression models (Ledolter, 2013).  It allows for non-linearity and variables interactions with no need to specify the structure in advance. Furthermore, the violation of constant variance which represents a critical assumption in the regression model is not critical in this technique (Ledolter, 2013).

The descriptive statistics result shows that lcavol has a mean of 1.35 which is less than the median of 1.45 indicating a negatively skewed distribution, with a minimum of -1.35 and a maximum of 2.8. The age of the prostate cancer patients has an average of 64 years, with a minimum of 41 and a maximum of 79 years old.  The lbph has an average of 0.1004 which is less than the median of 0.300 indicating the same negatively skewed distribution with a minimum of -1.39 and maximum of 2.33.  The lcp has an average of -0.18 which is higher than the median of -0.79 indicating a positive skewed distribution with a minimum of -1.39 and a maximum of 2.9.  The Gleason measure has a mean of 6.8 which is a little less than the median of 7 indicating a little negative skewed distribution with a minimum of 6 and maximum of 9.  The last variable of lpsa has an average of 2.48 which is a little less than the median of 2.59 indicating a little negatively skewed distribution with a minimum of -0.43 and maximum of 5.58. The result shows that there is a positive correlation between lpsa and lcavol, and between lcp and lcavol as well.  The result also shows that the age between 60 and 70 the lcavol gets increased.

Furthermore, the result also shows that the Gleason result takes integer values of 6 and larger. The result of the lspa shows that the log PSA score, is close to the normally distributed dataset.  The result in Task-4 of the correlation among prostate variables is not surprising as it shows that if their Gleason score is high now, then they likely had a bad history of Gleason scores, which is known for such high Gleason.  The result also shows that lcavol as a predictor should be included for any prediction of the lpsa.

As illustrated in Figure 4, the result shows that PSA is highly correlated with the log of cancer volume (lcavol); it appeared to have a highly linear relationship.  The result also shows that multicollinearity may become an issue; for example, cancer volume is also correlated with capsular penetration, and this is correlated with the seminal vesicle invasion.

For the implementation of the Tree, the initial tree has 12 leave nodes, and the size of the tree is thus 12 as illustrated in Figure 5.  The root shows the 97 cases with deviance of 133.4. Node 1 is the root; Node 2 has a value of lcp < 0.26 with 63 patients and deviance of 64.11.  Node 3 has the value of lcp > 0.26 with 34 cases and deviance of 13.39.  Node 4 has the lpsa < 2.30 with 35 cases and deviance of 24.72. Node 5 has lpsa > 2.30 with 28 cases and 18.6 deviance. Node 6 has lcp < 2.14 with 25 cases and deviance of 6.662.  Node 7 has lcp > 2.139 with 9 cases and deviance of 1.48. Node 8 has lpsa < 0.11 with 4 cases and deviance of 0.3311, while Node 9 has lpsa > 0.11 with 31 cases and deviance of 18.92, and age of < 52 with deviance of 0.12 and age o > 52 with deviance of 13.88. Node 10 has lpsa < 3.25 with 23 cases and deviance of 11.61. while Node 11 has lcp > 3.25 with 5 cases and deviance of 1.76.  Node 12 is for age < 62 with 7 cases and deviance of 0.73.

The first pruning process using α=1.7 did not result in any different from the initial tree.  It resulted in the 12 nodes.  The second pruning with α=2.05 improved the tree with eight nodes.  The root shows the same result of 97 cases with deviance of 133.4.  Node 1 has lcp < 0.26 with deviance of 64.11 while Node 2 has lcp > 0.26 with deviance of 13.39.  The third pruning using α=3 has further improved the tree as shown in Figure 8.  The final Tree has the root with four nodes: Node 1 for lcp < 0.26 and Node 2 for lcp > 0.26.  Node 3 has lpsa < 2.30, while Node 4 reflects lpsa > 2.30.  With regard to the prediction, the patient with lcp=0.20, which is categorized in Node 2, and lpsa of 2.40 which is categorized in Node 4, can be predicted to have a log volume of (lcavol) of 1.20.

The biggest challenge for the CART model which is described as flexible, in comparison to the regression models, is the overfitting (Giudici, 2005; Ledolter, 2013).  If the splitting algorithm is not stopped, the tree algorithm can ultimately extract all information from the data, including information which is not and cannot be predicted in the population with the current set of prediction causing random or noise variation (Ledolter, 2013).   However, when the subsequent splits add minimal improvement of the prediction, the stop of generating new split nodes, in this case, can be used as a defense against the overfitting issue.  Thus, if 90% of all cases can be predicted correctly from 10 splits, and 90.1% of all cases from 11 splits, then, there is no need to add the 11th split to the tree, as it does not add much value only .1%. There are various techniques to stop the split process.  The basic constraints (mincut, mindev) lead to a full tree fit with a certain number of terminal nodes.  In this case of the prostate analysis, the mincut=1 is used which is a minimum number of observations to include in a child node and obtained a tree of size 12.

Since the three-building is stopped as illustrated in Figure 10, the cross-validation is used to evaluate the quality of the prediction of the current tree.  The cross-validation subjects the tree computed from one set of observation (the training sample) to another independent set of observation (the test sample). If most or all of the splits determined by the analysis of the training sample are based on random noise, then the prediction for the test sample is described to be poor.  The cross-validation cost or CV cost is the averaged error rate for particular tree size.  The tree size which produces the minimum CV cost is found.  The reference tree is then pruned back to the number of nodes matching the size which produces the minimum CV cost.  Pruning was implemented in a stepwise bottom-up manner, by removing the least important nodes during each pruning cycle. The v-fold CV is implemented with the R command (cv.tree). The graph in Figure 13 of the CV Deviance indicates that, for the prostate example, a tree of size 3 is appropriate.  Thus, the reference tree which was obtained from all the data is being pruned back to size 3. CV chooses the capsular penetration and PSA as the decision variable.  The effect of capsular penetration on the response of log volume (lcavol) depends on PSA. The final graph of Figure 15 shows that the CAR divides up the space of the explanatory variables into rectangles, with each rectangle leading to a different prediction. The size of the circles of the data points in the respective rectangles reflects the magnitude of the response.  Figure 15 confirms that the tree splits are quite reasonable.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.