Decision Trees with a Comparison of Classification and Regression Decision Tree (CART)

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Decision Trees, with a comparison of Classification and Regression Decision Trees. The discussion also addresses the advantages and disadvantages of the Decision Trees. The focus of this discussion is on the Classification and Regression Tree (CART) algorithm as one of the statistical criteria. The discussion begins with a brief overview of the Classification, followed by additional related topics. It will end with a sample Decision Tree for a decision whether or not to take an umbrella.

Classification

Classification is a fundamental data mining technique (EMC, 2015). Most classification methods are supervised, in which they start with a training set of pre-labeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations (EMC, 2015). For instance, marketing, sales, and customer demographic data can be used to develop a classifier to assign a “purchase” or “no purchase” label to potential future customers (EMC, 2015). Classification is widely used for prediction purposes (EMC, 2015). Logistic Regression is one of the popular classification methods (EMC, 2015). Classification can be used for health care professionals to diagnose diseases such as heart disease (EMC, 2015). There are two fundamental classification methods: Decision Trees and Naïve Bayes. In this discussion, the focus is on the Decision Trees.

The Tree Models vs. Linear & Logistic Regression Models

The tree models are distinguished from the Linear and Logistic Regression models. The tree models produce a classification of observations into groups first and then obtain a score for each group, while the Linear and Logistic Regression methods produce a score and then possibly a classification based on a discriminant rule (Giudici, 2005).

Regression Trees vs. Classification Trees

The tree models are divided into regression trees and classification trees (Giudici, 2005). The regression trees are used when the response variable is continuous, while the classification trees are used when the response variable is quantitative discrete or qualitative (categorical) (Giudici, 2005). The tree models can be defined as a recursive process, through which a set of (n) statistical units are divided into groups progressively, based on a division rule aiming to increase a homogeneity or purity measure of the response variable in each of the obtained group (Giudici, 2005). An explanatory variable specifies a division rule at each step of the procedure, to split and establish splitting rules to partition the observations (Giudici, 2005). The final partition of the observation is the main result of a tree model (Giudici, 2005). It is critical to specify a “stopping criteria” for the division process to achieve such a result (Giudici, 2005).

Concerning the classification tree, fitted values are given regarding the fitted probabilities of affiliation to a single group (Giudici, 2005). A discriminant rule for the classification trees can be derived at each leaf of the tree (Giudici, 2005). The classification of all observations belonging to a terminal node in the class corresponding to the most frequent level is a commonly used rule, called “majority rule” (Giudici, 2005). While other “voting” schemes can also be implemented, in the absence of other consideration, this rule is the most reasonable (Giudici, 2005). Thus, each of the leaves points out a clear allocation rule of the observation, which is read using the path that connects the initial node to each of them. Therefore, every path in the tree model represents a classification rule (Giudici, 2005).

With comparison to other discriminant models, the tree models produce rules which are less explicit analytically, and easier to understand graphically (Giudici, 2005). The tree models can be regarded as nonparametric predictive models as they do not require assumptions about the probability distribution of the response variable (Giudici, 2005). This flexibility indicates that the tree models are generally applicable, whatever the nature of the dependent variable and the explanatory variables (Giudici, 2005). However, the disadvantages of this flexibility of a higher demand of computational resources, and their sequential nature and the complexity of their algorithm can make them dependent on the observed data, and even a small change might alter the structure of the tree (Giudici, 2005). Thus, it is difficult to take a tree structure designed for one context and generalize it to other contexts (Giudici, 2005).

The Classification Tree Analysis vs. The Hierarchical Cluster Analysis

The classification tree analysis is distinguished from the hierarchical cluster analysis despite their graphical similarities (Giudici, 2005). The classification trees are predictive rather than descriptive. While the hierarchical cluster analysis performs an unsupervised classification of the observations based on all available variables, the classification trees perform a classification of the observations based on all explanatory variables and supervised by the presence of the response variable (target variable) (Giudici, 2005). The second critical difference between the hierarchical cluster analysis and the classification tree analysis is related to the partition rule. While in the classification trees the segmentation is typically carried out using only one explanatory variable at a time, in the hierarchical clustering the divisive or agglomerative rule between groups is established based on the considerations on the distance between them, calculated using all the available variables (Giudici, 2005).

Decision Trees Algorithms

The goal of Decision Trees is to extract from the training data the succession of decisions about the attributes that explain the best class, that is, group membership (Fischetti, Mayor, & Forte, 2017). Decision Trees have a root, which is the best attribute to split the data upon, about the outcome (Fischetti et al., 2017). The dataset is partitioned into branches by this attribute (Fischetti et al., 2017). The branches lead to other nodes which correspond to the next best partition for the considered branch (Fischetti et al., 2017). The process continues until the terminal nodes are reached, where no more partitioning is required (Fischetti et al., 2017). Decision Trees allow class predictions (group membership) of previously unseen observations (testing datasets or prediction datasets) using statistical criteria applied on the seen data (training dataset) (Fischetti et al., 2017). There are six statistical criteria of six algorithms:

ID3
C4.5
Random Forest.
Conditional Inference Trees.
Classification and Regress Trees (CART)

The most used algorithm in the statistical community is the CART algorithm, while C4.5 and its latest version C5.0 are widely used by computer scientists (Giudici, 2005). The first versions of C4.5 and 5.0 were limited to categorical predictors, but the most recent versions are similar to CART (Giudici, 2005).

Classification and Regression Trees (CART)

CART is often used as a generic acronym for the decision tree, although it is a specific implementation of tree models (EMC, 2015). CART, similar to C4.5, can handle continuous attributes (EMC, 2015). While C4.5 uses entropy-based criteria to rank tests, CART uses the Gini diversity index defined in equation (1) (EMC, 2015; Fischetti et al., 2017).

Moreover, while C4.5 uses stopping rules, CART construct a sequence of subtrees, uses cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the lowest cost, (EMC, 2015; Hand, Mannila, & Smyth, 2001). CART represents a powerful nonparametric technique which generalizes parametric regression models (Ledolter, 2013). It allows nonlinearity and variable interactions without having to specify the structure in advance (Ledolter, 2013). It operates by choosing the best variable for splitting the data into two groups at the root node (Hand et al., 2001). It builds the tree using a single variable at a time, and can readily deal with large numbers of variables (Hand et al., 2001). It uses different statistical criteria to decide on tree splits (Fischetti et al., 2017). There are some differences between CART used for classification and the family of algorithms. In CART, the attribute to be partition is selected with the Gini index as a decision criterion (Fischetti et al., 2017). This method is described as more efficient compared to the information gain and information ratio (Fischetti et al., 2017). CART implements the necessary partitioning on the modalities of the attribute and merges modalities for the partition, such as modality A versus modalities B and C (Fischetti et al., 2017). The CART can predict a numeric outcome (Fischetti et al., 2017). In the case of regression trees, CART performs regression and builds the tree in a way which minimizes the squared residuals (Fischetti et al., 2017).

CART Algorithms of Division Criteria and Pruning

There are two critical aspects of the CART algorithm: Division Criteria, and Pruning, which can be employed to reduce the complexity of a tree (Giudici, 2005). Concerning the division criteria algorithm, the primary essential element of a tree model is to choose the division rule for the units belonging to a group, corresponding to a node of the tree (Giudici, 2005). The decision rule selection means a predictor selection from those available, and the selection of the best partition of its levels (Giudici, 2005). The selection is generally made using a goodness measure of the corresponding division rule, which allows the determination of the rule to maximize the goodness measure at each stage of the procedure (Giudici, 2005).

The impurity concept refers to a measure of variability of the response values of the observations (Giudici, 2005). In a regression tree, a node will be pure if it has null variance as all observations are equal, and it will be impure if the variance of the observation is high (Giudici, 2005). For the regression trees, the impurity corresponds to the variance, while for the classification trees alternative measures for the impurity are considered such as Misclassification impurity, Gini impurity, Entropy impurity, and Tree assessments (Giudici, 2005).

When there is no “stopping criterion,” a tree model can grow until each node contains identical observation regarding the values or levels of the dependent variable (Giudici, 2005). This approach does not contain a parsimonious segmentation (Giudici, 2005). Thus, it is critical to stop the growth of the tree at a reasonable dimension (Giudici, 2005). The tree configuration becomes ideal when it is parsimonious and accurate (Giudici, 2005). The parsimonious attribute indicates that the tree has a small number of leaves, and therefore, the predictive rule can be easily interpreted (Giudici, 2005). The accurate attribute indicates a large number of leaves which are pure to a maximum extent (Giudici, 2005). There are two opposing techniques for the final choice which tree algorithms can employ. The first technique uses stopping rules based on the thresholds on the number of the leaves, or on the maximum number of steps in the process, whereas the other algorithm technique introduces probabilistic assumptions on the variables, allowing the use of suitable statistical tests (Giudici, 2005). The growth is stopped when the decrease in impurity is too small, in the absence of the probabilistic assumptions (Giudici, 2005). The result of a tree model can be influenced by the choice of the stopping rule (Giudici, 2005).

The CART method utilizes a strategy different from the stepwise stopping criteria. The method is based on the pruning concept. The tree, first, is built to its greatest size, and it then gets “trimmed” or “pruned” according to a cost-complexity criterion (Giudici, 2005). The concept of pruning is to find a subtree optimally, to minimize a loss function, which is used by CART algorithm and depends on the total impurity of the tree and the tree complexity (Giudici, 2005). The misclassification impurity is usually chosen to be used for the pruning, although the other impurity methods can also be used. The minimization of the loss function results in a compromise between choosing a complex model with low impurity but high complexity cost and choosing a simple model with a high impurity with low complexity cost (Giudici, 2005). The loss function is assessed by measuring the complexity of the model fitted on the training dataset, whose misclassification errors are measured in the validation data set (Giudici, 2005). This method partitions the training data into a subset for building the tree and then estimates the misclassification rate on the remaining validation subset (Hand et al., 2001).

The CART has been widely used for several years by marketing applications and others (Hodeghatta & Nayak, 2016). The CART is described as a flexible model as the violations of constant variance which is very critical in regression, is permissible in the CART (Ledolter, 2013). However, the biggest challenge in the CART is the avoidance of the “overfitting” (Ledolter, 2013).

Advantages and Disadvantages of the Trees

Decision trees for regression and classification have advantages and disadvantages. Trees are regarded to be easier than linear regression and can be displayed graphically and interpreted easily (Cristina, 2010; Tibshirani, James, Witten, & Hastie, 2013). Decision trees are self-explanatory and easy to understand even for non-technical users (Cristina, 2010; Tibshirani et al., 2013). They can handle qualitative predictors without the need to create dummy variables (Tibshirani et al., 2013). Decision trees are efficient. Complex alternatives can be expressed quickly and precisely. A decision tree can easily be modified as new information becomes available. Standard decision tree notation is easy to adopt (Cristina, 2010). They can be used in conjunction with other management tools. Decision trees can handle both nominal and numerical attributes (Cristina, 2010). They are capable of handling datasets which may have errors or missing values. Decision trees are considered to be a non-parametric method, which means that they have no assumption about the spatial distribution and the classifier structure. Their representations are rich enough to represent any discrete-value classifier.

However, trees have limitations as well. They do not have the same level of predictive accuracy as some of the other regression and classification models (Tibshirani et al., 2013). Most of the algorithms, like ID3 and C4.5, require that the target attribute will have only discrete values. Decision trees are over-sensitive to the training set, to irrelevant attributes and noise. Decision trees tend to perform less if many complex interactions are present, and well if a few highly relevant attributes exist as they use the “divide and conquer” method (Cristina, 2010). Table 1 summarizes the advantages and disadvantages of the trees.

Table 1. Summary of the Advantages and Disadvantages of Trees.
Note: Constructed by the researcher based on the literature.

Take an Umbrella Decision Tree Example:

If input field value < n
- Then target = Y%
If input field value > n
- Then target = X%

Figure 1. Decision Tree for Taking an Umbrella

The decision depends on the weather, on the predicted rain probability, and whether it is sunny or cloudy.
The forecast predicts rain with a probability between 70% and 30%.
- If it is >70% rain probability, take an umbrella, else use >30% and <30% probability for further predictions.
- If it is >30% rain probability and cloudy, take an umbrella, else no umbrella.
- If it is <30% rain probability, no umbrella.

References

Cristina, P. (2010). Decision Trees. Retrieved from http://www.cs.ubbcluj.ro/~gabis/DocDiplome/DT/DecisionTrees.pdf.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Tibshirani, R., James, G., Witten, D., & Hastie, T. (2013). An introduction to statistical learning-with applications in R: New York, NY: Springer.

Share this:

Related

Published by Think and Knowledge Tank