Analysis of Ensembles

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze creating ensembles from different methods such as logistic regression, nearest neighbor methods, classification trees, Bayesian, or discriminant analysis. This discussion also addresses the use of the Random Forest to do the analysis.

Ensembles

There are two useful techniques which combine methods for improving predictive power: ensembles and uplift modeling.  Ensembles are the focus of this discussion. Thus, uplift modeling is not discussed in this discussion.  An ensemble combines multiple “supervised” models into a “super-model” (Shmueli, Bruce, Patel, Yahav, & Lichtendahl Jr, 2017)).  An ensemble is based on the dominant notion of combining models (EMC, 2015; Shmueli et al., 2017). Thus, several models can be combined to achieve improved predictive accuracy (Shmueli et al., 2017). 

Ensembles played a significant role in the million-dollar Netflix Prize contest which started in 2006 to improve their movie recommendation system (Shmueli et al., 2017).  The principle of combining methods is known for reducing risk because the variation is smaller than each of the individual components (Shmueli et al., 2017).  The risk is equivalent to a variation in prediction error in predictive modeling.  The more the prediction errors vary, the more volatile the predictive model (Shmueli et al., 2017).  Using an average of two predictions can potentially result in smaller error variance, and therefore, better predictive power (Shmueli et al., 2017).  Thus, results can be combined from multiple prediction methods or classifiers (Shmueli et al., 2017).  The combination can be implemented for predictions, classifications, and propensities as discussed below. 

Ensembles Combining Prediction Using Average Method

When combining prediction, the predictions can be combined with different methods by taking an average.  One alternative to a simple average is taking the median prediction, which would be less affected by extreme predictions (Shmueli et al., 2017). Computing a weighted average is another possibility where the weights are proportional to a quantity of interest such as quality or accuracy (Shmueli et al., 2017).  Ensembles for prediction are useful not only in cross-sectional prediction but also in time series forecasting (Shmueli et al., 2017).

Ensembles Combining Classification Using Voting Method

When combining classification, combining the results from multiple classifiers can be implemented using “voting,” for each record, multiple classifications are available.  A simple rule would be to choose the most popular class among these classifications (Shmueli et al., 2017). For instance, Classification Tree, a Naïve Bayes classifier, and discriminant analysis can be used for classifying a binary outcome (Shmueli et al., 2017).  For each record, three predicted classes are generated (Shmueli et al., 2017). Simple voting would choose the most common class of the three (Shmueli et al., 2017). Similar to the prediction, heavier weights can be assigned to scores from some models, based on considerations such as model accuracy or data quality, which can be implemented by setting a “majority rule” which is different from 50% (Shmueli et al., 2017).   Concerning the nearest neighbor (K-NN), an ensemble learning such as bagging can be performed with K-NN (Dubitzky, 2008). The individual decisions are combined to classify new examples.  Combining of individual results is performed by weighted or unweighted voting (Dubitzky, 2008).

Ensembles Combining Propensities Using Average Method

Similar to prediction, propensities can be combined by taking a simple or weighted average.  Some algorithms such as Naïve Bayes produce biased propensities and should not, therefore, be averaged with propensities from other methods (Shmueli et al., 2017).

Other Forms of Ensembles

Various methods are commonly used for classification, including bagging, boosting, random forest, and support vector machines (SVM).  The bagging, boosting, and random forest is all examples of ensemble methods which use multiple models to obtain better predictive performance than can be obtained from any of the constituent models (EMC, 2015; Ledolter, 2013; Shmueli et al., 2017).

  • Bagging: It is short for “bootstrap aggregating” (Ledolter, 2013; Shmueli et al., 2017). It was proposed by Leo Breiman in 1994, which is a model aggregation technique to reduce model variance (Swamynathan, 2017).  It is another form of Ensembles which is based on averaging across multiple random data samples (Shmueli et al., 2017).  There are two steps to implement bagging.  Figure 1illustrates the bagging process flow.
    • Generate multiple random samples by sampling “with replacement from the original data.”  This method is called “bootstrap sampling.”
    • Running an algorithm on each sample and producing scores (Shmueli et al., 2017).

Figure 1.  Bagging Process Flow (Swamynathan, 2017).

Bagging improves the performance stability of a model and helps avoid overfitting by separately modeling different data samples and then combining the result.  Thus, it is especially useful for algorithms such as Trees and Neural Networks.  Figure 2 illustrates an example of the bootstrap sample that has the same size as the original sample size, with ¾ of the original values plus replacement result in repetition of values.

Figure 2:  Bagging Example (Swamynathan, 2017).

Boosting: It is a slightly different method of creating ensembles.  It was introduced by Freud and Schapire in 1995 using the well-known AdaBoost algorithm (adaptive boosting) (Swamynathan, 2017).  The underlying concept of boosting is that rather than an independent individual hypothesis, combining hypotheses in a sequential order increases the accuracy (Swamynathan, 2017).  The boosting algorithms convert the “weak learners” into “strong learners” (Swamynathan, 2017).  Boosting algorithms are well designed to address the bias problems (Swamynathan, 2017). Boosting tends to increase the accuracy (Ledolter, 2013). The “AdaBoosting” process involves three steps. Figure 3 illustrates the “AdaBoosting” process:

  1. Assign uniform weight for all data points W0(x)=1/N, where N is the total number of training data points.
    1. At each iteration fit a classifier ym(xn) to the training data and update weights to minimize the weighted error function.
  • The final model is given by the following equation:

Figure 3.  “AdaBoosting” Process (Swamynathan, 2017).

            As an example illustration of AdaBoost, there is a sample dataset with 10 data points, with an assumption that all data points will have equal weights giving by, 1/10 as illustrated in Figure 4. 

Figure 4.  An Example Illustration of AdaBoost. Final Model After Three Iteration (Swamynathan, 2017).

  • Random Forest: It is another class of ensemble method using decision tree classifiers.  It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A particular case of random forest uses bagging on decision trees, where samples are randomly chosen with replacement from the original training set (EMC, 2015).
  • SVM: Itis another common classification method which combines linear models with instance-based learning techniques. The SVM select a small number of critical boundary instances called support vectors from each class and build a linear decision function which separates them as widely as possible.  SVM can efficiently perform, by default linear classifications and can also be configured to perform non-linear classifications (EMC, 2015).

Advantages and Limitations of Ensembles

Combining scores from multiple models is aimed at generating more precise predictions by lowering the prediction error variance (Shmueli et al., 2017).  The ensemble method is most useful when the combined models generate prediction error which is negatively associated or correlated, but it can also be useful when the correlation is low (Ledolter, 2013; Shmueli et al., 2017).  Ensembles can use simple averaging, weighted averaging, voting, and median (Ledolter, 2013; Shmueli et al., 2017).  Models can be based on the same algorithm or different algorithms, using the same sample or different sample (Ledolter, 2013; Shmueli et al., 2017).  Ensembles have become an important strategy for participants in data mining contests, where the goal is to optimize some predictive measure (Ledolter, 2013; Shmueli et al., 2017).  Ensembles which are based on different data samples help avoid overfitting. However, overfit can also happen with an ensemble in instances such as the choice of best weights when using a weighted average (Shmueli et al., 2017).   

The primary limitation of the ensemble is the resources which it requires such as computationally, and the skills and time investments (Shmueli et al., 2017).  Ensembles which combine results from different algorithms require the development of each model and their evaluation.  The boosting-type ensembles and bagging-type ensembles do not require much effort. However, they do have a computational cost.  Furthermore, ensembles which rely on multiple data sources require the collection and the maintenance of the multiple data sources (Shmueli et al., 2017).  Ensembles are regarded to be “black box” methods, where the relationship between the predictors and the outcome variable usually becomes non-transparent (Shmueli et al., 2017). 

The Use of Random Forests for Analysis

The decision tree is based on a set of True/False decision rules. The prediction is based on the tree rules for each terminal node.  A decision tree for a small set of sample training data encounters the overfitting problem. Random forest model, in contrast, is well suited to handle small sample size problems.  The random forest contains multiple decision trees as the more trees, the better.  Randomness is in selecting the random training subset from the training dataset, using bootstrap aggregating or bagging method to reduce the overfitting by stabilizing the predictions. This method is utilized in many other machine-learning algorithms, not only in the Random Forests (Hodeghatta & Nayak, 2016). There is another type of randomness which occurs when selecting variables randomly from the set of variables, resulting in different trees which are based on different sets of variables.  In a forest, all the trees would still influence the overall prediction by the random forest (Hodeghatta & Nayak, 2016).

The programming logic for Random Forest includes seven steps as follows (Azhad & Rao, 2011).

  1. Input the number of training set N.
  2. Compute the number of attributes M.
  3. For (m) input attributes used to form the decision at a node m<M.
  4. Choose training set by sampling with replacement.
  5. For each node of the tree, use one of the (m) variables as the decision node.
  6. Grow each tree without pruning.
  7. Select the classification with maximum votes.

Random Forests have a low bias (Hodeghatta & Nayak, 2016).  The variance is reduced, and thus, overfitting, by adding more trees, which is one of the advantages of the Random Forests, and hence gaining popularity.  The models of Random Forests are relatively robust to the set of input variables and often do not care about pre-processing of data.  Random Forests are described to be more efficient to build than other models such as SVM (Hodeghatta & Nayak, 2016).  Table 1 summarizes the Advantages and Disadvantages of Random Forests in a comparison with other Classification Algorithms such as Naïve Bayes, Decision Tree, Nearest Neighbor. 

Table 1.  Advantages and Disadvantages of Random Forest in comparison with other Classification Algorithms. Adapted from (Hodeghatta & Nayak, 2016).

References

Azhad, S., & Rao, M. S. (2011). Ensuring data storage security in cloud computing.

Dubitzky, W. (2008). Data Mining in Grid Computing Environments: John Wiley & Sons.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Shmueli, G., Bruce, P. C., Patel, N. R., Yahav, I., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R: John Wiley & Sons.

Swamynathan, M. (2017). Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python: Apress.