Machine Learning: Logistic Regression

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the assumptions of the Logistic Regression, and the assumptions of the Regular Regression, which are not applicable to the Logistic Regression.  The discussion and the analysis also address the type of the variables in both the Logistic Regression and the Regular Regression. 

Regular Linear Regression:

Regression analysis is used when a linear model is fit to the data and is used to predict values of an outcome variable or dependent variable from one or more predictor variable or independent variables (Field, 2013).  The Linear Regression is also defined in (Field, 2013) as a method which is used to predict the values of the continuous variables, and to make inferences about how specific variables are related to a continuous variable.  These two procedures of the prediction and inference rely on the information from the statistical model, which is represented by an equation or series of equations with some number of parameters (Tony Fischetti, 2015).  Linear Regression is the most important prediction method for “continuous” variables (Giudici, 2005). 

With one predictor or independent variable, the technique is sometimes referred to as “Simple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti, Mayor, & Forte, 2017; Giudici, 2005).  However, when there are several predictors or independent variables in the model, it is referred to as “Multiple Regression” (Field, 2013; Tony Fischetti, 2015; T. Fischetti et al., 2017; Giudici, 2005).  In Regression Analysis, the differences between what the model predicts and the observed data are called “Residuals” which are the same as “Deviations” when looking at the Mean (Field, 2013).  These deviations are the vertical distances between what the model predicted and each data point that was observed.  Sometimes, the predicted value of the outcome is less than the actual value, and sometimes it is greater, meaning that the residuals are sometimes positive and sometimes negative.  To evaluate the error in a regression model, like when the fit of the mean using the variance is assessed, a sum of squared errors can be used using residual sum squares (SSR) or the sum of squared residuals (Field, 2013).  This SSR is an indicator of how well a particular line fits the data: if the SSR is large, the line is not representative of the data; if the SSR is small, the line is a representative of the data (Field, 2013).

When using the Simple Linear Regression with the two variables; one independent or predictor and the other is outcome or dependent, the equation is as follows (Field, 2013).

In this Regression Model, the (b) is the correlation coefficient (more often denoted as ( r )), and it is a standardized measure (Field, 2013). However, an unstandardized measure of (b) can be used, but the equation will alter to be as follows (Field, 2013):

This model differs from that of a correlation only in that it uses an unstandardized measure of the relationship (b) and consequently a parameter (b0) for the value of the outcome must be included, when the predictor is zero (Field, 2013).  These parameters of (b0) and (b1) are known as the Regression Coefficients (Field, 2013).

When there are more than two variables which might be related to the outcome, Multiple Regression can be used. The Multiple Regression can be used with three, four or more predictors (Field, 2013).  The equation for the Multiple Regression is as follows:

The (b1) is the coefficient of the first predictor (X1), (b2) is the second coefficient of the second predictor (X2), and so forth, as (bn) is the coefficient of the nth predictor (Xni) (Field, 2013). 

To assess the goodness of fit for the Regular Regression, the sum of squares, the R and R2 can be used.  When using the Mean as a model, the difference between the observed values and the values predicted by the mean can be calculated using the sum of squares (denoted SST) (Field, 2013).  This value of the SST represents how good the Mean is as a model of the observed data (Field, 2013).  When using the Regression Model, the SSR can be used to represent the degree of inaccuracy when the best model is fitted to the data (Field, 2013). 

Moreover, these two sums of squares of SST and SSR can be used to calculate how much better the regression model is than using a baseline model such as the Mean model (Field, 2013).  The improvement in prediction resulting from using the Regression Model rather than the Mean Model is measured by calculating the difference between SST and SSR (Field, 2013).  Such improvement is the Model Sum of Squares (SSM) (Field, 2013).  If the value of SSM is large, then the regression model is very different from using the mean to predict the outcome variable, indicating that the Regression Model has made a big improvement to how well the outcome variable can be predicted (Field, 2013). However, if the SSM is small then using the Regression Model is a little better than using the Mean model (Field, 2013).  Calculating the R2 by dividing SSM by SST to measure the proportion of the improvement due to the model.  The R2 represents the amount of variance in the outcome explained by the mode (SSM) relative to how much variation there was to explain in the first place (SST) (Field, 2013).  Other methods to assess the goodness-of-fit of the Model include the F-test using Mean Squares (MS) (Field, 2013), and F-statistics to calculate the significance of R2(Field, 2013). To measure the individual contribution of a predictor in Regular Linear Regression, the estimated regression coefficient (b) and their standard errors to compute a t-statistic are used (Field, 2013).

The Regression Model must be generalized as a generalization is a critical additional step, because if the model cannot be generalized, then any conclusion must be restricted based on the model to the sample used (Field, 2013).  For the regression model to generalize, cross-validation can be used (Field, 2013; Tony Fischetti, 2015) and the underlying assumptions must be met (Field, 2013).

Central Assumptions of Regular Linear Regression in Order of Importance

The assumptions of the Linear Model in order of importance as indicated in (Field, 2013) are as follows:

  1. Additivity and Linearity:  The outcome variable should be linearly related to any predictors, and with several predictors, their combined effect is best described by adding their effects together. Thus, the relationship between variables is linear.  If this assumption is not met, the model is invalid.  Sometimes, variables can be transformed to make their relationships linear (Field, 2013).
  2. Independent Errors:  The residual terms should be uncorrelated (i.e., independent) for any two observations, sometimes described as “lack of autocorrelation” (Field, 2013).  If this assumption of independence is violated, the confidence intervals and significance tests will be invalid.  However, regarding the model parameters, the estimates using the method of least square will still be valid but not optimal (Field, 2013).  This assumption can be tested with the Durbin-Watson test, which tests for serial correlations between errors, specifically, it tests whether adjacent residuals are correlated (Field, 2013).  The size of the Durbin-Watson statistic depends upon the number of predictors in the model and the number of observation (Field, 2013).  As a very conservative rule of thumb, values less than one or greater than three are the cause of concern; however, values closer to 2 may still be problematic, depending on the sample and model (Field, 2013).
  3. Homoscedasticity:  At each level of the predictor variable(s), the variance of the residual terms should be constant, meaning that the residuals at each level of the predictor(s) should have the same variance (homoscedasticity) (Field, 2013).  When the variances are very unequal there is said to be heteroscedasticity.  Violating this assumption invalidates the confidence intervals and significance tests (Field, 2013). However, estimates of the model parameters (b) using the method of least squares are still valid but not optimal (Field, 2013).  This problem can be overcome by using weighted least squares regression in which each case is weighted by a function of its variance (Field, 2013).
  4. Normally Distributed Errors:   It is assumed that the residuals in the model are random, normally distributed variables with a mean of 0.  This assumption means that the differences between the model and the observed data are most frequently zero or very close to zero, and that differences much greater than zero happen only occasionally (Field, 2013).  This assumption sometimes is confused with the idea that predictors have to be normally distributed (Field, 2013).  Predictors do not need to be normally distributed (Field, 2013). In small samples a lack of normality will invalidate confidence intervals and significance tests; in large samples, it will not, because of the central limit theorem (Field, 2013).  If the concern is only with estimating the model parameters and not with the significance tests and confidence intervals, then this assumption barely matters (Field, 2013).  In other words, this assumption matters for significance tests and confidence intervals.  This assumption can also be ignored if the bootstrap of confidence intervals is used (Field, 2013).

Additional Assumptions of Regular Linear Regression

There are additional assumptions when dealing with Regular Linear Regression.  These additional assumptions are as follows as indicated in (Field, 2013).

  • Predictors are uncorrelated with “External Variable,” or “Third Variable”  External variables are variables which have not been included in the regression model and influence the outcome variable. These variables can be described as “third variable.” This assumption indicates that there should be no external variables that correlate with any of the variables included int eh regression model (Field, 2013).  If external variables do correlate with the predictors, the conclusion that is drawn from the model become “unreliable” because other variables exist that can predict the outcome just as well (Field, 2013).
  • Variable Types: All predictor (independent) variables must be “quantitative” or “categorical,” and the outcome (dependent) variables must be “quantitative,” “continuous” and “unbounded” (Field, 2013). The “quantitative” indicates that they should be measured at the interval level, and the “unbounded” indicates that there should be no constraints on the variability of the outcome (Field, 2013).  
  • No Perfect Multicollinearity: If the model has more than one predictor then there should be no perfect linear relationship between two or more of the predictors.  Thus, the predictors (independent) variables should not correlate too highly (Field, 2013).
  • Non-Zero Variance: The predictors should have some variations in value; meaning they do not have variances of zero (Field, 2013).  

Logistic Regression

When the dataset has categorical variables as well as continuous predictors (independent), Logistic Regression is used (Field, 2013). Logistic Regression is multiple regression but with an outcome (dependent) variable that is categorical and predictor variables that are continuous or categorical.  Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005).

Logistic Regression can have life-saving applications as in medical research it is used to generate models from which predictions can be made about the “likelihood” that, e.g. a tumor is cancerous or benign (Field, 2013). A database is used to develop which variables are influential in predicting the “likelihood” of malignancy of a tumor (Field, 2013). These variables can be measured for a new patient and their values placed in a Logistic Regression model, from which a “probability” of malignancy could be estimated (Field, 2013).  Logistic Regression calculates the “probability” of the outcome occurring rather than making a prediction of the outcome corresponding to a given set of predictors (Ahlemeyer-Stubbe & Coleman, 2014). The expected values of the target variable from a Logistic Regression are between 0 and 1 and can be interpreted as a “likelihood” (Ahlemeyer-Stubbe & Coleman, 2014).

There are two types of Logistic Regression; Binary Logistic Regression, and Multinomial or Polychotomous Logistic Regression.  The Binary Logistic Regression is used to predict membership of only two categorical outcomes or dependent variables, while the Multinomial or Polychotomous Logistic Regression is used to predict membership of more than two categorical outcomes or dependent variables (Field, 2013).

Concerning the assessment of the model, the R-statistics can be used to calculate a more literal version of the multiple correlations in the Logistic Regression model.  The R-statistic is the partial correlation between the outcome variable and each of the predictor variables, and it can vary between -1 and +1. A positive value indicates that as the predictor variable increases, so does the likelihood of the event occurring, while the negative value indicates that as the predictor variable increases, the likelihood of the outcome occurring decreases (Field, 2013).  If a variable has a small value of R then, it contributes a small amount to the model.  Other measures for such assessment include Hosmer and Lemeshow, Cox and Snell’s and Nagelkerke’s (Field, 2013). All these measures differ in their computation, conceptually they are somewhat the same, and they can be seen as similar to the R2 in linear regression regarding interpretation as they provide a gauge of the substantive significance of the model (Field, 2013).

In the Logistic Regression, there is an analogous statistics, the z-statistics, which follows the normal distribution to measure the individual contribution of predictors (Field, 2013). Like the t-tests in the Regular Linear Regression, the z-statistic indicates whether the (b) coefficient for that predictor is significantly different from zero (Field, 2013). If the coefficient is significantly different from zero, then the assumption can be that the predictor is making a significant contribution to the prediction of the outcome (Y) (Field, 2013). The z-statistic is known as the Wald statistic as it was developed by Abraham Wald (Field, 2013).  

Principles of Logistic Regression

One of the assumptions mentioned above for the regular linear models is that the relationship between variables is linear for the linear regression to be valid.  However, when the outcome variable is categorical, this assumption is violated as explained in the “Variable Types” assumption above, because and the outcome (dependent) variables must be “quantitative,” “continues” and “unbounded” (Field, 2013).  To get around this problem, the data must be transformed using the logarithmic transformation).  The purpose of this transformation is to express the non-linear relationship into a linear relationship (Field, 2013). However, Logistic Regression is based on this principle as it expresses the multiple linear regression equation in logarithmic terms called the “logit” and thus overcomes the problem of violating the assumption of linearity (Field, 2013).  The transformation logit (p) is used in Logistic Regression with the letter (p) representing the probability of success (Ahlemeyer-Stubbe & Coleman, 2014).  The logit (p) is a non-linear transformation, and Logistic Regression is a type of non-linear regression (Ahlemeyer-Stubbe & Coleman, 2014).

Assumptions of the Logistic Regression

In the Logistic Regression, the assumptions of the ordinary regression are still applicable.  However, the following two assumptions are dealt with differently in the Logistic Regression (Field, 2013):

  • Linearity:  While in the ordinary regression, the assumption is that the outcome has a linear relationship with the predictors, in the logistic regression, the outcome is categorical, and so this assumption is violated, and the log (or logit) of the data is used to overcome this violation (Field, 2013). Thus, the assumption of linearity in Logistic Regression is that there is a linear relationship between any continuous predictors and the logit of the outcome variable (Field, 2013). This assumption can be tested by checking if the interaction term between the predictor and its log transformation is significant (Field, 2013).  In short, the linearity assumption is that each predictor has a linear relationship with the log of the outcome variable when using the Logistic Regression.
  • Independence of Errors:  In the Logistic Regression, violating this assumption produces overdispersion, which can occur when the observed variance is bigger than expected from the Logistic Regression model.   The overdispersion can occur for two reasons (Field, 2013). The first reason is the correlated observation when the assumption of independence is broken (Field, 2013). The second reason is due to variability in success probabilities (Field, 2013). The overdispersion tends to limit standard errors, which creates two problems. The first problem is the test statistics of regression parameters which are computed by dividing by the standard error, so if the standard error is too small, then the test statistic will be too big and falsely deemed significant.  The second problem is the confidence intervals which are computed from standard errors, so if the standard error is too small, then the confidence interval will be too narrow and result in the overconfidence about the likely relationship between predictors and the outcome in the population. In short, the overdispersion occurs when the variance is larger than the expected variance from the model.  This overdispersion can be caused by violating the assumption of independence.  This problem makes the standard errors too small (Field, 2013), which can bias the conclusions about the significance of the model parameters (b-values) and population value (Field, 2013).

Business Analytics Methods Based on Data Types

In (Hodeghatta & Nayak, 2016), the following table summarizes the business analytics methods based on the data types.  As shown in the table, when the response (dependent) variable is continuous, and the predictor variables is either continuous or categorical, the Linear Regression method is used.  When the response (dependent) variable is categorical, and the predictor variables are either continuous or categorical, the Logistic Regression is used.  Other methods are also listed as additional information.

Table-1. Business Analytics Methods Based on Data Types. Adapted from (Hodeghatta & Nayak, 2016).

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Fischetti, T. (2015). Data Analysis with R: Packt Publishing Ltd.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Overfitting and Parsimony in Large Dataset Analysis

Dr. O. Aly
Computer Science

Introduction:  The purpose of this discussion is to discuss the issues of overfitting versus using parsimony and their importance in Big Data analysis.  The discussion also addresses if the overfitting approach is a problem in the General Least Square Model (GLM) approach.  Some hierarchical methods which do not require parsimony of GLM are also discussed in this discussion. This discussion does not include the GLM as it was discussed earlier. It begins with Parsimony in Statistics model.

Parsimony Principle in Statistical Model

The medieval (14th century) English philosopher, William of Ockham (1285 – 1347/49) (Forster, 1998) popularized a critical principle stated by Aristotle “Entities must not be multiplied beyond what is necessary” (Bordens & Abbott, 2008; Epstein, 1984; Forster, 1998).  The refinement of this principle by Ockham is now called “Occam’s Razor” stating that a problem should be stated in the simplest possible terms and explained with the fewest postulates possible (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  This method is now known as Law or Principle of Parsimony (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  Thus, based on this law, a theory should account for phenomena within its domain in the simplest terms possible and with the fewest assumptions (Bordens & Abbott, 2008; Epstein, 1984; Field, 2013; Forster, 1998).  As indicated by (Bordens & Abbott, 2008), if there are two competing theories concerning a behavior, the one which explains the behavior in the simplest terms is preferred under the law of parsimony. 

Modern theories of the attribution process, development, memory, and motivation adhere to this law of parsimony (Bordens & Abbott, 2008).  However, the history of science witnessed some theories which got crushed under their weight of complexity (Bordens & Abbott, 2008). For instance, the collapse of interest in the Hull-Spence model of learning occurred primarily because the theory had been modified so many times to account for anomalous data that was no longer parsimonious (Bordens & Abbott, 2008).  The model of Hull-Spence became too complicated with too many assumptions and too many variables whose values had to be extracted from the very data that the theory was meant to explain (Bordens & Abbott, 2008).  As a result of such complexity, the interest in the theory collapsed and got lost.  The Ptolemaic Theory of planetary motion also lost its parsimony because it lost much of its true predictive power (Bordens & Abbott, 2008).

Parsimonious is one of the characteristics of a good theory (Bordens & Abbott, 2008).  Parsimonious explanation or a theory explains a relationship using relatively few assumptions (Bordens & Abbott, 2008). When more than one explanation is offered for observed behavior, scientists and researchers prefer the parsimonious explanation which explains behavior with the fewest number of assumptions (Bordens & Abbott, 2008). Scientific explanations are regularly evaluated and examined for consistency with the evidence and with known principles for parsimony and generality (Bordens & Abbott, 2008).  Accepted explanations can be overthrown in favor of views which are more general, more parsimonious, and more consistent with observation (Bordens & Abbott, 2008).

How to Develop Fit Model Using Parsimony Principle

When building a model, the researcher should strive for parsimony  (Bordens & Abbott, 2008; Field, 2013). The statistical implication of using a parsimony heuristic is that models be kept as simple as possible, meaning predictors should not be included unless they have the explanatory benefit (Field, 2013).  This strategy can be implemented by fitting the model that include all potential predictors, and then systematically removing any that do not seem to contribute to the model (Field, 2013).  Moreover, if the model includes interaction terms, then, the interaction terms to be valid, the main effects involved in the interaction term should be retained (Field, 2013).  Example of the implementation of Parsimony in developing a model include three variables in a patient dataset: (1) outcome variable (as cured or not cured), which is dependent variable (DV), (2) intervention variable, which is a predictor independent variable (IV), and (3) duration, which is another predictor independent variable (Field, 2013).  Thus, the three potential predictors can be Intervention, Duration and the interaction of the “Intervention x Duration” (Field, 2013).  The most complex model includes all of these three predictors.  As the model is being developed, any terms that are added but did not improve the model should be removed and adopt the model which did not include those terms that did not make a difference.  Thus, the first model (model-1) which the researchers can fit would be to have only Intervention as a predictor (Field, 2013).  Then, the model is built up by adding in another main effect of the Duration in this example as model-2.  The interaction of the Intervention x Duration can be added in model-3. Figure 1 illustrates these three models of development. The goal is to determine which of these models best fits the data while adhering to the general idea of parsimony (Field, 2013).   If the interaction term model-3 did not improve the model (model-2), then model-2 should be used as the final model.  If the Duration in model-2 did not make any difference and did not improve model-1, then model-1 should be used as the final model (Field, 2013).  The aim is to build the model systematically and choose the most parsimonious model as the final model.  The parsimonious representations are essential because simpler models tend to give more insight into a problem (Ledolter, 2013). 

Figure 1.  Building Models based on the Principle of Parsimony (Field, 2013).

Overfitting in Statistical Models

Overfitting is a term used when using models or procedures which violate Parsimony Principle, it means that the model includes more terms than are necessary or uses more complicated approaches than necessary (Hawkins, 2004).   There are two types of “Overfitting” methods.  The first “Overfitting” method is to use a model which is more flexible than it needs to be (Hawkins, 2004).  For instance, a neural net can accommodate some curvilinear relationships and so is more flexible than a simple linear regression (Hawkins, 2004). However, if it is used on a dataset that conforms to the linear model, it will add a level of complexity without any corresponding benefit in performance, or even worse, with poorer performance than the simpler model (Hawkins, 2004).  The second “Overfitting” method is to use a model that includes irrelevant components such as a polynomial of excessive degree or a multiple linear regression that has irrelevant as well as the needed predictors (Hawkins, 2004). 

The “Overfitting” technique is not preferred for four essential reasons (Hawkins, 2004).  The first reason involves wasting resources and expanding the possibilities for undetected errors in databases which can lead to prediction mistakes, as the values of these unuseful predictors must be substituted in the future use of the mode (Hawkins, 2004).  The second reason is that the model with unneeded predictors can lead to worse decisions (Hawkins, 2004).   The third reason is that irrelevant predictor can make predictions worse because the coefficients fitted to them add random variation to the subsequent predictions (Hawkins, 2004). The last reason is that the choice of model has an impact on its portability (Hawkins, 2004). The one-predictor linear regression that captures a relationship with the model is highly portable (Hawkins, 2004).  The more portable model is preferred over, the less portable model, as the fundamental requirement of science is that one researcher’s results can be duplicated by another researcher (Hawkins, 2004). 

Moreover, large models overfitted on training dataset turn out to be extremely poor predictors in new situations as needed predictor variables increase the prediction error variance (Ledolter, 2013).  The overparameterized models are of little use if it is difficult to collect data on predictor variables in the future.  The partitioning of the data into training and evaluation (test) datasets is central to most data mining methods (Ledolter, 2013). Researchers must check whether the relationships found in the training dataset will hold up in the future (Ledolter, 2013).

How to recognize and avoid Overfit Models

A model overfits if it is more complicated than another model that fits equally well (Hawkins, 2004).  The recognition of overfitting model involves not only the comparison of the simpler model and the more complex model but also the issue of how the fit of a model is measured (Hawkins, 2004).  Cross-Validation can detect overfit models by determining how well the model generalizes to other datasets by partitioning the data (minitab.com, 2015).  This process of cross-validation helps assess how well the model fits new observations which were not used in the model estimation process (minitab.com, 2015). 

Hierarchical Methods

The regression analysis types include simple, hierarchical, and stepwise analysis (Bordens & Abbott, 2008).  The main difference between these types is how predictor variables are entered into the regression equation which may affect the regression solution (Bordens & Abbott, 2008).  In the simple regression analysis, all predictors variables are entered together, while in the hierarchical regression, the order in which variables are entered into the regression equation is specified (Bordens & Abbott, 2008; Field, 2013).  Thus, the hierarchical regression is used for a well-developed theory or model suggesting a specific causal order (Bordens & Abbott, 2008). As a general rule, known predictors should be entered into the model first in order of their importance in predicting the outcome (Field, 2013).  After the known predictors have been entered, any new predictors can be added into the model (Field, 2013).  In the stepwise regression, the order in which variables are entered is based on a statistical decision, not on a theory (Bordens & Abbott, 2008).  

The choice of the regression analysis should be based on the research questions or the underlying theory (Bordens & Abbott, 2008).  If the theoretical model is suggesting a particular order of entry, the hierarchical regression should be used (Bordens & Abbott, 2008). Stepwise regression is infrequently used because sampling and measurement error tends to make unstable correlations among variables in stepwise regression (Bordens & Abbott, 2008).  The main problem with the stepwise methods is that they assess the fit of a variable based on the other variables in the model (Field, 2013).

Goodness-of-fit Measure for the Fit Model

Comparison between hierarchical and stepwise methods:  The hierarchical and stepwise methods involve adding predictors to the model in stages, and it is useful to know these additions improve the model (Field, 2013).  Since the larger values of R2 indicates better fit, thus, a simple way to see whether a model has improved as a result of adding predictors to it would be to see whether R2 for the new model is bigger than for the old model.  However, it will always get bigger if predictors are added, so the issue is more whether it gets significantly bigger (Field, 2013).  The significance of the change in R2 can be assessed using the equation below as the F-statistics is also be used to calculate the significance of R2 (Field, 2013) 

However, because the focus is on the change in the models, thus the change in R2  (R2 change)  and R2 of the newer model (R2 new) are used using the following equation (Field, 2013). Thus, models can be compared using this F-ratio (Field, 2013). 

The Akaike’s Information Criterion (AIC) method is a goodness-of-fit measure which penalizes the model for having more variables. If the AIC is bigger, the fit is worse; if the AIC is smaller, the fit is better (Field, 2013).   If the Automated Linear Model function in SPSS is used, then AIC is used to select models rather than the change in R2. AIC is used to compare it with other models with the same outcome variables; if it is getting smaller, then the fit of the model is improving (Field, 2013).  In addition to the AIC method, there is Hurvich and Tsai’s criterion (AICC), which is a version of AIC designed for small samples (Field, 2013).  Bozdogan’s criterion (CAIC), which is a version of AIC which is used for model complexity and sample size. Bayesian Information Criterion (BIC) of Schwarz, which is comparable to the AIC (Field, 2013; Forster, 1998). However, it is slightly more conservative as it corrects more harshly for the number of parameters being estimated (Field, 2013).  It should be used when sample sizes are large, and the number of parameters is small (Field, 2013).

The AIC and BIC are the most commonly used measures for the fit of the model.  The values of these measures are all useful as a way of comparing models (Field, 2013).  The value of AIC, AICC, CAIC, and BIC can all be compared to their equivalent values in other models.  In all cases, smaller values mean better-fitting models (Field, 2013).

There is also Minimum Description Length (MDL) measure of Rissanen, which is based on the idea that statistical inference centers around capturing regularity in data; regularity, in turn, can be exploited to compress the data (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2015).  Thus, the goal is to find the model which compresses the data the most (Vandekerckhove et al., 2015).   There are three versions of MDL: crude two-part code, where the penalty for complex models is that they take many bits to describe, increasing the summed code length. In this version, it can be difficult to define the number of bits required to describe the model.  The second version of MDL is the Fisher Information approximation (FIA), which is similar to AIC and BIC in that it includes a first term that represents goodness-fo-fit, and additional terms that represent a penalty for complexity (Vandekerckhove et al., 2015).  The second term resembles that of BIC, and the third term reflects a more sophisticated penalty which represents the number of distinguishable probability distribution that a model can generate (Vandekerckhove et al., 2015).  The FIA differs from AIC and BIC in that it also accounts for functional form complexity, not just complexity due to the number of free parameters (Vandekerckhove et al., 2015).  The third version of MDL is normalized maximum likelihood (NML) which is simple to state but can be difficult to compute, for instance, the denominator may be infinite, and this requires further measures to be taken (Vandekerckhove et al., 2015). Moreover, NML requires integration over the entire set of possible datasets, which may be difficult to define as it depends on unknown decision process in the researchers (Vandekerckhove et al., 2015).

AIC and BIC in R: If there are (p) potential predictors, then there are 2p possible models (r-project.org, 2002).  AIC and BIC can be used in R as selection criteria for linear regression models as well as for other types of models.  As indicated in (r-project.org, 2002): the equations for AIC and BIC are as follows.

For the linear regression models, the -2log-likelihood (known as the deviance is nlog(RSS/n)) (r-project.org, 2002).  AIC and BIC need to get minimized (r-project.org, 2002).  The Larger models will fit better and so have smaller RSS but use more parameters (r-project.org, 2002).  Thus, the best choice of model will balance fit with model size (r-project.org, 2002).  The BIC penalizes larger models more heavily and so will tend to prefer smaller models in comparison to AIC (r-project.org, 2002). 

Example of the code in R using the state.x77 dataset is below. The function does not evaluate the AIC for all possible models but uses a search method that compares models sequentially as shown in the result of the R commands.

  • g <- lm(Life.Exp ~ ., data=state.x77.df)
  • step(g)

References

Bordens, K. S., & Abbott, B. B. (2008). Research Design and Methods: A Process Approach: McGraw-Hill.

Epstein, R. (1984). The principle of parsimony and some applications in psychology. The Journal of Mind and Behavior, 119-130.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Forster, M. R. (1998). Parsimony and Simplicity. Retrieved from http://philosophy.wisc.edu/forster/220/simplicity.html, University of Wisconsin-Madison.

Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1), 1-12.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

minitab.com. (2015). The Danger of Overfitting Regression Models. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models.

r-project.org. (2002). Practical Regression and ANOVA Using R Retrieved from https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf.

Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model Comparison and the Principle The Oxford handbook of computational and mathematical psychology (Vol. 300): Oxford Library of Psychology.

Quantitative Analysis of the “Faithful” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the faithful dataset. The project is divided into two main Parts.  Part-I evaluates and examines the DataSet for understanding the DataSet using the RStudio.  Part-I involves five significant tasks.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving nine significant tasks to analyze the Data Frames.   The result shows that there is a relationship between Waiting Time until Next Eruption and the Eruption Time.   The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is minimal almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternative hypotheses that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable Waiting Time and the independent variable of Eruptions. The project also analyzed the comparison among three regressions: standard regression, polynomial regression, and lowess regression.  The result shows a similar relationship and lines between the three models.

Keywords: R-Dataset; Faithful Dataset; Regression Analysis Using R.

Introduction

This project examines and analyzes the dataset of faithful.csv (oldfaithful.csv).  The dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/.  The dataset has 272 observations on two variables; eruptions and waiting. The dataset is to describe the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  A closer look at the failthful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequency than expected under non-human measurement.  The eruption variable is numeric for the eruption time in minutes.  The waiting variable is numeric for the waiting time to next eruption in minutes. The geyser is in package MASS.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: Density Histograms and Smoothed Density Histograms.
  • Task-3: Standard Linear Regression
  • Task-4: Polynomial Regression.
  • Task-5: Lowess Regression.
  • Task-6: The Summary of the Model.
  • Task-7: Comparison of Linear Regressions, Polynomial Regression, and Lowess Regression.
  • Task-8: The Summary of all Models.
  • Task-9:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “faithful.”

Task-1:  Install MASS Package

The purpose of this task is to install the MASS package which is required for this project. The faithful.cxv requires this package.

  • Command: >install.packages(“MASS”)

Task-2:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is a “faithful” dataset. It describes the Waiting Time between Eruptions and the duration of the Eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.  The dataset has 272 observations on two main variables:

  • Eruptions: numeric Eruption time in minutes.
  • Waiting: numeric Waiting time to next eruption in minutes.

Task-3:  Examine the Variables of the Data Sets

The main dataset is called “faithful.csv” dataset, which includes two main variables eruptions and waiting.

  • ##Examine the dataset
  • data()
  • ?faithful
  • install.packages(“MASS”)
  • install.packages(“lattice”)
  • library(lattice)
  • faithful <- read.csv(“C:/CS871/Data/faithful.csv”)
  • data(faithful)
  • summary(faithful)

Figure 1.  Eruptions and Waiting for Eruption Plots for Faithful dataset.

Task-4: Create a Data Frame to repreent the dataset of faithful.

  • ##Create DataFrame
  • faithful.df <- data.frame(faithful)
  • faithful.df
  • summary(faithful.df)

Task-5: Examine the Content of the Data Frame using head(), names(), colnames(), and dim() functions.

  • names(faithful.df)
  • head(faithful.df)
  • dim(faithful.df)

Part-II: Discussion and Analysis

Task-1:  The first Ten lines of Waiting and Eruptions.

  • ##The first ten lines of Waiting and Eruptions
  • faithful$waiting[1:10]
  • faithful$eruption[1:10]
  • ##The descriptive analysis of waiting and eruptions
  • summary(faithful$waiting)
  • summary(faithful$eruptions)

Task-2:  Density Histograms, and Smoothed Density Histograms.

  • ##Density histogram for Waiting Time
  • hist(faithful.df$waiting, col=”blue”, freq=FALSE, main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##smoothed density histogram for Waiting Time
  • smoothedDensity_waiting <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_waiting, col=”blue”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)
  • ##Density histogram for Eruptions
  • hist(faithful.df$eruptions, col=”red”, freq=FALSE, main=”Histogram of Eruption Time”, xlab=”Eruption Time In Minutes”)
  • ##smoothed density histogram for Eruptions
  • smoothedDensity_eruptions <- locfit(~lp(waiting), data=faithful.df)
  • plot(smoothedDensity_eruptions, col=”red”,  main=”Histogram of Waiting Time to Next Eruption”, xlab=”Waiting Time To Next Eruption In Minutes”)

Figure 2.  Density Histogram and Smoothed Density Histogram of Waiting Time.

Figure 3.  Density Histogram and Smoothed Density Histogram of Eruption.

Task-3: Standard Linear Regression

            The purpose of this task is to examine the Standard Linear Regression for the two main factors of the faithful dataset Waiting Time until Next Eruption and Eruption Time.  This task also addresses the diagnostic plots of the standard linear regression such as residuals vs. fitted as examined below.  The R codes are as follows:

  • ##Standard Regression of waiting time on eruption time.
  • lin.reg.model=lm(waiting~eruptions, data=faithful.df)
  • plot(waiting~eruptions, data=faithful, col=”blue”, main=”Regression of Two Factors of Waiting and Eruption Time”)
  • abline(lin.reg.model, col=”red”)

Figure 4.  Regression of Two Factors of Waiting and Eruption Time.

            The following graphs represent the diagnostic plots for the standard linear regressions.  The first plot represents the residual vs. fitted. The second plot represents the Normal Q-Q. The third plot represents Scale-Location. The fourth plot represents Residuals vs. Leverage.  The discussion and analysis of these graphs under the discussion analysis section of this project.

Figure 5.  Diagnostic Plots for Standard Linear Regression.

Task-4: Polynomial Regression

            The purpose of this task is to examine the polynomial regression for the Waiting Time Until Next Eruption and the Eruption Time variables. The R codes are as follows:

Figure 6.  Polynomial Regreession.

Task-5: Lowess Regression

            The purpose of this task is to examine the Lowess Regression.  The R codes are as follows:

Figure 7.  Lowess Regression.

Task-6: Summary of the Model.

            The purpose of this task is to examine the descriptive analysis summary such as residuals, intercept, R-squared.  The R code is as follows:

  • summary(model1)

Task-7:  Comparison of Linear Regression, Polynomial Regression and Lowess Regression.

  • ##Comparing local polynomial regression to the standard regression.
  • lowessReg=lowess(faithful$waiting~faithful$eruptions, f=2/3)
  • local.poly.reg <-locfit(waiting~lp(eruptions, nn=0.5), data=faithful)
  • standard.reg=lm(waiting~eruptions, data=faithful)
  • plot(faithful$waiting~faithful$eruptions, main=”Eruptions Time”, xlab=”Eruption Time in Minutes”, ylab=”Waiting Time to Next Eruption Time”, col=”blue”)
  • lines(lowessReg, col=”red”)
  • abline(standard.reg, col=”green”)
  • lines(local.poly.reg, col=”yellow”)

Figure 8.  Regression Comparison for the Eruptions and Waiting Time Variables.

Task-8:  Summary of these Models

            The purpose of this task is to examine the summary of each mode. The R codes are as follows:

  • ##Summary of the regressions
  • summary(lowessReg)
  • summary(local.poly.reg)
  • summary(standard.reg)
  • cor(faithful.df$eruptions, faithful.df$waiting)

Task-9: Discussion and Analysis

            The result shows that the descriptive analysis of the average for the eruptions is 3.49, which is lower than the median value of 4.0 minutes, indicating a negatively skewed distribution.  The average for the waiting time until the next eruptions is 70.9 which is less than the median of 76.0 indicating a negatively skewed distribution.  Figure 2 illustrated the density histogram and smoothed density histogram of the waiting time. The result shows that the peak waiting time is ~80 minutes with the highest density point of 0.04.   Figure 3 illustrated the density histogram and smoothed density histogram of the eruption time in minutes.  The result shows that the peak eruption time in minutes is ~4.4 with the highest frequency density point of 0.6.  Figure 4 illustrates the linear regression of the two factors of waiting until next eruption and the eruption time in minutes.  The result shows that when the waiting time increases, the eruption time in minutes increases.  The residuals depict the difference between the actual value of the response variable and the value of the response variable predicted using the regression.  The maximum residual is shown as 15.97.  The spread of residuals is provided by specifying the values of min, max, median, Q1, and Q3 of the residuals. In this case, the spread is from -12.08 to 15.97.  Since the principle behind the regression line and the regression equation is to reduce the error or this difference, the expectation is that the median value should be very near to 0.  However, the median shows .21 which is higher than 0.  The prediction error can go up to the maximum value of the residual.  As this value is 15.97 which is not small, this residual cannot be accepted.   The result also shows that the value next to the coefficient estimate is the standard error of the estimate. Ths specifies the uncertainty of the estimate, then comes the “t” value of the standard error.  This value specifies as to how large the coefficient estimate is concerning the uncertainty.  The next value is the probability that the absolute(t) value is greater than the one specified which is due to a chance error.  The probability value (p-value), known as the value of significance, should be very small like 0.001, 0.005, 0.01, or 0.05 for the relationship between the response variable and the independent variable to be significant. As the result shows, the probability of the error of the coefficient of eruptions is very small almost near 0, i.e., <2e-16. Thus, we reject the null hypotheses that there is no significance of the parameter to the model and accept the alternate hypothesis that the parameter is significant to the model.  We conclude that there is a significant relationship between the response variable waiting time and the independent variable of eruptions.

The diagnostic plots of the standard regression are also discussed in this project. Figure 5 illustrates four different diagnostic plots of the standard regression.  This analysis also covers the residuals and fitted lines.  Figure 5 illustrated the Residuals vs. Fitted in Linear Regression Model for Waiting Time until Next Eruption as a function of the Eruptions Time in minutes.  The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).  The result of the Linear Regression for the identified variables of Eruptions and Waiting Time shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 5 also illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016). The residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 5 also illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 5 also illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).   The result of the Polynomial regression is also addressed in this project.  The polynomial regression shows a relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 6 is similar to the Standard Linear Regression.  Lowess Regression shows the same pattern on the relationship between Waiting Time until Next Eruptions and the Eruption Time.  The line in Figure 7 is similar to the Standards Linear Regression.  These three lines of the Standard Linear Regression, Polynomial Regression and Lowess Regression are illustrated together in a comparison fashion in Figure 8.   The coefficient correlation result shows that there is a positive correlation indicating the positive effect of the Eruptions Time on the Waiting Time until Next Eruption.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

 

Quantitative Analysis of “Ethanol” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss locally weighted scatterplot smoothing known as LOWESS method for multiple regression models in a k-nearest-neighbor-based model.  The discussion also addresses whether the LOWESS is a parametric or non-parametric method.  The advantages and disadvantages of LOWESS from the computational standpoint are also addressed in this discussion.  Moreover, another purpose of this discussion is to select a dataset from http://vincentarelbundock.github.io/Rdatasets/ and perform a multiple regression analysis using R programming.  The dataset selected for this discussion is “ethanol” dataset.  The discussion begins with Multiple Regression, Lowess method, Lowess/Loess in R, and K-Nearest-Neighbor (k-NN), followed by the analysis of the “ethanol” dataset.

Multiple Regression

When there is more than one predictor variable, simple Linear Regression becomes Multiple Linear Regression, and the analysis becomes more involved (Kabacoff, 2011).  The Polynomial Regression typically is a particular case of the Multiple Regression (Kabacoff, 2011).  The Quadratic Regression has two predictors (X and X2), and Cubic Regression has three predictors (X, X2, and X3) (Kabacoff, 2011).  Where there is more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant (Kabacoff, 2011).   

Locally Weighted Scatterplot Smoothing (Lowess) Method

Regression assumes that the relationship between predictors and outcomes is linear.  However, non-linear relationships between variables can exist in some cases (Navarro, 2015).   There are some tools in statistics which can be employed to do non-linear regression.  The non-linear regression models assume that the relationship between predictors and outcomes is monotonic such as Isotonic Regression, while others assume that it is smooth but not necessarily monotonic such as Lowess Regression, while others assume that the relationship is of a known form which occurs to be non-linear such as Polynomial Regression (Navarro, 2015).  As indicated in (Dias, n.d.), Cleveland (1979) proposed the algorithm Lowess, as an outlier-resistant method based on local polynomial fits. The underlying concept is to start with a local polynomial (a k-NN type fitting) least square fit and then to use robust methods to obtain the final fit (Dias, n.d.).

Moreover, the Lowess and least square are non-parametric strategies for fitting a smooth curve to data points (statisticshowto.com, 2013).  The “parametric” indicates there is an assumption in advance that the data fits some distribution, i.e., normal distribution (statisticshowto.com, 2013). Parametric fitting can lead to fitting a smooth curve which misrepresents the data because some distribution is assumed in advance (statisticshowto.com, 2013).  Thus, in those cases, non-parametric smoothers may be a better choice (statisticshowto.com, 2013).  The non-parametric smoothers like Loess try to find a curve of best fit without assuming the data must fit some distribution shape (statisticshowto.com, 2013).  In general, both types of smoothers are used for the same set of data to offset the advantages and disadvantages of each type of smoother (statisticshowto.com, 2013). The benefits of non-parametric smoothing include providing a flexible approach to representing data, ease of use, easy computations (statisticshowto.com, 2013).  The disadvantages of the non-parametric smoothing include the following: (1) it cannot be used to obtain a simple equation for a set of data, (2) less well understood than parametric smoothers, and (3) it requires a little guesswork to obtain a result (statisticshowto.com, 2013). 

Lowess/Loess in R

Rhere are two versions of the lowess or loess scatter-diagram smoothing approach implemented in R (Dias, n.d.).  The former (lowess) was implemented first, while loess is more flexible and powerful (Dias, n.d.).  Example of lowess:

  • lowess(x,y f=2/3, iter=3, delta=.01*diff(range(x)))

where the following model is assumed: y = b(x)+e. 

  • The “f” is the smoother span which gives the proportion of points in the plot which influence the smooth at each value.  The larger values give more smoothness. 
    • The “iter” is the number of “robustifying” iterations which should be performed; using smaller values of “iter” will make “lowess” run faster. 
    • The “delta” represents the value of “x” which lies within “delta” of each other replace by a single value in the output from “lowess” (Dias, n.d.).

The loess() function uses a formula to specify the response (and in its application as a scatter-diagram smoother) a single predictor variable (Dias, n.d.).  The loess() function creates an object which contains the results, and the predict() function retrieves the fitted values.  These can be plotted along with the response variable (Dias, n.d.).  However, the points must be plotted in increasing order of the predictor variable in order for the lines() function to draw the line in an appropriate fashion, which is done using order() function applied to the predictor variable values and the explicit sub-scripting (in square brackets[]) to arrange the observations in ascending order (Dias, n.d.).

K-Nearest-Neighbor (K-NN)

The K-NN classifier is based on learning numeric attributes in an n-dimensional space. All of the training samples are stored in n-dimensional space with a unique pattern (Hodeghatta & Nayak, 2016).  When a new sample is given, the K-NN classifier searches for the pattern spaces which are closest to the sample and accordingly labels the class in the k-pattern space (called k-nearest-neighbor) (Hodeghatta & Nayak, 2016).  The “closeness” is defined regarding Euclidean distance, where the Euclidean distance between two points, X = (x1, x2, x3,.., xn)  and Y = (y1, y2, y3,.., yn) is defined as follows:

The unknown sample is assigned the nearest class among the K-NN pattern.  The aim is to look for the records which are similar to, or “near” the record to be classified in the training records which have values close to X = (x1, x2, x3,.., xn) (Hodeghatta & Nayak, 2016).  These records are grouped into classes based on the “closeness,” and the unknown sample will look for the class (defined by k) and identifies itself to that class which is nearest in the k-space (Hodeghatta & Nayak, 2016).   If a new record has to be classified, it finds the nearest match to the record and tags to that class (Hodeghatta & Nayak, 2016).  

The K-NN does not assume any relationship among the predictors(X) and class (Y) (Hodeghatta & Nayak, 2016).  However, it draws the conclusion of the class based on the similarity measures between predictors and records in the dataset (Hodeghatta & Nayak, 2016).  There are many potential measures, K-NN uses the Euclidean distance between the records to find the similarities to label the class (Hodeghatta & Nayak, 2016).  The predictor variable should be standardized to a common scale before computing the Euclidean distances and classifying (Hodeghatta & Nayak, 2016).  After computing the distances between records, a rule to put these records into different classes(k) is required (Hodeghatta & Nayak, 2016).  A higher value of (k) reduces the risk of overfitting due to noise in the training set (Hodeghatta & Nayak, 2016).  The value of (k) ideally can be between 2 and 10, for each time, to find the misclassification error and find the value of (k) which gives the minimum error (Hodeghatta & Nayak, 2016).

The advantages of the K-NN as a classification method include its simplicity and lack of parametric assumptions (Hodeghatta & Nayak, 2016).  It performs well for large training datasets (Hodeghatta & Nayak, 2016).  However, the disadvantages of the K-NN as a classification method include the time to find the nearest neighbors, reduced performance for a large number of predictors (Hodeghatta & Nayak, 2016). 

Multiple Regression Analysis for “ethanol” dataset Using R

This section is divided into five major Tasks.  The first task is to understand and examine the dataset.  Task-2, Task-3, and Task-4 are to understand density histogram, linear regression, and multiple linear regression.  Task-5 covers the discussion and analysis of the results.

Task-1:  Understand and Examine the Dataset.

The purpose of this task is to understand and examine the dataset. The description of the dataset is found in (r-project.org, 2018).  A data frame with 88 observations on the following three variables.

  • NOx Concentration of nitrogen oxides (NO and NO2) in micrograms/J.
  • C Compression ratio of the engine.
  • E Equivalence ratio–a measure of the richness of the air and ethanol fuel mixture.

#R-Commands and Results using summary(), names(), head(), dim(), and plot() functions.

  • ethanol <- read.csv(“C:/CS871/Data/ethanol.csv”)
  • data(ethanol)
  • summary(ethanol)
  • names(ethanol)
  • head(ethanol)
  • dim(ethanol)
  • plot(ethanol, col=”red”)

Figure 1.  Plot Summary of NOx, C, and E in Ethanol Dataset.

  • ethanol[1:3,]   ##First three lines
  • ethanol$NOx[1:10]  ##First 10 lines for concentration of nitrogen oxides (NOx)
  • ethanol$C[1:10]   ##First 10 lines for Compression Ratio of the Engine ( C )
  • ethanol$E[1:10]  ##First 10 lines for Equivalence Ratio ( E )
  • ##Descriptive Analysis using summary() function to analyze the central tendency.
  • summary(ethanol$NOx)
  • summary(ethanol$C)
  • summary(ethanol$E)

Task-2:  Density Histogram and Smoothed Density Histogram

  • ##Density histogram for NOx
  • hist(ethanol$NOx, freq=FALSE, col=”orange”)
  • install.packages(“locfit”)  ##locfit library is required for smoothed histogram
  • library(locfit)
  • smoothedDensity_NOx <- locfit(~lp(NOx), data=ethanol)
  • plot(smoothedDensity_NOx, col=”orange”, main=”Smoothed Density Histogram for NOx”)

Figure 2.  Density Histogram and Smoothed Density Histogram of NOx of Ethanol.

  • ##Density histogram for Equivalence Ration ( E )
  • hist(ethanol$E, freq=FALSE, col=”blue”)
  • smoothedDensity_E <- locfit(~lp(E), data=ethanol)
  • plot(smoothedDensity_E, col=”blue”, main=”Smoothed Density Histogram for Equivalence Ratio”)

Figure 3.  Density Histogram and Smoothed Density Histogram of E of Ethanol.

  • ##Density histogram for Compression Ratio ( C )
  • hist(ethanol$C, freq=FALSE, col=”blue”)
  • smoothedDensity_C <- locfit(~lp(C), data=ethanol)
  • plot(smoothedDensity_C, col=”blue”, main=”Smoothed Density Histogram for Compression Ratio”)

Figure 4.  Density Histogram and Smoothed Density Histogram of C of Ethanol.

Task-3:  Linear Regression Model

  • ## Linear Regression
  • lin.reg.model1=lm(NOx~E, data=ethanol)
  • lin.reg.model1
  • plot(NOx~E, data=ethanol, col=”blue”, main=”Linear Regression of NOx and Equivalence Ratio in Ethanol”)
  • abline(lin.reg.model1, col=”red”)
  • mean.NOx=mean(ethanol$NOx, na.rm=T)
  • abline(h=mean.NOx, col=”green”)

Figure 5:  Linear Regression of the NOx and E in Ethanol.

  • ##local polynomial regression of NOx on the equivalent ratio
  • ##fit with a 50% nearest neighbor bandwidth.
  • local.poly.reg <-locfit(NOx~lp(E, nn=0.5), data=ethanol)
  • plot(local.poly.reg, col=”blue”)

Figure 6:  Smoothed Polynomial Regression of the NOx and E in Ethanol.

Figure 7.  Residuals vs. Fitted Plots.

Figure 8.  Normal Q-Q Plot.

Figure 9. Scale-Location Plot.

Figure 10.  Residuals vs. Leverage.

  • ##To better understand the linearity of the relationship represented by the model.
  • summary(lin.reg.model1)
  • plot(lin.reg.model1)
  • crPlots(lin.reg.model1)
  • termplot(lin.reg.model1)

Figure 11.  crPlots() Plots for the Linearity of the Relationship between NOx and Equivalence Ratio of the Model.

##Examine the Correlation between NOx and E.

Task-4: Multiple Regressions

  • ##Produce Plots of some explanatory variables.
  • plot(NOx~E, ethanol, col=”blue”)
  • plot(NOx~C, ethanol, col=”red”)
  • ##Use vertical bar to find the relationship of E on NOx conditioned with C
  • coplot(NOx~E|C, panel=panel.smooth,ethanol, col=”blue”)
  • model2=lm(NOx~E*C, ethanol)
  • plot(model2, col=”blue”)

Figure 12. Multiple Regression – Relationship of E on NOx conditioned with C.

Figure 13. Multiple Regression Diagnostic Plot: Residual vs. Fitted.

Figure 14. Multiple Regression Diagnostic Plot: Normal Q-Q.

Figure 15. Multiple Regression Diagnostic Plot: Scale-Location.

Figure 16. Multiple Regression Diagnostic Plot: Residual vs. Leverage.

  • summary(model2)

Task-5:  Discussion and Analysis:  The result shows the average of NOx is 1.96, which is higher than the median of 1.75 indicating positive skewed distribution.  The average of the compression ratio of the engine ( C ) is 12.034 which is a little higher than the median of 12.00 indicating almost normal distribution.  The average ethanol equivalence ratio of measure ( E ) is 0.926, which is a little lower than the median of 0.932 indicating a little negative skewed distribution but close to normal distribution.  In summary, the average for NOx is 1.96, for C is 12.034 and for E is 0.926. 

The NOx exhaust emissions depend on two predictor variables: the fuel-air equivalence ratio ( E ), and the compression ratio ( C ) of the engine. The density of the NOx emissions and its smoothed version using the local polynomial regression are illustrated in Figure 2.  The result shows that the NOx starts to increase when the density starts at 0.15 and continues to increase.  However, after the density reaches 0.35, the NOx continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between NOx and density between 0.15 and .35 density, after which the relationship seems to go into the reverse and negative direction. 

The density of the Equivalence Ratio measure of the richness of air and ethanol fuel mixture and its smoothed version using the local polynomial regression are illustrated in Figure 3.  The result shows that the E varies with the density.  For instance, the density gets increased with the increased value of E until the density reaches ~1.5, and then it drops while the E continues to increase.  However, the density continues to drop until it reaches ~1.2, while the E continues to increase.   The density gets increased from ~1.2 until it reached ~1.6, and the E continues to increase. After density of ~1.6, the density gets dropped again while the E value continues to increase.  Thus, in summary, the density varies while the E value keeps increasing.

The density of the Compression Ratio of the Engine and its smoothed version using the local polynomial regression are illustrated in Figure 4.  The result shows that the C starts to increase when the density starts with ~0.09 and continues to increase.  However, after the density reaches ~0.11, the C continues to increase while the density starts to drop.  Thus, there seems to be a positive relationship between C and density between ~0.09 and ~.11 density, after which the relationship seems to go into the reverse and negative direction. 

Figure 5 illustrates the Linear Regression between the NOx and Equivalence Ratio in Ethanol. Figure 6 illustrates the Smoothed Polynomial Regression of the NOx and E in Ethanol. The result of the Linear Regression of the Equivalence Ratio ( E ) as a function of the NOx shows that while the E value increases, the NOx varies indicating an increase and then decrease.  Figure 6 shows the smoothed a polynomial regression of the NOx and E in ethanol, indicating the same result that there a positive association between E and NOx, meaning when E increases until it reaches ~0.9, NOx also increases until it reaches ~3.5. After that point, the relationship shows negative, meaning that the NOx gets increased with the increase of E. 

This analysis also covers the residuals and fitted lines.  Figure 7 illustrated the Residuals vs. Fitted in Linear Regression Model for NOx as a function of the E.  The residuals depicts the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016).  For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The plot of the fitted values against the residuals with a line shows the relationship between the two. The horizontal and straight line indicates that the “average residual” for all “fitted values” it is more or less the same (Navarro, 2015).

The result of the Linear Regression for the identified variables of E and NOx shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line.  Figure 8 illustrates the Normal Q-Q Plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  Figure 8 shows that the residuals are almost on the straight line, indicating that the residuals are normally distributed. Hence, the normality test of the residuals is passed.  Figure 9 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above. The points are spread in a random fashion around the horizontal line but not equally the line. If the horizontal line with equally randomly spread points, the result could indicate that the assumption of constant variance of the errors or homoscedasticity is fulfilled (Hodeghatta & Nayak, 2016).  Thus, it is not fulfilled in this case.  Figure 10 illustrates the Residuals vs. Leverage Plot generated for the Linear Regression Model. In this plot of Residuals vs. Leverage, the patterns are not as relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015).  The Cook’s distance lines are (red dashed line) are far indicating there is no influential case.   Figure 11 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016).  The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016).  The result of Figure 12 shows that the model created is not linear, which requires to re-explore the model. Moreover, the correlation between NOx and E result shows there is a negative correlation between NOx and E with a value of -0.11.   Figure 12 illustrates the Multiple Regression and the Relationship of Equivalence Ratio on NOx conditioned with Compression Ratio.  The Multiple Linear Regression is useful for modeling the relationship between a numeric outcome or dependent variable (Y), and multiple explanatory or independent variables (X).  The result shows that the interaction of C and E affects the NOx.  While the E and C increase, the NOx decreases.   Approximately 0.013 of variation in NOx can be explained by this model (E*C).   The interaction of E and C has a negative value of -0.063 on NOx. 

References

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Dias, R. (n.d.). Nonparametric Regression: Lowess/Loess. Retrieved from https://www.ime.unicamp.br/~dias/loess.pdf.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Kabacoff, R. I. (2011). IN ACTION.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

statisticshowto.com. (2013). Lowess Smoothing: Overview. Retrieved from http://www.statisticshowto.com/lowess-smoothing/.

Quantitative Analysis of “State.x77” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to continue working with R using state.x77 dataset for this assignment.  In this task, the dataset will get converted to a data frame.  Moreover, regression will be performed on the dataset.  The commands used in this discussion are derived from (r-project.org, 2018).  There are four major tasks.  The discussion begins with Task-1 to understand and examine the dataset.  Task-2 covers the data frame creation. Task-3 is to examine the data frame.  Task-4 investigates the data frame using the Linear Regression analysis.  Task-4 is comprehensive as it covers the R commands, the results of the commands and the analysis of the result. 

Task-1:  Understand and Examine the dataset:

The purpose of this task is to understand and examine the dataset.  The following is a summary of the variables from the information provided in the help site as a result of ?state.x77 command:

  • Command: > ?state.x77
  • Command: > summary(state.x77)
  • Command: >head(state.x77)
  • Command: >dim(state.x77)
  • Command:  >list(state.x77)

The dataset of state.x77has 50 rows and 8 columns giving the following statistics in the respective columns.

##The first 10 lines of Income, Illiteracy, and Murder.

  • state.x77.df$Income[1:10]
  • state.x77.df$Illiteracy[1:10]
  • state.x77.df$Murder[1:10]

The descriptive statistical analysis (Central Tendency) (mean, median, min, max, 3th quantile) of the Income, Illiteracy, and Population variables.

  • Command:>summary(state.x77.df$Income)
  • Command:>summary(state.x77.df$Illiteracy)
  • Command:>summary(state.x77.df$Population)

Task2:  Create a Data Frame

  • Command: >state.x77.df <- data.frame(state.x77)
  • Command:>state.selected.variables <- as.data.frame(state.x77[,c(“Murder”, “Population”, “Illiteracy”, “Income”, “Frost”)])

Task-3: Examine the Data Frame

  • Command: > list(state.x77.df)
  • Command: >names(state.x77.df)

Task-4: Linear Regression Model – Commands, Results and Analysis:

  • plot(Income~Illiteracy, data=state.x77.df)
  • mean.Income=mean(state.x77.df$Income, na.rm=T)
  • abline(h=mean.Income, col=”red”)
  • model1=lm(Income~Illiteracy, data=state.x77.df)
  • model1

Figure 1.  Linear Regression Model for Income and Illiteracy.

Analysis: Figure 1 illustrates the Linear Regression between Income and Illiteracy.  The result of the Linear Regression of the Income as a function of the Illiteracy shows that the income increases when the illiteracy percent decreases, and vice versa, indicating there is a reverse relationship between the illiteracy and income. More analysis on the residuals and the fitted lines are discussed below using plot() function in R. 

  • Command: > plot(model1)

Figure 2.  Residuals vs. Fitted in Linear Regression Model for Income and Illiteracy.

Analysis:  Figure 2 illustrated the Residuals vs. Fitted in the Linear Regression Model for Income as a function of the Illiteracy. The residuals depict the difference between the actual value of the response variable and the response variable predicted using the regression equation (Hodeghatta & Nayak, 2016).  The principle behind the regression line and the regression equation is to reduce the error or this difference (Hodeghatta & Nayak, 2016).  The expectation is that the median value should be very near to zero (Hodeghatta & Nayak, 2016). For the model to pass the test of linearity, no pattern in the distribution of the residuals should exist (Hodeghatta & Nayak, 2016).  When there is no pattern in the distribution of the residuals, it passes the condition of linearity (Hodeghatta & Nayak, 2016).  The Plot of the fitted values against the residuals with a line shows the relationship between the two.  The horizontal and straight line indicates that the “average residual” for all “fitted values” is more or less the same (Navarro, 2015).   The result of the Linear Regression for the identified variables of Illiteracy and Income (Figure 2) shows that the residual has a curved pattern, indicating that a better model can be obtained using the quadratic term because ideally, this line should be a straight horizontal line. 

Figure 3.  Normal Q-Q Plot of the Linear Regression Model for Illiteracy and Income.

Analysis: Figure 3 illustrates the Normal Q-Q plot, which is used to test the normality of the distribution (Hodeghatta & Nayak, 2016).  The result shows that the residuals are almost on the straight line in the preceding Normal Q-Q plot, indicating that the residuals are normally distributed.  Hence, the normality test of the residuals is passed.

Figure 4. Scale-Location Plot Generated in R to Validate Homoscedasticity for Illiteracy and Income.

Analysis: Figure 4 illustrates the Scale-Location graph, which is one of the graphs generated as part of the plot command above.  The points are spread in a random fashion around the near horizontal line, as such ensures that the assumption of constant variance of the errors (or homoscedasticity) is fulfilled (Hodeghatta & Nayak, 2016).

Figure 5. Residuals vs. Leverage Plot Generated in R for the LR Model.

Analysis: Figure 5 illustrates the Residuals vs. Leverage Plot generated for the LR Model.  In this plot of Residuals vs. Leverage, the patterns are not relevant as the case with the diagnostics plot of the linear regression.  In this plot, the outlying values at the upper right corner or the lower right corner are watched (Bommae, 2015).  Those spots are the places where a case can be influential against a regression line (Bommae, 2015).  When cases are outside of the Cook’s distance, meaning they have high Cook’s distance scores, the cases are influential to the regression results (Bommae, 2015). 

##Better understand the linearity of the relationship represented by the model.

  • Command: >crPlots(model1)

Figure 6.  crPlots() Plots for the Linearity of the Relationship between Income and Illiteracy of the Model.

Analysis:  Figure 6 illustrates the crPlots() function, which is used to understand better the linearity of the relationship represented by the model (Hodeghatta & Nayak, 2016).  The non-linearity requires to re-explore the model (Hodeghatta & Nayak, 2016).  The result of Figure 6 shows that the model created is linear and the reverse relationship between income and the illiteracy as analyzed above in Figure 1.

##Examine the Correlation between Income and Illiteracy.

Analysis: The correlation result shows a negative association between income and illiteracy as anticipated in the linear regression model.

References: 

Bommae, K. (2015). Understanding Diagnostic Plots of Linear Regression Analysis. Retrieved from https://data.library.virginia.edu/diagnostic-plots/.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Navarro, D. J. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. R package version 0.5.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

The Assumptions of General Least Square Modeling for Regression and Correlations

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to compare the assumptions of General Least Square Model (GLM) modeling for regression and correlations.  This discussion also covers the issues with transforming variables to make them linear.  The procedure in R for linear regression is also addressed in this assignment.  The discussion begins with some basics such as measurement scale, correlation, and regression, followed by the main topics for this discussion.

Measurement Scale

There are three types of measurement scale.  There is nominal (categorical) such as race, color, job, sex or gender, job status and so forth (Kometa, 2016).  There is ordinal (categorical) such as the effect of a drug could be none, mild and severe, job importance (1-5, where 1 is not important and 5 very important and so forth) (Kometa, 2016).  There is the interval (continuous, covariates, scale metric) such as temperature (in Celsius), weight (in kg), heights (in inches or cm) and so forth (Kometa, 2016). The interval variables have all the properties of nominal and ordinal variables (Bernard, 2011).  They are an exhaustive and mutually exclusive list of attributes, and the attributes have a rank-order structure (Bernard, 2011).  They have one additional property which is related to the distance between attributes (Bernard, 2011). The distance between the attributes are meaningful (Bernard, 2011). Therefore, the interval variables involve true quantitative measurement (Bernard, 2011).

Correlations

Correlation analysis is used to measure the association between two variables.  A correlation coefficient ( r ) is a statistic used for measuring the strength of a supposed linear association between two variables (Kometa, 2016).  The correlation analysis can be conducted using interval data, ordinal data, or categorical data (crosstabs) (Kometa, 2016).  The fundamental concept of the correlation requires the analysis of two variables simultaneously to find whether there is a relationship between the two sets of scores, and how strong or weak that relationship is, presuming that a relationship does, in fact, exist (Huck, Cormier, & Bounds, 2012).  There are three possible scenarios within any bivariate data set.  The first scenario is referred to as high-high, low-low when the high and low score on the first variable tend to be paired with the high and low score of the second variable respectively.  The second scenario is referred to as high-low, low-high, when the relationship represents inverse, meaning when the high and low score of the first variable tend to be paired with a low and high score of the second variable.  The third scenario is referred to as “little systematic tendency,” when some of the high and low scores on the first variable are paired with high scores on the second variable, whereas other high and low scores on the first variable are paired with low scores of the second variable (Huck et al., 2012).

The correlation coefficient varies from -1 and +1 (Huck et al., 2012; Kometa, 2016).  Any ( r ) falls on the right side represents a positive correlation, indicating a direct relationship between the two measured variables, which can be categorized under the high-high, low-low scenario.  However, any ( r ) falls on the left side represents a negative correlation, indicating indirect, or inverse, relationship, which can be categorized under high-low, low-high scenario.   If ( r ) lands on either end of the correlation continuum, the term “perfect” may be used to describe the obtained correlation. The term high comes into play when ( r ) assumes a value close to either end, thus, implying a “strong relationship,” conversely, the term low is used when ( r ) lands close to the middle of the continuum, thus, implying a “weak relationship.”   Any ( r ) ends up in the middle area of the left, or right side of the correlation continuum is called “moderate” (Huck et al., 2012).  Figure 1 illustrates the correlation continuum of values -1 and +1.

Figure 1. Correlation Continuum (-1 and +1) (Huck et al., 2012).

The most common correlation coefficient is the Pearson correlation coefficient, used to measure the relationship between two interval variables (Huck et al., 2012; Kometa, 2016).  Pearson correlation is designed for situations where each of the two variables is quantitative, and each variable is measured to produce raw scores (Huck et al., 2012).  Spearman’s Rho is the second most popular bivariate correlational technique, where each of the two variables is measured to produce ranks with resulting correlation coefficient symbolized as rs or p (Huck et al., 2012).  Kendall’s Tau is similar to Spearman’s Rho (Huck et al., 2012). 

Regression

When dealing with correlation and association between statistical variables, the variables are treated in a symmetric way. However, when dealing with the variables in a non-symmetric way, a predictive model for one or more response variables can be derived from one or more of the others (Giudici, 2005).  Linear Regression is a predictive data mining method (Giudici, 2005; Perugachi-Diaz & Knapik, 2017).

Linear Regression is described to be the most important prediction method for continuous variables, while Logistic Regression is the main prediction method for qualitative variables (Giudici, 2005).  Cluster analysis is different from Logistic Regression and Tree Models, as in the cluster analysis the clustering is unsupervised in the cluster analysis and is measured with no reference variables, while in Logistic Regress and Tree Models, the clustering is supervised and is measured against a reference variables such as response whose levels are known (Giudici, 2005).

The Linear Regression is to examine and predict data by modeling the relationship between the dependent variable also called “response” variable, and the independent variable also known as “explanatory” variable.  The purpose of the Linear Regression is to find the best statistical relationship between these variables to predict the response variable or to examine the relationship between the variables (Perugachi-Diaz & Knapik, 2017). 

Bivariate Linear Regression can be used to evaluate whether one variable called dependent variable or the response can be caused, explained and therefore predicted as a function of another variable called independent, the explanatory variable, the covariate or the feature (Giudici, 2005).  The Y is used for the dependent or response variable, and X is used for the independent or explanatory variable (Giudici, 2005). Linear Regression is the simplest statistical model which can describe Y as a function of an X (Giudici, 2005).  The Linear Regression model specifies a “noisy” linear relationship between variables Y and X, and for each paired observation (xi, yi), the following Regression Function is used (Giudici, 2005; Schumacker, 2015).

Where: 

  • i = 1, 2, …n
  • a = The intercept of the regression function.
  • b = The slope coefficient of the regression function also called the regression coefficient.
  • ei = the random error of the regression function, relative to the ith observation.

The Regression Function has two main elements; the Regression Line and the Error Term.  The Regression Line can be developed empirically, starting from the matrix of available data. The Error Term describes how well the regression line approximates the observed response variable.  The determination of the Regression Line can be described as a problem of fitting a straight line to the observed dispersion diagram, where the Regression Line is the Linear Function using the following formula (Giudici, 2005).

Where: 

 = indicates the fitted ith value of the dependent variable, calculated on the basis of the ith value of the explanatory variable of xi.

The Regression Line simple formula, as indicated in (Bernard, 2011; Schumacker, 2015) is as follows:

Where: 

  • y = variable value of dependent variable.
  • a and b are some constants.
  • x = the variable value of the independent variable.

The Error Term of ei in the expression of the Regression Function represents, for each observation yi, the residual, namely the difference between the observed response values yi, and the corresponding values fitted with the Regression Line using the following formula (Giudici, 2005):

Each residual can be interpreted as the part of the corresponding value that is not explained by the linear relationship with the explanatory variable.  To obtain the analytic expression of the regression line, it is sufficient to calculate the parameters a and b on the basis of the available data. The method of least square is often used for this. It chooses the straight line which minimizes the sum of squares of the errors of the fit (SSE), defined by the following formula (Giudici, 2005). 

Figure 2 illustrates the representation of the regression line.

Figure 2.  Representation of the Regression Line (Giudici, 2005).

General Least Square Model (GLM) for Regression and Correlations

The Linear Regression is based on the Gauss-Markov theorem, which states that if the errors of prediction are independently distributed, sum to zero and have constant variance, then the least squares estimation of the regression weight is the best linear unbiased estimator of the population (Schumacker, 2015).   The Gauss-Markov theorem provides the rule that justifies the selection of a regression weight based on minimizing the error of prediction, which gives the best prediction of Y, which is referred to as the least squares criterion, that is, selecting regression weights based on minimizing the sum of squared errors of prediction (Schumacker, 2015). The least squares criterion is sometimes referred to as BLUE, or Best Linear Unbiased Estimator (Schumacker, 2015).

Several assumptions are made when using Linear Regression, among which is one crucial assumption known as “independence assumption,” which is satisfied when the observations are taken on subjects which are not related in any sense (Perugachi-Diaz & Knapik, 2017).  Using this assumption, the error of the data can be assumed to be independent (Perugachi-Diaz & Knapik, 2017).  If this assumption is violated, the errors exist to be dependent, and the quality of statistical inference may not follow from the classical theory (Perugachi-Diaz & Knapik, 2017). 

Regression works by trying to fit a straight line between these data points so that the overall distance between points and the line is minimized using the statistical method called least square.  Figure 3 illustrates an example of a Scatter Plot of two variables, e.g., English and Maths Scores (Muijs, 2010).  

Figure 3. Example of a Scatter Plot of two Variables, e.g. English and Maths Scores (Muijs, 2010).

In Pearson’s correlation, ( r ) measures how much changes in one variable correspond with equivalent changes in the other variables (Bernard, 2011). It can also be used as a measure of association between an interval and an ordinal variable or between an interval and a dummy variable which are nominal variable coded as 1 or 0, present or absent (Bernard, 2011).  The square of Pearson’s r or r-squared is a PRE (proportionate reduction of error) measure of association for linear relations between interval variables (Bernard, 2011).  It indicates how much better the scores of a dependent variable can be predicted if the scores of some independent variables are known (Bernard, 2011).  The dots illustrated in Figure 4 is physically distant from the dotted mean line by a certain amount. The sum of the squared distances to the mean is the smallest sum possible which is the smallest cumulative prediction error giving the mean of the dependent is only known (Bernard, 2011).  The distance from the dots above the line to the mean is positive; the distances from the dots below the line to the mean are negative (Bernard, 2011).  The sum of the actual distances is zero.  Squaring the distances gets rid of the negative numbers (Bernard, 2011).  The solid line that runs diagonally through the graph in Figure 4 minimizes the prediction error for these data.  This line is called the best fitting line, or the least square line, or the regression line (Bernard, 2011).   

Figure 4.  Example of a Plot of Data of TFR and “INFMORT” for Ten countries (Bernard, 2011).

Transformation of Variables for Linear Regression

The transformation of the data can involve the data transformation of the data matrix in univariate and multivariate frequency distributions (Giudici, 2005).  It can also involve a process to simplify the statistical analysis and the interpretation of the results (Giudici, 2005).  For instance, when the p variables of the data matrix are expressed in different measurement units, it is a good idea to put all the variables into the same measurement unit so that the different measurement scales do not affect the results (Giudici, 2005).  This transformation can be implemented using the linear transformation to standardize the variables, taking away the average of each one and dividing it by the square root of its variance (Giudici, 2005).  There is other data transformation such as the non-linear Box-Cox transformation (Giudici, 2005).

The transformation of the data is also a method of solving problems with data quality, perhaps because items are missing or because there are anomalous values, known as outliers (Giudici, 2005).  There are two primary approaches to deal with missing data; remove it, or substitute it using the remaining data (Giudici, 2005).  The identification of anomalous values requires a formal statistical analysis; an anomalous value can seldom be eliminated as its existence often provides valuable information about the descriptive or predictive model connected to the data under examination (Giudici, 2005). 

The underlying concept behind the transformation of the variables is to correct for distributional problems, outliers, lack of linearity or unequal variances (Field, 2013).  The transformation of the variables changes the form of the relationships between variables, but the relative differences between people for a given variable stay the same. Thus, those relationships can still be quantified (Field, 2013).  However, it does change the differences between different variables because it changes the units of measurement (Field, 2013).  Thus, in the case of a relationship between variables, e.g., regression, the transformation is implemented at the problematic variable. However, in case of differences between variables such as a change in a variable over time, then the transformation is implemented for all of those variables (Field, 2013). 

There are various transformation techniques to correct various problems.  Log Transformation (log(Xi)) method can be used to correct for positive skew, positive kurtosis, unequal variances, lack of linearity (Field, 2013).  Square root transformation (ÖXi ) can be used to correct for positive skew, positive kurtosis, unequal variances, and lack of linearity (Field, 2013).  Reciprocal Transformation (1/Xi) can be used to correct for positive skew, positive kurtosis, unequal variances (Field, 2013).  The Reverse Score Transformation can be used to correct for negative skew (Field, 2013).  Table 1 summarizes these types of transformation and their correction use.

Table 1.  Transformation of Data Methods and their Use. Adapted from (Field, 2013).

Procedures in R for Linear Regressions

In R, there is a package called “stats” package which contains two different functions which can be used to estimate the intercept and slope in the linear regression equation (Schumacker, 2015). These two functions in R are lm() and lsfit() (Schumacker, 2015).  The lm() function uses a data frame, while the lsfit() uses a matrix or data vector.  The lm() function outputs an intercept term, which has meaning when interpreting results in linear regression.   The lm() function can also specify an equation with no intercept of the form (Schumacker, 2015). 

Example of lm() function with intercept on y as dependent variable and x as independent variable: 

  • LReg = lm(y ~ x, data = dataframe).

Example of lm() function with no intercept on y as dependent variable and x as independent variable: 

  • LReg = lm(y ~ 0 + x, data=dataframe) or
  • LReg  = lm(y ~ x – 1, data = dataframe)

 The expectation when using the lm() function is that the response variable data is distributed normally (Hodeghatta & Nayak, 2016).  However, the independent variables are not required to be normally distributed (Hodeghatta & Nayak, 2016).  Predictors can be factors (Hodeghatta & Nayak, 2016).

#cor() function to find the correlation between variables

cor(x,y)

#To build linear regression model with R

model <-lm(y ~ x, data=dataset)

References

Bernard, H. R. (2011). Research methods in anthropology: Qualitative and quantitative approaches: Rowman Altamira.

Field, A. (2013). Discovering Statistics using IBM SPSS Statistics: Sage publications.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Huck, S. W., Cormier, W. H., & Bounds, W. G. (2012). Reading statistics and research (6th ed.): Harper & Row New York.

Kometa, S. T. (2016). Getting Started With IBM SPSS Statistics for Windows: A Training Manual for Beginners (8th ed.): Pearson.

Muijs, D. (2010). Doing quantitative research in education with SPSS: Sage.

Perugachi-Diaz, Y., & Knapik, B. (2017). Correlation in Linear Regression.

Schumacker, R. E. (2015). Learning statistics using R: Sage Publications.

Quantitative Analysis of “Births2006.smpl” Dataset Using R-Programming

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to analyze the selected dataset of the births2006.smpl.  The dataset is part of the R library “nutshell.” The project is divided into two main Parts.  Part-I evaluates and examines the dataset for understanding the Dataset using the R.  Part-I involves five significant tasks for the examination of the dataset. Part-II is about the Data Analysis of the dataset.  The Data Analysis involves nine significant tasks.  The first eight tasks involve the codes and the results with Plot Graphs, and Bar Charts for analysis.   Task-9 is the last task of Part-II for discussion and analysis. The most observed results include the higher number of the birth during the working days of Tuesday through Thursday than the weekend, and the domination of the vaginal method over the C-section.  The result also shows that the average birth weight gets increased among the male babies for quintuplet while the trend continues to decline among the female babies. The researcher recommends further statistical significance, and the effect size tests to verify these results and examine the interaction among specific variables such as birth weight and Apgar score.

Keywords: Births2006.smpl; Box Plot and Graphs Analysis Using R.

Introduction

This project examines and analyzes the dataset of births2006.smpl which is part of the Nutshell package of RStudio.  This dataset contains information on babies born in the United in the year 2006.  The source of this dataset is (https://www.cdc.gov/NCHS/data_access/VitalStatsOnline.htm).  There is only one record per birth.  The dataset is a random ten percent sample of the original data (RDocumentation, n.d.).  The package which is required for this dataset is called “nutshell” in R.  The dataset contains 427,323 as shown below.  There are two Parts.  Part-I addresses five tasks to examine and understand the dataset using R before the analysis as follows:

Part-II address the analysis using R. Part-II includes seven tasks include the following. These seven tasks are followed by the discussion and analysis of the results. 

  • Task-1: The first five records of the dataset.
  • Task-2: The Number of Birth in 2006 per day of the week in the U.S.
  • Task-3: The Number of Birth per Delivery Method and Day of Week in 2006 in the U.S.
  • Task-4: The Number of Birth based on Birth Weight and Single or Multiple Birth Using Histogram.
  • Task-5: The Number of Birth based on Birth Weight and Delivery Method Using Histogram.
  • Task-6: Box Plot of Birth Weight Per Apgar Score.
  • Task-7: Box Plot of Birth Weight Per Day of Week.
  • Task-8: The Average of Birth Weight Per Multiple Births by Gender.
  • Task-9:  Discussion and Analysis.

Part-I:  Understanding and Examining the Dataset “births2006.smpl”

Task-1: Install Nutshell Package

The purpose of this task is to install the nutshell package which is required for this project. The births2006.smpl dataset is part of Nutshell package in R. 

  • Command: >install.packages(“nutshell”)
  • Command: >library (nutshell)

Task-2: Understand the Variables of the Dataset

The purpose of this task is to understand the variables of the dataset.  This dataset is part of RStudio dataset (RDocumentation, n.d.).  The main dataset is called “births2006.smpl” dataset, which includes thirteen variables as shown in Table 1. 

Table 1. The Variables of the Dataset of births2006.smpl.

This dataset contains information on babies born in the United in the year 2006.  The source of this dataset is (https://www.cdc.gov/NCHS/data_access/VitalStatsOnline.htm).  There is only one record per birth.  The dataset is a random ten percent sample of the original data (RDocumentation, n.d.).  The package which is required for this dataset is called “Nutshell” in R.  The dataset contains 427,323 as shown below. 

  • Command:> nrow(births.dataframe)

Task-3: Examine the Datasets Using R

The purpose of this task is to examine each dataset using RConsole. The commands which will primarily use in this section are a summary() to understand each dataset better.

  • Command: >summary(births2006.smpl),

Task-4: Create a Data frame to represent the dataset of births2006.smpl.

            The purpose of this task is to create a data frame for the dataset.

  • Command:  >births.dataframe <- data.frame(births2006.smpl)

Task-5: Examine the Content of the Data frame using head(), names(), colnames(), and dim() functions.

            The purpose of this task is to examine the content of the data frame using the functions of the head(), names(), colnames(), and dim().

  • Command:  head(births.dataframe)
    • Command: names(births.dataframe)Command: colnames(births.dataframe)Command:  >dim(births.dataframe)

Part-II: Birth Dataset Tasks and Analysis

Task-1: The First Five records of the dataset.

            The purpose of this task is to display the first five records using head() function.   

Task-2: The Number of Birth in 2006 per day of the week in the U.S.

The purpose of this task is to display a bar chart of the “frequency” of births according to the day of the week of the birth.  

Figure 1.  Frequency of Birth in 2006 per day of the week in United States.

Task-3: The Number of Births Per Delivery Method and Day of Week in 2006 in the U.S.

The purpose of this task is to show a bar chart of the “frequency” for two-way classification of birth according to the day of the week and the method of the delivery (C-section or Vaginal).

Figure 2.  The Number of Births Per Delivery Method and Day of Week in 2006 in the US.

Task-4: The Number of Birth based on Birth Weight and Single or Multiple Birth Using Histogram.

The purpose of this task is to use “lattice” (trellis) graphs using lattice R package, to condition density histograms on the value of a third variable. The variables for multiple births and the method of delivery are conditioning variables. Separate the histogram of birth weight according to these variables.  

Figure 3.  The Number of the Birth based on Weight and Single or Multiple Birth.

Task-5: The Number of Birth based on Birth Weight and Delivery Method Using Histogram.

The purpose of this task is to use “lattice” (trellis) graphs using lattice R package, to condition density histograms on the value of a third variable. The variables for multiple births and the method of delivery are conditioning variables. Separate the histogram of birth weight according to these variables.  

Figure 4.  The Number of the Birth based on Birth Weight and Delivery Method.

Task-6: Box Plot of Birth Weight Per Apgar Score

The purpose of this task is to use Box plot of birth weight against Apgar score and box plots of birth weight by day of the week of delivery. 

Figure 5.  Box Plot of Birth

Task-7: Box Plot of Birth Weight Per Day of the Week

The purpose of this task is to use Box plot of birth weight per day of the week.

Figure 6.  Box Plot of Birth Weight Per day of the Week.

Task-8: The Average of Birth Weight Per Multiple Births by Gender.

The purpose of this task is to calculate the average birth weight as a function of multiple births for males and females separately.  In this task, the tapply function is used, and the option na.rm=TRUE is used for missing values.

Figure 7.  Bar Plot of Average Birth Weight Per Multiple Births by Gender.

 Task-9: Discussion and Data Analysis

For the number of the births in 2006 per day of the week in United States, giving for Sunday (1) through the week until Saturday is (7), the result (Figure 1) shows that the highest number of births, which seems to be very close, happens in the working days of 3, 4, and 5, Tuesday, Wednesday, and Thursday respectively. The least number of birth is observed on day 1 (Sunday), followed by day 7 (Saturday), day 2 (Monday) and day 6 (Friday).  

For the number of births per delivery method for (C-section vs. vaginal), and the day of the week in 2006 in the United States, the result (Figure 2)  shows that the vaginal method is dominating the delivery methods and has the highest ranks in all weekdays in comparison with C-section.  The same high number of the birth per day in the vaginal method are the working days of Tuesday, Wednesday, and Thursday.  The least number of birth per day in the vaginal method is on Sunday, followed by Saturday, Monday, and Friday.   The highest number of birth in C-section is observed on Friday, followed by Tuesday through Thursday.  The least number of birth per day in C-section is still on Sunday, followed by Saturday and Monday. 

For the number of births based on birth weight and single or multiple births (twin, triplet, quadruplet, and quintuplet or higher), the result (Figure 3) shows that the single birth frequency has almost a normal distribution.  However, the more birth such as twin, triplet, quadruplet, and quintuplet or higher, the more distribution moves toward the left indicating less weight.  Thus, this result can suggest that the more birth (twin, triplet, quadruplet, and quintuplet or more) have lower birth rates on average. 

For the number of births based on the birth weight and delivery method, the result (Figure 4) shows that the vaginal and C-section have almost the same distribution.  However, the vaginal shows a higher percent total than the C-section.  The unknown delivery method is an almost the same pattern of distribution of vaginal and C-section.  More analysis is required to determine the effect of the weight on the delivery method and the rate of the birth. 

The Apgar score is a scoring system used by doctors and nursed to evaluate newborns one minute and five minutes after the baby is born (Gill, 2018).  The Apgar scoring system is divided into five categories: activity/muscle tone, pulse/heart rate, grimace, appearance, and respiration/breathing. Each category receives a score of 0 to 2 points.  At most, a child will receive an overall score of 10 (Gill, 2018). However, a baby rarely scores a 10 in the first few moments of life, because most babies have blue hands or feet immediately after the birth (Gill, 2018).  For the birth weight per Apgar score, the result (Figure 5) shows that the median is almost the same or close among the birth weight for Apgar score of 3-10.  The median for birth weight of Apgar score of 0 and 2 is close, while the least median is the Apgar score 1 within the same range of the birth weight of 0-2000 gram.  However, the birth weight from 2000-4000 gram, the median of the birth weight is close to each other for the Apgar score from 3-10, almost ~3000 gram.  The birth weight distribution varies, as it is more distributed between ~1500 to 2300 grams, the closer to Apgar score 10, the birth weight moves between ~2500 to ~3000 grams.  There are outliers in distribution for Apgar score 8 and 9.   These outliers show heavyweight babies above 6000 grams with Apgar score of 8-9.  As the Apgar score increases, the more outliers than the distribution of lower Apgar scores.  Thus, more analysis using statistical significance tests and effect size can be performed for further investigation of these two variables interaction. 

For the birth weight per day of the week, the result (Figure 6) shows that there is a normal distribution for the seven days of the week. The median of the birth weight for all days is almost the same.  The minimum, the maximum, and the range of the birth weight have also a normal distribution among the days of the week.  However, there are outliers in the birth weight for the working days of Tuesday, Wednesday, and Thursday.  There are additional outliers in the birth weight on Monday, as well as on Saturday but fewer outliers than the working days of Tues-Thurs.  This result indicates that there is no relationship between the birth weight and the days of the week, as the heavyweight babies above 6000 grams reflecting the outliers tend to occur with no regard to the days of the week. 

For the average of the birth weight per multiple births by gender, the result (Figure 7) shows that the single birth has the highest birth weight for the male and female of ~3500 grams.  The birth weight tends to decrease for “twin,” “triplet” for male and female.  However, the birth weight shows a decrease in the female and more decrease in male than female in “quadruplet.”  The more observed result is shown in the male gender babies as the birth weight gets increased for the “quintuplet or higher,” while the birth weight for female continues to decline for the same category of “quintuplet or higher.”  This result confirms the result of the impact of the multiple births on the birth weight as discussed earlier and illustrated in Figure 3.

In summary, the analysis of the dataset of births2006.smpl using R indicates that frequency of birth tends to focus more on the working days than the weekends, and the vaginal tends to dominate the delivery methods.  Moreover, the frequency of the birth based on birth weight and single or multiple births shows that the single birth has more normal distribution than the other multiple births.  The vaginal and C-section have shown almost similar distribution.  The birth weight per Apgar score is between ~2500-3000 grams and close among the Apgar score of 8-10.  The days of the week does not show any difference in the birth weight. Moreover, the birth weight per gender shows that the birth weight tends to decrease by multiple births among females and males, except only for the quintuplet, where it tends to decrease in a female while it increases in males.  This result of the increasing birth weight among male birth for quintuplet or higher requires more investigation to evaluate the reasons and causes for such an increase in the birth weight.  The researcher recommends further statistical significance, and the effect size tests verify these results.  

Conclusion

The project analyzed the selected dataset of the births2006.smpl.  The dataset is part of the R library “nutshell.” The project is divided into two main Parts.  Part-I evaluated and examined the dataset for understanding the Dataset using the R.  Part-I involved five major tasks for the examination of the dataset. Part-II addressed the Data Analysis of the dataset.  The Data Analysis involved nine major tasks.  The first eight tasks involved the codes and the results with Plot Graphs, and Bar Charts for analysis.   The discussion and the analysis were addressed in Task-9.  The most observed results showed that the number of the birth increases during the working days of Tuesday through Thursday over the weekend and the vaginal method is dominating over the C-section.  The result also showed that the average birth weight gets increased among the male babies for quintuplet while the trend continues to decline among the female babies. The researcher recommends further statistical significance, and the effect size tests to verify these results and examine the interaction among certain variables such as birth weight and Apgar score.

References

Gill, K. (2018). Apgar Score: What You Should Know. Retrieved from https://www.healthline.com/health/apgar-score#apgar-rubric.

RDocumentation. (n.d.). Births in the United States, 2006: births2006.smpl dataset. Retrieved from https://www.rdocumentation.org/packages/nutshell/versions/2.0/topics/births2006.smpl.

 

Machine Learning: Supervised Learning

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss the supervised learning and how it can be used in large datasets to overcome the problem where everything is significant with statistical analysis. The discussion also addresses the importance of a clear purpose of supervised learning and the use of random sampling.

Supervised Learning (SL) Algorithm

In accordance with the (Hall, Dean, Kabul, & Silva, 2014), SL “refers to techniques that use labeled data to train a model.”  It is comprised of “Prediction” (“Regression”) algorithm, and “Classification” algorithm.  The “Regression” or “Prediction” algorithm is used for “interval labels,” while the “Classification” algorithm is used for “class labels” (Hall et al., 2014).  In the SL algorithm, the training data represented in observations, measurements, and so forth are associated by labels reflecting the class of the observations (Han, Pei, & Kamber, 2011).  The new data is classified based on the “training set” (Han et al., 2011).

The “Predictive Modeling” (PM) operation of the “Data Mining” utilizes the same concept of the human learning by using the observation to formulate a model of specific characteristics and phenomenon (Coronel & Morris, 2016).  The analysis of an existing database to determine the essential characteristics “model” about the data set can implement using the PM operation (Coronel & Morris, 2016).  The (SL) algorithm develops these key characteristics represented in a “model” (Coronel & Morris, 2016).  The SL approach has two phases: (1) Training Phase, and (2) Testing Phase.  In the “Training Phase,” a model utilizing a large sample of historical data called “Training Set” is developed.  In the “Testing Phase,” the model is tested on new, previously unseen data, to determine the accuracy and the performance characteristics.  The PM operation involves two approaches: (1) Classification Technique, and (2) Value Prediction Technique (Connolly & Begg, 2015).  The nature of the predicted variables distinguish both techniques of the classification and value prediction (Connolly & Begg, 2015). 

The “Classification Technique” involves two specializations of classifications: (1) “Tree Induction,” and (2) “Neural Induction” which are used to develop a predetermined class for each record in the database from a set of possible class values (Connolly & Begg, 2015).  The application of this approach can answer questions like “What is the probability for those customers who are renting to be interested in purchasing home?” 

The “Value Prediction,” on the other hand, implements the traditional statistical methods of (1) “Linear Regression,” and (2) “Non-Linear Regression” which are used to estimate a continuous numeric value that is associated with a database record (Connolly & Begg, 2015).  The application of this approach can be used for “Credit Card Fraud Detection,” and “Target Mailing List Identification” (Connolly & Begg, 2015).  The limitation of this approach is that the “Linear Regression” works well only with “Linear Data” (Connolly & Begg, 2015). The application of the PM operation includes the (1) “Customer Retention Management,” (2) “Credit Approval,” (3) “Cross-Selling,” and (4) “Direct Marketing” (Connolly & Begg, 2015).  Furthermore, the Supervised methods such as Linear Regression or Multiple Linear Regression can be used if there exists a strong relationship between a response variable and various predictors (Hodeghatta & Nayak, 2016).   

Clear Purpose of Supervised Learning

The purpose of the supervised learning must be clear before the implementation of the data mining process.   Data mining process involves six steps in accordance to (Dhawan, 2014).  They are as follows. 

  • The first step includes the exploration of the data domain.  To achieve the expected result, understanding and grasping the domain of the application assist in accumulating better data sets that would determine the data mining technique to be applied. 
  • The second phase includes the data collection.  In the data collection stage, all data mining algorithms are implemented on some data sets.  
  • The third phase involves the refinement and the transformation of the data.  In this stage, the datasets will get more refined to remove any noise, outliner, missing values, and other inconsistencies.  The refinement of the data is followed by the transformation of the data for further processing for analysis and pattern extraction.  
  • The fourth step involves the feature selection.  In this stage, relevant features are selected to apply further processing. 
  • The fifth stage involves the application of the relevant algorithm.  After the data is acquired, cleaned and features are selected, in this step, the algorithm is selected to process the data and produce results.  Some of the commonly used algorithms include (1) clustering algorithm, (2) association rule mining algorithm, (3) decision tree algorithm, and (4) sequence mining algorithm. 
  • The last phase involves the observation, the analysis, and the evaluation of the data.  In this step, the purpose is to find a pattern in the result produced by the algorithm.  The conclusion is typically based on the observation and evaluation of the data.

Classification is one of the data mining techniques.  Classification based data mining exists as the cornerstone of the machine learning in artificial intelligence (Dhawan, 2014).  The process in the Supervised Classification begins with given sample data, also known as a training set, consists of multiple entries, each with multiple features.  The purpose of this supervised classification is to analyze the sample data and to develop an accurate understanding or model for each class using the attributes present in the data.  This supervised classification is used to classify and label test data.  Thus, the precise purpose of the supervised classification is very critical to analyze the sample data and develop an accurate model for each class using the attributes present in the data.  Figure 1 illustrates the supervised classification technique in data mining as depicted in (Dhawan, 2014).  

Figure 1:  Linear Overview of steps involved in Supervised Classification (Dhawan, 2014)

The conventional techniques employed in the Supervised Classification involves the known algorithms of (1) Bayesian Classification, (2) Naïve Bayesian Classification, (3) Robust Bayesian Classifier, and (4) Decision Tree Learning. 

Various Types of Sampling

A sample of records can be taken for any analysis unless the dataset is driven from a big data infrastructure (Hodeghatta & Nayak, 2016).  A randomization technique should be used, and steps must be taken to ensure that all the members of a population have an equal chance of being selected (Hodeghatta & Nayak, 2016). This method is called probability sampling.  There are various variations on this sampling type:  Random Sampling, Stratified Sampling, and Systematic Sampling (Hodeghatta & Nayak, 2016), cluster, and multi-stage (Saunders, 2011).  In Random Sampling, a sample is picked randomly, and every member has an equal opportunity to be selected. In Stratified Sampling, the population is divided into groups, and data is selected randomly from a group or strata.  In Systematic Sampling, members are selected systematically, for instance, every tenth member of that particular time or event (Hodeghatta & Nayak, 2016).  The most appropriate sampling technique to obtain a representative sample should be implemented based on the research question(s) and the objectives of the research study (Saunders, 2011). 

In summary, supervised learning is comprised of Prediction or Regression, and Classification. In both approaches, a clear understanding of the SL is critical to analyze the sample data and develop an accurate understanding or model for each class using the attributes present in the data.  There are various types of sampling:  random, stratified and systematic.  The most appropriate sampling technique to obtain a representative sample should be implemented based on the research question(s) and the objectives of the research study. 

References

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Coronel, C., & Morris, S. (2016). Database systems: design, implementation, & management: Cengage Learning.

Dhawan, S. (2014). An Overview of Efficient Data Mining Techniques. Paper presented at the International Journal of Engineering Research and Technology.

Hall, P., Dean, J., Kabul, I. K., & Silva, J. (2014). An Overview of Machine Learning with SAS® Enterprise Miner™. SAS Institute Inc.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Saunders, M. N. (2011). Research methods for business students, 5/e: Pearson Education India.

R-Programming Language

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to compare the statistical features of R to its programming features.  The discussion also outlines the programming features available in R in a table format. Furthermore, the discussion describes how the analytics of R are suited for Big Data.  We will begin by defining R followed by the comparison.

What is R?

R is defined in (r-project.org, n.d.) as “language and environment for statistical computing and graphics.” The R system for statistical computing is used for data analysis and graphics (Hothorn & Everitt, 2009; Venables, Smith, & Team, 2017).  It is also described as an integrated suite of software facilities for data manipulation, calculation and graphical display (Venables et al., 2017).  The root of R is the S language, developed by John Chambers and colleagues at Bell Laboratories (formerly AT&T, now owned by Lucent Technologies) starting in the 1960s (Hothorn & Everitt, 2009; r-project.org, n.d.; Venables et al., 2017).   The S language was designed and developed as a programming language for data analysis.  While S language is a full-features of programming language (Hothorn & Everitt, 2009; r-project.org, n.d.), R provides a wide range of statistical techniques such as linear and non-linear modeling, classical statistical tests, time-series analysis, classification, clustering and so forth (Venables et al., 2017; Verzani, 2014).  It also provides graphical techniques and is highly extensible (Hothorn & Everitt, 2009; r-project.org, n.d.).  It is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License (r-project.org, n.d.). R has become the “lingua franca” or common language of statistical computing (Hothorn & Everitt, 2009).  It is becoming the primary computing engine for reproducible statistical research because of its open source availability and its dominant language and graphical capabilities (Hothorn & Everitt, 2009).  It is developed for Unix-like, Windows and Mac families of the operating system (Hornik, 2016; Hothorn & Everitt, 2009; r-project.org, n.d.; Venables et al., 2017).

The R system provides an extensive, coherent, integrated collection of intermediate tools for data analysis. It also provides graphical facilities for data analysis and displays either directly on the computer or on hard-copy.  The term “environment” in R is to characterize R as a fully planned and coherent system, rather than an incremental accretion of specific and inflexible tools as the case with other data analysis software (Venables et al., 2017).  However, most programs written in R are written for a single piece of data analysis and inherently ephemeral (Venables et al., 2017).  The R system provides the most classical statistics and much of the latest methodology (Hothorn & Everitt, 2009; Venables et al., 2017).   Furthermore, the R system has a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities (Venables et al., 2017).  As observed, R has various advantages which makes it a powerful tool to use for data analysis.

Statistical Features vs Programming Features

With R, several statistical tests and methods can be performed such as two-sample tests, hypothesis testing, z-test, t-test, chi-square tests, regression analysis, multiple linear regression, analysis of variance, and so forth (Hothorn & Everitt, 2009; r-project.org, n.d.; Schumacker, 2014; Venables et al., 2017; Verzani, 2014).  With respect to the programming features, R is an interpreted language, and it can be accessed through a command line interpreter.  The R supports matrix arithmetic.  It supports procedural programming with functions and object-oriented programming with generic functions.  Procedural programming includes procedure, records, modules and procedure calls.  It has useful data handling and storage facilities.  Packages are part of R programming and are useful in collecting sets of R functions into a single unit.  The programming features of R include database input, exporting data, viewing data, variable labels, missing data and so forth.  R also supports a large pool of operators for performing operations on arrays and metrics.  It has facilities to print the reports for the analysis performed in the form of graphs either on-screen or on hardcopy (Hothorn & Everitt, 2009; r-project.org, n.d.; Schumacker, 2014; Venables et al., 2017; Verzani, 2014).  Table 1 summarizes these features. 

Table 1. Summary of the Programming Features and Statistical Features in R.

Big Data Analytics Using R:  Big Data has attracted the attention of various sectors, researchers, academia, government and even the media (Chen, Mao, & Liu, 2014; Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013).   Such attention is driven by the value and the opportunities that can be derived from Big Data.   The importance of Big Data has been evident in almost every sector.   There are various advanced analytical theories and methods which can be utilized in Big Data in different fields such as Medical, Finance, Manufacturing, Marketing, and more. These six analytical models are Clustering, Association Rules, Regression, Classification, Time Series Analysis, and Text Analysis (EMC, 2015).  The Cluster, Regression and Classification models can be used in the Medical field. The Classification model with the Decision Tree and Naïve Bayes method has been used to diagnose patients with specific diseases such as heart disease, and the probability of a patient having a specific disease.  As an example, in (Shouman, Turner, & Stocker, 2011), the researchers performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease.  The key benefit of the study was the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio.  The study also performed the experimentation with and without the voting technique. 

Furthermore, there are four major analytics types:  Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017).  The Descriptive Analytics are used to summarize historical data to provide useful information.  The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling.  The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables.  It is useful to find an optimum solution and the best course of action using the algorithm.

Moreover, many organizations have employed Big Data and Data Mining in some areas including fraud detection.  Big Data Analytics can empower healthcare industry in fraud detection to mitigate the impact of the fraudulent activities in the industry.  Several use cases such as (Halyna, 2017; Nelson, 2017) have demonstrated the positive impact of integrating Big Data Analytics into the fraud detection system.  Big Data Analytics and Data Mining have various techniques such as classification model, regression model, and clustering model.  The classification model employs logistic, tree, naïve Bayesian, and neural network algorithms.  It can be used for fraud detection. The regression model employs linear ad k-nearest-neighbor.  The clustering model employs k-means, hierarchical and principal component algorithms.   For instance, in (Liu & Vasarhelyi, 2013), the researchers applied the clustering technique using an unsupervised data mining approach to detect the fraud of insurance subscribers.  In (Ekina, Leva, Ruggeri, & Soyer, 2013), the researchers applied the Bayesian co-clustering with unsupervised data mining method to detect conspiracy fraud which involved more than one party.  In (Capelleveen, 2013), the researchers employed the outlier detection technique using an unsupervised data mining method to detect dental claim data within Medicaid.  In (Aral, Güvenir, Sabuncuoğlu, & Akar, 2012), the researchers used distance-based correlation using hybrid supervised and unsupervised data mining methods for prescription fraud detection.   These research studies and use cases are examples of taking advantages of Big Data Analytics in healthcare fraud detection.  Thus, it is proven that Big Data Analytics can play a significant role in various sectors such as healthcare fraud detection.  

Therefore, giving the nature of BD and BDA, and the nature of R language, which can be integrated with other languages such as SQL, Hadoop (Prajapati, 2013), Spark (spark.rstudio.com, 2018), R is becoming the primary workhorse for statistical analyses (Hothorn & Everitt, 2009), which can be used for BDA as discussed above.  Statistical methods not only help make scientific discoveries, but also quantifies the reliability, reproducibility, and general uncertainty associated with these discoveries (Ramasubramanian & Singh, 2017).  Examples of using R with BDA include (Matrix, 2006), which analyzed customer behavioral data to identify unique and actionable segments of the customer base. Another example includes (Gentleman, 2005) using R in genetics and molecular biology use case.

In summary, the R system offers various features such as programming and statistical features which help in data analysis.  Big Data has various types of analytics such as clustering, association rules, regression, classification, time series analysis and text analysis.  Most of these analyses are statistical based which can be leveraged by using the R language.   R has been used in various BDA sectors such as healthcare and fraud detection.  

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Aral, K. D., Güvenir, H. A., Sabuncuoğlu, İ., & Akar, A. R. (2012). A prescription fraud detection model. Computer methods and programs in biomedicine, 106(1), 37-46.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. The University of Twente.  

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of Bayesian methods in the detection of healthcare fraud.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gentleman, R. (2005). Reproducible research: A bioinformatics case study.

Halyna. (2017). Challenge Accomplished: Healthcare Fraud Detection Using Predictive Analytics. Retrieved from https://www.romexsoft.com/blog/healthcare-fraud-detection/.

Hornik, K. (2016). R FAQ. Retrieved from: https://CRAN.R-project.org/doc/FAQ/R-FAQ.html.

Hothorn, T., & Everitt, B. S. (2009). A handbook of statistical analyses using R: Chapman and Hall/CRC.

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Paper presented at the System Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

Matrix, L. (2006). Using R for Customer Analytics: A Practical Introduction to R for Business Analysts. (2006).

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Nelson, P. (2017). Fraud Detection Powered by Big Data – An Insurance Agency’s Case Story. Retrieved from https://www.searchtechnologies.com/blog/fraud-detection-big-data.

Prajapati, V.-i. (2013). Big Data Analytics with R and Hadoop: Packt Publishing Ltd.

r-project.org. (n.d.). What is R? . Retrieved from https://www.r-project.org/about.html.

Ramasubramanian, K., & Singh, A. (2017). Machine Learning Using R: Springer.

Schumacker, R. E. (2014). Learning statistics using R: Sage Publications.

Shouman, M., Turner, T., & Stocker, R. (2011). Using decision tree for diagnosing heart disease patients. Paper presented at the Proceedings of the Ninth Australasian Data Mining Conference-Volume 121.

spark.rstudio.com. (2018). R Interface For Apache Spark Retrieved from http://spark.rstudio.com/.

Venables, W. N., Smith, D. M., & Team, R. C. (2017). Introduction To R. Retrieved from: https://cran.r-project.org/doc/manuals/R-intro.pdf, Version 3.4.2(2017-09-28).

Verzani, J. (2014). Using R for introductory statistics: CRC Press.

Proposal for Big Data Analytics in Healthcare

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to develop a proposal for Big Data Analytics (BDA) in healthcare.  The proposal covers three major parts. Part 1 covers Big Data Analytics Business Plan in Healthcare. Part 2 addresses Security Policy Proposal in Healthcare.  Part 3 proposes Business Continuity and Disaster Recovery Plan in Healthcare.  The project begins with Big Data Analytics Overview Healthcare, discussing the opportunities and challenges in healthcare, and the Big Data Analytics Ecosystem overview in healthcare. The project covers four major component of the BDA Business Plan. Big Data Management is the first building block with detailed discussion on the data store types which the healthcare organization must select based on their requirements, and a use case to demonstrate the complexity of this task.  Big Data Analytics is the second Building Block which covers the technologies and tools that are required when dealing with BDA in healthcare.  Big Data Governance is the third building block which must be implemented to ensure data protection and compliance with the existing rules.  The last building block is the Big Data Application with a detailed discussion of the methods which can be used when using BDA in healthcare such as clustering, classifications, machine learning and so forth.  The project also proposes a Security Policy in comprehensive discussion of Part 2.  This part discusses in details various security measures as part of the Security Policy such as compliance with CIA Triad, Internal Security, Equipment Security, Information Security, and Protection techniques.  The last Part covers Business Continuity and Disaster Recovery Plan in Healthcare, and the best practice.  

Keywords: Big Data Analytics, Healthcare, Security Policy, Business Continuity, Disaster Recovery.

Introduction

Healthcare generates various types of data from various sources such as physician notes, X-Rays reports, Lab reports, case history, diet regime, list of doctors and nurses, national health register data, medicine and pharmacies, medical tools, materials and instruments expiration data identification based on RFID data (Archenaa & Anita, 2015; Dezyre, 2016; Wang, Kung, & Byrd, 2018).  Thus, there has been an exponentially increasing trend in generating healthcare data, which resulted in an expenditure of 1.2 trillion towards healthcare data solutions in the healthcare industry (Dezyre, 2016).  The healthcare organizations rely on Big Data technology to capture this healthcare information about the patients to gain more insight into the care coordination, health management, and patient engagement.   As cited in (Dezyre, 2016), McKinsey projects the use of Big Data in the healthcare industry can minimize the expenses associated with healthcare data management by $300-$500 billion, as an example of the benefits from using BD in healthcare.  

            This project discusses and analyzes various aspects of the Big Data Analytics in Healthcare. It begins with an overview of Big Data Analytics, its benefits and challenges in healthcare, followed by the Big Data Analytics Framework in Healthcare.  The primary discussion and analysis focus on three major components of this project; the Database component of the Framework, the Security Policy component, and Disaster Recovery Plan component. These three significant components play significant roles in BDA in the healthcare industry.

Big Data Analytics Overview in Healthcare

            The healthcare industry is continuously generating a large volume of data resulting from record keeping, patients related data, and compliance.  As indicated in (Dezyre, 2016), the US healthcare industry generated 150 billion gigabytes, which is 150 Exabytes of data in 2011.  In the era of information technology and digital world, the digitization of the data is becoming mandatory.  The analysis of such large volume of the data is critically required to improve the quality of healthcare, minimize the healthcare related costs, and respond to any challenges effectively and promptly.  Big Data Analytics (BDA) offers excellent opportunities in the healthcare industry to discover patterns and relationships using the machine learning algorithms to gain meaningful insights for sound decision making (Jee & Kim, 2013).  Although BDA provides great benefits to healthcare, the application of BDA is confronted with various challenges.  The following two sections summarize some of the benefits and challenges.

1. Big Data and Big Data Analytics Opportunities and Benefits in Healthcare

            Various research studies and reports discussed and analyzed various benefits of BDA in Healthcare.  These benefits include providing patient-centric services.  Healthcare organizations can employ BDA in various areas such as detecting diseases at an early stage, providing evidence-based medicine, minimizing the doses of the drugs to avoid side effects, and delivering effective medicine based on genetic makeups.  The use of BDA can reduce the re-admission rates and thereby the healthcare related costs for the patients are also reduced.   BDA can also be used in the healthcare industry to detect spreading diseases earlier before the disease gets spread using real-time analysis.  The analysis includes social logs of the patients who suffer from a disease in a particular geographical location. This analytical process can assist healthcare professionals to provide to the community to take the preventive measures.  Moreover, BDA is also used in the healthcare industry to monitor the quality of healthcare organizations and entities such as hospitals.   The treatment methods can be improved using BDA by monitoring the effectiveness of medications (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang et al., 2018). 

Moreover, researchers and practitioners discussed various BDA techniques in healthcare to demonstrate the great benefits of BDA in healthcare.  For instance, in (Aljumah, Ahamad, & Siddiqui, 2013), the researchers discussed and analyzed the application of the Data Mining (DM) to predict the modes of treating the diabetic patients.  The researchers of this study concluded that the drug treatment for young age diabetic patients could be delayed to avoid side effects, while the drug treatment for the old age diabetic patients should be immediate with other treatments as there are no other alternatives available.  In (Joudaki et al., 2015; Rawte & Anuradha, 2015), the researchers used the DM technique to detect healthcare fraud and abuse, that cost fortunate to the healthcare industry.  In (Landolina et al., 2012), the researchers discussed and analyzed the remote monitoring technique to reduce health care use and improve the quality of care in the heart failure patients with implantable defibrillators.  

Practical examples of the Big Data Analytics in Healthcare industry include Kaiser Permanent implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records. AstraZeneca and HealthCore have joined n alliance to determine the most effective and economical treatments for some chronic illness and common diseases based on their combined data (Fox & Vaidyanathan, 2016). 

            Thus, the benefits and advantages of BDA in the healthcare industry are not questionable.  Several types of research, studies, and real applications have proven and demonstrated the significant benefits and the critical role of BDA in healthcare.  In a simple word, BDA is revolutionizing the healthcare industry. 

2. Big Data and Big Data Analytics Challenges in Healthcare

            Although BDA offers great opportunities to healthcare industries, various challenges are emerging from the application of BDA in healthcare.  Various research studies and reports discussed various Big Data Analytics challenges in healthcare.  As indicated in the McKinsey report of (Groves, Kayyali, Knott, & Kuiken, 2016), the nature of the healthcare industry itself poses challenging to BDA.  In (Hashmi, 2013), three major challenges dealing with healthcare industry are discussed.  These challenges include the episodic culture, the data puddles, and the IT leadership.  The episodic culture addresses the conservative culture of the healthcare and the lack of the IT technologies mindset, which created a rigid culture.  Few healthcare providers have overcome this rigid culture and begun to use technology.  However, there is still a long way to go for technology to be the foundation in the healthcare industry.  The data puddles reflect the silo nature of healthcare. Silo is described by (Wicklund, 2014) to be one of the biggest flaws in the healthcare industry.  Healthcare industry is falling behind other industries because it is not using the technology properly, as all silos use their way to collect data from labs, diagnosis, radiology, emergency, case management and so forth. Collecting data from these sources is very challenging.  As indicated in (Hashmi, 2013), most healthcare organizations lack the knowledge of the basic concepts of data warehousing and data analytics.  Thus, until the healthcare providers have a good understanding of the value of BDA, taking full advantage of BDA in healthcare still has a very long way.  The third challenge represents the IT leadership.  The lack of the latest technologies among the IT leadership in the healthcare industry is a serious challenge.  As the IT professionals in health care depend on vendors, who stores the data within their tools and can control the access level to even IT professionals.  This approach is a limiting approach to IT advancement and knowledge of the emerging technologies and the application of Big Data.

            Other research studies argued that it would be difficult to ensure that Big Data plays a vital role in the healthcare industry (Jee & Kim, 2013; Ohlhorst, 2012; Stonebraker, 2012).  The concern is coming from the fact that Big Data has its challenges such as the complex nature of the emerging technologies, security and privacy risks, and the need for professional skills.  In (Jee & Kim, 2013), the researchers found that healthcare Big Data has unique attributes and values and poses different challenges compared to the business sector.  These healthcare challenges include the scale and scope of healthcare data, which is growing exponentially.  Healthcare Big Data can be defined using silo, security, and variety.  Security is the primary attribute of Big Data for governments or healthcare organizations, which does require extra care and attention in using healthcare where security, privacy, authority, and legitimacy issues are very much concern.  The attribute of the “variety” of healthcare data from reading a chart, to lab test result, to X-ray images developing structured, unstructured, and semi-structured data, which is the same as the case with the business sector. However, in the healthcare industry, most of the healthcare data are structured such as Electronic Health Records rather than semi-structured or unstructured.  Thus, the select of the database to store the data must be carefully selected when dealing with BDA in healthcare.  Figure 1summarizes the differences between BDA in healthcare industry vs. business sector. 

Figure 1. Big Data Analytics Challenges in Healthcare vs. Business Sector (Jee & Kim, 2013).

            BDA in healthcare is more challenging than the business sector due to the nature of healthcare industry and the data it generates.  In (Alexandru, Alexandru, Coardos, & Tudora, 2016), the researchers identified six major challenges for BDA in the healthcare industry, some of which overlap with the ones discussed earlier.  The first challenge for BDA in healthcare involves the interpretation and correlations, especially when dealing with a complex data structure such as healthcare dataset.  BD increases the need for standardization and interoperability in the healthcare industry, which is very challenging because some healthcare organizations use their data and infrastructure.   The security and privacy is a major concern in the business sector.  However, they become even more concern when dealing with healthcare information, due to the nature of healthcare industry. The data expertise and infrastructure are required to facilitate the analytical process of the healthcare data. However, as addressed in various studies, most of the healthcare organizations lack such expertise and BD and BDA.   This lack of expertise is posing challenges to BDA in healthcare. The timeliness is another challenging aspect of BDA in healthcare, as the time is critical in obtaining data for the clinical decision.  While BD speeds up decision support and may make it more accurate based on the collected data, care and attention to the data and the queries are very critical to ensure that time constraints are respected while still getting accurate answers.  The last challenge is the IT leadership which seems to be in agreement with (Hashmi, 2013).   As indicated in (Liang & Kelemen, 2016), several challenges of BDA in healthcare are discussed in several studies.  Some of these challenges include a data structure, data storage and transfers, inaccuracies in data, real-time analytics, and regulatory compliance. Figure 2 summarizes all these challenges of BDA in the healthcare industry, derived from (Alexandru et al., 2016; Hashmi, 2013; Jee & Kim, 2013; Liang & Kelemen, 2016).

Figure 2.  Summary of BDA Challenges in Healthcare (Alexandru et al., 2016; Hashmi, 2013; Jee & Kim, 2013; Liang & Kelemen, 2016).

            This project will not address all these challenges due to the limited scope of this project.  The scope of this project is limited only to the Database, Security, and Disaster Recovery.  Thus, the discussion and the analysis are focusing on these three components which are part of the challenges discussed above.  Before diving into these three major topics of this project, an overview of BDA framework in healthcare can assist in understanding the complexity of the application of BG in healthcare.

3. Big Data Analytics Ecosystem Overview in Healthcare

It is essential for healthcare organization IT professionals to understand the framework and the topology of the BDA for the healthcare organization to apply the security measures to protect patients’ information. The new framework for healthcare industry include the emerging new technologies such as Hadoop, MapReduce, and others which can be utilized to gain more insight in various areas.  The traditional analytic system was not found adequate to deal with a large volume of data such as the healthcare generated data (Wang et al., 2018).  Thus, new technologies such as Hadoop and its major components of the Hadoop Distributed File System (HDFS), and MapReduce functions with NoSQL databases such as HBase, and Hive were emerged to handle a large volume of data using various algorithms and machine learnings to extract value from such data. Data without analytics has no value.  The analytical process turns the raw data into valuable information which can be used to save lives, predict diseases, decrease costs, and improve the quality of the healthcare services.  

Various research studies addressed various BDA frameworks for healthcare in an attempt to shed light on integrating the new technologies to generate value for the healthcare.    These proposed frameworks vary.  For instance, in (Raghupathi & Raghupathi, 2014), the framework involved various layers.  The layers included Data Source Layer, Transformation Layer, Big Data Platform Layer, and Big Data Analytical Application Layer.   In (Chawla & Davis, 2013), the researchers proposed personalized healthcare, patient-centric framework, empowering patients to take a more active role in their health and the health of their families.  In (Youssef, 2014), the researcher proposed a framework for secure healthcare systems based on BDA in Mobile Cloud Computing environment.  The framework involved the Cloud Computing as the technology to be used for handling big healthcare data, the electronic health records, and the security model.

Thus, this project introduces the framework and the ecosystems for BDA in healthcare organizations which integrate the data governance to protect the patients’ information at the various level of data such as data in transit and storage.   The researcher of this project is in agreement with the framework proposed by (Wang et al., 2018), as it is a comprehensive framework addressing various data privacy protection techniques during the analytical processing.  Thus, the selected framework for this project is based on the ecosystems and topology of (Wang et al., 2018).

The framework consists of significant layers of the Data Layer, Data Aggregation Layer, Analytics Layer, Information Exploration Layer, and Data Governance Layer. Each layer has its purpose and its role in the implementation of BDA in the healthcare domain.   Figure 3 illustrates the BDA framework for healthcare organizations (Wang et al., 2018)

Figure 3.  Big Data Analytics Framework in Healthcare (Wang et al., 2018).

            The framework consists of the Data Governance Layer that is controlling the data processing starting with capturing the data, transforming the data, and the consumption of the data. The Data Governance Layer consists of three essential elements; the Master Data Management element, the Data Life-Cycle Management element, and the Data Security and Private Management element. These three significant elements of the Data Governance Layer to ensure the proper use of the data, the data protection from any breach and unauthorized access.

            The Data Layer represents the capture of the data from various sources such as patients’ records, mobile data, social media, clinical and lab results, X-Rays, R&D lab, home care sensors and so forth.  This data is captured in various types such as structured, semi-structured and unstructured formats.  The structured data represent the traditional electronic healthcare records (EHRs).  The video, voice, and images represent the unstructured data type. The machine-generated data forms semi-structured data, during transactions data including patients’ information forms structure data.  These various types of data represent the variety feature which is one of the three primary characteristics of the Big Data (volume, velocity, and variety).   The integration of this data pools is required for the healthcare industry to gain significant opportunities from BDA.

The Data Aggregation Layer consists of three significant steps to digest and handle the data; the acquisition of the data, the transformation of the data, and data storage.  The acquisition step is challenging because it involves reading the data from various communication channels including frequencies, sizes, and formats.  As indicated in (Wang et al., 2018), the acquisition of the data is a significant obstacle in the early stage of BDA implementation as the captured data has various characteristics, and budget may get exceeded to expand the data warehouse to avoid bottlenecks during the workload.   The transformation step involves various process steps such as the data moving step, the cleaning step, splitting step, translating step, merging step, sorting step, and validating data step.  After the data gets transformed using various transformation engines, the data are loaded into storage such as HDFS or in Hadoop Cloud for further processing and analysis.  The principles of the data storage are based on compliance regulations, data governance policies and access controls.  The data storage techniques can be implemented and completed using batch process or during the real-time.

            The Analytics Layer involves three central operations; the Hadoop MapReduce, Stream Computing, and in-database analytics based on the type of the data.  The MapReduce operation is the most popular BDA technique as it provides the capability to process a large volume of data in the batch form in a cost-effective fashion and to analyze various types of data such as structured and unstructured data using massively parallel processing (MPP).  Moreover, the analytical process can be at the real-time or near real time.  With respect to the real-time data analytic process, the data in motion is tracked, and responses to unexpected events as they occur and determine the next-best actions quickly.  Example include the healthcare fraud detection, where stream computing is a critical analytical tool in predicting the likelihood of illegal transactions or deliberate misuse of the patients’ information.  With respect to the in-database analytic, the analysis is implemented through the Data Mining technique using various approaches such as Clustering, Classification, Decision Trees, and so forth.   The Data Mining technique allows data to be processed within the Data Warehouse providing high-speed parallel processing, scalability, and optimization features with the aim to analyze big data.  The results of the in-database analytics process are not current or real-time. However, it generates reports with a static prediction, which can be used in healthcare to support preventive healthcare practices and improving pharmaceutical management.  This Analytic Layer also provides significant support for evidence-based medical practices by analyzing electronic healthcare records (EHRs), care experience, patterns of care, patients’ habits, and medical histories (Wang et al., 2018).

            In the Information Exploration Layer, various visualization reports, real-time information monitoring, and meaningful business insights which are derived from the Analytics Layer are generated to assist organizations in making better decisions in a timely fashion.  With respect to healthcare organizations, the most critical reporting involves real-time information monitoring such as alerts and proactive notification, real-time data navigation, and operational key performance indicators (KPIs).  The analysis of this information is implemented from devices such as smartphones, and personal medical devices which can be sent to interested users or made available in the form of dashboards in real-time for monitoring patients’ health and preventing accidental medical events.   The value of remote monitoring is proven for diabetes as indicated in (Sidhtara, 2015), and for heart diseases, as indicated in (Landolina et al., 2012).

Part-1: Big Data Analytics Business Plan in Healthcare

Healthcare can benefit from Big Data Analytics in various domains such as decreasing the overhead costs, curing and diagnosing diseases, increasing the profit, predicting epidemics and heading the quality of human life (Dezyre, 2016).  Healthcare organizations have been generating the substantial volume of data mostly generated by various regulatory requirements, record keeping, compliance and patient care.  There is a projection from McKinsey that Big Data Analytics in Healthcare can decrease the costs associated with data management by $300-$500 billion.  Healthcare data includes electronic health records (EHR), clinical reports, prescriptions, diagnostic reports, medical images, pharmacy, insurance information such as claim and billing, social media data, and medical journals (Eswari, Sampath, & Lavanya, 2015; Ward, Marsolo, & Froehle, 2014). 

Various healthcare organizations such as scientific research labs, hospitals, and other medical organizations are leveraging Big Data Analytics to reduce the costs associated with healthcare by modifying the treatment delivery models.  Some of the Big Data Analytics technologies have been applied in the healthcare industry.  For instance, Hadoop technology has been used in healthcare analytics in various domains.  Examples of Hadoop application in healthcare include cancer treatments and genomics, monitoring patient vitals, hospital network, healthcare intelligence, fraud prevention and detection (Dezyre, 2016). 

For a healthcare organization to embrace Big Data and Big Data Analytics successfully, the organization must embrace the building blocks of the Big Data into the building blocks of the healthcare system.  The organization should also integrate both building blocks into the Big Data Business Plan. 

As indicated in (Verhaeghe, n.d.), there are four major building blocks for Big Data Analytics.  The first building block is Big Data Management to enable organization capture, store and protect the data. The second building block for the Big Data is the Big Data Analytics to extract value from the data.  Big Data Integration is the third building block to ensure the application of governance over the data.  The last building block in Big Data is the Big Data Applications for the organization to apply the first three building blocks using the Big Data technologies.

1.    Building Block 1: Big Data Management

The healthcare data must be stored in a data store before processing the data for analytical purposes.  The traditional relational database was found inadequate to store the various types of data such as unstructured and semi-structured dataset.  Thus, new types of databases called NoSQL as a solution to the challenges faced by the relational database. 

The organization must choose the appropriate databases to store the medical records of the patients in a safe way not only to ensure compliance with the current regulations and rules such as HIPAA but also to ensure the protection against the data leak. Various recent platforms supporting Big Data management focus on data storage, management, processing, and distribution and data analytics. 

1.1 Data Store Types in Big Data Analytics

NoSQL stands for “Not Only SQL” (EMC, 2015; Sahafizadeh & Nematbakhsh, 2015).  NoSQL is used for modern, scalable databases in the age of Big Data.  The scalability feature enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015).  The platform can incorporate two types of scalability to support the processing of Big Data; horizontal scaling and vertical scaling. The horizontal scaling allows distributing the workload across many servers and nodes. Servers can be added to the horizontal scaling to increase the throughput (Sahafizadeh & Nematbakhsh, 2015).  The vertical scaling, on the other hands, more processors, more memories, and faster hardware can be installed on a single server (Sahafizadeh & Nematbakhsh, 2015).  NoSQL offers benefits such as mass storage support, reading and writing operations are fast, the expansion is easy, and the cost is low (Sahafizadeh & Nematbakhsh, 2015).  Examples of the NoSQL databases are MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.  BDA utilizes these various types of databases which can be scaled and distributed.  The data stores are categorized into four types of store:  document-oriented, column-oriented or column family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). 

The purpose of the document-oriented database is to store and retrieve collections of information and documents.  Moreover, it supports complex data forms in various format such as XML, JSON, in addition to the binary forms such as PDF and MS Word (EMC, 2015; Hashem et al., 2015).   The document-oriented database is similar to a tuple or in the relational database. However, the document-oriented database is more flexible and can retrieve documents and information based on their contents.  The document-oriented data store offers additional features such as the creation of indexes to increase the search performance of the document (EMC, 2015).   The document-oriented data stores can be used for the management of the content of web pages, as well as web analytics of log data (EMC, 2015). Example of the document-oriented data stores includes MongoDB, SimpleDB, and CouchDB (Hashem et al., 2015).  

The purpose of the column-oriented database is to store the content in columns aside from rows, with attribute values belonging to the same column stored contiguously (Hashem et al., 2015).  The column family database is used to store and render blog entries, tags, and viewers’ feedback. It is also used to store and update various web page metrics and counters (EMC, 2015). Example of the column-oriented database is BigTable.  In (EMC, 2015; Erl, Khattak, & Buhler, 2016) Cassandra is also listed as a column-family data store.

The key-value data store is designed to store and access data with the ability to scale to an immense size (Hashem et al., 2015).   The key-value data store contains value and a key to access that value.  The values can be complicated (EMC, 2015).  The key-value data store can be useful in using login ID as the key to the preference value of customers.  It is also useful in web session ID as the key with the value for the session.  Examples of key-value databases include DynamoDB, HBase, Cassandra, and Voldemort (Hashem et al., 2015).  While HBase and Cassandra are described to be the most popular and scalable key-value store (Borkar, Carey, & Li, 2012), DynamoDB and Cassandra are described to be the two common AP (Availability and Partitioning tolerance) systems (Chen, Mao, & Liu, 2014).  Others like (Kaoudi & Manolescu, 2015) describes Apache Accumulo, DynamoDB, and HBase as the popular key-value stores. 

The purpose of the graph database is to store and represent data which uses a graph model with nodes, edges, and properties related to one another through relations.  Example of the graph database is Neo4j (Hashem et al., 2015).  Table 1 provides examples of NoSQL Data Stores.

Table 1.  NoSQL Data Store Types with Examples.

1.2 Big Data Analytics Data Store Use Case in Healthcare

Various research studies discussed and analyzed these data stores when using BDA in healthcare.  Some researchers such as (Klein et al., 2015) have struggled to find the absolute answer on the proper data store in healthcare.  In this project of (Klein et al., 2015), the researchers performed application-specific prototyping and measurement to identify NoSQL products which can fit a data model and can query use cases to meet the performance requirements of the provider.  The provider has been using thick client system running at each site around the globe and connected to a centralized relational database.  The provider has no experience with NoSQL.  The purpose of the project was to evaluate NoSQL databases which will meet their needs.  The provider was a large healthcare provider requesting a new Electronic Health Records (EHRs) system which supports healthcare delivery for over nine million patients in more than 100 facilities across the world.  The rate of the data growth is more than one terabyte per month. The data must be retained for ninety-nine years.  The technology of NoSQL was considered for two major reasons.  The first reason involved a Primary Data Store for the EHRs system.  The second reason is to improve request latency and availability by using a local cache at each site.  This EHRs system required robust and strong replica consistency.  A comparison was performed between the identified data stores for the strong replica consistency vs. the eventual consistency among Cassandra, MongoDB, and Riak. The results of the project indicated that Cassandra data store demonstrated the best throughput performance, but with the highest latency for the specific workloads and configurations tested.   The researchers analyzed such results of Cassandra that Cassandra provides hash-based sharding spread the request and storage load better than MongoDB.  The second reason is that indexing feature of Cassandra allowed efficient retrieval of the most recently written records, compared to Riak.  The third reason is that the P2P architecture and data center aware feature of Cassandra provide efficient coordination of both reads and write operations across the replicas nodes and the data centers.  The results also showed that MongoDB and Cassandra provided a more efficient result with respect to the performance than Riak data store.  Moreover, they provided strong replica consistency required for such application of the data models.  The researchers concluded that MongoDB exhibited more transparent data modeling mapping than Cassandra, besides the indexing capabilities of MongoDB, were found to be a better fit for such application.   Moreover, the results also showed that the throughput varied by a factor of ten, read operation latency varied by a factor of five, and write latency by a factor of four with the highest throughput product delivering the highest latency. The results also showed that the throughput for workloads using strong consistency was 10-25% lower than workloads using eventual consistency.

The quick responses to accuracy in healthcare are regarded to be one of the challenges discussed earlier.  The use case focused on the performance analysis of these three selected data stores (Cassandra, MongoDB, and Riak) since it was a requirement from the provider, and also a challenging aspect of BDA in healthcare.  This use case has demonstrated that there is no single answer to the use of data store in BDA in healthcare.  It depends on the requirements of the healthcare organizations and the priorities.  With respect to the performance, these research studies shed light on the performance of Cassandra, MongoDB, and Riak when dealing with BDA in healthcare.

2.      Building Block 2: Big Data Analytics

The organization must follow the lifecycle of the Big Data Analytics. The life cycle of the data analytics defines the analytics process for the organization’s data science project.  This analytics process involves six phases of the data analytics lifecycle identified by (EMC, 2015).  These six phases involve “Discovery,” “Data Preparation,” “Model Planning,” “Model Building,”  “Communicate Results,” and “Operationalize” (EMC, 2015).

The “Discovery” is the first phase of the data analytics lifecycle which determines whether there is enough information to draft an analytic plan and share for peer review.  In this first phase, the business domain including the relevant history, the resources assessment including technology, time, data, and people are identified.  During this first phase of the “Discovery,” the problem of the business and the initial hypotheses are identified.  Moreover, the key stakeholders are also identified and interviewed to understand their perspectives toward the identified problem.  The potential data sources are identified, the aggregate data sources are captured, the raw data is reviewed, the data structures and tools needed for the project are evaluated, and the data infrastructure is identified and scoped such as disk storage and network capacity during this first phase.  

The “Data Preparation” is the second phase of the data analytics lifecycle.  During this second phase, the analytics sandbox and workspace are prepared, and the process of Extract, Transform and Load (ETL), or Extract, Load and Transform (ELT), known as ETLT, is performed.   Moreover, during this second phase, learning about the data is very important.  Thus, the data access to the project data must be clarified, gaps must be identified, and datasets outside the organization must be identified.  The “data conditioning” must be implemented which involves the process of cleaning the data, normalizing the datasets, and performing a transformation on the data.  During the “Data Preparation,” the visualization and statistics are implemented.  The common tools for the “Data Preparation” phase involve Hadoop, Alpine Miner, OpenRefine, and Data Wrangler. 

The “Model Planning” is the third phase of the data analytics lifecycle.  The purpose of this step is to capture the key predictors and variables instead of considering every possible variable which might impact the outcome.   In this phase, the data is explored, the variables are selected, the relationships between the variables are determined. The model is identified with the aim to select the analytical techniques to implement the goal of the project.  The common tools for the “Model Planning” phase include R, SQL Analysis Services, and SAS/ACCESS. 

The “Model Building” is the fourth phase of the data analytics lifecycle, the datasets are developed for testing, training and production purpose.  The models which are identified in phase three are implemented and executed.  The tools to run the identified models must be identified and examined.  The common tools for this phase of “Model Building” include commercial tools such as SAS Enterprise Miner, SPSS Modeler, Matlab, Alpine Miner, STATISTICA, and open source tools such as R and PL/R, Octave, WEKA. Python, and SQL. In accordance to (EMC, 2015), there are six main advanced analytical models and methods which can be utilized to analyze Big Data in different fields such as Finance, Medical, Manufacturing, Marketing, and so forth.  These six analytical models are Clustering, Association Rules, Regression, Classification, Time Series Analysis, and Text Analysis.  The Cluster, Regression, and Classification models can be used in medical field.  However, each model can serve the medical field in different areas.  For instance, the Clustering model with the K-Means analytical method can be used in the medical domain for preventive measures.   The Regression Model can also be used in the medical field to analyze the effect of specific medication or treatment on the patient, and the probability for the patient to respond positively to specific treatment.   The Classification model seems to be the most appropriate model to diagnose illness.  The Classification model with the Decision Tree and Naïve Bayes method can be used to diagnose patients with certain diseases such as heart diseases, and the probability of a patient having a particular disease. 

“Communicate Result” is the fifth phase which involves the communication of the result with the stakeholders.  The results of the project must be determined whether they are success or failure based on the criteria developed in the first phase of “Discovery.”  The key findings must be identified in this phase, the business value must be quantified, and a narrative summary of the findings must be communicated to the stakeholders.  The “Operationalize” is the last phase of the data analytics lifecycle.  This phase involves the final report delivery, briefing, code and technical documentation.  A pilot project may be implemented in a production environment. 

3.      Building Block 3: Big Data Integration and Governance

Big Data Integration is the third building block to ensure the application of governance over the data.  The data governance is critical to organizations, especially in healthcare due to several security and privacy rules, to ensure the data is stored and located in the right place and used correctly.   Data siloes are a persistent problem for healthcare organizations, and for those who have been curbing the integration and application of new technologies such as Big Data Analytics or have just recently began to recognize the value of Big Data Analytics.  As cited in (Jennifer, 2016), Dr. Ibrahim, Chief Data and Analytics Officer at Saint Francis Care suggested that the solution to the siloed organizational environment is Data Governance, which should be integrated into the overall strategic roadmap. 

As indicated in McKinsey report by (Groves et al., 2016), there are six significant steps which healthcare organizations must implement to improve technology and governance strategies for clinical and operational data.  The first step involves data ownership and security policies, which should be established and implemented to ensure the appropriate access control and security measures are configured for those authorized clinical members such as physicians, nurses and so forth.  The second step involves the “golden sources of truth” for clinical data which should be implemented and reinforced by the organization.  This step involves the aggregation of all relevant patient information in one central location to improve population health management and accountable-care-organization.  The third step involves the data architecture and governance models to manage and share key clinical, operational, and transactional data sources across the organization, thereby, breaking down the internal silos.  The fourth major step involves a clear data model which should be implemented by the organization to comply with all relevant standards and knowledge architecture which provides consistency across disparate clinical systems and external clinical data repositories.  The fifth step involves decision bodies with joint clinical and IT representation which should be developed by the organization.  These decision bodies are responsible for defining and prioritizing key data needs.  The IT role will be redefined throughout this step as an information services broker and architect, rather than an end-to-end manager of information services.  The last step involves “informatics talent” which has clinical knowledge and expertise, and advanced dynamic and statistical modeling capabilities, as the traditional model where all clinical and IT roles were separate is no longer workable in the age of Big Data Analytics.  

4.      Building Block 4: Big Data Application

The healthcare organization can apply Big Data Analytics in several areas related to healthcare and patient’s medical information.  Examples of such applications include three significant applications of EMR Data, Sensor Data, and Healthcare Systems.

4.1 Big Data Applications for EMR Data

            This application of Big Data in EMR involves Clustering, Computational Phenotyping, Disease Progression Modelling, and Image Data Analysis. 

The Clustering technique can assist in detecting similar patients or diseases.  There are two types of techniques to derive meaningful clusters because the raw healthcare data is not clean.  The first technique tends to learn robust latent representations first, followed by clustering methods.  The second technique adopts probabilistic clustering models which can deal with raw healthcare data effectively (Lee et al., 2017). 

The Computational Phenotyping has become a hot topic recently and has attracted the attention of a large number of researchers as it can assist learn robust representations from sparse, high-dimensional, noisy raw EMR data (Lee et al., 2017).  There are various types of computational phenotyping such as rules/algorithms, and latent factors or latent bases for medical features.   The doctors regard phenotyping as rules that define diagnostic or inclusion criteria.  The principal task of finding phenotyping is achieved by a supervised task.   The domain experts first select some features, then statistical methods such as logistic regression or chi-square test are performed to identify the significant features for developing acute kidney injury during hospital admissions (Lee et al., 2017).

The Disease Progression Modelling (DPM) is to utilize the computational methods to model the progression of a specific disease. A specific disease can be detected early with the help of DPM, and therefore, manage the disease better. For chronic diseases, the deterioration of the patients for chronic diseases can be delayed, and the healthcare outcome can be improved (Lee et al., 2017).  The DPM involves statistical regression methods, machine learning methods, and deep learning methods.  The statistical regression methods for DPM can model the correlation between the pathological features of patients and the condition indicators of the patients.  The progression of patients with patients’ features can be accessed through this correlation.   The survival analysis is another approach for DPM to link patients’ disease progression to the time before a particular outcome such as a liver transplant.  Although the statistical regression methods have shown to be efficient due to their simple models and computation, they cannot be generalized for all medical scenarios.  The machine learning for DPM includes various models from graphical models such as Markov models, to multi-task learning methods and artificial neural networks.  As an example, the multi-state Markov model which is proposed for predicting the progression between different stages for abdominal aortic aneurysm patients considering the probability of misclassification at the same time (Lee et al., 2017).   The Deep Learning Methods become more widely applicable to its robust representation and abstraction due to its non-linear activation functions inside.  For instance, a variant of Long Short-Term Memory (LSTM) can be employed to model the progression of both diabetes cohort and mental cohort (Lee et al., 2017).

            The Image Data Analysis can be used in analyzing medical images such as the MRI images.  Experiments show that incorporating deformable models with deep learning algorithms can achieve better accuracy and robustness for fully automatic segmentation of the left ventricle from cardiac MRI datasets (Lee et al., 2017). 

4.2 Big Data Applications for Sensor Data

The Big Data Applications for Sensor Data involves Mobile Healthcare, Environment Monitoring, and Disease Detection.   With respect to the Sensor Data, there has been a drive and trend toward the utilization of information and communication technology (ICT) in healthcare of more elderly population due to the shortage of clinical workforce, called mobile healthcare or mHealth (Lee et al., 2017).  Personalized healthcare services will be provided remotely, and diagnoses, medications, and treatments will be fine-tuned for patients by spatiotemporal and psycho-physiological conditions using the advanced technologies including the machine learning and high-performance computing.   The integration of chemical sensors for detecting the presence of specific molecules in the environment is another healthcare application for sensor data.  The environmental monitoring for haze, sewage water, and smog emission and so forth has become a significant worldwide problem.  With the currently advanced technologies, such environmental issues can be monitored.   With regard to disease detection, the biochemical-sensors deployed in the body can detect particular volatile organic compounds (VOCs),  The big potential of such sensor devices and BDA of BOCs will revolutionize healthcare both at home and in hospitals (Lee et al., 2017).

4.3 Big Data Applications Healthcare Systems

Some healthcare systems have been designed and developed to serve as platforms for solving various healthcare issues when deploying Big Data Analytics.  There is a system called HARVEST which is a healthcare system allowing doctors to view patients’ longitudinal EMR data at the point of care.  It is composed of two key parts; a front-end for better virtualization; a distributed back-end which can process patients’ various types of EMR data and extract informative problem concepts from patients’ free text data measuring each concept via “salience wights” (Lee et al., 2017).  There is another system called miniTUBA to assist clinical researchers in employing dynamic Bayesian networks (DBN) for data analytics in temporal datasets.  Another system called GEMINI which is proposed by (Lee et al., 2017) to address various healthcare problems such as phenotyping, disease progressing modeling, treatment recommendation and so forth.  Figure 4 illustrate GEMINI system.

Figure 4.  GEMINI Healthcare System (Lee et al., 2017).

            These are examples of Big Data Analytics Applications in Healthcare.  Additional applications should be investigated to fully utilize Big Data Analytics in various domains and areas of healthcare as healthcare industry is a vibrant field.

Part-2: Security Policy Proposal in Healthcare

      Security and privacy are very much co-related, as enforcing security can ensure the protection of the private and critical information of the patients.  Security and privacy are significant challenges in healthcare.  Various research studies and reports have addressed the serious problem with healthcare-related significant data security and privacy.  The Data Privacy concern is caused by the potential data breaches and data leak of the patients’ information.  As indicated in (Fox & Vaidyanathan, 2016), the cyber thieves routinely target the medical records.  The Federal Bureau of Investigation (FBI) issued a warning to healthcare providers to guard their data against cyber attacks, after the incident of the Community Health Systems Inc., which is regarded to be one of the largest U.S. hospital operators.  In this particular incident, 4.5 million patients’ personal information were stolen by Chinese hackers. Moreover, the names and addresses of 80 million patients were stolen by hackers from Anthem, which is regarded to be one of the largest U.S. health insurance companies.  Although the details of the illnesses of these patients and treatment were not exposed, however, this incident shows how the healthcare industry is exposed to cyber attacks.

1.1      Increasing Trend of Data Breaches in Healthcare

There is an increasing trend in such privacy data breach and data loss through cyber attacks incidents.  As indicated in (himss.org, 2018), medical and healthcare entities accounted for 36.5% of the reported data breaches in 2017.  In accordance with a recent report published by HIPAA, the first three months of 2018 experienced 77 healthcare data breaches reported to the Department of Health and Human Services’ Office for Civil Rights (OCR).  The report added that the impact of these breaches was significant as more than one million patients and health plan members were affected.  These breaches are estimated to be almost twice the number of individuals who were impacted by healthcare data breaches in Q4 of 2017.   Figure 5 illustrates such increasing trend in the Healthcare Data Breaches (HIPAA, 2018).

Figure 5:  Q1, 2018 Healthcare Data Breaches (HIPAA, 2018).

As reported in the same report, the healthcare industry is unique with respect to the data breaches because they are caused mostly by the insiders; “insiders were behind the majority of breaches” (HIPAA, 2018).   Other reasons involve improper disposal, loss/theft, unauthorized access/disclosure incidents, and hacking incidents. The most significant healthcare data breaches of Q1 of 2018 involved 18 healthcare security breaches which impacted more than 10,000 individuals.  The hacking/IT incidents involved more records than any other breach cause as illustrated in Figure 6  (HIPAA, 2018).

Figure 6.  Healthcare Records Exposed by Breach Cause (HIPAA, 2018).

The worst affected by the healthcare data breaches in Q1 of 2018 involved the healthcare providers.  With respect to the states, California was the worst affected state with 11 reported breaches and Massachusettes with eight security incidents. 

1.2      HIPAA Compliance Requirements

Health Insurance Portability and Accountability Act (HIPAA) of 1996 is U.S. legislation which provides data privacy and security provisions for safeguarding medical information.  Healthcare organizations must comply with and meet the requirements of HIPAA.  The compliance of HIPAA is critical because the privacy and security of the patients’ information are of the most critical aspect in healthcare domain.   The goal of the security is to meet the CIA Triad of Confidentiality, Integrity, and Availability.  In healthcare domain, organizations must apply security measures by utilizing commercial software such as Cloudera instead of using open source software which may be exposed to security holes (Fox & Vaidyanathan, 2016). 

1.3      Security Policy Proposal

            The Security Policy is a document which defines the scope of security needed by the organization.  It discusses the assets which require protection and the extent to which security solutions should go to provide the necessary protection (Stewart, Chapple, & Gibson, 2015).  The Security Policy is an overview of the security requirements of the organization.  It should identify the major functional areas of data processing and clarifies and defines all relevant terminologies.  It outlines the overall security strategy for the organization.  There are several types of Security Policies.  An issue-specific Security Policy focuses on a specific network service, department, function or other aspects which is distinct from the organization as a whole. The system-specific Security Policy focuses on individual systems or types of systems and prescribes approved hardware and software, outlines methods for locking down a system, and even mandates firewall or other specific security controls.   Moreover, there are three categories of Security Policies: Regulatory, Advisory, and Informative.  The Regulatory Policy is required whenever industry or legal standards are an application to the organization.  This policy discusses the regulation which must be followed and outlines the procedures that should be used to elicit compliance.  The Advisory Policy discusses behaviors and activities which are acceptable and defines consequences of violations.  Most policies are the advisory type.  The Informative Policy is designed to provide information or knowledge about a specific subject, such as company goals, mission statements, or how the organization interacts with partners and customers.  While Security Policies are broad overviews, standards, baselines, guidelines, and procedures, include more specific, detailed information on the actual security solution (Stewart et al., 2015).   

The security policy should contain security management concept and principles with which the organization will comply.  The primary objectives of security are contained within the security principles reflected in the CIA Triad of Confidentiality, Integrity, and Availability.  These three security principles are the most critical elements within the realm of security.  However, the importance of each element in this CIA Triad is based on the requirement and the security goals and objectives of the organization.   The security policies must consider these security principles. Moreover, the security policy should also contain additional security concepts such as Identification, Authentication, Authorization, Auditing and Accountability (Stewart et al., 2015).

2.3.1 CIA Triad Requirements in Security

The Confidentiality is the first of the security principles.  Confidentiality provides a high level of assurance that object, data or resources are restricted from unauthorized users.  Confidentiality can be maintained on the network and data must be protected from unauthorized access, use or disclosure while data is in storage, in transit, and in process.  Numerous attacks focus on the violation of the Confidentiality.  These attacks include the capturing of the network traffic and stealing password files as well as social engineering, port scanning, shoulder surfing, eavesdropping, sniffing and so forth.  The violation of the Confidentiality security principle can result from actions from system admin or end user or oversight in security policy or a misconfiguration of security control.  Numerous countermeasures can be implemented to alleviate the violation of the Confidentiality security principle to ensure Confidentiality against possible threats.  These security measures include encryption, network traffic padding, strict access control, rigorous authentication procedures, data classification, and extensive personnel training. 

            The Integrity is the second security principles, where objects must retain their veracity and be intentionally modified only by the authorized users.  Integrity principle provides a high level of assurance that objects, data, and resources are unaltered from their original protected state.  The unauthorized modification should not occur for the data in storage, transit or processing.  The Integrity principle can be examined using three methods. The first method is to prevent unauthorized users from making any modification.  The second method is to prevent authorized users from making an unauthorized modification such as mistakes.  The third method is to maintain the internal and external consistency of objects so that the data is a correct and accurate reflection of the real world and any relationship such as a child, peer, or parents is validated and verified.  Numerous attacks focus on the violation of the Integrity security principles using a virus, logic bombs, unauthorized access, errors in coding and applications, malicious modification, intentional replacement and system backdoors.  The violation of the Integrity can result in oversight in security policy or a misconfiguration of security control.  Numerous countermeasures can be implemented to enforce the Integrity security principle and ensure Integrity against possible threats. These security measures for Integrity include strict access control, rigorous authentication procedures, intrusion detection systems, encryption, complete hash verification, interface restrictions, function/input checks, and extensive personnel training (Stewart et al., 2015). 

            The Availability is the third security principle which grants timely and uninterrupted access to objects.  The Availability provides a high level of assurance that the object, data, and resources are accessible to authorized users.   It includes efficient, uninterrupted access to objects and prevention of Denial-of-Service (DoS) attacks.  The Availability principle also means that the supported infrastructure such as communications, access control, and network services is functional and allows authorized users to gain authorized access.  Numerous attacks on the Availability include device failure, software errors, and environmental issues such as flooding, power loss, and so forth.  It also includes attacks such as DoS attacks, object destruction, and communication interruptions.  The violation of the Availability principles can occur as a result of the actions of any user, including administrators, or of oversight in security policy or a misconfiguration of security control.  Numerous countermeasures can be implemented to ensure the Availability principles against possible threats.  These security measures include designing the intermediary delivery system properly, using access controls effectively, monitoring performance and network traffic, using routers and firewalls to prevent DoS attacks.  Additional countermeasures for the Availability principle include the implementation of redundancy for critical systems and maintaining and testing backup systems.   Most security policies and Business Continuity Planning (BCP) focus on the use of Fault Tolerance features at the various levels of the access, storage, security aiming at eliminating the single point of failure to maintain the availability of critical systems (Stewart et al., 2015). 

            These three security principles drive the Security Policies for organizations. Some organizations such as military and government organizations tend to prioritize Confidentiality above Integrity, while private organization tends to prioritize Availability above Confidentiality and Integrity.  However, the prioritization does not imply that the other principles are ignored or improperly addressed (Stewart et al., 2015).

Additional security concepts must be considered in the Security Policy of the organization.  These security concepts are called Five Elements of AAA Services.  They include identification. Authentication, Authorization, Auditing, and Accounting. 

            The Identification can include username, swiping a smart card, waving a proximity device, speaking a phrase, or positioning hand or face or finger for a camera or scanning device.  The Identification concept is fundamental as it verifies the access to the secured building or data to only the authorized users.   The Authentication requires additional information from the users. The typical form of Authentication is the password, PINs, or passphrase, or security questions.  If the user is authenticated, it does not mean the user is authorized. 

The Authorization concept reflects the right privileges which are assigned to the authenticated user.  The access control matrix is evaluated to determine whether the user is authorized to access specific data or object.  The Authorization concept is implemented using access control such as discretionary access contr0l (DAC), mandatory access control (MAC), or role-based access control (RBAC) (Stewart et al., 2015). 

            The Auditing concept or Monitoring is the programmatic technique through which an action of a user is tracked and recorded to hold the user accountable for such action, while authenticated on a system.   The abnormal activities are detected in a system using the Auditing and Monitoring concept.   The Auditing and Monitoring concept is required to detect malicious actions by users, attempted intrusions, and system failures and to reconstruct events. It is to provide evidence for the prosecution and produce problem reports and analysis (Stewart et al., 2015). 

            The Accountability concept is the last security concept which must be addressed in the Security Policy.  The Accountability concept is to maintain security only if users are held accountable for their actions.  The Accountability concept is implemented by linking a human to the activities of an online identity through the security services and Auditing and Monitoring techniques, Authorization, Authentication, and Identification.  Thus, the Accountability is based on the strength of the Authentication process. Without robust Authentication process and techniques, there will be a doubt about the Accountability.  For instance, if the Authentication is using only a password technique, there is a significant room for doubt because of the password, especially weak passwords.  However, the password for the implementation of multi-factor authentication, smartcard, fingerprint scan, there will be very little room for doubt (Stewart et al., 2015).   

2.3.2 Building and Internal Security

            The Security Policy must address the security access from outside to the building and from inside within the building.  Some employees are authorized to enter one part of the building but not others.  Thus, the Security Policy must identify the techniques and methods which will be used to ensure access for those authorized users.  Based on the design of the building discussed earlier, the employees will have the main entrance.   There is no back door as an entrance for the employees.   The badge will be used to enter the building, and also to enter the authorized area for each employee.  The visitors will have another entrance because this is a healthcare organization.  The visitors and patients will have to be authorized by the help desk to direct them to the right place, such as pediatric, emergency and so forth.  Thus, there will be two main entrances, one for employees and another for visitors and patients.   All equipment rooms must be locked all the time, and the access to these equipment rooms must be controlled.  A strict inventory of all equipment so that any theft can be discovered.  The access to the data centers and servers rooms must have more restrictions and more security than the normal security to equipment rooms.  The data center must be secured physically with lock systems and should not have drop ceilings.  The work areas should be divided into sections based on the security access for employees.  For instance, help desk employees will not have access to the data center or server rooms.  Works areas should be restricted to employees based on their security access roles and privileges.  Any access violation will require three warnings, after which a violation action will be taken against the employees which can lead to separating the employee from the organization (Abernathy & McMillan, 2016).

2.3.3 Environmental Security

            Most considerations concerning security revolve around preventing mischief.  However, the security team is responsible for preventing damage to the data and equipment from environmental conditions because it is part of the Availability principle of the CIA Triad.  The Security Plan should address fire protection, fire detection, and fire suppression.  Thus, all the measures for fire protection, detection and suppression must be in place.  Example of the fire protection, no hazards materials should be used.  Concerning power supply, there are common power issues such as prolonged high voltage, power outage. The preventive measures to prevent static electricity from damaging components should be observed.  Some of these measures include anti-static sprays, proper humidity level, anti-static mats, and wristbands.  HVAC should be considered, not only for the comfort of the employees but also for the computer rooms, data centers, and server centers.  The water leakage and flooding should be examined, and security measures such as water detectors should be in place.  Additional environmental alarms should be in place to protect the building from any environmental events that can cause damage to the data center or server center.  The organization will comply with these environmental measures (Abernathy & McMillan, 2016). 

2.3.4 Equipment Security

            The organization must follow the procedure concerning equipment and media and the use of safes and vaults for protecting other valuable physical assets.  The procedures involve security measures such as tamper protection.  Tampering includes defacing, damaging, or changing the configuration of a device. The Integrity verification measures should be used to look for evidence of data tampering, error and omissions.  Moreover, sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016).  

            An inventory for all should be performed, and the relevant list should be maintained and updated regularly.  The physical protection of security devices includes firewalls, NAT devices, and intrusion detection and prevention systems.  The tracking devices technique can be used to track a device that has critical information.  With respect to protecting physical assets such as smartphones, laptops, tablets, locking the devices is a proper security technique (Abernathy & McMillan, 2016). 

2.3.5 Information Security

With respect to the Information Security, there are seven main pillars.  Figure 7 summarizes these pillars for Information Security for the healthcare organization.  

  • Complete Confidentiality.
  • Available Information.
  • Traceability.
  • Reliable Information.
  • Standardized Information.
  • Follow Information Security Laws, Rules, and Standards.
  • Informed Patients and Family with Permission.

Figure 7.  Information Security Seven Pillars.

The Complete Confidentiality is to ensure that only authorized people can access sensitive information about the patients.  The Confidentiality is the first principle of the CIA Triad. The Confidentiality of the Information Security is related to the information handled by the computer system, manual information handling or through communications among employees. The ultimate goal of the Confidentiality is to protect patients’ information from unauthorized users.  The Available Information means that healthcare professionals should have access to the patients’ information when needed.  This security feature is very critical in health care cases.  The healthcare organization should keep medical records, and the systems which store these records should be trustworthy.  The information should be available with no regard to the place, person or time.   The Traceability means that actions and decision concerning the flow of information in the Information System should be traceable through logging and documentation.  The Traceability can be ensured by logging, supervision of the networks, and use of digital signatures.  The Auditing and Monitoring concept discussed earlier can enforce the Traceability goal.  The Reliable Information means that the information is correct.  To have access to reliable information is very important in the healthcare organization.  Thus, preventing unauthorized users from accessing the information can enforce the reliability of the information.  The Standardized Information reflects the importance of using the same structure and concepts when recording information.  The healthcare organization should comply with all standards and policies including HIPAA to protect patients’ information.  The Informed Patients and Family is important to make sure they are aware of the health status.  The patient has to approve before passing any medical records to any relatives (Kolkowska, Hedström, & Karlsson, 2009).

2.3.6 Protection Techniques

            The Security Policy should cover protection techniques and mechanisms for security control.  These protection techniques include multiple layers or levels of access, abstraction, hiding data, and using encryption.  The multi-level technique is known as defense in depth providing multiple controls in a series.  This technique allows for numerous and different controls to guard against threats.  When organizations apply the multi-level technique, most threats are mitigated, eliminated or thwarted.  Thus, this multi-level technique should be applied in the healthcare organization.  For instance, a single entrance is provided, which has several gateway or checkpoints that must be passed in sequential order to gain entry into active areas of the building.  The same concept of the multi-layering can be applied to the networks.  The single sign-on technique should not be used for all employees at all levels for all applications, especially in a healthcare organization.  Serious consideration must be taken when implementing single sign-on because it eliminates the multi-layer security technique (Stewart et al., 2015). 

            The abstraction technique is used for efficiency.   Elements that are similar should be classified and put into groups, classes, or roles with security controls, restrictions, or permissions as a collective.  Thus, the abstraction concept is used to define the types of data an object can contain, the types of functions to be performed.  It simplifies the security by assigning security controls to a group of objects collected by type or functions (Stewart et al., 2015). 

            The data hiding is another protection technique to prevent data from being discovered or accessed by unauthorized users.  Data hiding protection technique include keeping a database from being accessed by unauthorized users and restricting users to a lower classification level from accessing data at a higher classification level.   Another form of the data hiding technique includes preventing the application from accessing hardware directly.  The data hiding is a critical element in a security control and programming (Stewart et al., 2015). 

            The encryption is another protection technique which is used to hide the meaning or intent of communication from unintended recipients.  Encryption can take many forms and can be applied to every type of electronic communications such as text, audio, video files, and applications.   Encryption is an essential element in security control, especially for the data in transit. Encryption has various types and strength.  Each type of the encryption is used for specific purpose.   Examples of these encryption types include PKI and Cryptographic Application, and Cryptography and Symmetric Key Algorithm (Stewart et al., 2015).

Part-3: Business Continuity and Disaster Recovery Plan

Organizations and businesses are confronted with disasters whether the disaster is caused by nature such as hurricane or earthquake or human-made calamities such as fire or burst water pipes.  Thus, organizations and business must be prepared for such disasters to recover and ensure the business continuity in the middle of these sudden damages.  The critical importance of planning for business continuity and disaster recovery has to lead the International Information System Security Certification Consortium (ISC) included the two process of Business Continuity and Disaster Recovery in the Common Body of Knowledge for the CISSP program (Abernathy & McMillan, 2016; Stewart et al., 2015).

3.1  Business Continuity Planning (BCP)

The Business Continuity Planning (BCP) involves the assessment of the risk to the organizational processes and the development of policies, plans, and process to minimize the impact of those risks if it occurs.   Organizations must implement BCP to maintain the continuous operation of the business if any disaster occurs.  The BCP emphasize on the keeping and maintaining the business operations with the reduction or restricted infrastructure capabilities or resources.  The BCP can be used to manage and restore the environment. If the continuity of the business is broken, then the business processes have seized, and the organization is in the disaster mode, which should follow the Disaster Recovery Planning (DRP).  The top priority of the BCP and DRP is always people.  The main concern is to get people out of the harm; and the organization can address the IT recovery and restorations issues (Abernathy & McMillan, 2016; Stewart et al., 2015).

3.1.1 NIST Seven Steps for Business Continuity Planning

As indicated in (Abernathy & McMillan, 2016), the steps of the Special Publications (SP) 800-34 Revision 1 (R1) from the NIST include seven steps.  The first step involves the development of the contingency planning policy. The second step involves the implementation of the Business Impact Analysis.  The Preventive Controls should be identified representing the third step.  The development of Recovery Strategies is the fourth step. The fifth step involves the development of the BCP.  The six-step involves the testing, training, and exercise. The last step is to maintain the plan. Figure 8 summarizes these seven steps identified by the NIST. 

Figure 8.  Summary of the Business Continuity Steps (Abernathy & McMillan, 2016).

3.2  Disaster Recovery Planning

            In case of the disaster event occur, the organization must have in place a strategy and plan to recover from such a disaster.  Organizations and businesses are exposed to various types of disasters.  However, these types of disaster are categorized to be either disaster caused by nature or disaster caused by a human.  The disasters which are nature related include the earthquakes, floods, storms, hurricanes, volcanos, and fires.  The human-made disasters include fires caused intentionally, acts of terrorism, explosions, and power outages.  Other disasters can be caused by hardware and software failures, strikes and picketing, theft and vandalism.  Thus, the organization must be prepared and ready to recover from any disaster.  Moreover, the organization must document the Disaster Recovery Plan and provide training to the personnel (Stewart et al., 2015).   

3.2.1        Fault Tolerance and System Resilience

            The security CIA Triad involves Confidentiality, Integrity, and Availability.  Thus, the fault tolerance and system resilience affect directly one of the security CIA Triad of the Availability.  The underlying concept behind the fault tolerance and the system resilience is to eliminate single points of failure.   The single point of failure represents any component which can cause the entire system to fail.  For instance, if a computer had data on a single disk, the failure of the disk can cause the computer to fail, so the disk is a single point of failure.  Another example involves the database when a single database serves multiple web servers; the database becomes a single point of failure (Stewart et al., 2015).  

            The fault tolerance reflects the ability of the system to suffer a fault but continue to operate.  The fault tolerance is implemented by adding redundant components such as additional disks within a redundant array of independent disks (RAID), sometimes it is called inexpensive disks, or additional servers within a failover clustered configuration.  The system resilience reflects the ability of the system to maintain the acceptable level of service during an adverse event, and be able to return to the previous state.  For instance, if a primary server in a failover cluster fails, the fault tolerance can ensure the failover to another system.  The system resilience indicates that the cluster can fail back to the original server after the original server is repaired (Stewart et al., 2015).

3.2.2        Hard Drives Protection Strategy

            The organization must have a plan and strategy to protect the hard drives from single points of failure to provide fault tolerance and system resilience.  The typical technique is to add a redundant array of disks of the RAID array.  The RAID array includes two or more disks, and most RAID configurations will continue to operate even after one of the disks fails. There are various types of arrays.  The organization must utilize the proper RAID for the fault tolerance and system resilience.    Figure 9 summarizes some of the standard RAID configurations. 

Figure 9. A Summary of the Common RAID Configurations.

3.2.3        Servers Protection

            The fault tolerance can be considered and added to critical servers with failover clusters.  The failover cluster includes two or more servers or nodes, if one server fails, the other server in the cluster can take over its load automatically using the “failover” process.  The failover clusters can also provide fault tolerance for multiple devices or applications.  The typical topology to consider fault tolerance include multiple web servers with network load balancing, and multiple database servers at the backend with a load balancer as well, and RAID arrays for redundancy.  Figure 10 illustrates a simple failover cluster with network load balancing, adapted from (Stewart et al., 2015).

Figure 10.  Failover Cluster with Network Load Balancing (Stewart et al., 2015).

3.2.4        Power Sources Protection Using Uninterruptible Power Supply

            The organization must also consider the power sources with “uninterruptible power supply” (UPS), a generator, or both to ensure fault tolerance environment.   The UPS provides battery-supplied power for a short period between 5 and 30 minutes, while the generator provides long-term power.  The goal of using UPS is to provide power long enough to complete a logical shutdown of a system, or until a generator is powered on to provide stable power. 

3.2.5        Trusted and Secure Recovery

The organization must ensure that the recovered environment is secure to protect against any malicious attacks. Thus, the system administrator with the security professional must ensure that the system can be trusted by the users.  The system can be designed to fail in a fail-secure state or a fail-open state. The fail-secure system will default to a secure state in the event of a failure, blocking all access. The fail-open system will fail in an open state, granting all access to all users.  In a critical healthcare environment, the fail-secure system should be the default configuration in case of failure, and the security professional can set the access after the failure using an automated process to set up the access control as identified in the security plan (Stewart et al., 2015). 

3.2.6        Quality of Service (QoS)

            The control of Quality of Service (QoS) provides protection and integrity of the data network under load.  The QoS attempts to manage factors such as bandwidth, latency, the variation in latency between packets, known as “Jitter,” packet loss, and interference.  The QoS systems often prioritize certain traffic types which have a low tolerance for interference and have high business requirements (Stewart et al., 2015). 

3.2.7        Backup Plan

            The organization must implement a disaster recovery plan which covers the details of the recovery process of the system and environment in case of failure.  The disaster recovery plan should be designed to allow the recovery even in the absence of the DRP team, by allowing people in the scene to begin the recovery effort until the DRP team arrives. 

            The organization must engineer and develop the DBP to allow the business unit and operation with the highest priority are recovered first.  Thus, the priority of the business operations and the unit must be identified in the DRP.   All critical business operations must be placed with the top priority and be considered to be recovered first.   The organization must consider the panic associated with disaster in the DRP.  The personnel of the organization must be trained to be able to handle the disaster recovery process properly and reduce the panic associated with it.   Moreover, the organization must establish internal and external communications in case of the disaster recovery to allow people to communicate during the recovery process (Stewart et al., 2015). 

            The DRP should address in detail the backup strategy and plan.  There are three types of backups; Full Backup, Incremental Backup, and Differential Backup.  The Full Backup store a complete copy of the data contained on the protected devices.  It duplicates every file on the system.  The Incremental Backup stores only those files which have been modified since the time of the most recent full or incremental backup.  The Differential Backup store all files which have been modified since the time of the most recent full backup.  It is very critical for the healthcare organization to employ more than one type of backup.  The Full Backup over the weekend and incremental or differential backups on a nightly basis should be implemented as part of the DRP (Stewart et al., 2015).

3.3  BC/DR Best Practice

Various research studies such as (cleardata.com, 2015; Ranajee, 2012) have discussed the best practice for Business Continuity (BC) and Disaster Recovery (DR) (BC/DR) in healthcare.  As addressed in (cleardata.com, 2015), the best practice involves Cloud Computing as the technology to use for BC and DR rather than handling them in-house.  Some of the primary reasons for taking BD/DR to the Cloud involve the easier compliance of HIPAA and HITECH, better objectives for recovery such as recovery time objective (RTO), and recovery point objective (RPO), moving the expenditure from capital expenditure to operational expenditure, fast deployment, and enhanced scalability.

The best practice for BC/DR for healthcare organizations using the Cloud involves five significant steps.  The first step involves the health-check of the existing BC/DR environment.  This step involves the risk assessment check, IT performance check, Backup Integrity Check, and Restore Capabilities Check.  The second step involves the Impact Analysis which define the costs, benefits, and risks associated with moving aspects of the BC/DR to the cloud.  The impact analysis should cover the financial and the allocated budget, personnel, technology, business process, security, compliance, patient care, innovation, and growth.  The third step in the best practice for BC/DR is to outline the solution requirements such as the RTO and RPO, regulatory requirements, resources requirements, the use of BYOD, the ability to share the patient’s data within the same network, and with service providers. The next step for healthcare organizations to map the requirements to the available deployment models, such as cloud backup-as-a-service (BUaaS), Cloud Replication, Cloud Infrastructure-as-a-Service (IaaS).  The last step involves the criteria identification for the demonstration of experience in the healthcare industry including HIPAA compliance, RTO and RPO to meet the risk assessment guidelines, and the proof-of-concept delivery to test the BC/DR.

Conclusion

The project developed a proposal for Big Data Analytics (BDA) in healthcare.  The proposal covered three significant parts. Part 1 covered Big Data Analytics Business Plan in Healthcare. Part 2 addressed Security Policy Proposal in Healthcare.  Part 3 proposed Business Continuity and Disaster Recovery Plan in Healthcare.  The project began with Big Data Analytics Overview Healthcare, discussing the opportunities and challenges in healthcare, and the Big Data Analytics Ecosystem overview in healthcare. The project covered significant component in Part 1 of the BDA Business Plan such as the four major building blocks. Big Data Management is the first building block with detailed discussion on the data store types which the healthcare organization must select based on their requirements, and a use case to demonstrate the complexity of this task.  Big Data Analytics is the second Building Block which covers the technologies and tools that are required when dealing with BDA in healthcare.  Big Data Governance is the third building block which must be implemented to ensure data protection and compliance with the existing rules.  The last building block is the Big Data Application with a detailed discussion of the methods which can be used when using BDA in healthcare such as clustering, classifications, machine learning and so forth.  The project also proposed a Security Policy in a comprehensive discussion of Part 2.  This part discussed in details various security measures as part of the Security Policy such as compliance with CIA Triad, Internal Security, Equipment Security, Information Security, and Protection techniques.  The last Part covers Business Continuity and Disaster Recovery Plan in Healthcare, and the best practice.  

References

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.

Aljumah, A. A., Ahamad, M. G., & Siddiqui, M. K. (2013). Application of data mining: Diabetes health care in young and old patients. Journal of King Saud University-Computer and Information Sciences, 25(2), 127-136.

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Borkar, V. R., Carey, M. J., & Li, C. (2012). Big data platforms: what’s next? XRDS: Crossroads, The ACM Magazine for Students, 19(1), 44-49.

Chawla, N. V., & Davis, D. A. (2013). Bringing big data to personalized healthcare: a patient-centered framework. Journal of general internal medicine, 28(3), 660-665.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

cleardata.com. (2015). Best Practices in Healthcare IT Disaster Recovery Planning. Retrieved from https://www.cleardata.com/research/healthcare-it-disaster-recovery-planning/, White Paper.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Eswari, T., Sampath, P., & Lavanya, S. (2015). Predictive methodology for diabetic data analysis in big data. Procedia Computer Science, 50, 203-208.

Fox, M., & Vaidyanathan, G. (2016). IMPACTS OF HEALTHCARE BIG DATA: A FRAMEWORK WITH LEGAL AND ETHICAL INSIGHTS. Issues in Information Systems, 17(3).

Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Hashmi, N. (2013). The Challenges of Implementing Big Data Analytics in Healthcare. Retrieved from https://searchhealthit.techtarget.com/tip/The-challenges-of-implementing-big-data-analytics-in-healthcare.

himss.org. (2018). 2017 Security Metrics:  Guide to HIPAA Compliance: What Healthcare Entities and Business Associates Need to Know. . Retrieved from  http://www.himss.org/file/1318331/download?token=h9cBvnl2.

HIPAA. (2018). Report: Healthcare Data Breaches in Q1, 2018. Retrieved from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/.

Jee, K., & Kim, G.-H. (2013). Potentiality of big data in the medical sector: focus on how to reshape the healthcare system. Healthcare informatics research, 19(2), 79-85.

Jennifer, B. (2016). The Top 3 Planning Pain Points in Healthcare Big Data Analytics. Retrieved from https://healthitanalytics.com/news/the-top-3-planning-pain-points-in-healthcare-big-data-analytics.

Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., & Arab, M. (2015). Using data mining to detect health care fraud and abuse: a review of the literature. Global journal of health science, 7(1), 194.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.

Kolkowska, E., Hedström, K., & Karlsson, F. (2009). Information security goals in a Swedish hospital. Paper presented at the 8th Annual Security Conference, 15-16 April 2009, Las Vegas, USA.

Landolina, M., Perego, G. B., Lunati, M., Curnis, A., Guenzati, G., Vicentini, A., . . . Valsecchi, S. (2012). Remote Monitoring Reduces Healthcare Use and Improves Quality of Care in Heart Failure Patients With Implantable DefibrillatorsClinical Perspective: The Evolution of Management Strategies of Heart Failure Patients With Implantable Defibrillators (EVOLVO) Study. Circulation, 125(24), 2985-2992.

Lee, C., Luo, Z., Ngiam, K. Y., Zhang, M., Zheng, K., Chen, G., . . . Yip, W. L. J. (2017). Big healthcare data analytics: Challenges and applications Handbook of Large-Scale Distributed Computing in Smart Healthcare (pp. 11-41): Springer.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Ohlhorst, F. J. (2012). Big data analytics: turning big data into big money: John Wiley & Sons.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Ranajee, N. (2012). Best practices in healthcare disaster recovery planning: The push to adopt EHRs is creating new data management challenges for healthcare IT executives. Health management technology, 33(5), 22-24.

Rawte, V., & Anuradha, G. (2015). Fraud detection in health insurance using data mining techniques. Paper presented at the Communication, Information & Computing Technology (ICCICT), 2015 International Conference on.

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Sidhtara, T. (2015). 8 Studies that Prove the Value of Remote Monitoring for Diabetes. Retrieved from https://www.glooko.com/2015/05/8-studies-that-prove-the-value-of-remote-monitoring-for-diabetes/.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Stonebraker, M. (2012). What Does’ Big Data’Mean. Communications of the ACM, BLOG@ ACM.

Verhaeghe, X. (n.d.). The Building Blocks of a Big Data Strategy. Retrieved from https://www.oracle.com/uk/big-data/features/bigdata-strategy/index.html.

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

Ward, M. J., Marsolo, K. A., & Froehle, C. M. (2014). Applications of business analytics in healthcare. Business Horizons, 57(5), 571-582.

Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.

Youssef, A. E. (2014). A framework for secure healthcare systems based on big data analytics in mobile cloud computing environments.