Building Blocks of a System for Healthcare Big Data Analytics

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to create the building blocks of a system for healthcare Big Data Analytics and compare the building block design to a DNA networked cluster currently used by an organization in the current market.

The discussion begins with the Cloud Computing Building Blocks, followed by Big Data Analytics Building Blocks, and DNA Sequencing. The discussion also addresses the building blocks for the health analytics and the building blocks for DNA Sequencing System, and the comparison between both systems.

Cloud Computing Building Blocks

The Cloud Computing model contains two elements: the front end and the back end.  Both elements are connected to the network. The user interacts with the system using the front end, while the cloud itself is the back end. The front end is the client which the user uses to access the cloud through a device such as a smartphone, tablet, and laptops.  The backend represented by the Cloud provides applications, computers, servers and data storage which creates the services (IBM, 2012).   

As indicated in (Macias & Thomas, 2011), three building blocks are required to enable Cloud Computing. The first block is the “Infrastructure,” where the organization can optimize data center consolidation, enhance network performance, connect anyone, anywhere seamlessly, and implement pre-configured solutions.  The second block is the “Applications,” where the organization can identify applications for rapid deployment, and utilize automation and orchestration features.  The third block is the “Services,” where the organization can determine the right implementation model, and create a phased cloud migration plan.

In (Mousannif, Khalil, & Kotsis, 2013-14), the building blocks for the Cloud Computing involve the physical layer, the virtualization layer, and the service layer.  Virtualization is a basic building block in Cloud Computing.  Virtualization is the technology which hides the physical characteristics of the computing platform from the front end users.  Virtualization provides an abstract and emulated computing platform.  The clusters and grids are features and characteristics in Cloud Computing for high-performance computing applications such as simulations. Other building blocks of the Cloud Computing include Service-Oriented Architectures (SOA) and Web Services (Mousannif et al., 2013-14). 

Big Data Building Block

As indicated in (Verhaeghe, n.d.), there are four major building blocks for Big Data Analytics.  The first building block is Big Data Management to enable organization capture, store and protect the data. The second building block for the Big Data is the Big Data Analytics to extract value from the data.  Big Data Integration is the third building block to ensure the application of governance over the data.  The last building block in Big Data is the Big Data Applications for the organization to apply the first three building blocks using the Big Data technologies.

DNA Sequencing

DNA stands for Deoxyribonucleic Acid which represents the smallest building block of life (Matthews, 2016).  As indicated in (Salzberg, 1999), advances in biotechnology have produced enormous volumes of DNA-related information.  However, the rate of data generation is outpacing the ability of the scientists to analyze the data.  DNA Sequencing is a technique used to determine the order of the four chemical building blocks, called “bases,” which make up the DNA molecule (genome.gov, 2015).  The sequence provides the kind of genetic information which is carried in a particular DNA segment.  DNA sequencing can provide valuable information about the role of inheritance in susceptibility to disease and response to the influence of environment.  Moreover, DNA sequencing provides rapid and cost-effective diagnosis and treatments.  Markov chains and hidden Markov models are probabilistic techniques which can be used to analyze the result of the DNA sequencing (Han, Pei, & Kamber, 2011).  Example of the DNA Sequencing application is discussed and analyzed in (Leung et al., 2011), where the researchers employed Data Mining on DNA Sequences biological data sets for Hepatitis B Virus. 

DNA Sequencing was performed on non-networked computers, using a limited subset of data due to the limited computer processing speed (Matthews, 2016).  However, DNA Sequencing has been experiencing various advanced technologies and techniques.  Predictive Analytic is an example of these techniques which are applied to DNA Sequencing resulting Predictive Genomics.  Cloud Computing plays a significant role in the success of the Predictive Genomics for two major reasons.  The first reason is the volume of the genomic data, while the second reason is the low cost (Matthews, 2016).  Cloud Computing is becoming a valuable tool for various domains including the DNA Sequencing.   As cited in (Blaisdell, 2017), the study of the Transparency Market Research showed that the healthcare Cloud Computing market is going to evolve further, reaching up to $6.8 Billion by 2018. 

Building Block for Healthcare System

Healthcare data requires protection due to the security and privacy concerns.  Thus, Private Cloud will be used in this use case.  To build a Private Cloud, the virtualization layer, the physical layer, and the service layer are required.  The virtualization layer consists a hypervisor to allow multiple operating systems to share a single hardware system.  The hypervisor is a program which controls the host processors and resources by allocating the resources to each operating system.  Two types of hypervisors: native and also called bare-metal or type 1 and hosted also called type 2.  Type 1 runs directly on the physical hardware while Type 2 runs on a host operating system which runs on the physical hardware.  Examples of the native hypervisor include VMware’s ESXi, Microsoft’s Hyper-V. Example of the hosted hypervisor includes Oracle VirtualBox and VMware’s Workstation.  The physical layer can consist of two computer pools one for PC and the other for the server (Mousannif et al., 2013-14).   

In (Archenaa & Anita, 2015), the researchers illustrated the secure Healthcare Analytic System.  The Electronic health record is a heterogeneous dataset which is given as input to HDFS through Flume and Sqoop. The analysis of the data is performed using MapReduce and Hive by implementing Machine Learning algorithm to analyze the similar pattern of data, and to predict the risk for patient health condition at an early stage.  HBase database is used for storing the multi-structured data. STORM is used to perform live streaming and any emergency conditions such as patient temperature rate falling beyond the expected level. Lambda function is also used in this healthcare system.  The final component of a building block in Healthcare system involves the reports generated by the top layer tools such as “Hunk.”  Figure 1 illustrates the Healthcare System, adapted from

Figure 1.  Healthcare Analytics System. Adapted from (Archenaa & Anita, 2015)

Building Block for DNA and Next Generation Sequencing System

Besides the DNA Sequencing, there is a next-generation sequencing (NGS) which is increasing exponentially since 2007 (Bhuvaneshwar et al., 2015).  In (Bhuvaneshwar et al., 2015), the Globus Genomic System is proposed as an enhanced Galaxy workflow system made available as a service offering users the capability to process and transfer data easily, reliably and quickly.  This system addresses the end-to-end NGS analysis requirements and is implemented using Amazon Cloud Computing Infrastructure.  Figure 2 illustrates the framework for the Globus Genomic System taking into account the security measures for protecting the data.  Examples of healthcare organizations which are using Genomic Sequencing include Kaiser Permanente in Northern California, and Geisinger Health System in Pennsylvania (Khoury & Feero, 2017).  

Figure 2. Globus Genomics System for Next Generation Sequencing (NGS). Adapted from (Bhuvaneshwar et al., 2015).

In summary, Cloud Computing has reshaped the healthcare industry in many aspects.  Healthcare Cloud Computing and Analytics provide many benefits from the easy access to the electronic patient records to DNA Sequencing and NGS.  The building blocks of the Cloud Computing must be implemented with care for security and privacy consideration to protect the patients’ data from unauthorized users.  The building blocks for Healthcare Analytics system involves advanced technologies such as Hadoop, MapReduce, STORM, Flume as illustrated in Figure 1.  The building blocks for DNA Sequencing and NGS System involves Dynamic Worker Pool, HTCondor, Shared File System, Elastic Provisioner, Globus Transfer and Nexus, and Galaxy as illustrated in Figure 2.  Each system has the required building blocks to perform the analytics tasks.  

References

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., . . . Madhavan, S. (2015). A case study for cloud-based high throughput analysis of NGS data using the globus genomics system. Computational and structural biotechnology journal, 13, 64-74.

Blaisdell, R. (2017). DNA Sequencing in the Cloud. Retrieved from https://rickscloud.com/dna-sequencing-in-the-cloud/.

genome.gov. (2015). DNA Sequencing. Retrieved from https://www.genome.gov/10001177/dna-sequencing-fact-sheet/.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

IBM. (2012). Cloud computing fundamentals: A different way to deliver computer resources. Retrieved from https://www.ibm.com/developerworks/cloud/library/cl-cloudintro/cl-cloudintro-pdf.pdf.

Khoury, M. J., & Feero, G. (2017). Genome Sequencing for Healthy Individuals? Think Big and Act Small! Retrieved from https://blogs.cdc.gov/genomics/2017/05/17/genome-sequencing-2/.

Leung, K., Lee, K., Wang, J., Ng, E. Y., Chan, H. L., Tsui, S. K., . . . Sung, J. J. (2011). Data mining on dna sequences of hepatitis b virus. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(2), 428-440.

Macias, F., & Thomas, G. (2011). Three Building Blocks to Enable the Cloud. Retrieved from https://www.cisco.com/c/dam/en_us/solutions/industries/docs/gov/white_paper_c11-675835.pdf.

Matthews, K. (2016). DNA Sequencing. Retrieved from https://cloudtweaks.com/2016/11/cloud-dna-sequencing/.

Mousannif, H., Khalil, I., & Kotsis, G. (2013-14). Collaborative learning in the clouds. Information Systems Frontiers, 15(2), 159-165. doi:10.1007/s10796-012-9364-y

Salzberg, S. L. (1999). Gene discovery in DNA sequences. IEEE Intelligent Systems and their Applications, 14(6), 44-48.

Verhaeghe, X. (n.d.). The Building Blocks of a Big Data Strategy. Retrieved from https://www.oracle.com/uk/big-data/features/bigdata-strategy/index.html.

Use Case: Analysis of Heart Disease

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the Fast-and-Frugal Trees (FFTrees).  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

Keywords: Decision Tree, Diagnosis of Heart Disease.

Introduction

            The medical records and the databases to store these records are increasing rapidly.  This rapid increase is leading the researchers and practitioners to employ Big Data technologies.  The Data Mining technique plays a significant role in finding patterns and in extracting knowledge to provide better patient care and effective diagnostic capabilities.  As indicated in (Koh & Tan, 2011), “In healthcare, data mining is becoming increasingly popular, if not increasingly essential.”  Healthcare can benefit from Data Mining application in various areas such as the evaluation of treatment effectiveness, customer and patient relationship management, healthcare management, fraud detection, and prevention.  Moreover, other benefits include predictive medicine and analysis of DNC micro-arrays. 

Various research studies employed various Data Mining techniques in the healthcare.  In (Alexander & Wang, 2017), the main objective of the study was to identify the usage of Big Data Analytics to predict and prevent heart attacks.  The results showed that Big Data Analytics is useful in predicting and pr3eventing attacks.  In (Dineshgar & Singh, 2016), the purpose of the study was to develop a prototype Intelligent Heart Disease Prediction System (IHDPC) using Data Mining technique.  In (Karthiga, Mary, & Yogasini, 2017), the researchers utilized the Data Mining techniques to predict heart disease using the Decision Tree algorithm and Naïve Bayes.  The result showed that the prediction accuracy of 99%. Thus, Data Mining techniques enable the healthcare industry to predict patterns.  In (Kirmani & Ansarullah, 2016), the researchers also applied the Data Mining techniques with the aim to investigate the result after applying different types of Decision Tree methods to obtain the better performance in the heart disease.  These research studies are examples of the vast literature on the use of Big Data Analytics and Data Mining in the healthcare industry.

            In this project, the heart disease dataset is utilized as the Use Case for Data Mining application.  The project used Hortonworks sandbox, with Hive, MapReduce, and Tez.  The project also integrated R with Hive to perform statistical analysis including Decision Tree method.  The project utilized techniques from various research studies such as (Karthiga et al., 2017; Kirmani & Ansarullah, 2016; Martignon, Katsikopoulos, & Woike, 2008; Pandey, Pandey, & Jaiswal, 2013; Phillips, Neth, Woike, & Gaissmaier, 2017; Reddy, Raju, Kumar, Sujatha, & Prakash, 2016).

            The project begins with Phase 1 of Sandbox Configuration, followed by Phase 2 of the Heart Disease Use Case.  The Sandbox configuration included the environment set up from mapping the sandbox IP to the Ambari Console management and the Integration of R and RStudio with Hive.  The Heart Disease Use Case involved fourteen steps starting from understanding the dataset to the analysis of the result.  The project articulates the steps and the commands as required for this project. 

Phase 1:  Sandbox Configuration

1.      Environment Setup

The environment setup begins with the installation of the Virtual Box and Hortonworks Sandbox.

  1. It is installed using the Oracle VM VirtualBox, which was installed from http://www.virtualbox.org
  2. Install Hortonworks Docker Sandbox version 2.6.4 from http://hortonworks.com/sandbox

After the installation, the environment must be configured to function using the following steps fully. 

1.1       IP Address and HTTP Web Port

After the Sandbox is installed, the host must use an IP address depending on the Virtual Machine (VMware, VirtualBox), or container (Docker). After the installation is finished, the local IP address is assigned with the HTTP web access port of 8888 as shown below.  Thus, the local access is using http://127.0.0.1:8888/

1.2       Map the Sandbox IP to Desired Hostname in the Hosts file

            The IP address can be mapped to a hostname using the hosts file a shown below. After setting the hostname to replace the IP address, the sandbox can be accessed from the browser using http://hortonworks-sandbox.com:8888.

1.3     Roles and User Access

There are five major users with different roles in Hortonworks Sandbox.  These roles with their passwords and roles are summarized in Table 1.  

Table 1.  Users, Roles, Service, and Passwords.

With respect to the Access, putty can be used to access the sandbox using SSH with Port 2222.  The root will get the prompt to specify the new password.

1.4       Shell Web Client Method

The shell web client is also known as Shell-in-a-box to issue shell commands without installing additional software.  It uses port 4200.  The admin password can be reset using the Shell Web Client.

1.5       Transfer Data and Files between the Sandbox and Local Machine.

            To transfer files and data between the local machine and the sandbox, secure copy using scp command can be used as illustrated below.

To transfer from the local machine to the sandbox:

To transfer from the sandbox to the local machine.

1.6       Ambari Console and Management

            The Admin can manage Ambari using the web address with port 8080 using the address of  http://hortonworks-sandbox.com:8080 with the admin user and password.  The admin can operate the cluster, manage users and group, and deploy views. The cluster section is the primary UI for Hadoop Operators. The clusters allow admin to grant permission to Ambari users and groups. 

2.      R and RStudio Setup

To download and install RStudio Server, following the following steps:

  •  Execute: $sudo yum install rstudio-server-rhel-0.99.893-x86_64.rpm
  • Install dpkg to divert th location of /sbin/initctl
    • Execute: $yum install dpkg
    • Execute: $dpkg-diver –local –rename –add /sbin/initctl
    • Execute: $ln -s /bin/true /sbin/initctl
  • Install R and verify the installatio or RStudio.
    • Execute: $yum install -y R
    • Execute: $yum -y install libcurl-devel
    • Execute: $rstudio-server verify-installation.
  • The default port of RStduio server is 8787 which is not opened in the Docker Sandbox.  You can use port 8090 which is opened for the Docker.
    • Execute: $sed -I “1 a www-port=8090” /etc/rstudio/rserver.conf
    • Restart the server by typing: exec /usr/lib/rstudio-server/bin/rserver
    • It will close your session. However, you can now browse to RStudion using port 8090 as shown below.
    • RStudio login is amy_ds/amy_ds
  1.  Alternatively, open up the RStudio port 8787 by implementing the following steps:
    • Access the VM VirtualBox Manager tool. 
    • Click on the Hortonworks VM à Network à Advanced à Port Forwarding.  Add Port 8787 for RStudio.   
    • After you open up the port, modify /etc/rstudio/rstudio-server.conf to reflect 8787 port.
    • Stop and start the VM.

Phase 2:  Heart Disease Use Case

 1.      Review and Understand the Dataset

  • Obtain the heart disease dataset from the archive site at:

http://archive.ics.uci.edu/ml/datasets/.

  • Review the dataset.  The dataset has fourteen variables with (N=271).  Table 2 describes these attributes.

Table 2.  Heart Disease Dataset Variables Description.

  1. Load heart.dat file into the Hadoop Distributed File System (HDFS):
    1. Login to Ambari using amy_ds/amy_ds.
    1. File View à user à amy_ds à upload
  • Start Hive database.  Create a table “heart” and import the dataset to Hive database.
  • Retrieve the top 10 records for verification.

2.      Configure MapReduce As the Execution Engine in Hive

There is an option to configure MapReduce in Hive to take advantage of the MapReduce feature in Hive.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: mr (for MapReduce).
    1. Key: hive.auto.convert.join à Value: false.
  1. Test the query using MapReduce as the execution engine.  The following query ran using MapReduce.
  • Configure Tez As the Execution Engine in Hive

The user can also modify the value for hive.execution.engine from mr to tez as Hive is enabled on Tez execution and take advantage of the DAG execution representing the query instead of multiple stages of MapReduce program which involved a log of synchronization, barriers, and I/O overheads.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: tez.
    1. Key: hive.auto.convert.join à Value: false.

4.      Integrate TEZ with Hive for Directed Acyclic Graph (DAG)

This integration is implemented on Tez to also take advantage of the Directed Acyclic Graph (DAG) execution. This technique is improved in Tez, by writing intermediate dataset into memory instead of hard disk.

  1. Go to Settings in Hive view.
  2. Change the hive.execution.engine to tez.

5.      Track Hive on Tez jobs in HDP Sandbox using the Web UI.

  1. Track the job from the browser:  http://hortonworks-sandbox.com:8088/cluster, while running or post to see the details.
    1. Retrieve the average age, average cholesterol by gender for female and males.

6. Monitor Cluster Metrix

  1. Monitor Cluster Metrix

    

7. Review the Statistics of the table from Hive.

  1. Table à Statistics à Recompute and check Include Columns.
    1. Click on Tez View.
    1. Click Analyze
  • Click Graphical View

8.      Configure ODBC for Hive Database Connection

  1. Configure a User Data Source in ODBC on the client to connect to Hive database. 
  • Test the ODCB connection to Hive.

9.      Setup R to use the ODBC for Hive Database Connection

  1.  Execute the following to install the odbc packages in R.
    1. >install.packages(“RODBC”)
  2. Execute the following to load and run the required library to establish the database connection from R to Hive:
    1. >library(“RODBC”)
  3. Execute the following command to establish the database connection from R to Hive:
    1. >cs881 <- odbcConnect(“Hive”)
  • Execute the following to retrieve the top 10 records from Hive from R using the ODBC connection:
    • >sqlQuery(cs881,”SELECT TOP(10) FROM heart”)

10.   Create Data Frame

  • Execute the following command to create a data frame:
    • >heart_df <- sqlQuery(cs881,”SELECT * FROM heart”)
  • Review the headers of the columns:
    • >print(head(heart_df))
  • Review and Analyze the Statistics Summary:
    • >summary(head_df).
  • List the names of the columns:
    • >names(heart_df)

11.   Analyze the Data using Descriptive Analysis

  • Find the Heart Disease Patients, Age and Cholesterol Level
    • Among all genders
    • >age_chol_heart_disease <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1”)
    • summary(age_chol_heart_disease)

Figure 1.  Cholesterol Level Among All Heart Disease Patients.

  • Among Female Patients
    • age_chol_heart_disease_female <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 0”)
    • summary(age_chol_heart_disease_female)

Figure 2.  Cholesterol Level Among Heart Disease Female Patients.

  • Among Male Patients
    • age_chol_heart_disease_male <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 1”)
    • summary(age_chol_heart_disease_male)

Figure 3.  Cholesterol Level Among Heart Disease Male Patients.

 12.  Analyze the Data using Decision Tree

  • Print the headers of the columns
  • Create input.dat for the diagnosis, age, sex, and sugar attributs.
  • Create png file.
  • Install library(party).
    • >install.packages(party)
    • >library(party)
  • Load the Decision Tree

Figure 4.  Decision Tree for Hearth Disease Patients.

13.    Create FFTree for heart disease

Figure 5.  FFT Decision Tree with Low Risk and High-Risk Patients.

Figure 6.  Sensitivity and Specificity for Heart Disease Patients Using FFTree.

Figure 7. Custom Heart Disease FFTree.

14.    Analysis of the Results

The analysis of the heart disease dataset included descriptive analysis and decision tree.  The result of the descriptive analysis showed that the minimum age among the patients who are exposed to the heart disease is 29 years old, while the maximum age is 79, with median and mean of 52 years old.  The result also showed that the minimum cholesterol level for these patients is 126, while the maximum is 564, with a median of 236 and mean of 244 indicating that the cholesterol level also gets increased with the increased age.   

The descriptive analysis drilled down and focused on gender (female vs. male) to identify the impact of the age on the cholesterol for the heart disease patients.  The result showed the same trend among the female heart disease patients, with a minimum age of 34, and maximum age of 76, with median and mean of 54.   The cholesterol level among female heart disease patients begins with the minimum of 141 and maximum of 564, and median of 250 and mean of 257.  The maximum cholesterol level of 564 is an outlier with another outlier in the age of 76.   With respect to the heart of the male disease patients, the result showed the same trend.  Among the male heart disease patients, the results showed that the minimum age is 29 years old, and maximum age of 70 years old, with a median of 52 and mean of 51.  The cholesterol level among these male heart disease patients showed 126 minimum and 325 maximum level, median and mean of 233. There is another outlier among the male heart disease patients at the age of 29.  Due to these outliers in the dataset among the female and heart disease patients, the comparison between male and female patients will not be accurate.   However, Figure 5 and Figure 6 show similarities in the impact of the age on the cholesterol level between both genders.

With regard to the decision tree, the first decision tree shows the data is partitioned among six nodes.  The first two nodes are for the Resting Blood Pressure (RBP) attribute (RestingBP). The first node of this cluster shows 65 heart disease patients have RBP of 138 or less, while the second node of this cluster shows 41 heart disease patients have RBP of greater than 138.  These two nodes show the vessels is zero or less with a heart rate of 165 or less.  For the vessel that exceeds the zero level, there is another node with 95 patients.  The second set of the nodes are for the heart rate of greater than 165.  The three nodes are under the vessels; less than or equal zero vessels, and greater than zero vessels.  Two nodes are under the first categories of zero or less, with 22 heart disease patients with a heart rate of 172 or less. The last node shows the vessels is greater than zero with 15 heart disease patients.  The FFTree results show that the high-risk heart disease patients with vessel greater than zero, while the low-risk patients are of zero or less of vessels. 

Conclusion

The purpose of this project was to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the FFTrees.  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Dineshgar, G. P., & Singh, L. (2016). A Review of Data Mining For Heart Disease Prediction. International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE), 5(2).

Karthiga, A. S., Mary, M. S., & Yogasini, M. (2017). Early Prediction of Heart Disease Using Decision Tree Algorithm. International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST), 3(3).

Kirmani, M. M., & Ansarullah, S. I. (2016). Prediction of Heart Disease using Decision Tree a Data Mining Technique. IJCSN International Journal of Computer Science and Network, 5(6), 885-892.

Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of healthcare information management, 19(2), 65.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Pandey, A. K., Pandey, P., & Jaiswal, K. (2013). A heart disease prediction model using decision tree. IUP Journal of Computer Sciences, 7(3), 43.

Phillips, N. D., Neth, H., Woike, J. K., & Gaissmaier, W. (2017). FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12(4), 344.

Reddy, R. V. K., Raju, K. P., Kumar, M. J., Sujatha, C., & Prakash, P. R. (2016). Prediction of heart disease using decision tree approach. International Journal of Advanced Research in Computer Science and Engineering, 6(3).

Big Data Analytics Framework and Relevant Tools Used in Healthcare Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Big Data Analytics framework and relevant tools used in healthcare data analytics.  The discussion also provides examples of how healthcare organizations can implement such a framework.

Healthcare can benefit from Big Data Analytics in various domains such as decreasing the overhead costs, curing and diagnosing diseases, increasing the profit, predicting epidemics and heading the quality of human life (Dezyre, 2016).  Healthcare organizations have been generating the very large volume of data mostly generated by various regulatory requirements, record keeping, compliance and patient care.  There is a projection from McKinsey that Big Data Analytics in Healthcare can decrease the costs associated with data management by $300-$500 billion.  Healthcare data includes electronic health records (EHR), clinical reports, prescriptions, diagnostic reports, medical images, pharmacy, insurance information such as claim and billing, social media data, and medical journals (Eswari, Sampath, & Lavanya, 2015; Ward, Marsolo, & Froehle, 2014). 

Various healthcare organizations such as scientific research labs, hospitals, and other medical organizations are leveraging Big Data Analytics to reduce the costs associated with healthcare by modifying the treatment delivery models.  Some of the Big Data Analytics technologies have been applied in the healthcare industry.  For instance, Hadoop technology has been used in healthcare analytics in various domains.  Examples of Hadoop application in healthcare include cancer treatments and genomics, monitoring patient vitals, hospital network, healthcare intelligence, fraud prevention and detection (Dezyre, 2016).  Thus, this discussion is limited to the Hadoop technology in healthcare.  The discussion begins with the types of analytics and the potential benefits of some of the analytic in healthcare, and then followed by the main discussion about Hadoop Framework for Diabetes including its major components of the Hadoop Distributed File System (HDFS) and Map/Reduce.

Types of Analytics

There are four major analytics types:  Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017).  The Descriptive Analytics are used to summarize historical data to provide useful information.  The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling.  The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables.  It is useful to find an optimum solution and the best course of action using the algorithm.  The Prescriptive Analytics, as indicated in (Mohammed et al., 2014) is less used in the clinical field.  The Diagnostic Analytics is described as an advanced type of analytics to get to the cause of a problem using drill-down techniques and data discovery.

Hadoop Framework for Diabetes

The predictive analysis algorithm is utilized by (Eswari et al., 2015) in Hadoop/MapReduce environment in predicting the diabetes types prevalent, the complications associated with each diabetic type, and the required treatment type.  The analysis used by (Eswari et al., 2015) was performed on Indian patients.  In accordance to the World Health Organization, as cited in (Eswari et al., 2015), the probability for the age between 30-70 for patients to die from four major Non-Communicable Diseases (NCD) such as diabetes, cancer, stroke, and respiratory is 26%.   In 2014, 60% of all death in India was caused by NCDs.  Moreover, in accordance with the Global Status Report, as cited in (Eswari et al., 2015), NCD claims will reach 52 million patients globally by the year of 2030. 

The architecture for the predictive analysis included four phases:  Data Collection, Data Warehousing, Predictive Analysis, Processing Analyzed Reports.  Figure 1 illustrates the framework used for the Predictive Analysis System-Healthcare application, adapted from (Eswari et al., 2015). 

Figure 1.  Predictive Analysis Framework for Healthcare. Adapted from (Eswari et al., 2015).

Phase 1:  The Data Collection phase included raw diabetic data which is loaded into the system.  The data is unstructured including EHR, patient health records (PHR), clinical systems and external sources such as government, labs, pharmacies, insurance and so forth.  The data have different formats such as .csv, tables, text.  The data which was collected from various sources in the first phase was stored in Data Warehouses. 

Phase 2:  During the second phase of data warehousing, the data gets cleansed, and loaded to be ready for further processing.

Phase 3:  The third phase involved the Predictive Analysis which used the predictive algorithm in Hadoop, Map/Reduce environment to predict and classify the type of DM, complications associated with each type, and the treatment type to be provided.  Hadoop framework was used in this analysis because it can process extremely large amounts of health data by allocating partitioned data sets to numerous servers.  Hadoop utilized the Map/Reduce technology to solve different parts of the larger problem and integrate them into the final result.  Moreover, Hadoop utilized the Hadoop Distributed File System (HDFS) for the distributed system. The Predictive Analysis phase involved Pattern Discovery and Predictive Pattern Matching. 

With respect to the Pattern Discovery, it was important for DM to test patterns such as plasma, glucose concentration, serum insulin, diastolic blood pressure, diabetes pedigree, Body Mass Index (BM), age, number of times pregnant.   The process of the Pattern Discovery included the association rule mining between the diabetic type and other information such as lab results. It also included clustering to cluster and group similar patterns.  The classification step of the Pattern Discovery included the classification of patients risk based on the health condition.  Statistics were used to analyze the Pattern Discovery.  The last step in the Pattern Discovery involved the application.  The process of the Pattern Discovery of the Predictive Analysis phase is illustrated in Figure 2

Figure 2.  Pattern Discovery of the Predictive Analysis.

With respect to Predictive Pattern Matching of the Predictive Analysis, the Map/Reduce operation was performed whenever the warehoused dataset was sent to Hadoop system.  The Pattern Matching is the process of comparing the analyzed threshold value with the obtained value.   The Mapping phase involved splitting the large data into small tasks for Worker/Slave Nodes (WN).  As illustrated in Figure 3, the Master Node (MN) consists of Name Node (NN) and Job Tracker (JT) which used the Map/Reduce technique.   The MN sends the order to Worker/Slave Node, which process the pattern matching task for diabetes data with the help of Data Node (DN) and Task Tracker (TT) which reside on the same machine of the WN.  If the WN completed the pattern matching based on the requirement, the result was stored in the intermediate disk, known as local write.  If the MN initiated the reduce task, all other allocated Worker Nodes read the processed data from the intermediate disks.  The reduce task is performed in the WN based on the query received from the Client to the MN.  The results of the reduce phase will be distributed in various servers in the cluster.

Figure 3.  Pattern Matching System Using Map/Reduce. Adapted from (Eswari et al., 2015).

Phase 4:  In this phase, the Analyzed Reports are processed and distributed to various servers in the cluster and replicated through several nodes depending on the geographical area.  Using the proper electronic communication technology to exchange the information of patients among healthcare centers can lead to obtaining proper treatment at the right time in remote locations at low cost.

The implementation of Hadoop framework did help in transforming various health records of diabetic patients to useful analyzed result to help patients understand the complication depending on the type of diabetes. 

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Eswari, T., Sampath, P., & Lavanya, S. (2015). Predictive methodology for diabetic data analysis in big data. Procedia Computer Science, 50, 203-208.

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Ward, M. J., Marsolo, K. A., & Froehle, C. M. (2014). Applications of business analytics in healthcare. Business Horizons, 57(5), 571-582.

NoSQL Database Application to Health Informatics Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze a NoSQL database type such as Cassandra, MongoDB and how it is used or applied to health informatics data analytics. The discussion is based on a project implemented by (Klein et al., 2015).  In this project, the researchers performed application-specific prototyping and measurement to identify NoSQL products which can fit a data model and can query use cases to meet the performance requirements of the provider.  The provider was a healthcare provider with specific requirements to employ Electronic Health Record (EHR) system.  The project used three NoSQL databases Cassandra, MongoDB, and Riak as the three candidates based on the maturity of the product and the availability of the enterprise support.  The researchers faced challenges of selecting the right NoSQL during their work on the project.

This research study is selected because it is comprehensive and it has rich information about the implementation of these three data stores in healthcare. Moreover, the research study has additional useful information regarding healthcare such HL7, and healthcare-specific data models of “FHIR Patient Resources” and “FHIR Observation Resources,” besides the performance framework such as YCSB.  

NoSQL Database Application in Healthcare

The provider has been using thick client system running at each site around the globe and connected to a centralized relational database.  The provider has no experience with NoSQL.  The purpose of the project was to evaluate NoSQL databases which will meet their needs.  

The provider was a large healthcare provider requesting a new EHR system which supports healthcare delivery for over nine million patients in more than 100 facilities across the world.  The rate of the data growth is more than one terabyte per month. The data must be retained for ninety-nine years.  The technology of NoSQL was considered for two major reasons.  The first reason involved a Primary Data Store for the EHR system.  The second reason is to improve request latency and availability by using a local cache at each site.  

The project involved four major steps as discussed below.  Step four involved five major configuration tasks to test the identified data stores as discussed below as well.  This EHR system requires robust and strong replica consistency.  A comparison was performed between the identified data stores for the strong replica consistency vs. the eventual consistency.   

Project Implementation

Step 1: Identity the Requirement:  The first step in this project was identify the requirements from the stakeholders of the provider. These requirements are used to develop the evaluation of NoSQL database.  There were two main requirements.  The first requirement involved high availability with low latency with a high load in distributed systems.   This first requirement reflected the performance and the scalability as the measure to evaluate the NoSQL candidates.  The second requirement involved logical data models and query patterns supported by NoSQL, and replica consistency in a distributed framework.  This requirement reflected the data model mapping as the measure to evaluate the NoSQL candidates.

Step 2:  Define Two Primary Use Cases for the Use of the EHR System:  The providers provided two specific use cases for the EHR system.  The first use case was to read recent medical test results for a patient. This use case is regarded to be the core function used to populate the user interface when a clinician selects a new patient.  The second use case was to achieve a strong replica consistency when a new medical test result is written for a patient. The purpose of this strong replica consistency is to allow all clinicians using the EHR framework to see the information to make health care decision for the patient, with no regard to the location of the patient either at the same site or in another location.

Step 3:  Select the Candidate NoSQL Database:  The provider requested the evaluation of different data models of NoSQL data stores such as key-value, column, and document to determine the best-fit NoSQL which can meet their requirements.  Thus, Cassandra, MongoDB, and Riak were the candidates for this project based on the maturity of the product and enterprise support.

Step 4:  Performance Tests Design and Execution:  A systematic test process was designed and executed to evaluate the three candidates based on the use cases requirements defined earlier. This systematic test process included five major Tasks as summarized in Table 1. 

Table 1:  Summary of the Performance Tests Design and Execution Tasks.

Task 1:  Test Environment Configuration:  The test environment was developed using the three identified NoSQL databases: MongoDB, Cassandra, and Riak.  Table 2 shows the identified NoSQL database, types, version, and source.  The test environment included two configurations. The first configuration involved a single node server.  The purpose of the first single-node environment was to validate the test environment for each database type.  The second configuration involved nine-node environment. The purpose of the nine-node environment was to represent the production environment that was geographically distributed across three data centers.   The dataset was shared across three nodes, and then replicated to two additional groups; each group has three nodes.  The replication configuration is implemented based on each NoSQL database.  For instance, when using MongoDB, the replica systems are implemented using the Primary/Secondary feature.  When using Cassandra, the replica configuration is implemented using the data center built-in awareness distribution feature.  When using Riak, the data was shared across all nine nodes, with three replicas of each shared stored across the nine nodes.   Amazon EC2 (Elastic Compute Cloud) instances were used for the test environment implementation.  Table 3 describes the EC2 type and source.

Table 2: Summary of Identified NoSQL Databases, Types, Versions, Sources, and Implementation.

Table 3:  Details of Nodes, Types, and Size.

Task 2:  Data Model Mapping:  A logical data model was mapped to the identified data model for the healthcare.  The identified data model was HL7 Fast Healthcare Interoperability Resources (FHIR).  HL7 is Health Level-7 refers to a set of identified international standards to transfer clinical and administrative data between software applications.  Various healthcare providers use these applications.  These identified international standards focus on the application layer, which is layer-7 in the OSI model (Beeler, 2010).   Two models were used: “FHIR Patient Resources,” and “FHIR Observation Resources.”  The test results for a patient were modeled using the “FHIR Patient Resources” such as names, address, and phone, while other medical related information such as test type, result quantity, and result units were modeled using “FHIR Observation Resources.”  The relationship between patient and test results was a one-to-many relationship (1: M), and between patient and observations was also one-to-many relationship as illustrated in Figure 1.

Figure 1:  The Logical Data Model and the Relationship between Patient, Test Result, and Observations.

This relationship of 1: M between the patient and the test result and the observation, and the need to access efficiently the most recently written test results were very challenging for each identified NoSQL data store.  Thus, for MongoDB, the researchers used a composite index of two attributes (Patient ID, Observation ID) for the test result records, and indexed by the lab result data-time stamp.  Using this approach enabled the efficient retrieval process of the most recent test result records for a particular patient.  For Cassandra, the researchers used a similar approach as MongoDB but using a composite index of three attributes (Patient ID, Lab Result, Data-Time Stamp).  The retrieval of the most recent test result was efficient using this approach of the three-composite index in Cassandra because the result was returned sorted by the server. With respect to Riak, the relationship of 1: M was more complicated than MongoDB and Cassandra.  The key-value data model of Riak enables the retrieval of a value, which has a unique key.  Riak has the feature of a “secondary index” to avoid a full scan when the key is not known.   However, each node in the cluster stores the secondary indices for those shards stored by the nodes.  The secondary index requires a query to match such an index, which results in “scatter-gather” performed by the “request coordinator” asking each node for records with the requested secondary index value, waiting for all nodes to respond, and then sending the list of keys for the matching records back to the requester.  This operation causes latency to locate records and the need for two round trips to retrieve the records had a negative impact on the performance of Riak data store.  There is no technique in Riak to filter and return only the most recent observations for a patient. Thus, all matching records must be returned and then sorted and filtered by the client.   Table 4 summarizes the data model mapping for each of the identified data store, and the impact on performance.

Table 4:  Data Model Mapping and Impact on Performance.

Task 3: Data Generation and Load: The dataset contained one million patient records (Patient Records: N=1,000,000), and ten million records for test results (Test Results Records: N=10,000,000). The number for the test result for a patient ranged from zero to twenty with an average of seven (Test Result for each Patient:  Min=0, Max=20, Mean=7). 

Task 4: Load Test Client:  For this task, the researchers used Yahoo Cloud Serving Benchmark (YCSB) framework to manage the execution of the test, and to test the measurement.  One of the key features of YCSB, as indicated in (Cooper, Silberstein, Tam, Ramakrishnan, & Sears, 2010), is the extensibility, which provides an easy definition of new workload types, and flexibility to benchmark new systems.  YCSB framework and workloads are available in open source for system evaluations.  The researchers replaced the simple data models, data sets, and queries of YCSB framework to reflect the project implementations are meeting the specific use cases of the providers.  Another YCSB feature are the ability to specify the total number of operations and the mix of reading and write operations in a workload.  The researchers utilized this feature and applied the 80% read, and %20 write for each load for EHR system in response to the provider’s requirements.  Thus, the read operations were used to retrieve the five most recent observations for a patient, and the write operations were used to insert a new observation record for a patient.   Two use cases for workload were used.  The first use case was to test the data store as a local cache, which involved the write-only workload operation which was performed on a daily basis to load a local cache from a centralized primary data store with records for patients with a scheduled appointment that day.  The second use case was for the read workload to flush the cache back to the centralized primary data store as illustrated in Figure 2.

Figure 2.  Read (R) and Write (W) Workloads.

The “operation latency” was measured by the YCSB framework, as the time calculated between the request time and the response time from the data store.   The calculation for reading and write operation latency was performed separately using the YCSB framework.  Besides the “operation latency,” the latency distribution is a key scalability metric in Big Data Analytics. Thus, the researchers recorded both the average and the 95% values.  Moreover, the researchers extended the test to include the overall throughput in operations per second, which reflected the total number of operations for reading and write divided by the total workload execution time, excluding the time for the initial setup and the cleaning to obtain a more accurate result. 

Task 5: Test Script Development and Execution:  In this task, the researchers performed three runs to decrease any impact associated with the transient events in the cloud infrastructure.  These three runs were performed for each of the identified data stores.  The standard deviation of the throughput for any three-run set never exceeded 2% of the average.   YCSB allows running multiple execution threads to create concurrent client sessions. Thus, the workload execution was repeated for a defined range of test client threads for each of the three-run tests. This workload execution approach created a corresponding number of concurrent database connections.  The researchers indicated that NoSQL data stores are not designed to operate with a large number of concurrent database client sessions. NoSQL databases can usually handle between 16-64 concurrent sessions.

The researchers analyzed the appropriate approach to distributing the multiple concurrent connections to the database across the server nodes. Based on their analysis, the researchers found that MongoDB utilizes a centralized router node and all clients connected to that single router node.  With respect to Cassandra, the data center built-in awareness distribution feature created three sub-clusters of three nodes each, and client connections were spread uniformly across the three nodes in one sub-clusters.   With respect to the Riak data store, client connections are allowed only to be spread uniformly across the full set of the nine-node.  Table 5 summarizes how each data stores handles concurrent connections in a distributed system.

Table 5.  A Summary of the Data Store Techniques for Concurrent Connections in Distributed System.

Results and Findings

The nine-node topology was configured to represent the production system. Moreover, a single server configuration was also tested which demonstrated its limitation for production use. Thus, the performance result does not reflect the single server configuration but rather the nine-node topology configuration and test execution.  However, the comparison between the single node and distributed nine-node scenario was performed to provide insight on the performance of the data store and the efficiency of the distributed systems and the tradeoff to scale up with more nodes versus using faster nodes with more storage.  The results covered three major areas:  Strong Consistency Evaluation, Eventual Consistency Evaluation, and Performance Evaluation.

Strong Consistency Evaluation

With respect to MongoDB, all writes were committed on the Primary Server, while all reads were performed from the Primary Server.  In Cassandra, all writes were committed on a quorum formed on each of the three sub-clusters, while the read operations required a quorum only on the local sub-cluster.  With respect to Riak, the effect was to require a quorum on the entire nine-node cluster for both read and write operations.  Table 6 summarizes this strong consistency evaluation result.  

Table 6.  Strong Consistency Evaluation. Adapted from (Klein et al., 2015).

Eventual Consistency Evaluation: The eventual consistency tests were performed on Cassandra and Riak.  The results showed that writes were committed on one node with replication occurring after the operation was acknowledged to the client.  The results also showed that read operations were executed on one replica, which may or may not return the latest values written to the data store. 

Performance Evaluation: Cassandra demonstrated the best overall performance result among MongoDB and Riak peaking at approximately 3500 operations per seconds as illustrated in Figure 3.  The comparison in Figure 3 include throughput read-only workload, write-only workload and read/write workload, replicated data, and quorum consistency for these three data stores.

Figure 3.  Workload Comparison among Cassandra, MongoDB, and Riak. 

Adapted from (Klein et al., 2015). 

The decreased contention of storage I/O improved the performance, while additional work of coordinating write and read quorums across replicas and data centers decreased the performance.  With respect to Cassandra, the improved performance exceeded the degraded performance resulting in net higher performance in the distributed configuration than the other two data stores.  The built-in awareness distributed feature in Cassandra participated in the improved performance of Cassandra because this feature separates replication and sharding configurations.   This separation allowed larger read operations to be completed without the requirement of request coordination such as P2P proxying of the client request as the case in Riak.

With respect to latency, the results showed for the read/write workload that MongoDB had a constant average latency as the number of concurrent sessions with the increased number of concurrent sessions.  The result also showed the read/write operations, that Cassandra achieved the highest overall throughput, it had the highest latencies indicating high internal concurrency in processing the requests. 

Conclusion

Cassandra data store demonstrated the best throughput performance, but with the highest latency for the specific workloads and configurations tested.   The researchers analyzed such results of Cassandra that Cassandra provides hash-based sharding spread the request and storage load better than MongoDB.  The second reason is that indexing feature of Cassandra allowed efficient retrieval of the most recently written records, compared to Riak.  The third reason is that the P2P architecture and data center aware feature of Cassandra provide efficient coordination of both reads and write operations across the replicas nodes and the data centers.  

The results also showed that MongoDB and Cassandra provided a more efficient result with respect to the performance than Riak data store.  Moreover, they provided strong replica consistency required for such application of the data models.  The researchers concluded that MongoDB exhibited more transparent data modeling mapping than Cassandra, besides the indexing capabilities of MongoDB, were found to be a better fit for such application. 

Moreover, the results also showed that the throughput varied by a factor of ten, read operation latency varied by a factor of five, and write latency by a factor of four with the highest throughput product delivering the highest latency. The results also showed that the throughput for workloads using strong consistency was 10-25% lower than workloads using eventual consistency.

References

Beeler, G. W. J. (2010). Introduction to:  HL7 References Information Model (RIM).  ANSI/HL7 RIM R3-2010 and ISO 21731. Retrieved from https://www.hl7.org/documentcenter/public_temp_4F08F84F-1C23-BA17-0C2B98D837BC327B/calendarofevents/himss/2011/HL7ReferenceInformationModel.pdf.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., & Sears, R. (2010). Benchmarking cloud serving systems with YCSB. Paper presented at the Proceedings of the 1st ACM symposium on Cloud computing.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.

NoSQL Databases: Cassandra vs. DynamoDB

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the differences between two NoSQL databases Cassandra and DynamoDB. The discussion begins with a brief overview of NoSQL and the data store types for these NoSQL databases, followed by more focus discussion about Cassandra and DynamoDB.

NoSQL Overview

NoSQL stands for “Not Only SQL” (EMC, 2015; Sahafizadeh & Nematbakhsh, 2015).  NoSQL is used for modern, scalable databases in the age of Big Data.  The scalability feature enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015).  The platform can incorporate two types of scalability to support the processing of Big Data; horizontal scaling and vertical scaling. The horizontal scaling allows distributing the workload across many servers and nodes. Servers can be added in the horizontal scaling to increase the throughput (Sahafizadeh & Nematbakhsh, 2015).  The vertical scaling, on the other hands, more processors, more memories, and faster hardware can be installed on a single server (Sahafizadeh & Nematbakhsh, 2015).  NoSQL offers benefits such as mass storage support, reading and writing operations are fast, the expansion is easy, and the cost is low (Sahafizadeh & Nematbakhsh, 2015).  Examples of the NoSQL databases are MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.

NoSQL Data Stores Types

Data stores are categorized into four types of store:  document-oriented, column-oriented or column family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015).  The purpose of the document-oriented database is to store and retrieve collections of information and documents.  Moreover, it supports complex data forms in various format such as XML, JSON, in addition to the binary forms such as PDF and MS Word (EMC, 2015; Hashem et al., 2015).   The document-oriented database is similar to a tuple or in the relational database. However, the document-oriented database is more flexible and can retrieve documents and information based on their contents.  The document-oriented data store offers additional features such as the creation of indexes to increase the search performance of the document (EMC, 2015).   The document-oriented data stores can be used for the management of the content of web pages, as well as web analytics of log data (EMC, 2015). Example of the document-oriented data stores includes MongoDB, SimpleDB, and CouchDB (Hashem et al., 2015).   The purpose of the column-oriented database is to store the content in columns aside from rows, with attribute values belonging to the same column stored contiguously (Hashem et al., 2015).  The column family database is used to store and render blog entries, tags, and viewers’ feedback. It is also used to store and update various web page metrics and counters (EMC, 2015). Example of the column-oriented database is BigTable.  In (EMC, 2015; Erl, Khattak, & Buhler, 2016) Cassandra is also listed as a column-family data store. The key-value data store is designed to store and access data with the ability to scale to a very large size (Hashem et al., 2015).   The key-value data store contains value and a key to access that value.  The values can be complex (EMC, 2015).  The key-value data store can be useful in using login ID as the key to the preference value of customers.  It is also useful in web session ID as the key with the value for the session.  Examples of key-value databases include DynamoDB, HBase, Cassandra, and Voldemort (Hashem et al., 2015).  While HBase and Cassandra are described to be the most popular and scalable key-value store (Borkar, Carey, & Li, 2012), DynamoDB and Cassandra are described to be the two popular AP (Availability and Partitioning tolerance) systems (M. Chen, Mao, & Liu, 2014).  Others like (Kaoudi & Manolescu, 2015) describes Apache Accumulo, DynamoDB, and HBase as the popular key-value stores.  The purpose of the graph database is to store and represent data which uses a graph model with nodes, edges, and properties related to one another through relations.  Example of the graph database is Neo4j (Hashem et al., 2015).  Table 1 provides examples of NoSQL Data Stores.

Table 1.  NoSQL Data Store Types with Examples.

Cassandra

Cassandra is described as the most popular NoSQL database (C. P. Chen & Zhang, 2014; Mishra, Dehuri, & Kim, 2016).  It is the second-generation distributed key-value store which was developed by Facebook in 2008 (Bifet, 2012; Cattell, 2011; C. P. Chen & Zhang, 2014; Rabl, Sadoghi, & Jacobsen, 2012).  It is also described as a clustered, key-value database which uses column-oriented storage and redundant storage for accessibility in both read/write performance and data sizes (Mishra et al., 2016).  

Cassandra can handle the very large amount of data which spread out across many servers. It also provides a highly available service without a single point of failure (Bahrami & Singhal, 2015; Tilmann Rabl et al., 2012).  Failure detection and recovery are fully automated (Cattell, 2011). It adopts concepts from both DynamoDB and BigTable. Cassandra integrated the distributed technology of DynamoDB with the data model of BigTable (M. Chen et al., 2014; Tilmann Rabl et al., 2012).  Thus, the architecture of Cassandra is a mixture of BigTable of Google and DynamoDB of Amazon, providing availability and scalability (M. Chen et al., 2014; Tilmann Rabl et al., 2012).  Example of Cassandra’s application is Netflix which is using it as the back-end database for its streaming services (Bifet, 2012). 

Cassandra, like HBase, is written in Java and used under Apache licensing (Cattell, 2011).   Cassandra has column groups, uses memory to cache the updates which get flushed into a disk, and the representation of the disk representation is compacted periodically.  Cassandra can be used for partitioning and replication (Cattell, 2011).  The partition and copy techniques in Cassandra are said to be similar to those of DynamoDB to achieve consistency (M. Chen et al., 2014).  However, Cassandra is said to have a weaker concurrency model than other systems, as there is no locking technique and replicas are updated asynchronously (Cattell, 2011). 

When using Cassandra, newly available nodes are brought automatically into a cluster using “phi accrual algorithm” to detect node failure and determine cluster membership in a distributed fashion using a “gossip-style algorithm” (Cattell, 2011).   Tables in Cassandra are in the form of distributed four-dimensional structured mapping, where the four dimension including row, column, column family and super column (M. Chen et al., 2014). Cassandra provides the concept of “super column” providing another level of grouping within column groups (Cattell, 2011).  The row is distinguished by a string-key with arbitrary length (M. Chen et al., 2014).  The number of columns to be read or written does not matter because the operation on rows is an auto operation (M. Chen et al., 2014).  The columns can constitute clusters which are called column families, similar to the data model of BigTable (M. Chen et al., 2014). 

Cassandra uses an “ordered hash index” providing the benefit of both the hash and B-Tree indexes (Cattell, 2011).  However, the sorting is slower in “ordered hash index” than with the B-Tree index (Cattell, 2011). Cassandra is said to be gaining a lot of momentum as an open source project as it has reportedly scaled to about 150 machines or more in the production platform of Facebook (Cattell, 2011).  Cassandra uses the eventual-consistency model which is said to be not adequate.  However, “quorum reads” of a majority of replicas provide a technique to get the latest data (Cattell, 2011).  The writes in Cassandra are atomic within a column family (Cattell, 2011).  Moreover, Cassandra supports versioning and conflict resolution techniques (Cattell, 2011). The key functions of Cassandra involve P2P (Peer-to-Peer) system structured and unstructured, decentralized storage system, symmetric system orientation, efficient latencies, linear scalability, the map is indexed by a unique “low-key,” “column-key” (Kalid, Syed, Mohammad, & Halgamuge, 2017).

With respect to security, Cassandra supports the encryption of all password using MD5 hash function and the passwords are very weak, which can cause a threat if any malicious user can bypass client authorization (Sahafizadeh & Nematbakhsh, 2015).  The user can extract the data because of the lack of authorization technique in inter-node message exchange (Sahafizadeh & Nematbakhsh, 2015).  Thus, Cassandra is potential for denial of service attack because it performs one threat per one client and it does not support inline auditing (Sahafizadeh & Nematbakhsh, 2015).  Cassandra uses a query language called Cassandra Query Language (CQL), which is similar to SQL (Sahafizadeh & Nematbakhsh, 2015).  Experiments showed that the injection attack is possible in Cassandra using CQL like SQL injection (Sahafizadeh & Nematbakhsh, 2015).  Moreover, Cassandra has a limitation of managing inactive connection (Sahafizadeh & Nematbakhsh, 2015).

DynamoDB

DynamoDB belongs to Amazon and is highly available, scalable, key-value, and low-latency NoSQL database (Kalid et al., 2017; Mishra et al., 2016; Russell & Van Duren, 2016).  DynamoDB is described as one of the earliest NoSQL databases affecting the design of other NoSQL databases such as Cassandra (Mishra et al., 2016).  DynamoDB supports both the key-value and document data store (Sahafizadeh & Nematbakhsh, 2015).  The goal of DynamoDB is higher performance and high throughput (Thuraisingham, Parveen, Masud, & Khan, 2017).  DynamoDB can expand and shrink as required by the applications (Thuraisingham et al., 2017).  It supports in-memory cache using DynamoDB Accelerator providing millisecond responses for millions of requests per seconds (Thuraisingham et al., 2017).

With respect to security, in DynamoDB the data security, authentication, and access control can be implemented on a per-table basis, to leverage the AWS identity and access management system (Russell & Van Duren, 2016).  However, data encryption is not supported in DynamoDB.  It supports the communication between the client and server using the HTTPS protocol.  DynamoDB supports authentication and authorization, and requests need to be signed using HMAC-SHA256 (Sahafizadeh & Nematbakhsh, 2015).  

Summary of Comparison between DynamoDB and Cassandra

DynamoDB was one of the earliest NoSQL databases impacting the design of other NoSQL databases such as Cassandra.  Cassandra integrated the data model from BigTable of Google, and the distributed system technology of DynamoDB is providing availability and scalability.  DynamoDB and Cassandra are popular for Availability and Partitioning tolerance. The partitioning and copying techniques in Cassandra are similar to those of DynamoDB to achieve consistency. DynamoDB does not support encryption while Cassandra does use MD5 for all passwords. However, DynamoDB supports HTTPS protocol.  Table xx summarizes the comparison between DynamoDB and Cassandra.

Table 2.  Summary of Comparison between DynamoDB and Cassandra

References

Bahrami, M., & Singhal, M. (2015). The role of cloud computing architecture in big data Information granularity, big data, and computational intelligence (pp. 275-295): Springer.

Bifet, A. (2012). Mining big data in real time. Informatica, 37(1).

Borkar, V. R., Carey, M. J., & Li, C. (2012). Big data platforms: what’s next? XRDS: Crossroads, The ACM Magazine for Students, 19(1), 44-49.

Cattell, R.-i. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12-27.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Kalid, S., Syed, A., Mohammad, A., & Halgamuge, M. N. (2017). Big-data NoSQL databases: A comparison and analysis of “Big-Table”,“DynamoDB”, and “Cassandra”. Paper presented at the Big Data Analysis (ICBDA), 2017 IEEE 2nd International Conference on.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

Rabl, T., Sadoghi, M., & Jacobsen, H. (2012). Solving Big Data Challenges for Enterprise Application Performance Management.

Russell, B., & Van Duren, D. (2016). Practical Internet of Things Security: Packt Publishing Ltd.

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Thuraisingham, B., Parveen, P., Masud, M. M., & Khan, L. (2017). Big Data Analytics with Applications in Insider Threat Detection: CRC Press.

Decision Tree in Diagnosing Heart Disease Patients

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to discuss and analyze the Decision Tree in diagnosing heart disease patients.  The project focuses on the research study of (Shouman, Turner, & Stocker, 2011) who performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease.  The key benefit of this study is the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio.  The study also performed the experimentation with and without the voting technique. The project analyzed the steps performed by the researchers of (Shouman et al., 2011), the attributes used, the voting techniques, the data discretization using supervised methods of equal width and equal frequency, and unsupervised methods of chi merge and entropy.   The four major steps for the evaluation of the Decision Tree in diagnoses of the heart disease include Data Discretization, Data Partitioning, Training Data and Decision Tree Type Selection, and Reduced Error Pruning to develop pruned Decision Tree.  The findings of the researchers indicated that Gain Ratio Decision Tree type increases the accuracy of the probability calculation.   The researcher of this project is in agreement with the researchers of the experimentation for further experimentation using larger set to examine to verify if the result will be different with a large set of data.

Keywords: Decision Tree, Diagnosis of Heart Disease, Multi-Variant.

Introduction

            Various research studies used various data mining techniques in the healthcare to diagnose diseases such as diabetes, stroke, cancer, and heart.  Researchers have applied various data mining techniques in the diagnosis of the heart diseases such as Naïve Bayes, Decision Tree, Neural Network, Kernel Density, with different level of accuracy using defined groups, bagging algorithm, and support vector machine.  However, Decision Tree mining technique has demonstrated by several research studies successful application in the diagnosis of the heart disease.  In this project, the focus is on Decision Tree mining technique for heart disease diagnosis.  The discussion and the analysis of the Decision Tree in this project are based on the research study of (Shouman et al., 2011).

            Decision Tree mining technique has various types such as J4.8, C4.5, Gini Index and Information Gain.  Most research studies applied J4.8 Decision Tree, based on Gain Ratio in the extraction of Decision Tree rules, and binary discretization.  Other discretization techniques such as voting method and reduced pruning are known to provide more accurate Decision Trees.  In (Shouman et al., 2011),  the researchers investigated various techniques to different types of Decision Trees with the aim to better performance in diagnosis of heart disease.  The sensitivity, the specificity, and the accuracy measures are calculated to evaluate the performance of the alternative Decision Trees. 

            The risk factors associated with heart diseased are identified as age, blood pressure, smoking habit, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, and lack of physical activity.   Decision Tree cannot handle continuous variables directly.  Thus, the continuous variables must be converted into discrete attributes. This process is called “discretization” method.  There are two discretization methods; binary and multi-interval.  The J4.8 and C4.5 Decision Trees utilizes the binary discretization for the continuous-valued feature.   The multi-interval discretization method is known to produce more accurate Decision Tree result than the binary discretization method.  However, the multi-interval discretization method is less used than the binary discretization method in the research studies of heart disease diagnosis. Other methods such as multiple classifier voting and reduced error pruning can be used to improve the accuracy of the Decision Tree result in the heart disease diagnosis analysis. 

            The Data Discretization method and the Decision Tree Type are the two components which impact the performance of the Decision Tree as an analytical mining technique.   In an effort to identify the best method for accuracy, the researchers of (Shouman et al., 2011) investigated multiple classifiers voting methods with different multi-interval discretization methods such as equal width, the equal frequency with different types of Decision Tree such as Information Gain, Gini Index, and Gain Ratio.  Microsoft Visual Studio 2008 was used in this investigation effort.

Examination of the Decision Tree Analytical Technique for Heart Disease Patients

In this research study, the researchers used twelve Decision Tree variants by mixing discretization approaches with different Decision Tree types.  Each variant was examined through five different voting partitioning schemes of three, five, seven, nine and eleven partitioning.   The dataset used in this research study from Cleveland Clinic Foundation Hearth disease (UCI, 1988).  The dataset has seventy-six raw attributes.  However, because the published experiments only refer to thirteen of them, the researchers restricted the testing of this research study to the same thirteen attributes to allow comparison with other literature results.  The selected dataset attributes are illustrated in Table 1.  Although the researchers are talking about thirteen attributes, the table displays fourteen attributes.  The researchers of this project investigated the additional attribute from (UCI, 1988), and found out that the fourteenth attribute is the predicted attribute for diagnosis of the heart disease patients.

Table 1.  Selected Dataset Attributes. Adapted from (Shouman et al., 2011)

The test executed over seventy Decision Trees using the same dataset.  The dataset contains 303 rows of which 297 are complete, with six missing value rows which got eliminated from the test.  The tests were performed one time with the voting application, and another time without the voting application to evaluate the impact of the voting on the accuracy.  The research study implemented these tests using the four major steps below.

  1. Data Discretization.
  2. Data Partitioning.
  3. Training Data and Decision Tree Type Selection.
  4. The Reducing Error Pruning Application to develop pruned Decision Tree.
  1. Data Discretization for Discrete Attributes

Data Discretization can be either supervised or non-supervised.  The supervised data discretization method does not utilize the class membership information, while the non-supervised method uses the class labels to implement the discretization process such as chi-square based method, and entropy-based method.  The discretization process is used to convert the continuous attributes to discrete attributes in the dataset.  The discretization method uses five intervals.   The chi merge and entropy methods are the two of the most well-known discretization methods in the supervised discretization.  This chi merge discretization uses X2 statistic to identify the class independence from the two adjacent intervals, and if they are dependent, they get combined, or if they are not dependent, they get separated. The pair of the intervals get merged with the lowest value of X2 provided that the interval number is more than the pre-defined maximum number of intervals.  The entropy method is described as an information-theoretic measure of the uncertainty contained in the training set.  The purpose of the entropy method is to select boundaries for discretization by evaluating the cut points of the candidates through the entropy-based method.  The entropy for each candidate cut point is calculated after the instances are sorted into ascending numeric order.   To minimize entropy, the cut points are recursively selected until a stopping criterion is implemented.  The stop criterion is implementing five intervals of the attribute.

The equal-width interval and equal-frequency methods are used in the unsupervised discretization method.  The equal-width discretization algorithm determines the maximum and minimum values of the discretized attribute and obtains a user-defined number of equal width discrete intervals.  The equal-frequency, on the other hand, uses the same technique of the equal-width but it does sort all values in ascending order before the division of the range.  Figure 1 summarizes the data discretization.

Figure 1.  Data Discretization in Decision Tree Analytical Mining. Adapted from (Shouman et al., 2011).

2. Data Partitioning

This step of Data Partitioning involved testing with and without voting.  The application of the voting in the classification algorithm is proven to increase the accuracy.  Thus the researchers applied the multiple classifier voting by dividing the training data into smaller equal subsets of data and developing a Decision Tree classifier for each data subset.  Each classifier represents a single vote, and the voting is based on either plurality voting or majority voting.   The researcher of (Shouman et al., 2011) performed the experimentation of voting subsets, dividing the data into three and eleven subsets for each discretization method for each Decision Tree type.  The result indicated that the nine subsets were the most successful division.  Figure 2 illustrates the Data Partitioning step with the voting and without the voting techniques.

Figure 2.  Data Partitioning With Voting and Without Voting. Adapted from (Shouman et al., 2011).

3. Training Data and Decision Tree Type Selection

            In this experimentation, the researchers used four Decision Tree types; the Information Gain, the Gini Index, the Gain Ratio, and the Pruning types, as they are the most commonly used Decision Tree types.   The Decision Tree types are distinguished by the mathematical model which is used in selecting the splitting attribute in extracting the Decision Tree rules.  Figure 3 illustrates the Training Data step for these three Decision Tree types.

Figure 3.  Decision Tree Types and Training Data Adapted from (Shouman et al., 2011).

            In the Information Gain, the splitting attribute that is selected which maximize the Information Gain, and minimize the entropy value.  The splitting attribute is identified by calculating the Information Gain for each attribute and selecting the attribute which will maximize the Information Gain.  The calculation of the Information Gain for each attribute is implemented using the following mathematical formula, where k is the number of classes of the target attribute, Pis the number of occurrences of class i divided by the total number of instances to get the probability of i occurring. 

            Information Gain can produce biased results because its measure is biased toward the tests with many outcomes, where the attributes with large values are selected.  Thus, the Gain Ratio Decision Tree type was introduced to reduce the effect of such a bias result.  The Gain Ratio makes adjustments to the Information Gain for each attribute to allow for the breadth and uniformity of the attribute values.  The mathematical formula for the Gain Ratio is as follows where the split information is a value based on the column sums of the frequency table: 

In the Gini Index Decision Tree type, the impurity of the data is measured.  The calculation of Gini Index is implemented for each attribute in the dataset as shown in the following mathematical formula, where the target attribute has k classes with the probability of i class being Pi,  The splitting attribute in Gini Index has the largest reduction in the value of the Gini Index.

 4. Reducing Error Pruning Application to Develop Pruned Decision Tree

The Reduced Error Pruning method is described as the fastest pruning technique and proven to provide accuracy and small decision rules.   In this step, the researchers applied the reduce error pruning method to the three selected Decision Tree types to improve the decision tree performance.  After the decision tree rules are extracted from the training data, the reduced error pruning method was applied to those rules, providing more compact decision tree rules, and minimizing the number of extracted rules.  Figure 4 illustrates the four step-process including the Reduced Error Pruning step which is preceded by the Training Data of the three selected  Decision Tree Types.

Figure 4.  The Four Major Steps of the Decision Tree Process to Evaluate Alternative Techniques (Shouman et al., 2011).

Performance Measures

            Three measures were used to evaluate the performance of each combination; sensitivity, specificity, and accuracy.  The sensitivity is the proportion of positive instances which are correctly classified as positive for sick patients, while the specificity is the proportion of the negative instances which are correctly classified as negative for healthy patients.  The accuracy is the proportion of instances which are correctly classified as shown in Table 2.

Table 2.  Performance Measures.

These performance evaluation measures; sensitivity, specificity, and accuracy were used in the diagnosis of the heart disease using equal width, equal frequency, chi merges, and entropy discretization with the three selected Decision Tree Types of Information Gain, Gini Index, and Gain Ration Decision Trees, and Reduce Error Pruning Application.

The Research Findings

      The result without the voting application showed that the highest accuracy of 79.1% was achieved by using the equal width discretization Information Gain Decision Tree.  However, the result of the voting application showed the better accuracy of 84.1% using equal frequency discretization Gain Ration Decision Tree, which is 6.4% increase in the accuracy more than the test without voting.   The result also showed that the chi merge and entropy supervised discretization methods with or without voting did not show any improvement in the accuracy of the Decision Tree.  Table 3 summarizes these results are focusing on accuracy only which are derived from detailed tables in (Shouman et al., 2011).  Figure 5 visualizes these results as well.

Table 3.  Accuracy Result With and Without Voting.

Figure 5.  Visual View of the Evaluation of Alternative Decision Tree Techniques.

            The researchers compared their findings and results with J4.8 Decision Tree and bagging algorithm which used the same dataset.  They found out that their tests showed higher performance measures in sensitivity, specificity, and accuracy than J4.8 Decision Tree.  Moreover, the results showed higher sensitivity, and accuracy than the bagging algorithm.  Table 4 showed such a comparison, adapted from (Shouman et al., 2011).

Table 4.  Comparison between the Proposed Model and J4.8 and Bagging Algorithm. Adapted from (Shouman et al., 2011).

            While most researchers are using the binary discretization with Gain Ration Decision Tree in the diagnosis of the heart disease patients, the researchers of this study concluded based on their experimentation, that the application of multi-interval equal frequency discretization with nine voting Gain Ratio Tree provides a better result in the diagnosis of heart disease patients.  Moreover, the accuracy can be improved by increasing the granularity in splitting attributes offered by the multi-interval discretization.  The accuracy of the probability calculation is increased for any given value using the Gain Ration calculation.  The voting application across multiple similar trees validated the higher probability and enhanced the selection of the useful splitting attribute values.  The researchers proposed further research testing to apply the same techniques to evaluate the same performance measures on a larger dataset.

Conclusion

            This project discussed and analyzed the Decision Tree in diagnosing heart disease patients.  The project focused on the research study of (Shouman et al., 2011) who performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease.  The key benefit of this study is the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio.  The study also performed the experimentation with and without the voting technique. The project analyzed the exact steps performed by the researchers of (Shouman et al., 2011), the attributes used, the voting techniques, the data discretization using supervised methods of equal width and equal frequency, and unsupervised methods of chi merge and entropy.   The four major steps for the evaluation of the Decision Tree in diagnoses of the heart disease include Data Discretization, Data Partitioning, Training Data and Decision Tree Type Selection, and Reduced Error Pruning to develop pruned Decision Tree.  The findings of the researchers indicated that Gain Ratio Decision Tree type increases the accuracy of the probability calculation.   The researcher of this project is in agreement with the researchers of the experimentation for further experimentation using larger set to examine to verify if the result will be different with a large set of data.

References

Shouman, M., Turner, T., & Stocker, R. (2011). Using decision tree for diagnosing heart disease patients. Paper presented at the Proceedings of the Ninth Australasian Data Mining Conference-Volume 121.

UCI. (1988). Heart Disease Dataset. Retrieved from http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

Theories and Techniques Used in the Diagnosis of Illness with Big Data Analytics

Dr. Aly, O.
Computer Science.

Introduction

The purpose of this discussion is to discuss and analyze the theories and techniques which can be used in the diagnosis of illnesses with the use of Big Data Analytics.  The discussion will be followed with some recommendations and the rationale for such recommendations. 

Advanced Analytical Theories and Methods

In accordance to (EMC, 2015), there are six main advanced analytical theories and methods which can be utilized to analyze Big Data in different fields such as Finance, Medical, Manufacturing, Marketing, and so forth.  These six analytical models are Clustering, Association Rules, Regression, Classification, Time Series Analysis, and Text Analysis.  In this discussion, the researcher discusses and analyzes each model with the analytical method used for each model.  Based on the discussion and analysis of each model and its analytical methods, the discussion ends with the conclusion on the most appropriate analytical model and method in the diagnosis of illnesses.

  1. Clustering Model and K-Means Method
    The “Clustering” model is used to group similar objects using the unsupervised technique to find hidden structure within unlabeled data because the labels to apply to the clusters cannot be determined in advance.  The K-means method is an analytical and unsupervised method, which is commonly used the method when using the Clustering Model. When using the analytical technique of the K-means, K-means identifies K clusters of objects based on the proximity of the objects to the center of the K groups, for a selected K value.  The center of the K group is identified by the Mean, which is the average, of each n-dimensional vector of attributes of each cluster.  The Clustering and its common method of K-Means can be used in processing images such as security images.  It can also be used in medicine such as targeting individuals for specific preventive measures or participation in the clinical trial.  Moreover, the Clustering model with its analytical technique of K-Means can also be used in customer segmentation to identify customers with certain purchase patterns to increase sale and to retain the customers by reducing the “churn rate.”  The Clustering model can also be applied to human genetics field and also to biology to group and classify plants and animals. Thus, marketing, medical, biology, and economics can benefit from the application of the advanced analytical Clustering model.  When the cluster is identified, the labels can be applied to each cluster to classify each group based on the characteristics of each group.  Thus, the Clustering model has used a lead-in to Classification Model.  
  2. Classification Model and Decision Tree and Naïve Bayes Methods
    It is used for data mining. Unlike the Clustering Model, the class labels are predetermined in the Classification Model, where the class labels are assigned to new observations.  Most of the Classification methods are “supervised” methods. The logistic regression analytical method is the popular method when using Classification model. The Classification model is commonly used for prediction.  It can be used in healthcare to diagnose patients with heart disease.  It can also be used in filtering spam emails.  The two main Classification analytical methods are the “Decision Tree” also known as “Prediction Tree,” and “Naïve Bayes.”  Additional Classification methods are also available such as bagging, boosting, random forest, and support vector machine.  However, the focus of this discussion is on the two classifiers of the “Decision Tree” and “Naïve Bayes.”

    The “Prediction Tree” utilizes a tree structure to identify the sequences of decisions and consequences.  It has two categories of trees: “Classification Tree” and “Regression Tree.”  While the “Classification Tree” can apply to output variables which are “categorical” such as Yes|No, the “Regression Tree” can apply to output variables which are numeric or continuous such as the predicted prices of a product.  The “Decision Tree” can be used for a checklist of symptoms during the evaluation of the doctor to the patient’s case.  The analytical method of the Decision Trees can also be used in artificial intelligence engine of a video game, financial institutions to decide if a loan can be approved and in animal classification such as a mammal or not mammal.  The Decision Tree utilizes the “greedy algorithm,” choosing the best available option available at that moment. 

    The “Naïve Bayes” analytical method is a probabilistic classification method which is based on “Bayes’ Theorem” or “Bayes’ Law.”  This theorem provides the relationship between the two event probabilities and their conditional probabilities.  There is an assumption that the absence or presence of a particular feature of a class is not related to the absence or presence of other features.  The Bays’ Law utilizes the categorical input variables. However, the variations of the algorithm can work with a continuous variable.   The Naïve Bayes classifiers perform better than the Decision Tree on categorical values with many levels.  The Naïve Bayes classifiers are robust to irrelevant variables, which can be distributed among all classes and can handle missing values, unlike the Logistic Regression.  The Naïve Bayes algorithm handles high-dimensional data efficiently.   Native Bayes is competitive with other learning algorithms such as Decision Trees, and Neural Networks, and in some cases, it outperforms other methods.  The Naïve Bayes classifiers are easy to implement and can execute efficiently even without prior knowledge of the data.  They are commonly used to classify text documents such as email spam filtering.  The Naïve Bayes classifiers can also be used in fraud detection such as auto insurance.   It can also be used to compute the probability of a patient having a disease.   
  3. Regression Model and Linear Regression and Logistic Regression:  Regression Analysis is used to explain the influence of an input variable which is independent of the outcome of another variable which is dependent.  It has two categories of regressions: “Linear Regression,” and “Logistic Regression.”  The “Linear Regression” can provide the expected income of a person, while the “Logistic Regression” can compute the probability which an application will default on a loan.  There are additional Regression models include “Ridge Regression,” and “Lasso Regression.”  The discussion of this section is limited only to the “Linear Regression” and “Logistic Regression.”

    The “Linear Regression” is an analytical method used to model the relationship between several input variables and the outcome variable. The outcome variable is a continuous variable.  There is a key assumption that the relationship between the input and the output variables is linear.  The “Linear Regression” method is used in business, government, and other scenarios. Examples of the application of the “Linear Regression” include Real Estate, Demand Forecasting, Medical.  In the “Real Estate” the Linear Regression method can be used to model residential home prices as a function of the home’s living area.  In the Demand Forecasting, the Linear Regression can be used to predict the demand for products and services. In the Medical, the Linear Regression can be used to analyze the effect of proposed radiation treatment on reducing the tumor sizes in cancer patients.

    The “Logistic Regression” can be used to predict the probability of an outcome based on the input.  The outcome variable is categorical, unlike the “Linear Regression,” where the outcome variable is continuous. The “Logistic Regression” analytical method is used in medical, finance, marketing, and engineering.  In the Medical field, the “Logistic Regression” can be used in scenarios such as determining the probability of likelihood for a patient to respond to a specific treatment positively.  In the Finance field, the “Logistic Regression” can be used in scenarios such as determining the probability for a person to default on loan.  In Marketing, the “Logistic Regression” can be used in scenarios such as determining the probability for a customer to switch to another wireless carrier. In the Engineering field, the “Logistic Regression” can be used in scenarios such as determining the probability of a mechanical part that can experience a malfunction or failure.  

    As a brief comparison, the “Linear Regression” model is good when the outcome variable is continuous, while the “Logistic Regression” mode is good when the outcome variable is categorical.  Each method can be applied in certain scenarios as explained above.
  4. Association Rules and the Apriori Method:  The Association Rule is an unsupervised method.  It is a descriptive and not predictive method, often used to discover interesting relationship hidden in a large dataset.  The “Association Rules” are commonly used for mining transactions in databases.  Examples of scenarios for “Association Rules” include products which can be purchased together, and similar customers buying similar products.  The “apriori” algorithm is the earliest algorithm for generating Association Rules.  The “support” is the major component of the “apriori” method.  The “apriori” algorithm takes the bottom-up iterative approach to discover the frequent datasets by identifying all possible items and determine the most frequent item.  The “apriori” algorithm decreases the computational workload by only examining the datasets which meet the specified minimum threshold.  However, if the size of the dataset is very large, the “apriori” method can be computationally expensive.  Thus, various approaches such as partitioning, sampling, transaction reduction can be used to improve the efficiency of the “apriori’ algorithms.

    The Association Rules can be applied to marketing.  The “market basket analysis” is one of the specific implementations of the “Association Rules” mining.  The “recommender systems” and “clickstream analyses” are also using the “Association Rules” mining. Moreover, as indicated in (Wassan, 2014), the recommender system can also be used to extract relevant information from the Electronic Health Records and offer healthcare recommendations to users or various stakeholders of the clinical environment.
  5. Time Series Analysis and ARIMA Model: The “Time Series” analysis attempts to model the underlying structure of observations taken over time.  Various methods are used for the “Time Series” analysis, such as “Auto-aggressive Integrated Moving Average” (ARIMA), “Auto-regressive Moving Average with Exogenous Inputs” (ARMAX), Sectoral Analysis, “Generalized Autoregressive Conditionally Heteroscedastic (GARCH), Kalman Filtering, and “Multivariate Time Series Analysis.”  The focus in this discussion is on the ARIMA model.

    There are four components for the “Time Series;” trend, seasonality, cyclic, and random.  Box-Jenkins methodology excludes any trends or seasonality in the “Time Series.”  The “Time Series” must be stationary to apply ARIMA model properly.  The advantage of ARIMA model is that the analysis can be based on historical time series data for the variable.  The disadvantage of the ARIMA model is the minimal data requirement.

    The “Time Series” model can be used in finance, economics, biology, engineering, retails and manufacturing.  In the “Retails” scenario, the model can look to forecast future monthly sales.  In the “Manufacturing” field, the model can be used to forecast future spare part demands to ensure an adequate supply of the parts to repair customer products. In the “Finance” field, the model can be used for stock trading, and the use of a technique called pairs training where a strong correlation between the prices of two stocks is identified. 
  6. Text Analysis and Text Mining:  The “Text Analysis” also known as “Text Analytics” involves the processing and modeling of the textual data to derive useful insight. The “Text Mining” is an important component of the “Text Analysis” which is used to discover relationships and patterns in large text collections.  The unstructured data of the text collections is a challenging factor in Text Analysis.  The typical process for the Text Analysis involves six steps; collection of the raw text, the representation of the text, the implementation of the “Term Frequency-Inverse Document Frequency” (TFIDF) to compute the usefulness of each word or term in the text, the categorization of the documents by topics using topic modeling, the sentiment analysis, and the gain of greater insight.  This model can be used in scenarios such as social media.

Conclusion

Based on the above discussion and analysis of the advanced analytical models and methods, the researcher concludes that Cluster, Regression, and Classification models can be used in medical field.  However, each model can serve the medical field in different areas.  For instance, the Clustering model with the K-Means analytical method can be used in the medical domain for preventive measures.   The Regression Model can also be used in the medical field to analyze the effect of certain medication or treatment on the patient, and the probability for the patient to respond positively to specific treatment.   The Classification model seems to be the most appropriate model to diagnose illness.  The Classification model with the Decision Tree and Naïve Bayes method can be used to diagnose patients with certain diseases such as heart diseases, and the probability of a patient having a certain disease. 

References

EMC. (2015). Data Science and Big Data Analytics: Wiley.

Wassan, J. T. (2014). Modelling Stack Framework for Accessing Electronic Health Records with Big Data Needs. International Journal of Computer Applications, 106(1).

Encryption: Public Key Encryption and Public Key Infrastructure (PKI)

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to elaborate on the previous discussion of the Encryption  and discuss the functionality provided by the public key encryption and public key infrastructure (PKI).

Public Key Infrastructure (PKI)

The PKI is a framework which enables the integration of various services which are related to cryptography. The purpose of the PKI is to provide confidentiality, integrity, access control, authentication and most importantly non-repudiation.  The encryption/decryption, digital signature, and key exchange are the three primary functions of the PKI (Srinivasan, 2016).

There are three major functional components to the PKI.  The first component involves The Certificate Authority (CA), an entity which issues certificates. One or more in-house servers, or a trusted third party such as VeriSign or GTE, can provide the CA function.  The second component involves the repository for keys, certificates, and Certificate Revocation List (CRLs), which is usually based on a Light-weight Directory Access Protocol (LDAP)-enabled directory service.  The third component involves the management function, which is typically implemented via a management console (RSA, 1999).  Moreover, if the PKI provides automated key recovery, there may also be a key recovery service. The Key Recovery is an advanced function required to recover data or messages when a key is lost.  PKI may also include Registration Authority (RA) which is an entity dedicated to user registration and accepting requests for certificates.  The user registration is the process of collecting information of the user and verifying the user identity, which is then used to register the user according to a policy.  This process is different from the process of creating, signing, and issuing a certificate.  For instance, the Human Resources department may manage the RA function, while the IT department manages the CA.  Moreover, a separate RA makes it harder for any single department to subvert the security system.  Organizations can choose to have registration handled by a separate RA, or included as a function of the CA.  Figure 1 illustrates the main server components of a PKI; certificate server, certificate repository, and key recovery server accompanied with management console, as well as PKI-enabled applications building blocks (RSA, 1999).

Figure 1.  The Main Server Components of PKI (RSA, 1999).

PKI Standards

The PKI standards permit multiple PKIs to interoperate, and multiple applications to interface with a single, consolidated PKI.  The PKI standards are required for enrollment procedures, certificate formats, CRL formats, certificate enrollment messages formats, digital signature formats, and challenge and response protocols.  The primary focus of interoperable PKI standards is the PKI working group of the Internet Engineering Task Force (IETF), also known as the PKIX group for PKI for X.509 certificates (RSA, 1999).  The PKIX specification is based on two other standards X.509 from the International Telecommunication Union (ITU) and the Public Key Cryptography Standards (PKCS) from RSA Data Security (RSA, 1999). 

Standards Rely on PKI

There are standards which rely on PKI.  Most major security standards are designed to work with PKI.  Secure Socket Layer (SSL) and Transport Layer Security (TLS), which are used to secure access to Web servers and web-based applications, rely on PKI.  The Secure Multipurpose Internet Mail Extensions (S/MIME), which is used to secure messaging rely on PKI.  The Secure Electronic Transaction (SET) to secure bank card payments, and  IPSEC to secure connection using VPN require PKI (Abernathy & McMillan, 2016; RSA, 1999; Stewart, Chapple, & Gibson, 2015).

The PKI Functions

The most common PKI functions are issuing certificates, revoking certificates, creating and publishing CRLs, storing and retrieving certificates and CRLs, and key lifecycle management.  The enhanced and emerging functions of PKI include the time-stamping and policy-based certificate validation.  The summary of the PKI functions is illustrated in Table 1, adapted from (RSA, 1999). 

Table 1.  PKI Functions (RSA, 1999).

Public Key Encryption

In 1976, the idea of public key cryptography was first presented in Stanford University by Martin Hellman, Ralph Merkle, and Whitfield Diffie (Janczewski, 2007; Maiwald, 2001; Srinivasan, 2016).  There are three requirements for the public key encryption method.  When the decryption process is applied to the encrypted message, the result must be the same as the original message before it was encrypted.  It must be exceedingly difficult to deduce the decryption (private) key from the encryption (public) key.  The encryption must not be able to be broken by a plaintext attack. Since the encryption and decryption algorithms and the encryption key will be public, people attempting to break the encryption will be able to experiment with the algorithms to attempt to find any flaws in the system (Janczewski, 2007).

One popular method of the public key encryption was discovered by a group of MIT in 1978 and was named RSA after the initials of the three members of the group Ron Rivest, Adi Shamir, and Leonard Adleman (Janczewski, 2007).   The RSA Algorithm was patented by MIT, and then this patent was handed over to a company in California called Public Key Partners (PKP), which holds an exclusive commercial license to sell and sublicense the RSA public key cryptosystem.  PKP also holds other patents which cover public key cryptography algorithm.  RSA encryption can be broken based on factoring numbers involved, which can be ignored due to the massive amount of time required to factor large numbers.   However, RSA is too slow for encrypting large amounts of data.  Thus, it is often used for encrypting the key used in a private key method such the International Data Encryption Algorithm (IDEA) (Janczewski, 2007).

The main difference between the symmetric key encryption and the public key encryption is the number of keys used in operation.  The symmetric key encryption utilizes a single key both to encrypt and decrypt information, while the public key encryption utilizes two keys, one key is used toencrypt, and a different key is then used to decrypt the information (Maiwald, 2001).  Figure 2 illustrates the primary public key or asymmetric encryption operation.  Both the sender and receiver must have a key. The keys are related to each other and called key pair, but they are different.  The relationship between the keys is that the information encrypted by one key can be decrypted by the other key.  One key is called private, while the other key is called public.  The private key is kept secret by the owner of the key pair.  The public key is published with information about who the owner is.  It is published as public because there is no way to publish a private key from it.

Figure 2.  Public Key Encryption Operation (Maiwald, 2001).

The encryption is performed with the public key, where only the owner of the key pair can decrypt the information since the private key is kept secret by the owner if the confidentiality is desired.   The owner of the key pair encrypts the information with the private key if authentication is desired.   The integrity of the information can be checked if the original information was encrypted with the private key of the owner (Maiwald, 2001; Stewart et al., 2015).

The asymmetric key cryptography or public key encryption provides an extremely flexible infrastructure, facilitating simple, secure communication between parties that do not necessarily know each other before initiating the communication.  The public key encryption also provides the framework for the digital signing of messages to ensure non-repudiation and message integrity.  It also provides a scalable cryptographic architecture for use by large numbers of users).  The significant strength of the public key encryption is the ability to facilitate communication between parties previously unknown to each other.  This process is made possible by PKI hierarchy of trust relationships.  These trusts permit combining asymmetric cryptography with symmetric cryptography along with hashing and digital certificates, providing hybrid cryptography (Abernathy & McMillan, 2016; Maiwald, 2001; Stewart et al., 2015)

The limitation of the public key encryption is that they tend to be computationally intensive and thus are much slower than symmetric key systems.  However, if the public key is teamed with the symmetric key encryption, the result is the much stronger system. The public key system is used to exchange keys and authenticate both ends of the connection.  The symmetric key system is then used to encrypt the rest of the traffic as it is faster than the public key system (Abernathy & McMillan, 2016; Maiwald, 2001; Stewart et al., 2015).

References

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Janczewski, L. (2007). Cyber warfare and cyber terrorism: IGI Global.

Maiwald, E. (2001). Network security: a beginner’s guide: McGraw-Hill Professional.

RSA. (1999). Understanding Public Key Infrastructure (PKI). Retrieved from ftp://ftp.rsa.com/pub/pdfs/understanding_pki.pdf, White Paper

Srinivasan, M. (2016). CISSP in 21 Days: Packt Publishing Ltd.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Encryption

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze encryption, the encryption types, the impact of the encryption, and the use of encryption when streaming data in the network.

Cryptography and Encryption

The cryptography comprises a set of algorithms and system-design principles, some well-developed and some are just emerging for protecting data in the era of Big Data and Cloud Computing.  Cryptography is a field of knowledge whose products are encryption technology.  Encryption technology is an inhibitor to compromising privacy with well-designed protocols.  However, it is not the “silver bullet” to cut through the complexity of the existing issues of privacy and security (PCAST, May 2014).

The encryption technology involves the use of a key. Only with the key can encrypted data be used. At every stage of the life of the key, it is potentially open to misuse that can ultimately compromise the data which the key was intended to protect. If the user with access to private keys can be forced into sharing them, no system based on encryption is secure.

The keys were distributed physically, on paper or computer media, protected by registered mail, armed guards until the1970s.  However, the intervention of the “public-key cryptography” changed everything.  The public-key cryptography allows individuals to broadcast their personal key publicly.  However, this public key is only an encryption key, useful for turning plaintext into cryptotext which is meaningless to others.   The private key is used to transform cryptotext to plaintext and is kept secret by the recipient (PCAST, May 2014).  

The message typically is in the form of plaintext represented by the letter P when encryption functions are described.  The sender of the message uses a cryptographic algorithm to encrypt the plaintext message and generate a ciphertext or cryptotext represented by the letter C.  The message is transmitted, and the recipient uses a predetermined algorithm to decrypt the ciphertext message and retrieve the plaintext version.   All cryptographic algorithms rely on keys to maintain their security.  A key is just a number and usually substantial binary number.  Every algorithm has a specific “keyspace,” which is the range of values which are valid for use as a key for a specific algorithm and is defined by its “bit size.”  Thus, 128-bit key can have a value from 0 to 2128.  It is essential and critical to protecting the security of the secret keys because of the security of the cryptography relies on the ability to keep and maintain the keys used privately (Abernathy & McMillan, 2016; Maiwald, 2001; Stewart, Chapple, & Gibson, 2015).  Figure 1 illustrates the basic encryption operation (Maiwald, 2001). 

Figure 1.  Basic Encryption Operation (Maiwald, 2001).

The modern cryptosystems utilize computationally sophisticated algorithm and long cryptographic keys to meet the cryptographic four goals mentioned below.  There are three types of an algorithm which are commonly used:  symmetric encryption algorithm, asymmetric encryption algorithm, and hashing algorithms (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Stewart et al., 2015; Woo & Lam, 1992).

Four Goals for Encryption

The cryptography provides an additional level of security to the data during processing, storage, and communications.  A series of increasingly sophisticated algorithms have been designed to ensure the confidentiality integrity, authentication, and non-repudiation.  At the same time, hackers also have devoted time to undermine this additional security layer of the encryption.  There are four fundamental goals for organizations to use the cryptographic systems; the confidentiality, the integrity, the authentication, and non-repudiation.   However, not all of the cryptosystems are intended to achieve all four goals (Abernathy & McMillan, 2016; Stewart et al., 2015). 

The confidentiality ensures that the data remains private while at rest, such as when stored on a disk, or in or transit such as during transmission between two or more parties.  The confidentiality is the most common reason for using the cryptosystems. There are two types of cryptosystems which enforce confidentiality; symmetric key and asymmetric key algorithms.  The symmetric key cryptosystems use a shared secret key available to all users of the cryptosystem.  The asymmetric key algorithm uses special combinations of public and private key for each user of the system (Abernathy & McMillan, 2016; Stewart et al., 2015). 

When implementing a cryptographic system to provide confidentiality, the data types must be considered, whether the data is at rest or in motion.  The data at rest is the data that is stored in a storage area waiting to be accessed.  For instance, the data at rest include data stored on hard drives, backup tapes, cloud storage services, USB devices and other storage media.  The data in motion or data “on the wire” is the data that is being transmitted across the network between systems the data in motion might be traveling on a corporate network, a wireless network, or public Internet.  Both types of the data post different types of confidentiality risks against which cryptographic system can protect.  For instance, the data in motion may be susceptible to eavesdropping attacks, while data at rest is more susceptible to the theft of physical devices (Abernathy & McMillan, 2016; Stewart et al., 2015).

The integrity ensures that the data is not modified without authorization.  If the integrity techniques are in place, the recipient of a message can ensure that the message received is identical to the message sent.  The integrity provides protection against all forms of modification such intentional modification by a third party attempting to insert false information, and unintentional modification by faults in the transmission process.  The message integrity is enforced through the use of encrypted message digests, known as digital signatures created upon transmission of a message.  The integrity can also be enforced by both public and secret key cryptosystems (Abernathy & McMillan, 2016; Stewart et al., 2015).

The authentication is a primary function of the cryptosystems and verifies the claimed identity of system users.  Secret code can be used for the authentication.  The nonrepudiation assures the recipient that the message was originated by the sender and not someone masquerading as the sender.  Moreover, the nonrepudiation also prevents the sender to deny sending the message or repudiating the message. The secret key or symmetric key cryptosystems do not provide this guarantee of non-repudiation.  The non-repudiation is offered only by the public key or asymmetric cryptosystem (Abernathy & McMillan, 2016; Stewart et al., 2015).

Symmetric Cryptosystem

The symmetric key algorithms rely on a “shared secret” encryption key which is distributed to all users who participate in the communication.  The key is used by all members to both encrypt and decrypt messages.  The sender encrypts with the shared key, and the receiver decrypts with the same shared key. The symmetric encryption is difficult to break because of the long and large-size key. The symmetric encryption is primarily used for bulk encryption and to meet the confidentiality goal. The symmetric key cryptography can also be called “secret key cryptography and private key cryptography.”  There are several common symmetric cryptosystems such as the Data Encryption Standard (DES), Triple DES (3DES), International Data Encryption Algorithm (IDEA), Blowfish, Skipjack, and the Advanced Encryption Standard (AES).  The advantage of the symmetric cryptosystem is that it operates at high speed and it is faster than the asymmetric (1,000 to 10,000 faster) (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Maiwald, 2001; Stewart et al., 2015).  

The symmetric cryptosystem has limitations.  The first limitation is the key distribution. The members must implement a secure method of exchanging the shared secret key before establishing the communication with the symmetric key protocol.  If the secure electronic channel does not exist, an offline key distribution method must be used.  The second limitation represents the nonsupport to the non-repudiation, due to sharing the same key which makes it difficult to know the source of the message. The third limitation represents the inability of the scalable algorithm.  The last limitation of the symmetric cryptosystem is the frequent regeneration of the key (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Stewart et al., 2015). 

Asymmetric Cryptosystem

The asymmetric key algorithm, also known as “public key algorithm, provides a solution to the limitation of the symmetric key encryption.  This system uses two keys; a public key which is shared with all users, and a private key which is secret and known only to the user.  If the public key encrypts the message, the private key can decrypt the message, and the same applies if the private key encrypts the message, the public key decrypts the message.   The asymmetric key cryptosystem provides support to the digital signature technology.  The advantages of the asymmetric key algorithm include the generation of only one public-private key for new users, and the removal of the users easily.  Moreover, the key generation is required only when the private key of the user is compromised. The asymmetric key algorithm provides integrity, authentication, and non-repudiation.  The key distribution is a simple process in the asymmetric key algorithm.  The asymmetric key algorithm does not require a pre-existing relationship to provide a secure mechanism for data exchange (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Maiwald, 2001; Stewart et al., 2015).

The limitation of the asymmetric key cryptosystem is the slow speed of operation.  Thus, many applications which require the secure transmission of the large volume of data employs the public key cryptographic system to establish a connection and then exchange a symmetric secret key.  The remainder of the session then utilizes the symmetric cryptographic approach. Figure 2 illustrates a comparison of symmetric and asymmetric cryptographic systems (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Stewart et al., 2015).

Figure 2.  Comparison of Symmetric and Asymmetric Cryptographic Systems. Adapted from (Abernathy & McMillan, 2016; Stewart et al., 2015).

The Hashing Algorithm

The hashing algorithm produces message digests which are summaries of the content of the message.  It is challenging to derive a message from an ideal hash function, and two messages will unlikely produce the same hash value.  There are some of the more common hashing algorithm in use today including the Message Digest 2 (MD2), Message Digest 5 (MD5), Secure Hash Algorithm (SHA-0, SHA-1, and SHA-2), and Hashed Message Authentication Code (HMAC).  Unlike symmetric and asymmetric algorithms, the hashing algorithm is publicly known.  The hash functions are performed in one direction and using in reverse is not required.  The hashing algorithm ensures the integrity of the data as it creates a number which is sent along with the data.  When the data gets to the destination, this number can be used to determine whether even a single bit has changed in the data by calculating the hash value from the data which was received. The hashing algorithm also helps in protecting against undetected corruption (Abernathy & McMillan, 2016; Connolly & Begg, 2015; Stewart et al., 2015).

Attacks Against Encryption

The encryption systems can be attacked in three ways, through weaknesses in the algorithm, through brute-force against the key, or through a weakness in the surrounding system. When the algorithm is attacked through the weakness in the way that the algorithm changes plaintext into ciphertext so that the plaintext may be recovered without knowing the key.  The algorithm that has weaknesses of this type are rarely considered reliable enough for use.  The brute-force attacks are attempts to use every possible key on the ciphertext to find the plaintext.  On the average, 50% of the keys must be tried before finding the correct key.  The strength of the algorithm is then only defined by the number of keys that must be attempted.  Thus, the longer the key, the more significant the total number of keys and the larger the number of keys which must be tried until the correct key is found. The brute-force attacks will always succeed eventually if enough time and resources are used.  Thus, the algorithms should be measured by the length of time the information is expected to be protected even in the fact of a brute-force attack.  The algorithm is considered computationally secure if the cost of acquiring the key through brute-force is more than the value of the information being protected. The last encryption attack through weaknesses in the surrounding system can involve keeping the key in a file that has a password, but the password is weak and can be guessed easily (Maiwald, 2001).

References

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Maiwald, E. (2001). Network security: a beginner’s guide: McGraw-Hill Professional.

PCAST. (May 2014). Big Data and Privacy: A Technological Perspective.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Woo, T. Y., & Lam, S. S. (1992). Authentication for distributed systems. Computer, 25(1), 39-52.

Business Impact Analysis (BIA)

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the Business Impact Analysis (BIA) in providing useful information for the Business Continuity Plan (BCP) and a Disaster Recovery Plan (DRP).  The discussion begins with a brief overview of BCP and DRP, followed by the discussion and analysis of the BIA.

Business Continuity Plan (BCP)

The Business Continuity Planning (BCP) involves the assessment of the risk to the organizational processes and the development of policies, plans, and process to minimize the impact of those risks if it occurs.   Organizations must implement BCP to maintain the continuous operation of the business if any disaster occurs.  The BCP emphasize on the keeping and maintaining the business operations with the reduction or restricted infrastructure capabilities or resources.  The BCP can be used to manage and restore the environment. If the continuity of the business is broken, then the business processes have seized, and the organization is in the disaster mode, which should follow the Disaster Recovery Planning (DRP).  The top priority of the BCP and DRP is always people.  The main concern is to get people out of the harm; and the organization can address the IT recovery and restorations issues (Abernathy & McMillan, 2016; Stewart, Chapple, & Gibson, 2015). The BCP process involves four main steps to provide a quick, calm, and efficient response in the event of an emergency and to enhance the ability of the organization to recover from a disruptive event in a timely fashion.  These four steps include (1) the Project Scope and Planning, (2) Business Impact Assessment, (3) Continuity Planning, and (4) Documentation and Approval (Stewart et al., 2015). 

However, as indicated in (Abernathy & McMillan, 2016), the steps of the Special Publications (SP) 800-34 Revision 1 (R1) from the NIST include seven steps.  The first step involves the development of the contingency planning policy. The second step involves the implementation of the Business Impact Analysis.  The Preventive Controls should be identified representing the third step.  The development of Recovery Strategies is the fourth step. The fifth step involves the development of the BCP.  The six-step involves the testing, training, and exercise. The last step is to maintain the plan. Figure 1 summarizes these seven steps identified by the NIST. 

Figure 1.  A Summary of the Business Continuity Steps (Abernathy & McMillan, 2016).

Disaster Recovery Plan (DRP)

In case of the disaster event occur, the organization must have in place a strategy and plan to recover from such a disaster.  Organizations and businesses are exposed to various types of disasters.  However, these types of disaster are categorized to be either disaster caused by nature or disaster caused by a human.  The disasters which are nature related include the earthquakes, floods, storms, hurricanes, volcanos, and fires.  The human-made disasters include fires caused intentionally, acts of terrorism, explosions, and power outages.  Other disasters can be caused by hardware and software failures, strikes and picketing, theft and vandalism.  Thus, the organization must be prepared and ready to recover from any disaster.  Moreover, the organization must document the Disaster Recovery Plan and provide training to the personnel (Stewart et al., 2015).

Business Impact Analysis (BIA)

As defined in (Abernathy & McMillan, 2016), the BIA is a functional analysis which occurs as an element and component of the Business Continuity and Disaster Recovery.  In (Srinivasan, 2016), BIA is described as a type of risk assessment exercise which attempt to assess and evaluate qualitative and quantitative impacts on the business due to a disruptive event. The qualitative impacts are an operational impact, such as the ability to deliver, while the quantitative impacts are related to financial loss and described in numeric monetary value (Srinivasan, 2016; Stewart et al., 2015).

Organizations should perform a detailed and thorough BIA to assist business units and operations understand the impact of a disaster.  The BIA should list the critical and required business functions, their resources dependencies, and their level of criticality to the overall organization.  The development of the BCP is based on the BIA, which assists the organization to understand the impact of a disruptive event on the organization.  This analysis of the BIA is a management level analysis which identifies the impact of losing the resources of the organization.  The BIA involves four main steps. The first step involves the identification of the critical processes and resources, followed by the identification of the outage impacts, and estimate downtime. The third step involves the identification of the resource requirements, followed by the last step of the identification of the recovery priorities.  The BIA relies on any vulnerability analysis and risk management which are completed and performed by the BCP committee or a separate task force team (Abernathy & McMillan, 2016). 

References

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Srinivasan, M. (2016). CISSP in 21 Days: Packt Publishing Ltd.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.