Use Case: Analysis of Heart Disease

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the Fast-and-Frugal Trees (FFTrees).  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

Keywords: Decision Tree, Diagnosis of Heart Disease.

Introduction

            The medical records and the databases to store these records are increasing rapidly.  This rapid increase is leading the researchers and practitioners to employ Big Data technologies.  The Data Mining technique plays a significant role in finding patterns and in extracting knowledge to provide better patient care and effective diagnostic capabilities.  As indicated in (Koh & Tan, 2011), “In healthcare, data mining is becoming increasingly popular, if not increasingly essential.”  Healthcare can benefit from Data Mining application in various areas such as the evaluation of treatment effectiveness, customer and patient relationship management, healthcare management, fraud detection, and prevention.  Moreover, other benefits include predictive medicine and analysis of DNC micro-arrays. 

Various research studies employed various Data Mining techniques in the healthcare.  In (Alexander & Wang, 2017), the main objective of the study was to identify the usage of Big Data Analytics to predict and prevent heart attacks.  The results showed that Big Data Analytics is useful in predicting and pr3eventing attacks.  In (Dineshgar & Singh, 2016), the purpose of the study was to develop a prototype Intelligent Heart Disease Prediction System (IHDPC) using Data Mining technique.  In (Karthiga, Mary, & Yogasini, 2017), the researchers utilized the Data Mining techniques to predict heart disease using the Decision Tree algorithm and Naïve Bayes.  The result showed that the prediction accuracy of 99%. Thus, Data Mining techniques enable the healthcare industry to predict patterns.  In (Kirmani & Ansarullah, 2016), the researchers also applied the Data Mining techniques with the aim to investigate the result after applying different types of Decision Tree methods to obtain the better performance in the heart disease.  These research studies are examples of the vast literature on the use of Big Data Analytics and Data Mining in the healthcare industry.

            In this project, the heart disease dataset is utilized as the Use Case for Data Mining application.  The project used Hortonworks sandbox, with Hive, MapReduce, and Tez.  The project also integrated R with Hive to perform statistical analysis including Decision Tree method.  The project utilized techniques from various research studies such as (Karthiga et al., 2017; Kirmani & Ansarullah, 2016; Martignon, Katsikopoulos, & Woike, 2008; Pandey, Pandey, & Jaiswal, 2013; Phillips, Neth, Woike, & Gaissmaier, 2017; Reddy, Raju, Kumar, Sujatha, & Prakash, 2016).

            The project begins with Phase 1 of Sandbox Configuration, followed by Phase 2 of the Heart Disease Use Case.  The Sandbox configuration included the environment set up from mapping the sandbox IP to the Ambari Console management and the Integration of R and RStudio with Hive.  The Heart Disease Use Case involved fourteen steps starting from understanding the dataset to the analysis of the result.  The project articulates the steps and the commands as required for this project. 

Phase 1:  Sandbox Configuration

1.      Environment Setup

The environment setup begins with the installation of the Virtual Box and Hortonworks Sandbox.

  1. It is installed using the Oracle VM VirtualBox, which was installed from http://www.virtualbox.org
  2. Install Hortonworks Docker Sandbox version 2.6.4 from http://hortonworks.com/sandbox

After the installation, the environment must be configured to function using the following steps fully. 

1.1       IP Address and HTTP Web Port

After the Sandbox is installed, the host must use an IP address depending on the Virtual Machine (VMware, VirtualBox), or container (Docker). After the installation is finished, the local IP address is assigned with the HTTP web access port of 8888 as shown below.  Thus, the local access is using http://127.0.0.1:8888/

1.2       Map the Sandbox IP to Desired Hostname in the Hosts file

            The IP address can be mapped to a hostname using the hosts file a shown below. After setting the hostname to replace the IP address, the sandbox can be accessed from the browser using http://hortonworks-sandbox.com:8888.

1.3     Roles and User Access

There are five major users with different roles in Hortonworks Sandbox.  These roles with their passwords and roles are summarized in Table 1.  

Table 1.  Users, Roles, Service, and Passwords.

With respect to the Access, putty can be used to access the sandbox using SSH with Port 2222.  The root will get the prompt to specify the new password.

1.4       Shell Web Client Method

The shell web client is also known as Shell-in-a-box to issue shell commands without installing additional software.  It uses port 4200.  The admin password can be reset using the Shell Web Client.

1.5       Transfer Data and Files between the Sandbox and Local Machine.

            To transfer files and data between the local machine and the sandbox, secure copy using scp command can be used as illustrated below.

To transfer from the local machine to the sandbox:

To transfer from the sandbox to the local machine.

1.6       Ambari Console and Management

            The Admin can manage Ambari using the web address with port 8080 using the address of  http://hortonworks-sandbox.com:8080 with the admin user and password.  The admin can operate the cluster, manage users and group, and deploy views. The cluster section is the primary UI for Hadoop Operators. The clusters allow admin to grant permission to Ambari users and groups. 

2.      R and RStudio Setup

To download and install RStudio Server, following the following steps:

  •  Execute: $sudo yum install rstudio-server-rhel-0.99.893-x86_64.rpm
  • Install dpkg to divert th location of /sbin/initctl
    • Execute: $yum install dpkg
    • Execute: $dpkg-diver –local –rename –add /sbin/initctl
    • Execute: $ln -s /bin/true /sbin/initctl
  • Install R and verify the installatio or RStudio.
    • Execute: $yum install -y R
    • Execute: $yum -y install libcurl-devel
    • Execute: $rstudio-server verify-installation.
  • The default port of RStduio server is 8787 which is not opened in the Docker Sandbox.  You can use port 8090 which is opened for the Docker.
    • Execute: $sed -I “1 a www-port=8090” /etc/rstudio/rserver.conf
    • Restart the server by typing: exec /usr/lib/rstudio-server/bin/rserver
    • It will close your session. However, you can now browse to RStudion using port 8090 as shown below.
    • RStudio login is amy_ds/amy_ds
  1.  Alternatively, open up the RStudio port 8787 by implementing the following steps:
    • Access the VM VirtualBox Manager tool. 
    • Click on the Hortonworks VM à Network à Advanced à Port Forwarding.  Add Port 8787 for RStudio.   
    • After you open up the port, modify /etc/rstudio/rstudio-server.conf to reflect 8787 port.
    • Stop and start the VM.

Phase 2:  Heart Disease Use Case

 1.      Review and Understand the Dataset

  • Obtain the heart disease dataset from the archive site at:

http://archive.ics.uci.edu/ml/datasets/.

  • Review the dataset.  The dataset has fourteen variables with (N=271).  Table 2 describes these attributes.

Table 2.  Heart Disease Dataset Variables Description.

  1. Load heart.dat file into the Hadoop Distributed File System (HDFS):
    1. Login to Ambari using amy_ds/amy_ds.
    1. File View à user à amy_ds à upload
  • Start Hive database.  Create a table “heart” and import the dataset to Hive database.
  • Retrieve the top 10 records for verification.

2.      Configure MapReduce As the Execution Engine in Hive

There is an option to configure MapReduce in Hive to take advantage of the MapReduce feature in Hive.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: mr (for MapReduce).
    1. Key: hive.auto.convert.join à Value: false.
  1. Test the query using MapReduce as the execution engine.  The following query ran using MapReduce.
  • Configure Tez As the Execution Engine in Hive

The user can also modify the value for hive.execution.engine from mr to tez as Hive is enabled on Tez execution and take advantage of the DAG execution representing the query instead of multiple stages of MapReduce program which involved a log of synchronization, barriers, and I/O overheads.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: tez.
    1. Key: hive.auto.convert.join à Value: false.

4.      Integrate TEZ with Hive for Directed Acyclic Graph (DAG)

This integration is implemented on Tez to also take advantage of the Directed Acyclic Graph (DAG) execution. This technique is improved in Tez, by writing intermediate dataset into memory instead of hard disk.

  1. Go to Settings in Hive view.
  2. Change the hive.execution.engine to tez.

5.      Track Hive on Tez jobs in HDP Sandbox using the Web UI.

  1. Track the job from the browser:  http://hortonworks-sandbox.com:8088/cluster, while running or post to see the details.
    1. Retrieve the average age, average cholesterol by gender for female and males.

6. Monitor Cluster Metrix

  1. Monitor Cluster Metrix

    

7. Review the Statistics of the table from Hive.

  1. Table à Statistics à Recompute and check Include Columns.
    1. Click on Tez View.
    1. Click Analyze
  • Click Graphical View

8.      Configure ODBC for Hive Database Connection

  1. Configure a User Data Source in ODBC on the client to connect to Hive database. 
  • Test the ODCB connection to Hive.

9.      Setup R to use the ODBC for Hive Database Connection

  1.  Execute the following to install the odbc packages in R.
    1. >install.packages(“RODBC”)
  2. Execute the following to load and run the required library to establish the database connection from R to Hive:
    1. >library(“RODBC”)
  3. Execute the following command to establish the database connection from R to Hive:
    1. >cs881 <- odbcConnect(“Hive”)
  • Execute the following to retrieve the top 10 records from Hive from R using the ODBC connection:
    • >sqlQuery(cs881,”SELECT TOP(10) FROM heart”)

10.   Create Data Frame

  • Execute the following command to create a data frame:
    • >heart_df <- sqlQuery(cs881,”SELECT * FROM heart”)
  • Review the headers of the columns:
    • >print(head(heart_df))
  • Review and Analyze the Statistics Summary:
    • >summary(head_df).
  • List the names of the columns:
    • >names(heart_df)

11.   Analyze the Data using Descriptive Analysis

  • Find the Heart Disease Patients, Age and Cholesterol Level
    • Among all genders
    • >age_chol_heart_disease <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1”)
    • summary(age_chol_heart_disease)

Figure 1.  Cholesterol Level Among All Heart Disease Patients.

  • Among Female Patients
    • age_chol_heart_disease_female <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 0”)
    • summary(age_chol_heart_disease_female)

Figure 2.  Cholesterol Level Among Heart Disease Female Patients.

  • Among Male Patients
    • age_chol_heart_disease_male <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 1”)
    • summary(age_chol_heart_disease_male)

Figure 3.  Cholesterol Level Among Heart Disease Male Patients.

 12.  Analyze the Data using Decision Tree

  • Print the headers of the columns
  • Create input.dat for the diagnosis, age, sex, and sugar attributs.
  • Create png file.
  • Install library(party).
    • >install.packages(party)
    • >library(party)
  • Load the Decision Tree

Figure 4.  Decision Tree for Hearth Disease Patients.

13.    Create FFTree for heart disease

Figure 5.  FFT Decision Tree with Low Risk and High-Risk Patients.

Figure 6.  Sensitivity and Specificity for Heart Disease Patients Using FFTree.

Figure 7. Custom Heart Disease FFTree.

14.    Analysis of the Results

The analysis of the heart disease dataset included descriptive analysis and decision tree.  The result of the descriptive analysis showed that the minimum age among the patients who are exposed to the heart disease is 29 years old, while the maximum age is 79, with median and mean of 52 years old.  The result also showed that the minimum cholesterol level for these patients is 126, while the maximum is 564, with a median of 236 and mean of 244 indicating that the cholesterol level also gets increased with the increased age.   

The descriptive analysis drilled down and focused on gender (female vs. male) to identify the impact of the age on the cholesterol for the heart disease patients.  The result showed the same trend among the female heart disease patients, with a minimum age of 34, and maximum age of 76, with median and mean of 54.   The cholesterol level among female heart disease patients begins with the minimum of 141 and maximum of 564, and median of 250 and mean of 257.  The maximum cholesterol level of 564 is an outlier with another outlier in the age of 76.   With respect to the heart of the male disease patients, the result showed the same trend.  Among the male heart disease patients, the results showed that the minimum age is 29 years old, and maximum age of 70 years old, with a median of 52 and mean of 51.  The cholesterol level among these male heart disease patients showed 126 minimum and 325 maximum level, median and mean of 233. There is another outlier among the male heart disease patients at the age of 29.  Due to these outliers in the dataset among the female and heart disease patients, the comparison between male and female patients will not be accurate.   However, Figure 5 and Figure 6 show similarities in the impact of the age on the cholesterol level between both genders.

With regard to the decision tree, the first decision tree shows the data is partitioned among six nodes.  The first two nodes are for the Resting Blood Pressure (RBP) attribute (RestingBP). The first node of this cluster shows 65 heart disease patients have RBP of 138 or less, while the second node of this cluster shows 41 heart disease patients have RBP of greater than 138.  These two nodes show the vessels is zero or less with a heart rate of 165 or less.  For the vessel that exceeds the zero level, there is another node with 95 patients.  The second set of the nodes are for the heart rate of greater than 165.  The three nodes are under the vessels; less than or equal zero vessels, and greater than zero vessels.  Two nodes are under the first categories of zero or less, with 22 heart disease patients with a heart rate of 172 or less. The last node shows the vessels is greater than zero with 15 heart disease patients.  The FFTree results show that the high-risk heart disease patients with vessel greater than zero, while the low-risk patients are of zero or less of vessels. 

Conclusion

The purpose of this project was to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the FFTrees.  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Dineshgar, G. P., & Singh, L. (2016). A Review of Data Mining For Heart Disease Prediction. International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE), 5(2).

Karthiga, A. S., Mary, M. S., & Yogasini, M. (2017). Early Prediction of Heart Disease Using Decision Tree Algorithm. International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST), 3(3).

Kirmani, M. M., & Ansarullah, S. I. (2016). Prediction of Heart Disease using Decision Tree a Data Mining Technique. IJCSN International Journal of Computer Science and Network, 5(6), 885-892.

Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of healthcare information management, 19(2), 65.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Pandey, A. K., Pandey, P., & Jaiswal, K. (2013). A heart disease prediction model using decision tree. IUP Journal of Computer Sciences, 7(3), 43.

Phillips, N. D., Neth, H., Woike, J. K., & Gaissmaier, W. (2017). FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12(4), 344.

Reddy, R. V. K., Raju, K. P., Kumar, M. J., Sujatha, C., & Prakash, P. R. (2016). Prediction of heart disease using decision tree approach. International Journal of Advanced Research in Computer Science and Engineering, 6(3).

Big Data Analytics Framework and Relevant Tools Used in Healthcare Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Big Data Analytics framework and relevant tools used in healthcare data analytics.  The discussion also provides examples of how healthcare organizations can implement such a framework.

Healthcare can benefit from Big Data Analytics in various domains such as decreasing the overhead costs, curing and diagnosing diseases, increasing the profit, predicting epidemics and heading the quality of human life (Dezyre, 2016).  Healthcare organizations have been generating the very large volume of data mostly generated by various regulatory requirements, record keeping, compliance and patient care.  There is a projection from McKinsey that Big Data Analytics in Healthcare can decrease the costs associated with data management by $300-$500 billion.  Healthcare data includes electronic health records (EHR), clinical reports, prescriptions, diagnostic reports, medical images, pharmacy, insurance information such as claim and billing, social media data, and medical journals (Eswari, Sampath, & Lavanya, 2015; Ward, Marsolo, & Froehle, 2014). 

Various healthcare organizations such as scientific research labs, hospitals, and other medical organizations are leveraging Big Data Analytics to reduce the costs associated with healthcare by modifying the treatment delivery models.  Some of the Big Data Analytics technologies have been applied in the healthcare industry.  For instance, Hadoop technology has been used in healthcare analytics in various domains.  Examples of Hadoop application in healthcare include cancer treatments and genomics, monitoring patient vitals, hospital network, healthcare intelligence, fraud prevention and detection (Dezyre, 2016).  Thus, this discussion is limited to the Hadoop technology in healthcare.  The discussion begins with the types of analytics and the potential benefits of some of the analytic in healthcare, and then followed by the main discussion about Hadoop Framework for Diabetes including its major components of the Hadoop Distributed File System (HDFS) and Map/Reduce.

Types of Analytics

There are four major analytics types:  Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017).  The Descriptive Analytics are used to summarize historical data to provide useful information.  The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling.  The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables.  It is useful to find an optimum solution and the best course of action using the algorithm.  The Prescriptive Analytics, as indicated in (Mohammed et al., 2014) is less used in the clinical field.  The Diagnostic Analytics is described as an advanced type of analytics to get to the cause of a problem using drill-down techniques and data discovery.

Hadoop Framework for Diabetes

The predictive analysis algorithm is utilized by (Eswari et al., 2015) in Hadoop/MapReduce environment in predicting the diabetes types prevalent, the complications associated with each diabetic type, and the required treatment type.  The analysis used by (Eswari et al., 2015) was performed on Indian patients.  In accordance to the World Health Organization, as cited in (Eswari et al., 2015), the probability for the age between 30-70 for patients to die from four major Non-Communicable Diseases (NCD) such as diabetes, cancer, stroke, and respiratory is 26%.   In 2014, 60% of all death in India was caused by NCDs.  Moreover, in accordance with the Global Status Report, as cited in (Eswari et al., 2015), NCD claims will reach 52 million patients globally by the year of 2030. 

The architecture for the predictive analysis included four phases:  Data Collection, Data Warehousing, Predictive Analysis, Processing Analyzed Reports.  Figure 1 illustrates the framework used for the Predictive Analysis System-Healthcare application, adapted from (Eswari et al., 2015). 

Figure 1.  Predictive Analysis Framework for Healthcare. Adapted from (Eswari et al., 2015).

Phase 1:  The Data Collection phase included raw diabetic data which is loaded into the system.  The data is unstructured including EHR, patient health records (PHR), clinical systems and external sources such as government, labs, pharmacies, insurance and so forth.  The data have different formats such as .csv, tables, text.  The data which was collected from various sources in the first phase was stored in Data Warehouses. 

Phase 2:  During the second phase of data warehousing, the data gets cleansed, and loaded to be ready for further processing.

Phase 3:  The third phase involved the Predictive Analysis which used the predictive algorithm in Hadoop, Map/Reduce environment to predict and classify the type of DM, complications associated with each type, and the treatment type to be provided.  Hadoop framework was used in this analysis because it can process extremely large amounts of health data by allocating partitioned data sets to numerous servers.  Hadoop utilized the Map/Reduce technology to solve different parts of the larger problem and integrate them into the final result.  Moreover, Hadoop utilized the Hadoop Distributed File System (HDFS) for the distributed system. The Predictive Analysis phase involved Pattern Discovery and Predictive Pattern Matching. 

With respect to the Pattern Discovery, it was important for DM to test patterns such as plasma, glucose concentration, serum insulin, diastolic blood pressure, diabetes pedigree, Body Mass Index (BM), age, number of times pregnant.   The process of the Pattern Discovery included the association rule mining between the diabetic type and other information such as lab results. It also included clustering to cluster and group similar patterns.  The classification step of the Pattern Discovery included the classification of patients risk based on the health condition.  Statistics were used to analyze the Pattern Discovery.  The last step in the Pattern Discovery involved the application.  The process of the Pattern Discovery of the Predictive Analysis phase is illustrated in Figure 2

Figure 2.  Pattern Discovery of the Predictive Analysis.

With respect to Predictive Pattern Matching of the Predictive Analysis, the Map/Reduce operation was performed whenever the warehoused dataset was sent to Hadoop system.  The Pattern Matching is the process of comparing the analyzed threshold value with the obtained value.   The Mapping phase involved splitting the large data into small tasks for Worker/Slave Nodes (WN).  As illustrated in Figure 3, the Master Node (MN) consists of Name Node (NN) and Job Tracker (JT) which used the Map/Reduce technique.   The MN sends the order to Worker/Slave Node, which process the pattern matching task for diabetes data with the help of Data Node (DN) and Task Tracker (TT) which reside on the same machine of the WN.  If the WN completed the pattern matching based on the requirement, the result was stored in the intermediate disk, known as local write.  If the MN initiated the reduce task, all other allocated Worker Nodes read the processed data from the intermediate disks.  The reduce task is performed in the WN based on the query received from the Client to the MN.  The results of the reduce phase will be distributed in various servers in the cluster.

Figure 3.  Pattern Matching System Using Map/Reduce. Adapted from (Eswari et al., 2015).

Phase 4:  In this phase, the Analyzed Reports are processed and distributed to various servers in the cluster and replicated through several nodes depending on the geographical area.  Using the proper electronic communication technology to exchange the information of patients among healthcare centers can lead to obtaining proper treatment at the right time in remote locations at low cost.

The implementation of Hadoop framework did help in transforming various health records of diabetic patients to useful analyzed result to help patients understand the complication depending on the type of diabetes. 

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Eswari, T., Sampath, P., & Lavanya, S. (2015). Predictive methodology for diabetic data analysis in big data. Procedia Computer Science, 50, 203-208.

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Ward, M. J., Marsolo, K. A., & Froehle, C. M. (2014). Applications of business analytics in healthcare. Business Horizons, 57(5), 571-582.

Decision Tree in Diagnosing Heart Disease Patients

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to discuss and analyze the Decision Tree in diagnosing heart disease patients.  The project focuses on the research study of (Shouman, Turner, & Stocker, 2011) who performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease.  The key benefit of this study is the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio.  The study also performed the experimentation with and without the voting technique. The project analyzed the steps performed by the researchers of (Shouman et al., 2011), the attributes used, the voting techniques, the data discretization using supervised methods of equal width and equal frequency, and unsupervised methods of chi merge and entropy.   The four major steps for the evaluation of the Decision Tree in diagnoses of the heart disease include Data Discretization, Data Partitioning, Training Data and Decision Tree Type Selection, and Reduced Error Pruning to develop pruned Decision Tree.  The findings of the researchers indicated that Gain Ratio Decision Tree type increases the accuracy of the probability calculation.   The researcher of this project is in agreement with the researchers of the experimentation for further experimentation using larger set to examine to verify if the result will be different with a large set of data.

Keywords: Decision Tree, Diagnosis of Heart Disease, Multi-Variant.

Introduction

            Various research studies used various data mining techniques in the healthcare to diagnose diseases such as diabetes, stroke, cancer, and heart.  Researchers have applied various data mining techniques in the diagnosis of the heart diseases such as Naïve Bayes, Decision Tree, Neural Network, Kernel Density, with different level of accuracy using defined groups, bagging algorithm, and support vector machine.  However, Decision Tree mining technique has demonstrated by several research studies successful application in the diagnosis of the heart disease.  In this project, the focus is on Decision Tree mining technique for heart disease diagnosis.  The discussion and the analysis of the Decision Tree in this project are based on the research study of (Shouman et al., 2011).

            Decision Tree mining technique has various types such as J4.8, C4.5, Gini Index and Information Gain.  Most research studies applied J4.8 Decision Tree, based on Gain Ratio in the extraction of Decision Tree rules, and binary discretization.  Other discretization techniques such as voting method and reduced pruning are known to provide more accurate Decision Trees.  In (Shouman et al., 2011),  the researchers investigated various techniques to different types of Decision Trees with the aim to better performance in diagnosis of heart disease.  The sensitivity, the specificity, and the accuracy measures are calculated to evaluate the performance of the alternative Decision Trees. 

            The risk factors associated with heart diseased are identified as age, blood pressure, smoking habit, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, and lack of physical activity.   Decision Tree cannot handle continuous variables directly.  Thus, the continuous variables must be converted into discrete attributes. This process is called “discretization” method.  There are two discretization methods; binary and multi-interval.  The J4.8 and C4.5 Decision Trees utilizes the binary discretization for the continuous-valued feature.   The multi-interval discretization method is known to produce more accurate Decision Tree result than the binary discretization method.  However, the multi-interval discretization method is less used than the binary discretization method in the research studies of heart disease diagnosis. Other methods such as multiple classifier voting and reduced error pruning can be used to improve the accuracy of the Decision Tree result in the heart disease diagnosis analysis. 

            The Data Discretization method and the Decision Tree Type are the two components which impact the performance of the Decision Tree as an analytical mining technique.   In an effort to identify the best method for accuracy, the researchers of (Shouman et al., 2011) investigated multiple classifiers voting methods with different multi-interval discretization methods such as equal width, the equal frequency with different types of Decision Tree such as Information Gain, Gini Index, and Gain Ratio.  Microsoft Visual Studio 2008 was used in this investigation effort.

Examination of the Decision Tree Analytical Technique for Heart Disease Patients

In this research study, the researchers used twelve Decision Tree variants by mixing discretization approaches with different Decision Tree types.  Each variant was examined through five different voting partitioning schemes of three, five, seven, nine and eleven partitioning.   The dataset used in this research study from Cleveland Clinic Foundation Hearth disease (UCI, 1988).  The dataset has seventy-six raw attributes.  However, because the published experiments only refer to thirteen of them, the researchers restricted the testing of this research study to the same thirteen attributes to allow comparison with other literature results.  The selected dataset attributes are illustrated in Table 1.  Although the researchers are talking about thirteen attributes, the table displays fourteen attributes.  The researchers of this project investigated the additional attribute from (UCI, 1988), and found out that the fourteenth attribute is the predicted attribute for diagnosis of the heart disease patients.

Table 1.  Selected Dataset Attributes. Adapted from (Shouman et al., 2011)

The test executed over seventy Decision Trees using the same dataset.  The dataset contains 303 rows of which 297 are complete, with six missing value rows which got eliminated from the test.  The tests were performed one time with the voting application, and another time without the voting application to evaluate the impact of the voting on the accuracy.  The research study implemented these tests using the four major steps below.

  1. Data Discretization.
  2. Data Partitioning.
  3. Training Data and Decision Tree Type Selection.
  4. The Reducing Error Pruning Application to develop pruned Decision Tree.
  1. Data Discretization for Discrete Attributes

Data Discretization can be either supervised or non-supervised.  The supervised data discretization method does not utilize the class membership information, while the non-supervised method uses the class labels to implement the discretization process such as chi-square based method, and entropy-based method.  The discretization process is used to convert the continuous attributes to discrete attributes in the dataset.  The discretization method uses five intervals.   The chi merge and entropy methods are the two of the most well-known discretization methods in the supervised discretization.  This chi merge discretization uses X2 statistic to identify the class independence from the two adjacent intervals, and if they are dependent, they get combined, or if they are not dependent, they get separated. The pair of the intervals get merged with the lowest value of X2 provided that the interval number is more than the pre-defined maximum number of intervals.  The entropy method is described as an information-theoretic measure of the uncertainty contained in the training set.  The purpose of the entropy method is to select boundaries for discretization by evaluating the cut points of the candidates through the entropy-based method.  The entropy for each candidate cut point is calculated after the instances are sorted into ascending numeric order.   To minimize entropy, the cut points are recursively selected until a stopping criterion is implemented.  The stop criterion is implementing five intervals of the attribute.

The equal-width interval and equal-frequency methods are used in the unsupervised discretization method.  The equal-width discretization algorithm determines the maximum and minimum values of the discretized attribute and obtains a user-defined number of equal width discrete intervals.  The equal-frequency, on the other hand, uses the same technique of the equal-width but it does sort all values in ascending order before the division of the range.  Figure 1 summarizes the data discretization.

Figure 1.  Data Discretization in Decision Tree Analytical Mining. Adapted from (Shouman et al., 2011).

2. Data Partitioning

This step of Data Partitioning involved testing with and without voting.  The application of the voting in the classification algorithm is proven to increase the accuracy.  Thus the researchers applied the multiple classifier voting by dividing the training data into smaller equal subsets of data and developing a Decision Tree classifier for each data subset.  Each classifier represents a single vote, and the voting is based on either plurality voting or majority voting.   The researcher of (Shouman et al., 2011) performed the experimentation of voting subsets, dividing the data into three and eleven subsets for each discretization method for each Decision Tree type.  The result indicated that the nine subsets were the most successful division.  Figure 2 illustrates the Data Partitioning step with the voting and without the voting techniques.

Figure 2.  Data Partitioning With Voting and Without Voting. Adapted from (Shouman et al., 2011).

3. Training Data and Decision Tree Type Selection

            In this experimentation, the researchers used four Decision Tree types; the Information Gain, the Gini Index, the Gain Ratio, and the Pruning types, as they are the most commonly used Decision Tree types.   The Decision Tree types are distinguished by the mathematical model which is used in selecting the splitting attribute in extracting the Decision Tree rules.  Figure 3 illustrates the Training Data step for these three Decision Tree types.

Figure 3.  Decision Tree Types and Training Data Adapted from (Shouman et al., 2011).

            In the Information Gain, the splitting attribute that is selected which maximize the Information Gain, and minimize the entropy value.  The splitting attribute is identified by calculating the Information Gain for each attribute and selecting the attribute which will maximize the Information Gain.  The calculation of the Information Gain for each attribute is implemented using the following mathematical formula, where k is the number of classes of the target attribute, Pis the number of occurrences of class i divided by the total number of instances to get the probability of i occurring. 

            Information Gain can produce biased results because its measure is biased toward the tests with many outcomes, where the attributes with large values are selected.  Thus, the Gain Ratio Decision Tree type was introduced to reduce the effect of such a bias result.  The Gain Ratio makes adjustments to the Information Gain for each attribute to allow for the breadth and uniformity of the attribute values.  The mathematical formula for the Gain Ratio is as follows where the split information is a value based on the column sums of the frequency table: 

In the Gini Index Decision Tree type, the impurity of the data is measured.  The calculation of Gini Index is implemented for each attribute in the dataset as shown in the following mathematical formula, where the target attribute has k classes with the probability of i class being Pi,  The splitting attribute in Gini Index has the largest reduction in the value of the Gini Index.

 4. Reducing Error Pruning Application to Develop Pruned Decision Tree

The Reduced Error Pruning method is described as the fastest pruning technique and proven to provide accuracy and small decision rules.   In this step, the researchers applied the reduce error pruning method to the three selected Decision Tree types to improve the decision tree performance.  After the decision tree rules are extracted from the training data, the reduced error pruning method was applied to those rules, providing more compact decision tree rules, and minimizing the number of extracted rules.  Figure 4 illustrates the four step-process including the Reduced Error Pruning step which is preceded by the Training Data of the three selected  Decision Tree Types.

Figure 4.  The Four Major Steps of the Decision Tree Process to Evaluate Alternative Techniques (Shouman et al., 2011).

Performance Measures

            Three measures were used to evaluate the performance of each combination; sensitivity, specificity, and accuracy.  The sensitivity is the proportion of positive instances which are correctly classified as positive for sick patients, while the specificity is the proportion of the negative instances which are correctly classified as negative for healthy patients.  The accuracy is the proportion of instances which are correctly classified as shown in Table 2.

Table 2.  Performance Measures.

These performance evaluation measures; sensitivity, specificity, and accuracy were used in the diagnosis of the heart disease using equal width, equal frequency, chi merges, and entropy discretization with the three selected Decision Tree Types of Information Gain, Gini Index, and Gain Ration Decision Trees, and Reduce Error Pruning Application.

The Research Findings

      The result without the voting application showed that the highest accuracy of 79.1% was achieved by using the equal width discretization Information Gain Decision Tree.  However, the result of the voting application showed the better accuracy of 84.1% using equal frequency discretization Gain Ration Decision Tree, which is 6.4% increase in the accuracy more than the test without voting.   The result also showed that the chi merge and entropy supervised discretization methods with or without voting did not show any improvement in the accuracy of the Decision Tree.  Table 3 summarizes these results are focusing on accuracy only which are derived from detailed tables in (Shouman et al., 2011).  Figure 5 visualizes these results as well.

Table 3.  Accuracy Result With and Without Voting.

Figure 5.  Visual View of the Evaluation of Alternative Decision Tree Techniques.

            The researchers compared their findings and results with J4.8 Decision Tree and bagging algorithm which used the same dataset.  They found out that their tests showed higher performance measures in sensitivity, specificity, and accuracy than J4.8 Decision Tree.  Moreover, the results showed higher sensitivity, and accuracy than the bagging algorithm.  Table 4 showed such a comparison, adapted from (Shouman et al., 2011).

Table 4.  Comparison between the Proposed Model and J4.8 and Bagging Algorithm. Adapted from (Shouman et al., 2011).

            While most researchers are using the binary discretization with Gain Ration Decision Tree in the diagnosis of the heart disease patients, the researchers of this study concluded based on their experimentation, that the application of multi-interval equal frequency discretization with nine voting Gain Ratio Tree provides a better result in the diagnosis of heart disease patients.  Moreover, the accuracy can be improved by increasing the granularity in splitting attributes offered by the multi-interval discretization.  The accuracy of the probability calculation is increased for any given value using the Gain Ration calculation.  The voting application across multiple similar trees validated the higher probability and enhanced the selection of the useful splitting attribute values.  The researchers proposed further research testing to apply the same techniques to evaluate the same performance measures on a larger dataset.

Conclusion

            This project discussed and analyzed the Decision Tree in diagnosing heart disease patients.  The project focused on the research study of (Shouman et al., 2011) who performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease.  The key benefit of this study is the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio.  The study also performed the experimentation with and without the voting technique. The project analyzed the exact steps performed by the researchers of (Shouman et al., 2011), the attributes used, the voting techniques, the data discretization using supervised methods of equal width and equal frequency, and unsupervised methods of chi merge and entropy.   The four major steps for the evaluation of the Decision Tree in diagnoses of the heart disease include Data Discretization, Data Partitioning, Training Data and Decision Tree Type Selection, and Reduced Error Pruning to develop pruned Decision Tree.  The findings of the researchers indicated that Gain Ratio Decision Tree type increases the accuracy of the probability calculation.   The researcher of this project is in agreement with the researchers of the experimentation for further experimentation using larger set to examine to verify if the result will be different with a large set of data.

References

Shouman, M., Turner, T., & Stocker, R. (2011). Using decision tree for diagnosing heart disease patients. Paper presented at the Proceedings of the Ninth Australasian Data Mining Conference-Volume 121.

UCI. (1988). Heart Disease Dataset. Retrieved from http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

Theories and Techniques Used in the Diagnosis of Illness with Big Data Analytics

Dr. Aly, O.
Computer Science.

Introduction

The purpose of this discussion is to discuss and analyze the theories and techniques which can be used in the diagnosis of illnesses with the use of Big Data Analytics.  The discussion will be followed with some recommendations and the rationale for such recommendations. 

Advanced Analytical Theories and Methods

In accordance to (EMC, 2015), there are six main advanced analytical theories and methods which can be utilized to analyze Big Data in different fields such as Finance, Medical, Manufacturing, Marketing, and so forth.  These six analytical models are Clustering, Association Rules, Regression, Classification, Time Series Analysis, and Text Analysis.  In this discussion, the researcher discusses and analyzes each model with the analytical method used for each model.  Based on the discussion and analysis of each model and its analytical methods, the discussion ends with the conclusion on the most appropriate analytical model and method in the diagnosis of illnesses.

  1. Clustering Model and K-Means Method
    The “Clustering” model is used to group similar objects using the unsupervised technique to find hidden structure within unlabeled data because the labels to apply to the clusters cannot be determined in advance.  The K-means method is an analytical and unsupervised method, which is commonly used the method when using the Clustering Model. When using the analytical technique of the K-means, K-means identifies K clusters of objects based on the proximity of the objects to the center of the K groups, for a selected K value.  The center of the K group is identified by the Mean, which is the average, of each n-dimensional vector of attributes of each cluster.  The Clustering and its common method of K-Means can be used in processing images such as security images.  It can also be used in medicine such as targeting individuals for specific preventive measures or participation in the clinical trial.  Moreover, the Clustering model with its analytical technique of K-Means can also be used in customer segmentation to identify customers with certain purchase patterns to increase sale and to retain the customers by reducing the “churn rate.”  The Clustering model can also be applied to human genetics field and also to biology to group and classify plants and animals. Thus, marketing, medical, biology, and economics can benefit from the application of the advanced analytical Clustering model.  When the cluster is identified, the labels can be applied to each cluster to classify each group based on the characteristics of each group.  Thus, the Clustering model has used a lead-in to Classification Model.  
  2. Classification Model and Decision Tree and Naïve Bayes Methods
    It is used for data mining. Unlike the Clustering Model, the class labels are predetermined in the Classification Model, where the class labels are assigned to new observations.  Most of the Classification methods are “supervised” methods. The logistic regression analytical method is the popular method when using Classification model. The Classification model is commonly used for prediction.  It can be used in healthcare to diagnose patients with heart disease.  It can also be used in filtering spam emails.  The two main Classification analytical methods are the “Decision Tree” also known as “Prediction Tree,” and “Naïve Bayes.”  Additional Classification methods are also available such as bagging, boosting, random forest, and support vector machine.  However, the focus of this discussion is on the two classifiers of the “Decision Tree” and “Naïve Bayes.”

    The “Prediction Tree” utilizes a tree structure to identify the sequences of decisions and consequences.  It has two categories of trees: “Classification Tree” and “Regression Tree.”  While the “Classification Tree” can apply to output variables which are “categorical” such as Yes|No, the “Regression Tree” can apply to output variables which are numeric or continuous such as the predicted prices of a product.  The “Decision Tree” can be used for a checklist of symptoms during the evaluation of the doctor to the patient’s case.  The analytical method of the Decision Trees can also be used in artificial intelligence engine of a video game, financial institutions to decide if a loan can be approved and in animal classification such as a mammal or not mammal.  The Decision Tree utilizes the “greedy algorithm,” choosing the best available option available at that moment. 

    The “Naïve Bayes” analytical method is a probabilistic classification method which is based on “Bayes’ Theorem” or “Bayes’ Law.”  This theorem provides the relationship between the two event probabilities and their conditional probabilities.  There is an assumption that the absence or presence of a particular feature of a class is not related to the absence or presence of other features.  The Bays’ Law utilizes the categorical input variables. However, the variations of the algorithm can work with a continuous variable.   The Naïve Bayes classifiers perform better than the Decision Tree on categorical values with many levels.  The Naïve Bayes classifiers are robust to irrelevant variables, which can be distributed among all classes and can handle missing values, unlike the Logistic Regression.  The Naïve Bayes algorithm handles high-dimensional data efficiently.   Native Bayes is competitive with other learning algorithms such as Decision Trees, and Neural Networks, and in some cases, it outperforms other methods.  The Naïve Bayes classifiers are easy to implement and can execute efficiently even without prior knowledge of the data.  They are commonly used to classify text documents such as email spam filtering.  The Naïve Bayes classifiers can also be used in fraud detection such as auto insurance.   It can also be used to compute the probability of a patient having a disease.   
  3. Regression Model and Linear Regression and Logistic Regression:  Regression Analysis is used to explain the influence of an input variable which is independent of the outcome of another variable which is dependent.  It has two categories of regressions: “Linear Regression,” and “Logistic Regression.”  The “Linear Regression” can provide the expected income of a person, while the “Logistic Regression” can compute the probability which an application will default on a loan.  There are additional Regression models include “Ridge Regression,” and “Lasso Regression.”  The discussion of this section is limited only to the “Linear Regression” and “Logistic Regression.”

    The “Linear Regression” is an analytical method used to model the relationship between several input variables and the outcome variable. The outcome variable is a continuous variable.  There is a key assumption that the relationship between the input and the output variables is linear.  The “Linear Regression” method is used in business, government, and other scenarios. Examples of the application of the “Linear Regression” include Real Estate, Demand Forecasting, Medical.  In the “Real Estate” the Linear Regression method can be used to model residential home prices as a function of the home’s living area.  In the Demand Forecasting, the Linear Regression can be used to predict the demand for products and services. In the Medical, the Linear Regression can be used to analyze the effect of proposed radiation treatment on reducing the tumor sizes in cancer patients.

    The “Logistic Regression” can be used to predict the probability of an outcome based on the input.  The outcome variable is categorical, unlike the “Linear Regression,” where the outcome variable is continuous. The “Logistic Regression” analytical method is used in medical, finance, marketing, and engineering.  In the Medical field, the “Logistic Regression” can be used in scenarios such as determining the probability of likelihood for a patient to respond to a specific treatment positively.  In the Finance field, the “Logistic Regression” can be used in scenarios such as determining the probability for a person to default on loan.  In Marketing, the “Logistic Regression” can be used in scenarios such as determining the probability for a customer to switch to another wireless carrier. In the Engineering field, the “Logistic Regression” can be used in scenarios such as determining the probability of a mechanical part that can experience a malfunction or failure.  

    As a brief comparison, the “Linear Regression” model is good when the outcome variable is continuous, while the “Logistic Regression” mode is good when the outcome variable is categorical.  Each method can be applied in certain scenarios as explained above.
  4. Association Rules and the Apriori Method:  The Association Rule is an unsupervised method.  It is a descriptive and not predictive method, often used to discover interesting relationship hidden in a large dataset.  The “Association Rules” are commonly used for mining transactions in databases.  Examples of scenarios for “Association Rules” include products which can be purchased together, and similar customers buying similar products.  The “apriori” algorithm is the earliest algorithm for generating Association Rules.  The “support” is the major component of the “apriori” method.  The “apriori” algorithm takes the bottom-up iterative approach to discover the frequent datasets by identifying all possible items and determine the most frequent item.  The “apriori” algorithm decreases the computational workload by only examining the datasets which meet the specified minimum threshold.  However, if the size of the dataset is very large, the “apriori” method can be computationally expensive.  Thus, various approaches such as partitioning, sampling, transaction reduction can be used to improve the efficiency of the “apriori’ algorithms.

    The Association Rules can be applied to marketing.  The “market basket analysis” is one of the specific implementations of the “Association Rules” mining.  The “recommender systems” and “clickstream analyses” are also using the “Association Rules” mining. Moreover, as indicated in (Wassan, 2014), the recommender system can also be used to extract relevant information from the Electronic Health Records and offer healthcare recommendations to users or various stakeholders of the clinical environment.
  5. Time Series Analysis and ARIMA Model: The “Time Series” analysis attempts to model the underlying structure of observations taken over time.  Various methods are used for the “Time Series” analysis, such as “Auto-aggressive Integrated Moving Average” (ARIMA), “Auto-regressive Moving Average with Exogenous Inputs” (ARMAX), Sectoral Analysis, “Generalized Autoregressive Conditionally Heteroscedastic (GARCH), Kalman Filtering, and “Multivariate Time Series Analysis.”  The focus in this discussion is on the ARIMA model.

    There are four components for the “Time Series;” trend, seasonality, cyclic, and random.  Box-Jenkins methodology excludes any trends or seasonality in the “Time Series.”  The “Time Series” must be stationary to apply ARIMA model properly.  The advantage of ARIMA model is that the analysis can be based on historical time series data for the variable.  The disadvantage of the ARIMA model is the minimal data requirement.

    The “Time Series” model can be used in finance, economics, biology, engineering, retails and manufacturing.  In the “Retails” scenario, the model can look to forecast future monthly sales.  In the “Manufacturing” field, the model can be used to forecast future spare part demands to ensure an adequate supply of the parts to repair customer products. In the “Finance” field, the model can be used for stock trading, and the use of a technique called pairs training where a strong correlation between the prices of two stocks is identified. 
  6. Text Analysis and Text Mining:  The “Text Analysis” also known as “Text Analytics” involves the processing and modeling of the textual data to derive useful insight. The “Text Mining” is an important component of the “Text Analysis” which is used to discover relationships and patterns in large text collections.  The unstructured data of the text collections is a challenging factor in Text Analysis.  The typical process for the Text Analysis involves six steps; collection of the raw text, the representation of the text, the implementation of the “Term Frequency-Inverse Document Frequency” (TFIDF) to compute the usefulness of each word or term in the text, the categorization of the documents by topics using topic modeling, the sentiment analysis, and the gain of greater insight.  This model can be used in scenarios such as social media.

Conclusion

Based on the above discussion and analysis of the advanced analytical models and methods, the researcher concludes that Cluster, Regression, and Classification models can be used in medical field.  However, each model can serve the medical field in different areas.  For instance, the Clustering model with the K-Means analytical method can be used in the medical domain for preventive measures.   The Regression Model can also be used in the medical field to analyze the effect of certain medication or treatment on the patient, and the probability for the patient to respond positively to specific treatment.   The Classification model seems to be the most appropriate model to diagnose illness.  The Classification model with the Decision Tree and Naïve Bayes method can be used to diagnose patients with certain diseases such as heart diseases, and the probability of a patient having a certain disease. 

References

EMC. (2015). Data Science and Big Data Analytics: Wiley.

Wassan, J. T. (2014). Modelling Stack Framework for Accessing Electronic Health Records with Big Data Needs. International Journal of Computer Applications, 106(1).

Advanced Processing Techniques for Big Data

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to discuss and analyze advanced processing techniques for Big Data.  There are various processing systems such as Iterative Processing, Graph Processing, Stream Processing also known as Event Processing or Real-Time Processing, and Batch Processing.    A MapReduce-based framework such as Hadoop supports the Batch-Oriented Processing.  MapReduce lacks the built-in support for the Iterative Processing which requires parsing datasets iteratively, large Graph Processing, and Stream Processing.  Thus, various models such as Twister, and iMapReduce are introduced to improve the Iterative Processing of the MapReduce, and Surfer, Apache Hama, Pregel, GraphLab for large Graph Processing.  Other models are also introduced to support the Stream Processing such as Aurora, Borealis, and IBM InfoSphere Streams.  This project focuses the discussion and the analysis of the Stream Processing models of Aurora and Borealis.  The discussion and the analysis of Aurora model includes an overview of the Aurora model as Streaming Processing Engine (SPE), followed by the Aurora Framework and the fundamental components of the Aurora topology.  The Query Model of Aurora, which is known as Stream Query Algebra “SQuAI,” supports seven operators constructing the Aurora network and queries for expressing its stream processing requirements.   The discussion and analysis also include the SQuAl and the Query Model, the Run-Time Framework and the Optimization systems to overcome bottlenecks at the network.  The Aurora* and Medusa as Distributed Stream Processing are also discussed and analyzed.  The second SPE is Borealis which is a Distributed SPE.  The discussion and the analysis of the Borealis involved the framework, the query model, and the optimization techniques to overcome bottlenecks at the network.  A comparison between Aurora and Borealis is also discussed and analyzed. 

Keywords: Stream Processing, Event Processing, Aurora, Borealis.

Introduction

            When dealing with Big Data, its different characteristics and attributes such as volume, velocity, variety, veracity, and value must be taken into consideration (Chandarana & Vijayalakshmi, 2014).   Thus, different types of the framework are required to run different types of analytics (Chandarana & Vijayalakshmi, 2014).  The workload of the large-scale data processing has different types of workloads (Chandarana & Vijayalakshmi, 2014).   Organizations deploy a combination of different types of workloads to achieve the business goal (Chandarana & Vijayalakshmi, 2014).  These various types of workloads involve Batch-Oriented Processing, Online-Transaction Processing, Stream Processing, Interactive ad-hoc Query and Analysis (Chandarana & Vijayalakshmi, 2014), and Online Analytical Processing (Erl, Khattak, & Buhler, 2016). 

For the Batch-Oriented Processing, a MapReduce-based framework such as Hadoop can be deployed for recurring tasks such as large-scale Data Mining or Aggregation (Chandarana & Vijayalakshmi, 2014; Erl et al., 2016; Sakr & Gaber, 2014).  For the OLTP such as user-facing e-commerce transactions, the Apache HBase can be deployed (Chandarana & Vijayalakshmi, 2014).  The OLTP system processes transaction-oriented data (Erl et al., 2016).  For the Stream Processing, Storm framework can be deployed to handle stream sources such as social media feeds or sensor data (Chandarana & Vijayalakshmi, 2014).  For the Interactive ad-hoc Query and Analysis, the Apache Drill framework can be deployed (Chandarana & Vijayalakshmi, 2014).  For the OLAP, which form an integral part of Business Intelligence, Data Mining, and Machine Learning, the systems are used for processing data analysis queries (Erl et al., 2016).   

Apache Hadoop framework allows distributed processing for large data sets across clusters of computers using simple programming models (Chandarana & Vijayalakshmi, 2014).   The Apache Hadoop framework involves four major modules; Hadoop Core, Hadoop Distributed Files System (HDFS), Hadoop YARN, and Hadoop Map Reduce.  The Hadoop Core is used as the common utilities which support other modules.  The HDFS module provides high throughput access to application data.  The Hadoop YARN module is for job scheduling and resource management.  The Hadoop MapReduce is for parallel processing of large-scale dataset (Chandarana & Vijayalakshmi, 2014).   

Moreover, there are various processing systems such as Iterative Processing (Schwarzkopf, Murray, & Hand, 2012), Graph Processing, and Stream Processing (Sakr & Gaber, 2014; Schwarzkopf et al., 2012).  The Iterative Processing systems utilize the in-memory caching (Schwarzkopf et al., 2012).   Many data analysis application requires the iterative processing of the data which includes algorithms for text-based search and machine learning.  However, because MapReduce lacks the built-in support for iterative processing which requires parsing datasets iteratively (Zhang, Chen, Wang, & Yu, 2015; Zhang, Gao, Gao, & Wang, 2012), various models such as Twister, HaLoop, and iMapReduce are introduced to improve the iterative processing of the MapReduce (Zhang et al., 2015).   With regard to the Graph Processing, MapReduce is suitable for processing flat data structures, such as vertex-oriented tasks and propagation is optimized for edge-oriented tasks on partitioned graphs. However, to improve the programming models for large graph processing, various models such as Surfer (Chen, Weng, He, & Yang, 2010; Chen et al., 2012), GraphX (Gonzalez et al., 2014), Apache Hama, GoldenOrb, Giraph, Phoebus, GPS (Cui, Mei, & Ooi, 2014), Pregel (Cui et al., 2014; Hu, Wen, Chua, & Li, 2014; Sakr & Gaber, 2014), and GraphLab (Cui et al., 2014; Hu et al., 2014; Sakr & Gaber, 2014).  With regard to the Steam Processing, because MapReduce is design for Batch-Oriented Computation such as log analysis and text processing (Chandarana & Vijayalakshmi, 2014; Cui et al., 2014; Erl et al., 2016; Sakr & Gaber, 2014; Zhang et al., 2015; Zhang et al., 2012), and is not adequate for supporting real-time stream processing tasks (Sakr & Gaber, 2014) various Steam Processing models are introduced such as DEDUCE, Aurora, Borealis, IBM Spade, StreamCloud, Stormy (Sakr & Gaber, 2014), Twitter Storm (Grolinger et al., 2014; Sakr & Gaber, 2014), Spark Streaming, Apache Storm (Fernández et al., 2014; Gupta, Gupta, & Mohania, 2012; Hu et al., 2014; Scott, 2015), StreamMapReduce (Grolinger et al., 2014), Simple Scalable Streaming System (S4) (Fernández et al., 2014; Grolinger et al., 2014; Gupta et al., 2012; Hu et al., 2014; Neumeyer, Robbins, Nair, & Kesari, 2010-639), and IBM InfoSphere Streams (Gupta et al., 2012).

            The project focuses on two models of the Stream Processing.  The discussion and the analysis will be on Aurora stream processing systems and Borealis stream processing systems.  The discussion and the analysis will also address their characteristics, architectures, performance optimization capability, and scalability.  The project will also discuss and analyze the performance bottlenecks, the cause of such bottlenecks and the strategies to remove these bottlenecks.  The project begins with a general discussion on the Stream Processing.

Stream Processing Engines

            Stream Processing is defined by (Manyika et al., 2011) as technologies designed to process large real-time streams of event data.  The Stream Processing allows applications such as algorithms trading in financial services, RFID even processing applications, fraud detection (Manyika et al., 2011; Scott, 2015), process monitoring, and location-based services in telecommunications (Manyika et al., 2011).  Stream Processing reflects the Real-Time Streaming, and also known as “Event Stream Processing” (Manyika et al., 2011).   The “Event Stream Processing” is also known as “Streaming Analytics” which is used to process customer-centric data “on the fly” without the need for long-term storage (Spiess, T’Joens, Dragnea, Spencer, & Philippart, 2014).

            In the Real-Time mode, the data is processed in-memory because it is captured before it gets persisted to the disk (Erl et al., 2016).  The response time ranges from a sub-second to under a minute (Erl et al., 2016).  The Real-Time mode reflects the velocity feature and characteristics of Big Data datasets (Erl et al., 2016).  When Big Data is processed using the Real-Time or Even Stream Processing, the data arrives continuously in a stream, or at an interval in events (Erl et al., 2016).  The individual data for streaming is small. However, the continuous nature leads to such streamed data result in very large datasets (Erl et al., 2016; Gradvohl, Senger, Arantes, & Sens, 2014).   Real-Time mode also involves “Interactive Mode” (Erl et al., 2016). The “Interactive Mode” refers to the Query Processing in the Real-Time (Erl et al., 2016). 

            The systems of the Event Stream Processing (ESP) are designed to provide high-performance analysis of streams with low latency (Gradvohl et al., 2014).  The first Event Stream Processing (ESP) systems, which were developed in the early 2000s, include Aurora, Borealis, STREAM, TelegraphCQ, NiagaraCQ, and Cougar (Gradvohl et al., 2014).  During that time, the systems were centralized systems namely running on a single server aiming to overcome the issues of stream processing by the traditional database (Gradvohl et al., 2014).   Tremendous efforts have been exerted to enhance and improve the data stream processing from centralized stream processing systems to stream processing engines with the ability to distribute queries among a cluster of nodes (Sakr & Gaber, 2014).  This next discussion will focus on two of the scalable processing of streaming data; Aurora and Borealis. 

  1. Aurora Streaming Processing Engine

Aurora was introduced through a project effort from Brandeis University, Brown University, and MIT (Abadi et al., 2003; Sakr & Gaber, 2014).   The prototype of Aurora implementation was introduced in 2003 (Abadi et al., 2003; Sakr & Gaber, 2014).  The GUI interface of Aurora is based on Java allowing construction and execution of Aurora networks, which supports the construction of arbitrary Aurora networks and query (Abadi et al., 2003; Sakr & Gaber, 2014).  The Aurora system is described as a processing model to manage data streams for monitoring applications, which are distinguished substantially from the traditional business data processing (Abadi et al., 2003; Sakr & Gaber, 2014).   The main aim of the monitoring applications is to monitor continuous streams of data (Abadi et al., 2003; Sakr & Gaber, 2014).  As an example of these Monitoring, Applications is the military applications which monitor readings from sensors worn by soldiers such as blood pressure, heart rate, position, and so forth.  Another example of these Monitoring Applications includes the financial analysis applications which monitor the stock data streams reported from various stock exchanges (Abadi et al., 2003; Sakr & Gaber, 2014).  The Tracking Applications which monitor the location of large numbers of the object are other types of Monitoring Applications (Abadi et al., 2003; Sakr & Gaber, 2014). 

Due to the nature of the Monitoring Applications, they can benefit from the Database Management System (DBMS) because of the high volume of monitored data and the requirement of the query for these applications (Abadi et al., 2003; Sakr & Gaber, 2014).  However, the existing DBMS systems are unable to fully support such applications because DBMS systems target Business Applications and not Monitoring Applications (Abadi et al., 2003; Sakr & Gaber, 2014).  DBMS gets its data from humans issuing transactions, while the Monitoring Applications get their data from external sources such as sources (Abadi et al., 2003; Sakr & Gaber, 2014).  The role of DBMS when supporting the Monitoring Applications is to detect and alert humans of any abnormal activities (Abadi et al., 2003; Sakr & Gaber, 2014).  This model is described as DBMS-Active, Human-Passive (DAHP) Model (Abadi et al., 2003; Sakr & Gaber, 2014).  This model is different from the traditional DBMS model which is described as Human-Active, DBMS-Passive (HADP) Model, where humans initiate queries and transactions on the DBMS passive repository (Abadi et al., 2003; Sakr & Gaber, 2014).

Besides, the Monitoring Applications require not only the latest value of the object but also the historical values (Abadi et al., 2003; Sakr & Gaber, 2014).   The Monitoring Applications are trigger-oriented applications to send the alert message when abnormal activities are detected (Abadi et al., 2003; Sakr & Gaber, 2014).  Besides, the Monitoring Applications requires approximate answers due to the nature of the data stream processing where data can get lost or omit for processing reasons.  The last characteristic of the Monitoring Applications involves the Real-Time requirement and the Quality-of-Service (QoS).  Table 1 summarizes these five major characteristics of the Monitoring Applications, for which Aurora systems are designed to manage data streams.

Table 1: Monitoring Applications Characteristics.

1.1 Aurora Framework

The traditional DBM could not be used to implement these Monitoring Applications with these challenging characteristics (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).  The prevalent requirements of these Monitoring Applications are the data and information streams, triggers, imprecise data, and real-time (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).  Thus, Aurora systems are designed to support these Monitoring Applications with these challenging characteristics and requirements (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).  The underlying concept of the Aurora System Model is to process the incoming data streams as an application administrator and use boxes and arrows paradigm as a data-flow system, where the tuples flow through a loop-free, directed graph of processing operations (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).  The output streams are presented to applications which get programmed to deal with the asynchronous tuples in an output stream (Abadi et al., 2003; Sakr & Gaber, 2014).   The Aurora System Model also maintains historical storage to support ad-hoc queries (Abadi et al., 2003; Sakr & Gaber, 2014).  The Aurora systems handle data from a variety of sources such as computer programs which generate values at regular or irregular intervals or hardware sensors (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).  Figure 1 illustrates the Aurora System Model reflecting the input data stream, the operator boxes, the continuous and ad-hoc queries, and the output to applications.  

Figure 1:  Overview of Aurora System Model and Architecture.  Adapted from (Abadi et al., 2003; Carney et al., 2002; Cherniack et al., 2003; Sakr & Gaber, 2014).

1.2  Aurora Query Model: SQuAl Using Seven Primitive Operations

            The Aurora Stream Query Algebra (SQuAl) supports seven operators which are used to construct Aurora networks and queries for expressing its stream processing requirements (Abadi et al., 2003; Sakr & Gaber, 2014).  Many of these operations have analogs in the relational query operation.  For instance, the “filter” operator in Aurora Query Algebra, which applies any number of predicates to each incoming tuple, routing the tuples based on the satisfied predicates, is like the relational operator “select” (Abadi et al., 2003; Sakr & Gaber, 2014).   The “aggregate” operators in Aurora Query Algebra computes stream aggregation to address the fundamental push-based nature of data streams, applying a function such as a moving average across a window of values in a stream (Abadi et al., 2003; Sakr & Gaber, 2014).  The windowed operations are required when the data is stale or time imprecise (Abadi et al., 2003; Sakr & Gaber, 2014).  The application administrator in the Aurora System Model can connect the output of one box to the input of several others which implements the “implicit split” operations rather than the “explicit split” of the relational operations (Abadi et al., 2003; Sakr & Gaber, 2014).  Besides, the Aurora System Model contains an “explicit union” operation where two streams can be put together (Abadi et al., 2003; Sakr & Gaber, 2014).   The Aurora System Model also represents a collection of streams with a common schema, called “Arcs” (Abadi et al., 2003; Sakr & Gaber, 2014).   The Arc does not have any specific number of streams which makes it easier to have streams come and goes without any modifications to the Aurora network (Abadi et al., 2003; Sakr & Gaber, 2014).

            In Aurora Query Model, the stream is an append-only sequence of tuples with uniform schema, where each tuple in a stream has a timestamp for QoS calculations (Abadi et al., 2003; Sakr & Gaber, 2014).  When using Aurora Query Model, there is no arrival order assumed which help in gaining latitude for producing outputs out of order for serving high-priority tuples first (Abadi et al., 2003; Sakr & Gaber, 2014).   Moreover, this no arrival order assumption also helps in redefining the windows for attributes, and in merging multiple streams (Abadi et al., 2003; Sakr & Gaber, 2014).   Some operators are described as “order-agnostic” such as such as Filter, Map, and Union.  Some other operators are described as “order-sensitive” such as BSort, Aggregate, Join, and Resample where they can only be guaranteed to execute with finite buffer space and in a finite time if they can assume some ordering over their input streams (Abadi et al., 2003; Sakr & Gaber, 2014).  Thus, the order-sensitive operators require order specification arguments which indicate the arrival order of the expected tuple (Abadi et al., 2003; Sakr & Gaber, 2014).   

            The Aurora Query Model supports three main operations modes: (1) the continuous queries of the real-time processing, (2) the views, and (2) the ad-hoc queries (Abadi et al., 2003; Sakr & Gaber, 2014).  These three operations modes utilize the same conceptual building blocks technique processing flows based on QoS specifications (Abadi et al., 2003; Sakr & Gaber, 2014).  In Aurora Query Model, each output is associated with two-dimensional QoS graphs which specify the utility of the output with regard to several performance-related and quality-related attributes (Abadi et al., 2003; Sakr & Gaber, 2014).   The stream-oriented operators which constitute the Aurora network and queries are designed to operate in a data flow mode where data elements are processed as they appear on the input (Abadi et al., 2003; Sakr & Gaber, 2014).

1.3 Aurora Run-Time Framework and Optimization

The main purpose of the Aurora run-time operations is to process data flows through a potentially large workflow diagram (Abadi et al., 2003; Sakr & Gaber, 2014).  The Aurora Run-Time Architecture involves five main techniques: (1) the QoS data structure, (2) the Aurora Storage Management (ASM), (3) the Run-Time Scheduling (RTS), (4) the Introspection, and (5) the Load Shedding.   

The QoS is a multi-dimensional function which involves response times, tuple drops, and values produced (Abadi et al., 2003; Sakr & Gaber, 2014).  The ASM is designed to store all tuples required by the Aurora network.  The ASM requires two main operations; one to manage storage for the tuples being passed through an Aurora network, and the second operations must maintain extra tuple storage which may be required at the connection point.  Thus, the ASM involves two main management operations: (1) the Queue Management, and (2) the Connection Point Management (Abadi et al., 2003; Sakr & Gaber, 2014).  

The RTS in Aurora is challenging because of the need to simultaneously address several issues such as large system scale, real-time performance requirements, and dependencies between box executions (Abadi et al., 2003; Sakr & Gaber, 2014).  Besides, the processing of tuple in Aurora spans many scheduling and execution steps, where the input tuple goes through many boxes before potentially contributing to an output stream, which may require secondary storage (Abadi et al., 2003; Sakr & Gaber, 2014). The Aurora systems reduce the overall processing costs by using two main non-linearities when processing tuples: “Interbox Non-Linearity,” and the “Intrabox Non-Linearity” techniques.  The Aurora systems take advantages of the Non-Linearity technique in both the Interbox and the Intrabox tuple processing through the “Train Scheduling” (Abadi et al., 2003; Sakr & Gaber, 2014).  The “Train Scheduling” is a set of scheduling heuristics which attempt (1) to have boxes queue as many tuples as possible without processing, thus generating long tuple trains, (2) to process complete trains at once, thus using the “Intrabox Non-Linearity” technique, and (3) to pass them to subsequent boxes without having to go to disk, thus employing the “Interbox Non-linearity” technique (Abadi et al., 2003; Sakr & Gaber, 2014).  The primary goal of the “Train Scheduling” is to minimize the number of I/O operations performed per tuple.  The secondary goal of the “Train Scheduling is to minimize the number of box calls made per tuple. With regard to the Introspection technique, Aurora systems employ static and dynamic or run-time introspection techniques to predict and detect overload situation (Abadi et al., 2003; Sakr & Gaber, 2014).  The purpose of the static introspection technique is to determine if the hardware running the Aurora network is sized correctly.  The dynamic analysis which is based on the run-time introspection technique uses timestamps for all tuples (Abadi et al., 2003; Sakr & Gaber, 2014).   With regard to the “Load Shedding”, Aurora systems reduces the volume of the tuple processing via the load shedding if an overload is detected as a result of the static or dynamic analysis, by either dropping the tuples or filtering the tuples (Abadi et al., 2003; Sakr & Gaber, 2014).   Figure 2 illustrates the Aurora Run-Time Architecture, adapted from (Abadi et al., 2003; Sakr & Gaber, 2014).

Figure 2:  Aurora Run-Time Framework. Adapted from (Abadi et al., 2003).

The Aurora optimization techniques involve two main optimization systems: (1) the dynamic continuous query optimization, and (2) the ad-hoc query optimization. The dynamic continuous query optimization involves the inserting projections, the combining boxes, and the reordering boxes optimization techniques (Abadi et al., 2003). The ad-hoc query optimization involves the historical information because Aurora semantics require the historical sub-network to be run first.  This historical information is organized in a B-tree data model (Abadi et al., 2003).  The initial boxes in an ad-hoc query can pull information from the B-tree associated with the corresponding connection point (Abadi et al., 2003).  When the historical operation is finished, the Aurora optimization technique switches the implementation to the standard push-based data structures and continues processing in the conventional mode (Abadi et al., 2003).

1.4 Aurora* and Medusa for Distributed Stream Processing

The Aurora System is a centralized stream processor.   However, in (Cherniack et al., 2003).   Aurora* and Medusa are proposed for distributed processing.  Several architectural issues must be addressed for building a large-scale distributed version of a stream processing system such as Aurora.  In (Cherniack et al., 2003), the problem is divided into two categories:  intra-participant distribution, and inter-participant distribution.  The intra-participant distribution involves small-scale distribution within one administrative domain which can be handled by the proposed model of Aurora* (Cherniack et al., 2003).  The inter-participant distribution involves large-scale distribution across administrative boundaries, which is handled by the proposed model of Medusa (Cherniack et al., 2003).

2. Borealis Streaming Processing Engine

Borealis is described as the second generation of the Distributed SPE which also got developed at Brandeis University, Brown University and MIT (Abadi et al., 2005; Sakr & Gaber, 2014).  The Borealis streaming model inherits the core functionality of the stream processing from Aurora model, and the core functionality of the distribution from Medusa model (Abadi et al., 2005; Sakr & Gaber, 2014).   The Borealis model is an expansion and extension of both models to provide more advanced capabilities and functionalities which are commonly required by newly-emerging stream processing applications (Abadi et al., 2005; Sakr & Gaber, 2014).   Borealis is regarded to be the successor to Aurora (Abadi et al., 2005; Sakr & Gaber, 2014).

The second generation of the SPE has three main requirements which are critical and at the same time challenging.  The first requirement involves the “Dynamic Revision of Query Results” (Abadi et al., 2005; Sakr & Gaber, 2014).  Applications are forced to live with imperfect results because corrects or updates to previously processed data are only available after the fact unless the system has techniques to revise its processing and results to take into account newly available data or updates (Abadi et al., 2005; Sakr & Gaber, 2014).  The second requirement for the second generation of the SPE involves “Dynamic Query Modification,” which allows runtime with low overhead, fast and automatic modification (Abadi et al., 2005; Sakr & Gaber, 2014).  The third requirement for the second generation of SPE involves “Flexible and Highly-Scalable Optimization,” where the optimization problem will be more balanced between the sensor-heavy and server-heavy optimization.  The more flexible optimization structure is needed to deal with a large number of devices and perform cross-network sensor-heavy server-heavy resource management and optimization (Abadi et al., 2005; Sakr & Gaber, 2014). However, this requirement for such optimization framework has two additional challenges.  The first challenge is the ability to simultaneously optimize different QoS metrics such as processing latency, throughput, or sensor lifetime (Abadi et al., 2005; Sakr & Gaber, 2014).  The second challenge of such flexible optimization structure and framework is the ability to perform optimizations at different levels of granularity at the node level, sensor network level, a cluster of sensors and server level and so forth (Abadi et al., 2005; Sakr & Gaber, 2014).  These advanced challenges, capabilities, and requirements for the second-generation of SPE are added to the classical architecture of the SPE to form and introduce Borealis framework (Abadi et al., 2005; Sakr & Gaber, 2014).

2.1 The Borealis Framework

            The Borealis framework is a distributed stream processing engine where the collection of continuous queries submitted to Borealis can be seen as a giant network of operators whose processing is distributed to multiple sites (Abadi et al., 2005; Sakr & Gaber, 2014). There is a sensor proxy interface which acts as another Borealis site (Abadi et al., 2005; Sakr & Gaber, 2014).  The sensor networks can participate in query processing behind that sensor proxy interface (Abadi et al., 2005; Sakr & Gaber, 2014).

            Borealis server runs on each node with Global Catalog (GC), High Availability (HA) module, Neighborhood Optimizer (NHO), Local Monitor (LM), Admin, Query Processor (QP)at the top and meta-data, control and data at the bottom of the framework. The GC can be centralized or distributed across a subset of processing nodes, holding information about the complete query network and the location of all query fragments (Abadi et al., 2005; Sakr & Gaber, 2014).  The HA modules monitor each node to handle any failure (Abadi et al., 2005; Sakr & Gaber, 2014).  The NHO utilizes the local information and other information from other NHOs to improve the load balance between the nodes (Abadi et al., 2005; Sakr & Gaber, 2014).  The LM collects performance-related statistics, while the local system reports to the local optimizer as well as the NHOs (Abadi et al., 2005; Sakr & Gaber, 2014).  The QP is the core component of the Borealis’ framework.  The actual execution of the query is implemented in the QP (Abadi et al., 2005; Sakr & Gaber, 2014).  The QP, which is a single site processor, receives the input data streams, and the result is pulled through the I/O Queue, routing the tuples to and from remote Borealis node and clients (Abadi et al., 2005; Sakr & Gaber, 2014).  The Admin module controls the QP, and issues system control messages (Abadi et al., 2005; Sakr & Gaber, 2014).  These messages are pushed to the Local Optimizer (LO), which communicates with Run-Time major components of the QP to enhance the performance.  These Run-Time major components of the Borealis include (1) the Priority Scheduler, (2) Box Processors, and (3) Load Shedder.  The Priority Scheduler determines the order of box execution based on the priority of the tuples.   The Box Processors can change the behavior during the run-time based on the messages received from the LO.  The Load Shedder discards the low-priority tuples when the node is overloaded (Abadi et al., 2005; Sakr & Gaber, 2014).  The Storage Manager is part of the QP and responsible for storing and retrieving data which flows through the arcs of the local query diagram.  The Local Catalog is another component of the QP to store the query diagram description and metadata and is accessible by all components.  Figure 3 illustrates Borealis’ framework, adapted from (Abadi et al., 2005; Sakr & Gaber, 2014).

Figure 3.  Borealis’ Framework, adapted from (Abadi et al., 2005; Sakr & Gaber, 2014).

2.2 Borealis’ Query Model and Comparison with Aurora’s Query Model

            Borealis inherits the Aurora Model of boxes-and-arrows to specify the continuous queries, where the boxes reflect the query operators and the arrows reflect the data flow between the boxes (Abadi et al., 2005; Sakr & Gaber, 2014).  Borealis extends the data model of Aurora by supporting three types of messages of the insertion, the deletion, and the replacement (Abadi et al., 2005; Sakr & Gaber, 2014).  The Borealis’ queries are an extended version of Aurora’s operators to support revision messages (Abadi et al., 2005; Sakr & Gaber, 2014).   The query model of Borealis supports the modification of the box semantic during the runtime (Abadi et al., 2005; Sakr & Gaber, 2014).  The QoS in Borealis is like in Aurora forms the basis of resource management decision.  However, while each query output is provided with QoS function in Aurora’ model, Borealis allows QoS to be predicted at any point in the data flow (Abadi et al., 2005; Sakr & Gaber, 2014).  Thus, Borealis supports a Vector of Metrics for supplied messages to allow such prediction of QoS.

In the context of the query result revision, Borealis supports “replayable” query diagram and the processing scheme revision.  While Aurora has an append-only model where a message cannot be modified once it is placed on a stream providing an approximate or imperfect result, the Borealis’ model supports the modification of messages to processes the query intelligently and provide correct query results (Abadi et al., 2005; Sakr & Gaber, 2014).   The query diagram must be replayable when messages are revised and modified because the processing of the modified message must replay a portion of the past with the modified value (Abadi et al., 2005; Sakr & Gaber, 2014).   This replaying process is also useful for recovery and high availability (Abadi et al., 2005; Sakr & Gaber, 2014). This dynamic revision with the replaying process can add more overhead.  Thus, the “closed” model is used to generates deltas to show the effects of the revisions instead of the entire result.

In the context of the queries modification, Borealis provides online modification of continuous queries by supporting the control lines, and the time travel features.  The control lines extend the basic query model of Aurora to change operator parameters and operators themselves during the run-time (Abadi et al., 2005; Sakr & Gaber, 2014).  The Borealis’ boxes contain the standard data input lines and special control lines which carry messages with revised box parameters and new box function (Abadi et al., 2005; Sakr & Gaber, 2014).   Borealis provides a new function called “Bind” to bind the new parameters to free variables within a function definition, which will lead to a new function to be created (Abadi et al., 2005; Sakr & Gaber, 2014).   The Aurora’s connections points are leveraged to enable the time travel in Borealis.  The original purpose of the connection points was to support ad-hoc queries, which can query historical and run-time data.  This concept is extended in Borealis model to include connection point views to enable time travel applications, ad-hoc queries and the query diagram to access the connection points independently and in parallel (Abadi et al., 2005; Sakr & Gaber, 2014).   The connection point views include two operations to enable the time travel:  the replay operation, and the undo operation.  

2.3  The Optimization Model of Borealis

Borealis has an optimizer framework to optimize processing across a combined sensor and server network, to deal with the QoS in stream-based applications, and to support scalability, size-wise and geographical stream-based applications.   The optimization model contains multiple collaborating monitoring and optimization components.  The monitoring components include the local monitor at every site and end-point monitor at output sites.  The optimization components include the global optimizer, neighborhood optimizer, and local optimizer (Abadi et al., 2005).   While Aurora evaluated the QoS only at outputs and had a difficult job inferring QoS at upstream nodes, Borealis can evaluate the predicted-QoS score function on each message by utilizing the values of the Metrics Vector (Abadi et al., 2005).   Borealis utilizes Aurora’ concept of train scheduling of boxes and tuples to reduce the scheduling overhead.  While Aurora processes the message in order of arrival, Borealis contains box scheduling flexibility which allows processing message out of order because the revision technique can be used to process them later as insertions (Abadi et al., 2005).  Borealis offers a superior load shedding technique than Aurora’s technique (Abadi et al., 2005).  The Load Shedder in Borealis detects and handle overload situations by adding the “drop” operators to the processing network (Ahmad et al., 2005).  The “drop” operator aims to filter out messages, either based on the value of the tuple or in a randomized fashion, meaning out of order to overcome the overload (Ahmad et al., 2005).  The higher quality outputs can be achieved by allowing nodes in a chain to coordinate in choosing where and how much load to shed.  Distributed load shedding algorithms are used to collect local statistics from nodes and pre-computes potential drop plans at the compilation time (Ahmad et al., 2005).  Figure 4 illustrates the Optimization Components of Borealis, adapted from (Abadi et al., 2005).

Figure 4.  The Optimization Components of Borealis, adapted from (Abadi et al., 2005).

            Borealis provides a fault-tolerance technique in a distributed SPE such as replication, running multiple copies of the same query network on distinct processing nodes (Abadi et al., 2005; Balazinska, Balakrishnan, Madden, & Stonebraker, 2005).  When a node experiences a failure on one of its input streams, the node tries to find an alternate upstream replica.  All replicas must be consistent.  To ensure such consistency, data-serializing operator “SUnion” is used to take multiple streams as input and produces one output stream with deterministically ordered tuples, to ensure all operators of the replica processing the same input in the same order (Abadi et al., 2005; Balazinska et al., 2005).   To provide high availability, each SPE ensures that input data is processed and results forwarded within a user-specified time threshold of its arrival (Abadi et al., 2005; Balazinska et al., 2005).  When the failure is corrected, each SPE which experienced tentative data reconciles its state and stabilizes its output by replacing the previously tentative output with stable data tuples forwarded to downstream clients, reconciling the state of SPE based on checkpoint/redo, undo/redo and the revision tuples new concept (Abadi et al., 2005; Balazinska et al., 2005).

Conclusion

                This project discussed and analyzed advanced processing of Big Data.  There are various processing systems such as Iterative Processing, Graph Processing, Stream Processing also known as Event Processing or Real-Time Processing, and Batch Processing.    A MapReduce-based framework such as Hadoop supports the Batch-Oriented Processing.  MapReduce also lacks the built-in support for the Iterative Processing which requires parsing datasets iteratively, large Graph Processing, and Stream Processing.  Thus, various models such as Twister, HaLoop, and iMapReduce are introduced to improve the Iterative Processing of the MapReduce. With regard to the Graph Processing, MapReduce is suitable for processing flat data structures, such as vertex-oriented tasks and propagation is optimized for edge-oriented tasks on partitioned graphs.  However, various models are introduced to improve the programming models for large graph processing such as Surfer, Apache Hama, GoldenOrb, Giraph, Pregel, GraphLab.  With regard to the Stream Processing, various models are also introduced to overcome the limitation of the MapReduce framework which deals only with batch-oriented processing.  These Stream Processing models include Aurora, Borealis, IBM Space, StreamCloud, Stormy, Twitter Storm, Spark Streaming, Apache Storm, StreamMapReduce, Simple Scalable Streaming System (S4), and IBM InfoSphere Streams.  This project focused the discussion and the analysis of the Stream Processing models of Aurora and Borealis.  The discussion and the analysis of Aurora model included an overview of the Aurora model as Streaming Processing Engine (SPE), followed by the Aurora Framework and the fundamental components of the Aurora topology.  The Query Model of Auroral, which is known as Streak Query Algebra “SQuAI,” supports seven operators constructing the Aurora network and queries for expressing its stream processing requirements.   The discussion and analysis also included the “SQuAl” and the Query Model, the Run-Time Framework and the Optimization systems.  The Aurora* and Medusa as Distributed Stream Processing are also discussed and analyzed.  The second SPE is Borealis which is a Distributed SPE.  The discussion and the analysis of the Borealis involved the framework, the query model, and the optimization technique.  Borealis is an expansion to Aurora’s SPE to include and support features which are required for the Distributed Real-Time Streaming.  The comparison between Aurora and Borealis is also discussed and analyzed at all levels from the network, query model, and the optimization techniques. 

References

Abadi, D. J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., . . . Ryvkina, E. (2005). The Design of the Borealis Stream Processing Engine.

Abadi, D. J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., . . . Zdonik, S. (2003). Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2), 120-139.

Ahmad, Y., Berg, B., Cetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., . . . Tatbul, N. (2005). Distributed operation in the Borealis stream processing engine. Paper presented at the Proceedings of the 2005 ACM SIGMOD international conference on Management of data.

Balazinska, M., Balakrishnan, H., Madden, S., & Stonebraker, M. (2005). Fault-tolerance in the Borealis distributed stream processing system. Paper presented at the Proceedings of the 2005 ACM SIGMOD international conference on Management of data.

Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., . . . Stonebraker, M. (2002). Monitoring streams—a new class of data management applications. Paper presented at the VLDB’02: Proceedings of the 28th International Conference on Very Large Databases.

Chandarana, P., & Vijayalakshmi, M. (2014, 4-5 April 2014). Big Data analytics frameworks. Paper presented at the Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on.

Chen, R., Weng, X., He, B., & Yang, M. (2010). Large graph processing in the cloud. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Chen, R., Yang, M., Weng, X., Choi, B., He, B., & Li, X. (2012). Improving large graph processing on partitioned graphs in the cloud. Paper presented at the Proceedings of the Third ACM Symposium on Cloud Computing.

Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., & Zdonik, S. B. (2003). Scalable Distributed Stream Processing.

Cui, B., Mei, H., & Ooi, B. C. (2014). Big data: the driver for innovation in databases. National Science Review, 1(1), 27-30.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409.

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., & Stoica, I. (2014). GraphX: Graph Processing in a Distributed Dataflow Framework.

Gradvohl, A. L. S., Senger, H., Arantes, L., & Sens, P. (2014). Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 6(2), 174-179.

Grolinger, K., Hayes, M., Higashino, W. A., L’Heureux, A., Allison, D. S., & Capretz, M. A. (2014). Challenges for mapreduce in big data. Paper presented at the Services (SERVICES), 2014 IEEE World Congress on.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: what is new from databases perspective? Paper presented at the International Conference on Big Data Analytics.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010-639). S4: Distributed stream computing platform. Paper presented at the 2010 IEEE International Conference on Data Mining Workshops.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Schwarzkopf, M., Murray, D. G., & Hand, S. (2012). The seven deadly sins of cloud computing research. Paper presented at the Presented as part of the.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Spiess, J., T’Joens, Y., Dragnea, R., Spencer, P., & Philippart, L. (2014). Using Big Data to Improve Customer Experience and Business Performance. Bell Labs Tech. J., 18: 13–17. doi:10.1002/bltj.21642

Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for Mining Evolving Big Data. IEEE transactions on knowledge and data engineering, 27(7), 1906-1919.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.

Extract Knowledge from a Large-Scale Dataset

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze one research work which involves knowledge extraction from a large-scale dataset with specific real-world semantics by applying machine learning.  The discussion begins with an overview of Linked Open Data, Semantic Web Application, and Machine Learning.   The discussion also addresses Knowledge Extraction from Linked Data, its eight-phase lifecycle, and examples of Linked Data-driven applications.  The discussion focuses on DBpedia as an example of such applications. 

Linked Open Data, Semantic Web Application, and Machine Learning

The term Linked Data, as indicated in (Ngomo, Auer, Lehmann, & Zaveri, 2014), refers to a set of best practices for publishing and interlinking structured data on the Web.  The Linked Open Data (LOD) is regarded to be the next generation systems recommended on the web (Sakr & Gaber, 2014).  The LOD describes the principles which are designated by Tim Berners-Lee to publish and connect the data on the web (Bizer, Heath, & Berners-Lee, 2011; Sakr & Gaber, 2014).  These Linked Data principles include the use of URIs as names for things, HTTP URIs, RDF, SPARQL, and the Link of RDF to other URIs (Bizer et al., 2011; Ngomo et al., 2014; Sakr & Gaber, 2014). The LOD cloud, as indicated in (Sakr & Gaber, 2014), covers more than an estimated fifty billion facts from various domains like geography, media, biology, chemistry, economy, energy and so forth.  Semantic web offers cost-effective techniques to publish data in a distributed environment (Sakr & Gaber, 2014).  The Semantic Web technologies are used to publish structured data on the web, set links between data from one data source to data within other data sources (Bizer et al., 2011).  The LOD implements the Semantic Web concepts where organizations can upload their data on the web for open use of their information (Sakr & Gaber, 2014).   The underlying technologies behind the LOD involves the Uniform Resource Identifiers (URI), HyperText Transfer Protocol (HTTP), Resource Description Framework (RDF).  The properties of the LOD include any data, anyone can publish on the Web of Data, Data publishers are not constrained in choice of vocabularies to present data, and entities are connected by RDF links (Bizer et al., 2011).  The LOD applications include Linked Data Browsers, The Search Engines, and Domain-Specific Applications (Bizer et al., 2011).  Examples of the Linked Data Browsers are Tabulator Browser (MIT, USA), Marbles (FU Berlin, DE), OpenLink RDF Browser (OpenLink, UK), Zitgist RDF Browser (Zitgist, USA), Humboldt (HP Labs, UK), Disco Hyperdata Browser (FU Berlin, DE), and Fenfire (DERI, Irland) (Bizer et al., 2011).   Examples of the Search Engines include two types, (1) Human-Oriented Search Engines, and (2) Application-Oriented Indexes.  The Human-Oriented Search Engines examples are Falcons (IWS, China), and “Sig.ma” (DERI, Ireland).  The Application-Oriented Indexes examples include Swoogle (UMBC, USA), VisiNav (DERI, Irland), and Watson (Open University, UK).   The Domain-Specific Applications examples include a “mashing up” data from various Linked Data sources such as Revyu, DBpedia Mobile, Talis Aspire, BBC Programs and Music, and DERI Pipes (Bizer et al., 2011).  There are various challenges for the LOP applications such as user Interfaces and Interaction Paradigms; Application Architectures; Schema Mapping and Data Fusion; Link Maintenance; Licensing; Trust, Quality, and Relevance; and Privacy (Bizer et al., 2011).

Machine Learning is described by (Holzinger, 2017) as the fastest growing technical field.  The primary goal of the Machine Learning (ML) is to develop software which can learn from the previous experience, where a usable intelligence can be reached.  ML is categorized into “supervised,” and “unsupervised” learning techniques depending on whether the output values are required to be present in the training data (Baştanlar & Özuysal, 2014).  The “unsupervised” learning technique requires only the input feature values in the training data, and the learning algorithm discovers hidden structure in the training data based on them (Baştanlar & Özuysal, 2014).  The “Clustering” techniques which partition the data into coherent groups fall into this category of “unsupervised” learning (Baştanlar & Özuysal, 2014).  The “unsupervised” techniques are used in the biometrics for problems such as microarray and gene expression analysis (Baştanlar & Özuysal, 2014).  Market segment analysis, grouping people according to their social behavior, and categorization of articles according to their topics are the common tasks which involve clustering and “unsupervised” learning (Baştanlar & Özuysal, 2014).  Typical clustering algorithms are K-means, Hierarchical Clustering, and Spectral Clustering (Baştanlar & Özuysal, 2014).  The “supervised” learning techniques require the value of the output variable for each training sample to be known (Baştanlar & Özuysal, 2014). 

To improve the structure, semantic richness and quality of Linked Data, the existing algorithm of the ML need to be extended from basic Description Logics such as ALC to expressive ones such as SROIQ(D) to serve as the basis of OWL 2 (Auer, 2010).  The algorithms need to be optimized for processing very large-scale knowledge bases (Auer, 2010).  Moreover, tools and algorithms must be developed for user-friendly knowledge representation, maintenance, and repair which enable detecting and fixing any inconsistency and modeling errors (Auer, 2010).

Knowledge Extraction from Linked Data

There is a difference between Data Mining (DM) and Knowledge Discovery (KD) (Holzinger & Jurisica, 2014).  Data Mining is described in (Holzinger & Jurisica, 2014) as “methods, algorithms, and tools to extract patterns from data by combining methods from computational statistics and machine learning: Data mining is about solving problems by analyzing data present in databases” (Holzinger & Jurisica, 2014).  Knowledge Discovery, on the other hands, is described in (Holzinger & Jurisica, 2014) as “Exploratory analysis and modeling of data and the organized process of identifying valid, novel, useful and understandable patterns from these data sets.”  However, some researchers argue there is no difference between DM and KD.  In (Holzinger & Jurisica, 2014), Knowledge Discovery and Data Mining (KDD) are of equal importance and necessary in combination.

Linked Data has a lifecycle with eight phases: (1) Extraction, (2) Storage and Querying, (3) Authoring, (4) Linking, (5) Enrichment, (6) Quality Analysis, (7) Evolution and Repair, and (8) Search, Browsing, and Exploration (Auer, 2010; Ngomo et al., 2014).  The information which is represented in unstructured forms or semi-structured forms must be mapped to the RDF data model which reflects the Extraction first phase (Auer, 2010; Ngomo et al., 2014).  When there is a critical mass of RDF data, techniques must be implemented to store, index and query this RDF data efficiently which reflects the Storage and Querying the second phase.  New structured information or correction and extension to the existing information can be implemented at the third phase called “Authoring” phase (Auer, 2010; Ngomo et al., 2014).  If information about the same or related entities is published by different publishers, links between those different information assets have to be established in the Linking phase.   During the Linking phase, there is a lack of classification, structure and schema information, which can be tackled in the Enrichment phase by enriching data with the higher-level structure to aggregate and query the data more efficiently.  The Data Web, similar to the Document Web can contain a variety of information of different quality, and hence strategies for quality evaluation and assessment of the data published must be established which is implemented at the Quality Analysis phase.  When a quality problem is detected, strategies to repair the problem and to support the evolution of Linked Data must be implemented at the Evolution and Repair phase.  The last phase of the Linked Data is for users to browse, search and explore the structured information available on the Data Web fast and in a user-friendly fashion (Auer, 2010; Ngomo et al., 2014).

The Linked Data-driven applications are categorized into four categories: (1) Content reuse applications such as BBC’s Musing store which reuses metadata from DBPedia and MusicBrainz, (2) Semantic Tagging and Rating applications such as Faviki which employs unambiguous identifiers from DBPedia, (3) Integrated Question-Answering Systems such as DBPedia Mobile with the ability to indicate locations from the DBPedia dataset in the user’s vicinity, and (4) Event Data Management Systems such as Virtuoso’s ODS-Calendar with the ability to organize events, tasks, and notes (Konstantinou, Spanos, Stavrou, & Mitrou, 2010).  

Integrative, and interactive Machine Learning is the future for Knowledge Discovery and Data Mining.  This concept is demonstrated by (Sakr & Gaber, 2014) using the environmental domain, and in (Holzinger & Jurisica, 2014) using the biomedical domain.  Other applications of LOD include DBpedia which reflects the Linked Data version of Wikipedia, BBC’s Platform for the World Cup 2010 and the 2012 Olympic game (Kaoudi & Manolescu, 2015).   RDF dataset can involve a large volume of data such as data.gov which contains more than five billion triples, while the latest version of DBpedia corresponding to more than 2 billion triples (Kaoudi & Manolescu, 2015).  The Linked Cancer Genome Atlas dataset consists of 7.36 billion triples and is estimated to reach 20 billion as cited in (Kaoudi & Manolescu, 2015).  Such triple atoms have been reported to be frequent in real-world SPARQL queries, reaching about 78% of the DBpedia query log and 99.5% of the Semantic Web Dog query log as cited in (Kaoudi & Manolescu, 2015).  The focus of this discussion is on DBPedia to extract out the knowledge from a large-scale dataset with real-world semantics by applying machine learning. 

DBpedia

DBpedia is a community project which extracts structured, multi-lingual knowledge from Wikipedia and makes it available on the Web using Semantic Web and Linked Data technologies (Lehmann et al., 2015; Morsey, Lehmann, Auer, Stadler, & Hellmann, 2012). DBpedia is interlinked with several external data sets following the Linked Data principles (Lehmann et al., 2015). It develops a large-scale, multi-lingual knowledge based by extracting structured data from Wikipedia editions (Lehmann et al., 2015; Morsey et al., 2012) in 111 languages (Lehmann et al., 2015).  Using DBpedia knowledge base, several tools have been developed such as DBpedia Mobile, Query Builder, Relation Finder, and Navigator (Morsey et al., 2012).  DBpedia is used for several commercial applications such as Muddy Boots, Open Calais, Faviki, Zemanta, LODr, and TopBraid Composer (Bizer et al., 2009; Morsey et al., 2012). 

The technical framework of DBpedia extraction involves four phases of Input, Parsing, Extraction, and Output (Lehmann et al., 2015).   The input phase involves Wikipedia pages which are read from an external source using either a dump from Wikipedia or MediaWiki API (Lehmann et al., 2015).  The Parsing phase involves parsing Wikipedia pages using wiki parser which transforms the source code of a Wikipedia page into an Abstract Syntax Tree (Lehmann et al., 2015).  In the Extraction phase, the Abstract Syntax Tree of each Wikipedia page is forwarded to the extractor to extract many things such as labels, abstracts, or geographical coordinates (Lehmann et al., 2015).   A set of RDF statements is yielded by each extractor which consumes an Abstract Syntax Tree (Lehmann et al., 2015).  The Output phase involves writing the collected RDF statements to a sink supporting different formats such as N-Tiples (Lehmann et al., 2015).   The DBpedia extraction framework uses various extractors for translating different parts of Wikipedia pages to RDF statements (Lehmann et al., 2015).  The extraction framework of DBpedia is divided into four categories:  Mapping-Based Infobox Extraction, Raw Infobox Extraction, Feature Extraction, and Statistical Extraction (Lehmann et al., 2015).  The DBpedia extraction framework involves two workflows: Dump-based Extraction, and the Live Extraction.  The Dump-based Extraction workflow employs the Database Wikipedia page collection as the source of article texts, and the N-Triples serializer as the output destination (Bizer et al., 2009).  The knowledge base result is made available as Linked Data for download and via DBpedia’s main SPARQL endpoint (Bizer et al., 2009).  The Live Extraction workflow employs the update stream to extract new RDF whenever an article is changed in Wikipedia (Bizer et al., 2009).  The Live Wikipedia page collection access the article text to obtain the current version of the article encoded according to the OAI-PMH protocol (The Open Archives Initiative Protocol for Metadata Harvesting) (Bizer et al., 2009).  The SPARQL-Update Destination removes the existing information and inserts new triples into a separate triple store (Bizer et al., 2009).   The time for DBpedia to reflect the latest update of Wikipedia is between one to two minutes by (Bizer et al., 2009).  The update stream causes performance issue and the bottleneck as the changes need more than one minutes to arrive from Wikipedia (Bizer et al., 2009).

References

Auer, S. (2010). Towards creating knowledge out of interlinked data.

Baştanlar, Y., & Özuysal, M. (2014). Introduction to machine learning miRNomics: MicroRNA Biology and Computational Analysis (pp. 105-128): Springer.

Bizer, C., Heath, T., & Berners-Lee, T. (2011). The linked data-the story so far.

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S. (2009). DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services, and Agents on the World Wide Web, 7(3), 154-165.

Holzinger, A. (2017). Introduction To MAchine Learning & Knowledge Extraction (MAKE). Machine Learning & Knowledge Extraction.

Holzinger, A., & Jurisica, I. (2014). Knowledge discovery and data mining in biomedical informatics: The future is in integrative, interactive machine learning solutions Interactive knowledge discovery and data mining in biomedical informatics (pp. 1-18): Springer.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Konstantinou, N., Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically approaching the semantic web bottleneck. International Journal of Web Engineering and Technology, 6(1), 83-111.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., . . . Auer, S. (2015). DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2), 167-195.

Morsey, M., Lehmann, J., Auer, S., Stadler, C., & Hellmann, S. (2012). Dbpedia and the live extraction of structured data from Wikipedia. The program, 46(2), 157-181.

Ngomo, A.-C. N., Auer, S., Lehmann, J., & Zaveri, A. (2014). Introduction to linked data and its lifecycle on the web. Paper presented at the Reasoning Web International Summer School.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Proactive Model

Dr. Aly, O.
Computer Science

The purpose of this blog is to have a brainstorming blogging session about the “Proactive Model.”  The computational intelligence and the machine learning techniques have gained popularity in different domains.  Internet of things and internet of people are terms which can indicate the increasing interaction between humans and machines.  Internet of Things (IoT) is regarded to be “one of the most promising fuels of Big Data expansion”  (De Mauro, Greco, & Grimaldi, 2015).  Internet of things is the core component of Web 4.0.  The Web has gone from the first generation of Web 1.0 which was about static web pages, broadcasting information for read-only.  Web 1.0 was innovated by Berners-Lee (Aghaei, Nematbakhsh, & Farsani, 2012; Choudhury, 2014; Kambil, 2008; Patel, 2013), and is known as the “Web of Information Connections” (Aghaei et al., 2012).  Web 2.0 which came out in 2004 is read-write and is known as the “Web of People Connections (Aghaei et al., 2012) to connect people.  Web 3.0 which came out in 2006 is known as “Semantic Web” or the “Web of Knowledge Connections” to share knowledge, followed by Web 4.0 is known as the “Web of Intelligence Connections” where Artificial Intelligence (AI) is expected to play a role.   The current technology as indicated in TED’s video of (Hougland, 2014) can assist people to save lives in case of unexpected health issues such as the heart attack or stroke, by wearing a band in hand.  There are also other tools for elder people to save them when they fall, and they need help while living alone by themselves with no assistance.  These tools are reactive tools which can assist after the fact.  The question is:

“Can the “Web of Intelligence Connections” be intelligent enough to be proactive and provide us with useful information on a daily basis?”

As a computer science researcher, who started with Web 1.0 and experienced the amazing evolution of the Web, I believe that our children will have better opportunities and better health because of the “Proactive Model.”  They will have far advanced tools through which they will communicate daily about what to eat, when to exercise, what to drink, and basically what to do.  For instance, the tool that is based on the “Proactive Model” will monitor the glucose level, the cholesterol level, the potassium level and so forth daily to be able to intelligently tell you what is lacking in your body and what you need to do to fill that gap.  If the person has low potassium, the tool can suggest eating some food such as banana to fill that gap of the potassium level.  If the person has high cholesterol level, the tool can intelligently inform the person of such a fact that can cause damage at heart and provide recommendations to overcome such high cholesterol before it gets worse and lead to the heart attack.  This “Proactive Model” will get embedded into our future children lives and be part of their lives. 

            The healthcare system may raise the question about their role in that model, and the impact of this model on the practice of the doctors.  The health care system should drive this model.  The doctors will play a role in these tools as the recommendations will be based on medical practices.  These recommendations are not arbitrarily and must be based on the recommendation of the doctors the same way when you go to visit the doctor.  On the other hand, the practice of the doctors will be more focused on more serious things that cannot be proactively controlled such as car accidents, or any unexpected or anticipated accidents.

            Do you think it is possible to have such intelligent and sophisticated “Proactive Model?”  If so, how do you vision the model and what obstacles do you think it will face?

References

Aghaei, S., Nematbakhsh, M. A., & Farsani, H. K. (2012). Evolution of the world wide web: From WEB 1.0 TO WEB 4.0. International Journal of Web & Semantic Technology, 3(1), 1.

Choudhury, N. (2014). World Wide Web and its journey from web 1.0 to web 4.0.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Hougland, B. (2014). What is the Internet of Things? And why should you care?  [Video file]. TED Talks: Retrieved from https://www.youtube.com/watch?v=_AlcRoqS65E.

Kambil, A. (2008). What is your Web 5.0 strategy? Journal of business strategy, 29(6), 56-58.

Patel, K. (2013). The incremental journey for World Wide Web: introduced with Web 1.0 to recent Web 5.0–a survey paper.