Big Data Analytics Framework and Relevant Tools Used in Healthcare Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Big Data Analytics framework and relevant tools used in healthcare data analytics.  The discussion also provides examples of how healthcare organizations can implement such a framework.

Healthcare can benefit from Big Data Analytics in various domains such as decreasing the overhead costs, curing and diagnosing diseases, increasing the profit, predicting epidemics and heading the quality of human life (Dezyre, 2016).  Healthcare organizations have been generating the very large volume of data mostly generated by various regulatory requirements, record keeping, compliance and patient care.  There is a projection from McKinsey that Big Data Analytics in Healthcare can decrease the costs associated with data management by $300-$500 billion.  Healthcare data includes electronic health records (EHR), clinical reports, prescriptions, diagnostic reports, medical images, pharmacy, insurance information such as claim and billing, social media data, and medical journals (Eswari, Sampath, & Lavanya, 2015; Ward, Marsolo, & Froehle, 2014). 

Various healthcare organizations such as scientific research labs, hospitals, and other medical organizations are leveraging Big Data Analytics to reduce the costs associated with healthcare by modifying the treatment delivery models.  Some of the Big Data Analytics technologies have been applied in the healthcare industry.  For instance, Hadoop technology has been used in healthcare analytics in various domains.  Examples of Hadoop application in healthcare include cancer treatments and genomics, monitoring patient vitals, hospital network, healthcare intelligence, fraud prevention and detection (Dezyre, 2016).  Thus, this discussion is limited to the Hadoop technology in healthcare.  The discussion begins with the types of analytics and the potential benefits of some of the analytic in healthcare, and then followed by the main discussion about Hadoop Framework for Diabetes including its major components of the Hadoop Distributed File System (HDFS) and Map/Reduce.

Types of Analytics

There are four major analytics types:  Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017).  The Descriptive Analytics are used to summarize historical data to provide useful information.  The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling.  The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables.  It is useful to find an optimum solution and the best course of action using the algorithm.  The Prescriptive Analytics, as indicated in (Mohammed et al., 2014) is less used in the clinical field.  The Diagnostic Analytics is described as an advanced type of analytics to get to the cause of a problem using drill-down techniques and data discovery.

Hadoop Framework for Diabetes

The predictive analysis algorithm is utilized by (Eswari et al., 2015) in Hadoop/MapReduce environment in predicting the diabetes types prevalent, the complications associated with each diabetic type, and the required treatment type.  The analysis used by (Eswari et al., 2015) was performed on Indian patients.  In accordance to the World Health Organization, as cited in (Eswari et al., 2015), the probability for the age between 30-70 for patients to die from four major Non-Communicable Diseases (NCD) such as diabetes, cancer, stroke, and respiratory is 26%.   In 2014, 60% of all death in India was caused by NCDs.  Moreover, in accordance with the Global Status Report, as cited in (Eswari et al., 2015), NCD claims will reach 52 million patients globally by the year of 2030. 

The architecture for the predictive analysis included four phases:  Data Collection, Data Warehousing, Predictive Analysis, Processing Analyzed Reports.  Figure 1 illustrates the framework used for the Predictive Analysis System-Healthcare application, adapted from (Eswari et al., 2015). 

Figure 1.  Predictive Analysis Framework for Healthcare. Adapted from (Eswari et al., 2015).

Phase 1:  The Data Collection phase included raw diabetic data which is loaded into the system.  The data is unstructured including EHR, patient health records (PHR), clinical systems and external sources such as government, labs, pharmacies, insurance and so forth.  The data have different formats such as .csv, tables, text.  The data which was collected from various sources in the first phase was stored in Data Warehouses. 

Phase 2:  During the second phase of data warehousing, the data gets cleansed, and loaded to be ready for further processing.

Phase 3:  The third phase involved the Predictive Analysis which used the predictive algorithm in Hadoop, Map/Reduce environment to predict and classify the type of DM, complications associated with each type, and the treatment type to be provided.  Hadoop framework was used in this analysis because it can process extremely large amounts of health data by allocating partitioned data sets to numerous servers.  Hadoop utilized the Map/Reduce technology to solve different parts of the larger problem and integrate them into the final result.  Moreover, Hadoop utilized the Hadoop Distributed File System (HDFS) for the distributed system. The Predictive Analysis phase involved Pattern Discovery and Predictive Pattern Matching. 

With respect to the Pattern Discovery, it was important for DM to test patterns such as plasma, glucose concentration, serum insulin, diastolic blood pressure, diabetes pedigree, Body Mass Index (BM), age, number of times pregnant.   The process of the Pattern Discovery included the association rule mining between the diabetic type and other information such as lab results. It also included clustering to cluster and group similar patterns.  The classification step of the Pattern Discovery included the classification of patients risk based on the health condition.  Statistics were used to analyze the Pattern Discovery.  The last step in the Pattern Discovery involved the application.  The process of the Pattern Discovery of the Predictive Analysis phase is illustrated in Figure 2

Figure 2.  Pattern Discovery of the Predictive Analysis.

With respect to Predictive Pattern Matching of the Predictive Analysis, the Map/Reduce operation was performed whenever the warehoused dataset was sent to Hadoop system.  The Pattern Matching is the process of comparing the analyzed threshold value with the obtained value.   The Mapping phase involved splitting the large data into small tasks for Worker/Slave Nodes (WN).  As illustrated in Figure 3, the Master Node (MN) consists of Name Node (NN) and Job Tracker (JT) which used the Map/Reduce technique.   The MN sends the order to Worker/Slave Node, which process the pattern matching task for diabetes data with the help of Data Node (DN) and Task Tracker (TT) which reside on the same machine of the WN.  If the WN completed the pattern matching based on the requirement, the result was stored in the intermediate disk, known as local write.  If the MN initiated the reduce task, all other allocated Worker Nodes read the processed data from the intermediate disks.  The reduce task is performed in the WN based on the query received from the Client to the MN.  The results of the reduce phase will be distributed in various servers in the cluster.

Figure 3.  Pattern Matching System Using Map/Reduce. Adapted from (Eswari et al., 2015).

Phase 4:  In this phase, the Analyzed Reports are processed and distributed to various servers in the cluster and replicated through several nodes depending on the geographical area.  Using the proper electronic communication technology to exchange the information of patients among healthcare centers can lead to obtaining proper treatment at the right time in remote locations at low cost.

The implementation of Hadoop framework did help in transforming various health records of diabetic patients to useful analyzed result to help patients understand the complication depending on the type of diabetes. 

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Eswari, T., Sampath, P., & Lavanya, S. (2015). Predictive methodology for diabetic data analysis in big data. Procedia Computer Science, 50, 203-208.

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Ward, M. J., Marsolo, K. A., & Froehle, C. M. (2014). Applications of business analytics in healthcare. Business Horizons, 57(5), 571-582.