Data Privacy and Governance in Healthcare Industry

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the data privacy and governance as they relate to the healthcare industry.  The discussion also addresses the privacy violations, approaches to avoid them, and the best practices which healthcare industry could adapt to avoid instances of data breach and privacy violation.

Data Privacy and Governance

The privacy protections of users’ information depend on the capabilities of the technology adopted to extract, analyze, and correlate potentially sensitive dataset.  In the age of Big Data and the emerging technologies to deal with Big, protecting the privacy is becoming more challenging than before.  The development of Big Data tools requires security measures and safeguards to protect the privacy of the users and the patients in the healthcare industry in particular.  Moreover, the data which gets used for the analytical purpose may include regulated information or intellectual property.  Thus, the Big Data professionals must comply with the regulations to ensure the appropriate data use and protection (CSA, 2013). 

With respect to privacy, various legislative and regulatory compliance issues.  Many US regulations include privacy requirements such as Health Insurance Portability and Accountability Act (HIPAA), the Sarbanes-Oxley Act of 2002 (SOX), the Gramm-Leach-Bliley Act.  When privacy is violated, the individuals and organizations must be informed. Otherwise, legal ramifications will be enforced. The issues of privacy must be addressed when allowing or restricting personal use of email, retaining email, recording phone conversation, gathering information about surfing or spending habits, and so forth (Stewart, Chapple, & Gibson, 2015).

Data Breach and Privacy Violation in Healthcare

In accordance to a recent report published by HIPAA, the first three months of 2018 experienced 77 healthcare data breaches reported to the Department of Health and Human Services’ Office for Civil Rights (OCR).  The report added that the impact of these breaches was significant as more than one million patients and health plan members were affected.  These breaches are estimated to be almost twice the number of individuals who were impacted by healthcare data breaches in Q4 of 2017.   Figure 1 illustrates such increasing trend in the Healthcare Data Breaches (HIPAA, 2018).

Figure 1:  Q1, 2018 Healthcare Data Breaches (HIPAA, 2018).

As reported in the same report, the healthcare industry is unique with respect to the data breaches because they are caused mostly by the insiders; “insiders were behind the majority of breaches” (HIPAA, 2018).   Other reasons involve improper disposal, loss/theft, unauthorized access/disclosure incidents, and hacking incidents. The largest healthcare data breaches of Q1 of 2018 involved 18 healthcare security breaches which impacted more than 10,000 individuals.  The hacking/IT incidents involved more records than any other breach cause as illustrated in Figure 2  (HIPAA, 2018).

Figure 2.  Healthcare Records Exposed by Breach Cause (HIPAA, 2018).

The worst affected by the healthcare data breaches in Q1 of 2018 involved the healthcare providers.  With respect to the states, California was the worst affected state with 11 reported breaches and Massachusetts with eight security incidents.

Best Practice

Organizations must follow four main principles to select technologies and activities which will protect their confidential data assets.  The first principle includes the policies which must be honored by the organization throughout the confidential data lifespan.  Organizations should commit to process all data by the regulations and laws, protecting users and patients private information and employ transparency to allow them to correct their information.  The second principle involves the risk of unauthorized access or misuse of confidential information, which should be minimized by organizations.   Organizations should establish a security system to provide administrative, technical and physical security measures to ensure the CIA Triad elements of Confidentiality, Integrity, and Availability of the data.  The third principle involves the impact of the confidential data loss.  Organizations must establish a protection system to minimize the impact of stolen or lost data by providing reasonable safeguards for the data in storage and transit, such as encryptions to ensure the confidentiality of the data in case of loss or stolen.  Organizations must develop an appropriate plan to respond to data breach effectively and provide training to all involved employees.  The last principle involves the documentation of the applicable controls and the demonstration of their effectiveness.  Organizations must verify the effectiveness of these principles implementations using the appropriate monitoring, auditing, and control systems. Moreover, organizations must establish an appropriate system for the non-compliance and the escalation path (Salido, 2010).   Figure 3 summarizes these four principles which organizations must adhere to ensure the compliance with the CIA Triad. 

Figure 3:  Four Principles for Data Privacy and Confidentiality.  Adapted from (Salido, 2010).

In summary, organizations including healthcare must adhere to the data privacy regulations and legislative rules carefully to protect the users and patients from data breaches.  Organizations must commit to this four principles for data privacy and confidentiality.  Organizations including healthcare must implement the proper security policy and risk management to ensure the protection of private information as well as to minimize the impact of confidential data in case of loss or theft.

References

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

HIPAA. (2018). Report: Healthcare Data Breaches in Q1, 2018. Retrieved from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/.

Salido, J. (2010). Data Governance for Privacy, Confidentiality, and Compliance: A Holistic Approach. ISACA Journal, 6, 17.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

The Role of Big Data Brokers in Healthcare Industry

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the role of the Data Brokers in the healthcare industry.  This discussion begins with the dark side of the Big Data industry, followed by the Data Broker business, and a discussion of some use cases in Data Broker in healthcare.

The Dark Side of Big Data Industry

Big Data Analytics provide various benefits to organizations and businesses in various areas.  With respect to healthcare, as indicated in (Wang, Kung, & Byrd, 2018), the benefits of Big Data covers various areas such IT infrastructure benefits, organizational benefit, managerial benefits and strategic benefits. In IT infrastructure, the healthcare organization can benefit from reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, and better use of healthcare system.  The organizational benefits of Big Data analytics in healthcare organization include a reduction in surgery-related hospitalization, improving the quality and accuracy of a clinical decision, and reducing the diagnosis time.   The managerial benefits of Big Data analytics include gaining insights quickly about changing healthcare trends in the market, optimizing business-related decision and fraud detection.  Strategic benefits of Big Data in healthcare organizations include the development of highly competitive healthcare services. Thus, the utilization of Big Data Analytics makes a difference in healthcare organizations in particular, and in all businesses in general.

However, there is a dark side to Big Data.  As indicated in (Martin, 2015), there are ethical issues in the Big Data industry.  Big Data is about information from various sources to create knowledge, make better predictions and tailor services.  The ethical issues are coming from reselling the data of the consumers to the secondary market for Big Data.  Examples of these ethical issues include the Data Broker. 

Data Broker Business:  Data Broker is defined in (Gartner, n.d.) as follows:

 “A Data Broker is a business that aggregates information from a variety of sources; processes it to enrich, cleanse or analyze it; and licenses it to other organizations. Data Brokers can also license another company’s data directly, or process another organization’s data to provide them with enhanced results. Data is typically accessed via an application programming interface (API) and frequently involves subscription type contract. Data typically is not ‘sold’ (i.e., its ownership transferred), but rather it is licensed for particular or limited uses. (A data broker is also sometimes known as information broker, syndicated data broker, or information product company.)”

Big Data Means Big Business to Data Brokers: Use Cases in Healthcare:  As discussed in (Tanner, 2016), Data Brokers make money off the patients’ medical records.  The patients’ identity by law should be kept in a secure place for privacy protection.  Organizations which sell medical information to data mining companies strip the private information of the patients such as social security, names and detailed addresses to protect the identity of the patients (Tanner, 2016). However, the Data Brokers add unique numbers to each record they collect which allow them to match disparate pieces of information to the same individual even if they do not know the patient’s name.  Such matching information makes the information more valuable to the Data Brokers (Tanner, 2016).  As indicated in (Hill, 2013), Data Brokers are making money by selling lists of rape victims, alcoholics, and patients with AIDS/HIV, and hundreds of other illnesses.  There are about four thousand Data Broker.  There are 320 Million people in the U.S who cannot escape from Data Brokers business.  Moreover, Data Broker and Pharmacies such as Walgreen commercialize the medical data (Leetaru, 2018).

In summary, although Big Data has tremendous benefits to organizations in many industries including healthcare industry, across the board, there is a major concern about the privacy of people including the patients in the healthcare industry.  More regulations must be enforced to protect the privacy of people from the hands of businesses such as Data or Information Broker.  As indicated in (Slater, 2014), “The debate surrounding how to regulate big data is set to continue, both within the data broker context and beyond.”

References

Gartner. (n.d.). Data Broker. Retrieved from https://www.gartner.com/it-glossary/data-broker.

Hill, K. (2013). Data Broker Was Selling Lists Of Rape Victims, Alcoholics, and ‘Erectile Dysfunction Sufferers.’ Retrieved from https://www.forbes.com/sites/kashmirhill/2013/12/19/data-broker-was-selling-lists-of-rape-alcoholism-and-erectile-dysfunction-sufferers/#25a8eb841d53.

Leetaru, K. (2018). How Data Brokers And Pharmacies Commercialize Our Medical Data. Retrieved from https://www.forbes.com/sites/kalevleetaru/2018/04/02/how-data-brokers-and-pharmacies-commercialize-our-medical-data/#238acd4b11a6.

Martin, K. E. (2015). Ethical issues in the big data industry. Browser Download This Paper.

Slater, A. (2014). The Costs and Benefits of Data Brokers. Retrieved from https://www.theregreview.org/2014/06/19/19-slater-costs-and-benefits-of-data-brokers/.

Tanner, A. (2016). How Data Brokers Make Money Off Your Medical Records. Retrieved from https://www.scientificamerican.com/article/how-data-brokers-make-money-off-your-medical-records/.

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

Case Study: Big Data Analytics Used to Detect Fraudulent Data in Healthcare Industry

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to present a case study where Big Data Analytics was used to detect fraudulent data in the healthcare industry.  The discussion also discusses and analyze the case study and how the Big Data Analytics can be used to eliminate or mitigate fraud in the healthcare industry.

Healthcare Fraud is Reality

Healthcare industry is valued at $2.7 trillion industry in the United States alone.  One-third of this value is lost due to various waste, fraud, and abuse.  The common examples for the fraud and abuse in healthcare include illegal medical billing for falsified claims, multiple claims filed by multiple providers, and stolen identity of the patients to gain reimbursement for medical service never provided. There is an estimation of fraudulent billing of 3-10% of the annual healthcare costs in the US (Halyna, 2017).  As indicated in (Travaille, Mueller, Thornton, & J., 2011), there is an estimation between $600 and $850 billion annually is lost in fraud, waste, and abuse in the healthcare system in the U.S., out of which $125 to $175 billion is due to fraudulent activities. Example of the magnitude of the Healthcare Fraud is the Medicare fraud scheme for a Texas doctor who netted nearly $375 million. The fraud included $350 million for Medicare, and more than $24 million for Medicaid (medicarefraudcenter.org, 2012).  Thus, the healthcare fraud is an intentional deception which is used by the medical practitioners to obtain unauthorized benefits (Joudaki et al., 2015). Frauds are categorized as criminal activities and must be detected and stopped. As indicated in (Rashidian, Joudaki, & Vian, 2012), combatting the fraud in healthcare industry remains a challenge. 

Thus, there is a serious need for effective fraud detection and prevention in the healthcare industry.   Big Data Analytics plays a significant role in the fraud detection in the healthcare industry. As indicated in (Hitchcock, 2018) “Big data has massively transformed the healthcare industry in so many ways, contributing largely to today’s more efficient value-based healthcare system.” Various research studies showed that Big Data Analytics and Data Mining could be used effectively in fraud detection.  For example, in the research study of (J. Liu et al., 2016), the researchers used the graph analysis for detecting fraud, waste, and abuse in healthcare data.  In (Bresnick, 2017), machine learning and Big Data Analytics are helping Anthem improve member engagement and detect fraud, waste, and abuse.  In (Suleiman, Agrawal, Seay, & Grosky, 2014), Big Data is used to filter fraudulent Medicaid applications. 

Case Study In Health Insurance

This case study is based on (Nelson, 2017).  The health insurance company was using the traditional technique in fraud detection.  The traditional technique in fraud investigation relied on executing SQL queries using data warehouse which stores massive amounts of claims, billings, and other information.  This process was taking weeks or months before enough evidence for a legal case was developed.  The time for fraud increases the losses of the organization.

The health insurance utilized Big Data techniques such as Predictive Analytics and Machine Learning for fraud detection.  The integration of Big Data and search architecture was proven to be the most feasible approach for fraud detection.  Figure 1 illustrates the Big Data architecture for fraud detection utilized by the healthcare insurance company.  This Big Data framework enabled the fraud detection effort to be more scalable, faster, and more accurate. 

Figure 1.  Big Data Analytics Architecture for Fraud Detection (Nelson, 2017).

As a result of the integration of Big Data into the fraud detection system of the health insurance, the company experienced an immediate return on investment by saving $150 million used to prosecute large fraud cases.  The advantages of the integration of Big Data into the fraud detection also include the analysis of 10+ million claims, 100+ million of bill line details, and related record.  The framework also provided more accurate information by computing key fraud indicators and providing automatic red flag datasets for suspicious activities and leveraging all records and not just a statistical sample.

Big Data Analytics Integration and Impact in Fraud Detection in Healthcare

Many organizations have employed Big Data and Data Mining in some areas including fraud detection.  Big Data Analytics can empower healthcare industry in fraud detection to mitigate the impact of the fraudulent activities in the industry.  Several use cases such as (Halyna, 2017; Nelson, 2017) have demonstrated the positive impact of integrating Big Data Analytics into the fraud detection system.  

Big Data Analytics and Data Mining have various techniques such as classification model, regression model, and clustering model.  The classification model employs logistic, tree, naïve Bayesian, and neural network algorithms.  It can be used for fraud detection. The regression model employs linear ad k-nearest-neighbor.  The clustering model employs k-means, hierarchical and principal component algorithms.   

For instance, in (Q. Liu & Vasarhelyi, 2013), the researchers applied the clustering technique using unsupervised data mining approach to detect the fraud of insurance subscribers.  In (Ekina, Leva, Ruggeri, & Soyer, 2013), the researchers applied the Bayesian co-clustering with unsupervised data mining method to detect conspiracy fraud which involved more than one party.  In (Capelleveen, 2013), the researchers employed the outlier detection technique using unsupervised data mining method to detect dental claim data within Medicaid.  In (Aral, Güvenir, Sabuncuoğlu, & Akar, 2012), the researchers used distance-based correlation using hybrid supervised and unsupervised data mining methods for prescription fraud detection.   These research studies and use cases are examples of taking advantages of Big Data Analytics in healthcare fraud detection.  Thus, it is proven that Big Data Analytics can play a significant role in healthcare fraud detection.

References

Aral, K. D., Güvenir, H. A., Sabuncuoğlu, İ., & Akar, A. R. (2012). A prescription fraud detection model. Computer methods and programs in biomedicine, 106(1), 37-46.

Bresnick, J. (2017). Borrowing from Retail, Anthem’s Big Data Analytics Boost Member Engagement. Retrieved from https://healthitanalytics.com/news/borrowed-from-retail-anthems-big-data-analytics-boost-member-engagement.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. The University of Twente.  

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of Bayesian methods in the detection of healthcare fraud.

Halyna. (2017). Challenge Accomplished: Healthcare Fraud Detection Using Predictive Analytics. Retrieved from https://www.romexsoft.com/blog/healthcare-fraud-detection/.

Hitchcock, E. (2018). The Role of Big Data in Preventing Healthcare Fraud, Waste and Abuse. Retrieved from https://www.datameer.com/blog/role-big-data-preventing-healthcare-fraud-waste-abuse/.

Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., & Arab, M. (2015). Using data mining to detect health care fraud and abuse: a review of the literature. Global journal of health science, 7(1), 194.

Liu, J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sridharan, K., . . . Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine, 37(2), 33-46.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

medicarefraudcenter.org. (2012). Biggest Medicare Fraud Scheme in History Netted $375 Million for Texas Doctor. Retrieved from http://www.medicarefraudcenter.org/medicare-fraud-news/8-2012/39-biggest-medicare-fraud-scheme-in-history-netted-375-for-texas-doctor.html.

Nelson, P. (2017). Fraud Detection Powered by Big Data – An Insurance Agency’s Case Story. Retrieved from https://www.searchtechnologies.com/blog/fraud-detection-big-data.

Rashidian, A., Joudaki, H., & Vian, T. (2012). No Evidence of the Effect of the Interventions to Combat Health Care Fraud and Abuse: A Systematic Review of Literature. Retrieved from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0041988.

Suleiman, M., Agrawal, R., Seay, C., & Grosky, W. (2014). Data-driven implementation to filter fraudulent Medicaid applications. Paper presented at the SOUTHEASTCON 2014, IEEE.

Travaille, P., Mueller, R., Thornton, D., & J., H. (2011). Electronic Fraud Detection in the U.S. Medicaid Healthcare Program:  Lessons Learned from other Industries. Retrieved from https://research.utwente.nl/en/publications/electronic-fraud-detection-in-the-us-medicaid-healthcare-program-.

Installation and Configuration of Openstack and AWS

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project was to articulate all the steps for the installation and configuration of OpenStack and Amazon Web Services.  The project begins with an overview of OpenStack.  It is divided into three main phases.  The first Phase discusses and analyzes the differences between the Networking techniques in AWS and OpenStack.  Phase 2 discusses the required configurations to deploy the OpenStack Controller.  Phase 2 also discusses and analyzes the expansion of OpenStack to include additional node as the Compute node.   Phase 3 discusses the issues encountered during the installation and configuration of OpenStack and AWS services. A virtual bridge for the provider network was configured where all VMs traffic reaches the Internet through the external bridge.   The floating IP also must be disallowed to avoid dropping the packet when they reach AWS.  In this project, OpenStack using the Controller Node and an additional Compute Node is deployed and accessed successfully using Horizon dashboard.  Elastic Cloud Compute (EC2) is also installed and configured successfully using the default VPC, the default Security Group and Access Control List.  

Keywords: OpenStack, Amazon Web Services (AWS).

Introduction

            OpenStack is a result of initiatives from Rackspace and NASA in 2010 because NASA could not store its data in the Public Cloud for security reasons.  OpenStack is an open source project which can be utilized by leading vendors to bring AWS-like ability and agility to the private cloud.  OpenStack has been growing since its inception in 2010 to include 500 member companies as part of the OpenStack Foundation with platinum and gold members from the largest IT vendors globally.  Examples of these platinum members include RedHat, Suse, IBM, Hewlett Packard Enterprise, Ubuntu, AT&T and Rackspace (Armstrong, 2016).

            OpenStack provides primarily an Infrastructure-as-s-Service (IaaS) function within the Private cloud, where it makes centralized storage, commodity computes, and networking features available to end users to self-service their needs, through the Horizon dashboard or a set of common APIs.  Many organizations are deploying OpenStack in-house to develop their own data centers.  The implementation of the OpenStack is less likely to fail when utilizing professional service support from known vendors and can create alternative solutions to Microsoft Azure and AWS.  Examples of these professional service vendors include Red Hat, Suse, HP, Canonical, Mirantis, and so forth.  They provide different methods of installing the platform  (Armstrong, 2016). 

            The release cycle of the OpenStack is six months during which an upstream release is created.  OpenStack Foundation creates the upstream release and governs it.  Example of the public cloud deployment of OpenStack includes AT&T, RackSpace, and GoDaddy. Thus, OpenStack is not exclusively used for private cloud.  However, OpenStack has been increasingly popular as a Private Cloud alternative to AWS Public Cloud. OpenStack is now widely used for Network Function Virtualization (NFV) (Armstrong, 2016).  

OpenStack and AWS utilize different approaches to Networking.  This section begins with AWS Networking, followed by OpenStack Networking.

Phase 1:  OpenStack Networking vs. AWS Networking

1.1       AWS Networking

Virtual Private Cloud (VPC) is a hybrid cloud comprising of public and private clouds.  The VPC is the default setting for new AWS users.  The VPC can also be connected to a network of users or the private data center of the organization.  The underlying concept of connecting the VPC to the private data center of the organization is the use of the gateway and virtual private cloud gateway (VPG).  The VPG is two redundant VPN tunnels, which gets instantiated from the private network of the user or the organization.  The gateway of the organization exposes a set of external static addresses from the site of the organization, which are using Network Address Translation-Traversal (NAT-T) to hide the address.  The organization can use one gateway device to access multiple VPCs.   The VPC provides an isolated view of all provisioned instances.  Identity and Access Management (IAM) of AWS is used to set up user account to access the VPC.   Figure 1 illustrates an example of the AWS VPC with virtual machines or instances mapped with one or more security groups and connected to different subnets connected to the VPC router (Armstrong, 2016; AWS, 2017). 

Figure 1.  VPC of AWS showing multiple instances using Security Group.

            The networking is simplified by VPC using software and allowing users and organizations to perform a certain set of networking operations such as mapping the subnet, using Domain Name System (DNS), Public and Private IP addresses assignments, security group and access control list application.   When organizations create a virtual machine or instance, a default VPC is assigned to it automatically.  All VPC comes with a default router which can have additional custom routes and the routing priority to forwarding traffic to specific subnets based on the requirements of the organizations and users.   Figure 2 illustrates VPC using Private IP, Public IPs, and the Main Route Table, adapted from (Armstrong, 2016; AWS, 2017).

Figure 2.  AWS VPC Configuration Example (AWS, 2017).

            With respect to the IP Addressing of AWS, a mandatory private IP is assigned automatically to every virtual machine or instance, also a public IP and DNS entry unless the instance is a dedicated instance.  The Private IP is used to route traffic among instances when there is a need for a virtual machine to communicate with another virtual machine that is close to it on the same subnet.  The Public IP, on the other hand, are accessible through the Internet.   If there is a need for a persistent Public IP address for a virtual machine, the Elastic IP addressed feature is provided by AWS which is limited to five per VPC account only.  When using Elastic IP addresses, the IP address can be mapped quickly to another instance in case of a failure of the instance.  When using AWS, it can take up to 24 hours for the DNS Time to live (TTL) of a Public IP address to propagate.  Moreover, AWS supports a Maximum Transmission Unit (MTU) of 1,500 regarding throughput which can be passed to an instance in AWS.  The organization must consider this feature for application performance consideration (Armstrong, 2016; AWS, 2017).

            AWS uses Security Groups and Access Control Lists.  The SG in AWS is used to group a collection of access control rules with implicit denies.  The SG in AWS can be associated with one or more network interfaces of instances.  The SG acts as the firewall for the instances.  There is a default SG which gets applied automatically if no other security group is specified with the instantiated instance.  The default SG allow all outbound traffic and all inbound traffic only from other instances within the same VPC.  The default SG group cannot be deleted.  With the custom SG, no inbound traffic, but all outbound traffic allowed.  The user can add Access Control List (ACL) rules are associated with the SG governing the inbound traffic using AWS console (Armstrong, 2016; AWS, 2017).

            The VPC of AWS ha access to different regions and availability zone of shared computer dictating the data center which the instance and virtual machine will be deployed in.  The availability zone AZ is an isolated location residing in a region which is a geographic area isolated by design.  Thus,  AZ can be a subset of a region.  Organizations and users can place resources in different locations for redundancy for recovery consideration.  AWS supports the use of more than one AV when deploying production workloads on AWS.   Moreover, organizations and users can replicate the instances and data across regions (Armstrong, 2016; AWS, 2017).

            Elastic Load Balancing (ELB) feature is also offered by AWS, which can be configured within a VPC.   The ELB can be external or internal.  When the ELB is external, it allows the creation of the internet-facing entry point into the VP using an associated DNS entry and balances load among the instances in the VPC.  The SG is assigned to the ELB to control the access to ports which need to be used (Armstrong, 2016; AWS, 2017).   

1.2       OpenStack Networking

            OpenStack is deployed in a data center on multiple controllers.  These controllers contain all services of the OpenStack. These controllers can be installed on virtual machine, bare metal physical servers, or containers.  When these controllers get deployed in a production environment, they host all OpenStack services in a high availability and redundancy platform.   Different installers to install OpenStack are offered by different OpenStack vendors.  Examples of these installers include RedHat Director, Mirantis Fuel, HPs HPE installed, and Juju for Canonical. All these installers install controllers.  They are also used to scale out compute nodes on the OpenStack cloud (Armstrong, 2016; OpenStack, 2018b).

            With respect to the services of the OpenStack, there are eleven core services which are installed on the OpenStack controlled.  These core services include Keystone, Heat, Glance, Cinder, Nova, Horizon, Rabbitmq, Galera, Swift, Ironic and Neutron.  Figure 3 summarizes each core service of the OpenStack (OpenStack, 2018a).  The Neutron architecture is similar in constructs to AWS regarding Neutron Networking services (Armstrong, 2016; OpenStack, 2018b). 

Figure 3.  Summary of OpenStack Core Services (OpenStack, 2018a)

In OpenStack, a Project is referred to as a Tenant providing an isolated view of everything which a team has provisioned in the OpenStack cloud.  Using the Keystone Identity service, different users can be set up for a Project (Tenant).  These accounts can be integrated with LDAP such as Active Directory to support customizable permission model (Armstrong, 2016; OpenStack, 2018b).

The Neutron Service of OpenStack performs all networking related tasks and functions.  These functions and tasks include seven major steps. The first step includes the creation of instances or virtual machine mapped to networks.  The second step includes the assignment of IP addresses using the built-in DHCP service.  The third step includes the application of DNS entries to instances from named servers.  The fourth step includes the assignment of Private and Floating IP addressed.  The fifth step incluse the creation or the associatoiin of the network subnet, followed by creating the routers.  The last step is the application of the Security Groups (Armstrong, 2016; OpenStack, 2018b).

The compute nodes of the OpenStack are deployed using a Hypervisor which uses Open vSwitch.  Most vendor distributions of OpenStack provide KVM Hypervisor by default, which gets deployed and configured on each computes node by the OpenStack Installer.  The compute nodes in OpenStack are connected to the access layer of the STP 3-tier model. In modern networks, they are connected to the Leave switches, with VLANs connected to each computes node in the OpenStack cloud.  The networks of the Tenant are used to provide isolation among tenants and use VXLAN and GRE tunneling to connect the layer two network (Armstrong, 2016; OpenStack, 2018b).

The configuration and setup of simple networking using Neutron in a Project (Tenant) network requires two different networks; an internal network and an external network.  The internal network is used for traffic among instances in the Project, where the subnet name and range are specified in the Subnet.  The external network is used to make the internal network accessible from outside of the OpenStack.   A router is also used in OpenStack to route packets to the network, which will be associated with the networks.  The external network needs to be set as the router’s gateway.  The last step in the network configuration connects one router to an internal and external network.   Instances are provisioned in OpenStack onto the internal Private Network by selecting the Private Network NIC during the deployment of the instance.  OpenStack assigns pools of Public IPs known as Floating IP addresses from an external network for instances which need to be externally routable outside of the OpenStack (Armstrong, 2016; OpenStack, 2018b).

OpenStack uses SG like AWS to set up firewall rules between instances.  However, OpenStack, unlike AWS, supports both ingress and egress ACL rules, whereas AWS allows all outbound communications.  OpenStack can work with both ingress and egress rules.   SSH access must be configured as an ACL rule against the parent SG in OpenStack which is pushed down to Open vSwitch into kernel space on each Hypervisor.  When the internal and external networks are set up and configured for the Project (Tenant), instances are ready to be launched on the Private network.  Users can access the instance from Horizon dashboard (Armstrong, 2016; OpenStack, 2018b).

With respect to regions and availability zones in OpenStack, like AWS, OpenStack uses regions and AZ.  The compute nodes in OpenStack (Hypervisors) can be assigned to different AZ, which is a virtual separation of computing resources.  The AZ in OpenStack can be segmented into host aggregated. However, a compute node can be assigned to only one AZ in OpenStack, while it can be a part of multiple host aggregates in the same AZ (Armstrong, 2016; OpenStack, 2018b). 

OpenStack offers Load-Balancer-as-a-Service (LBaaS) which allows incoming requests to be distributed evenly among the designated instances using a Virtual IP (VIP).  Examples of the popular LBaaS plugins in OpenStack include Citrix NetScaler, F5, HaProxy, and Avi networks.  The underlying concept of LBaaS on OpenStack is to allow organizations and users to use LBaaS as a broker to the load balancing solutions, using APIs of the OpenStack or using the Horizon dashboard to configure the Load Balancer (Armstrong, 2016; OpenStack, 2018b). 

Phase 2:  AWS and OpenStack Setup and Configuration

            This project deployed OpenStack on AWS and limited to the configuration of the controller node.  In the same project, the OpenStack cloud is expanded to add a compute node.   The topology for this project is illustrated in Figure 4.   Port 9000 will be configured to be accessed from the browser on the client.  The Compute Node VM will be using a different IP address than that IP address for the OpenStack Node. A Private Network will be configured using the Vagrant software.   NAT interface will be configured and mapped to the Compute Node and the OpenStack Controller Node as illustrated in Figure 4.

Figure 4.  This Project’s Topology.

The Controller Node is configured to have one processor, 4 GB memory, and 5 G storage.  The Compute Node is configured to have one processor, 2 GB memory, and 10 GB storage.  The installation must be performed on a 64bit version of distribution on each node.  VirtualBox is used in this project.  The Vagrant software is also used in this project.  Another software called Sublime Text is installed to configure the Vagrant file and avoid any control characters at the end of each line which can cause problems.  The project is using the Pike release.

2.1 Amazon Machine Images (AMI) Elastic Cloud Compute (EC2) AMI Configuration

The project requires AWS account, to select the image which can be used for OpenStack Deployment.  Multi-Factor Authentication is implemented to access the account.  Amazon Machine Image (AMI) Elastic Compute Cloud (EC2) is selected from the pool of the AMIs for this project.  The Free Tier EC2 instance is configured with the default Security Group (SG) and Access Control List (ACL) rules as discussed earlier.   EC2 AMI is a template which contains the software configuration such as operating system, application server, and applications required to launch and instantiate the instance.  The EC2 AMI is configured to use the default VPC. 

2.2 OpenStack Controller Node Configuration

The Controller Node is configured first to use the IP Address identified in the topology.  This configuration is implemented using Vagrant software and Vagrant file. 

  • Connect to the controller using the Vagrant software.  To start the Controller from Vagrant, execute:
    • $vagrant up the controller. 
  • Verify the Controller is running successfully.
    • $vagrant status
  • Verify the NAT address using eth0. 
    • $ifconfig -a
  • Verify the Private IP Address using eth1.  The IP address shows the same IP address configured in the configuration file.

Access the Controller Node of the OpenStack from the Browser using the Port 9000. 

  • Verify the Hypervisors from Horizon interface. 

2.3 OpenStack Compute Node Configuration

The OpenStack Cloud is expanded by adding a Compute Node.  The configuration of the compute node is performed using the Vagrant file.

  • Connect to the computer using Vagrant command.  The Compute Node is using node1 as the hostname.  To start the Compute Node from Vagrant, execute the following command:
    • $vagrant up node1. 
  • Verify the Compute Node is running successfully.
    • $vagrant status
  • Access node1 using SSH.
  • Check OpenStack Services:
    • $sudo systemctl list-units devstack@*
  • Verify the NAT address using eth0. 
    • $ifconfig -a
  • Verify the Private IP Address using eth1.  The IP address shows the same IP address configured in the configuration file.

Access the Controller Node of the OpenStack from the Browser using the Port 9000.  Verify the Hypervisors from Horizon interface. 

Phase 3:  Issues Deploying OpenStack on AWS

There are some issues encountered during the deployment of OpenStack on AWS.   The issue which impacted EC2 AMI involved the MAC address which must be registered in the AWS network environment.  Moreover, the MAC address and the IP address must be mapped together because the packets will not be allowed to flow if the MAC address and the IP address are different.  

3.1 Neutron Networking

                During the configuration of the OpenStack Neutron Networking, a virtual bridge for the Provider Network is configured where all VMs traffic will reach the Internet through the external bridge which is followed by the actual physical NIC of eth1.  Thus, NIC with a special type of configuration will be configured as the external interface as shown in the topology for this project (Figure 4).   

 3.2 Disable Floating IP

            The floating IP must be disabled because it will send the packet through the router’s gateway with the IP address as a floating IP address, which will result in dropping the packets once they reach AWS because they will reach the switch with no registered IP and MAC address.  In this project, the NAT is configured to access the public address externally as shown in the topology in Figure 4.

Conclusion

The purpose of this project was to articulate all the steps for the installation and configuration of OpenStack and Amazon Web Services.  The project began with an overview of OpenStack.  It is divided into three main phases.  The first Phase discussed and analyzed the differences between the Networking techniques in AWS and OpenStack.  Phase 2 discussed the required configurations to deploy the OpenStack Controller.  Phase 2 also discussed and analyzed the expansion of OpenStack to include additional node as the Compute node.   Phase 3 discussed the issues encountered during the installation and configuration of OpenStack and AWS services. A virtual bridge for the provider network was configured where all VMs traffic reaches the Internet through the external bridge.   The floating IP also must be disallowed to avoid dropping the packet when they reach AWS.  In this project, OpenStack using the Controller Node and an additional Compute Node was deployed and accessed successfully using Horizon dashboard.  Elastic Cloud Compute (EC2) was also installed and configured successfully using the default VPC, the default Security Group, and Access Control List.  

References

Armstrong, S. (2016). DevOps for Networking: Packt Publishing Ltd.

AWS. (2017). Virtual Private Cloud:  User Guide. Retrieved from: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-ug.pdf.

OpenStack. (2018a). Introduction to OpenStack. Retrieved from https://docs.openstack.org/security-guide/introduction/introduction-to-openstack.html.

OpenStack. (2018b). OpenStack Overview. Retrieved from https://docs.openstack.org/install-guide/overview.html.

The Use of Cloud Computing Technology in Healthcare Industry

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the use of the Cloud Computing technology in the healthcare industry. It also discusses and analyzes the present issue related to healthcare data in the Cloud, advantages, and disadvantages of having the data into the Public and Private Cloud.  The discussion also provides a use case scenario.

Healthcare in Cloud Computing

As indicated in (Chen & Hoang, 2011), the healthcare industry is moving slowly toward the Cloud Computing technology due to the sensitive nature of the healthcare data.  There is a fear among healthcare organizations to employ Cloud Computing because of the privacy and security issues which can cause data leak from the Cloud to unauthorized users.  Various researchers exerted tremendous effort to propose cloud framework to ensure data protection framework for the healthcare industry.

In (Chen & Hoang, 2011), the researchers proposed a robust data protection framework that is surrounded by a chain of protection schemes from Access Control, Monitoring, to Active Auditing.  The proposed framework includes three major models for this chain of protection schemes. The first component of this proposed robust framework includes Cloud-based, Privacy-aware, and Role-based Access Control (CPRBAC) model.  The second model includes the Triggerable Data File Structure (TDFS) model.  The third component includes the Active Auditing Scheme (AAS).  In (Regola & Chawla, 2013), the researchers presented a prototype infrastructure in Amazon’s Virtual Private Cloud to allow researchers and practitioners to utilize the data in a HIPAA-compliant environment.  In (Yu, Kollipara, Penmetsa, & Elliadka, 2013), the researchers provided an approach for a distributed storage system using a combination of RDBMS and NoSQL databases to ensure optimal system performance and scalability.  These three research studies are examples of the tremendous effort exerted by researchers in the domain of healthcare to ensure security.

Healthcare Use Case

The Healthcare Information System supports clinical and medical activities related to patient care.  The system is an integration of several components where each component serves a specific need of a medical system. These components include Radiology Information System (RIS), Picture Archiving and Communication System (PACS), Laboratory Information System (LIS), and Policy and Procedure Management System (PPMS) (Yu et al., 2013).

In (Yu et al., 2013), the researchers focused on the RIS which is a software used to manage the patients and their radiology data such as ultrasound scans, X-rays, CT-scans, audio, and video.  The patient activities management include examination scheduling, patient data processing and monitoring, and analysis of the patient records statistics.  The radiology data management include the processing of file records, formatting and storing radiology data with a digital signature, and tracking the film records.  The RIS deals with very large of unstructured and structured data.  The RIS is often used with the PACS and requires very large storage space. 

The researchers examined two NoSQL databases for this project: MongoDB and Cassandra. They found that MongoDB is more apt for Healthcare Information Systems. Table 1 summarizes the comparison between MongoDB and Cassandra, adapted from (Yu et al., 2013).

Table 1. Comparison between MongoDB and Cassandra for Healthcare data (Yu et al., 2013).

The RIS Framework in this project included the System Architecture, Cloud Architecture. The System Architecture was deployed in AWS using EC2 (Elastic Compute Cloud), which can be accessed by request from a browser using HTML or a mobile client application.  The application server was placed in the Public Cloud.  The database was placed in the Private Cloud.  When the system requires communication with the Database in the Private, the request must go through various security measures and pass through the security of the Private Cloud and the firewall to connect to the storage server. The request talks to either SQL or NoSQL database based on the data management logic model. The System Architecture is deployed in the Cloud.  The Cloud Architecture involved Public Cloud and Private Cloud.  The Private Cloud was used to store all sensitive data. The storage server controls the SQL and NoSQL databases along with the security and backup capabilities and functionalities.  The NAS server was used as the storage solution to deal with the large volume of the healthcare data (Yu et al., 2013). 

Advantages and Disadvantages of healthcare data in the Cloud

The Cloud Computing offer various advantages to several industries, including healthcare industry.  The major benefits of using the Cloud Computing technology for healthcare include Scalability, Data Storage, Data Sharing and Data Availability, Reliability and Efficiency, and Cost Reduction (Pullarao & Thirupathi Rao, 2013).  The major challenge when using Cloud Computing in the healthcare industry is the security. However, as demonstrated in the above use case, the risks of leaking data from the Cloud to unauthorized users can be mitigated and eliminated by using the Private Cloud which has additional security measures. Public Cloud should never be used for storing sensitive data. In the above use case, the Public Cloud was used only for the application layer, with security measures such as access control and SSL to access the data from the browser.

References

Chen, L., & Hoang, D. B. (2011, 16-18 Nov. 2011). Towards Scalable, Fine-Grained, Intrusion-Tolerant Data Protection Models for Healthcare Cloud. Paper presented at the 2011IEEE 10th International Conference on Trust, Security, and Privacy in Computing and Communications.

Pullarao, K., & Thirupathi Rao, K. (2013). A secure approach for storing and using health data in a private cloud computing environment. International Journal of Advanced Research in Computer Science, 4(9).

Regola, N., & Chawla, N. (2013). Storing and using health data in a virtual private cloud. Journal of medical Internet research, 15(3), e63.

Yu, W. D., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, 9-12 Oct. 2013). A distributed storage solution for cloud-based e-Healthcare Information System. Paper presented at the 2013 IEEE 15th International Conference on e-Health Networking, Applications, and Services (Healthcom 2013).

Current State of Data Storage for Big Data

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the current state of data storage for big data. The discussion also discusses and analyzes the impact of the Big Data storage on the organizational process.

Big Data and Big Data Analytics Brief Overview

The term Big Data refers to the explosive growth in the volume of the data which are difficult to store, process and analyze.  The volume of the Big Data is only one feature.  However, the major 3Vs characterized Big Data with volume, variety, and velocity.  The variety of the data is reflected by the different types of data collected from sensors, smartphones, or social networks.  Thus, the data collected forms additional types of data such as unstructured, and semi-structured besides the structured type.  The velocity characteristic of the Big Data reflects the speed of the data transfer, where the content of the data is continuously changing.  These three major features characterize the Big Data nature. Big Data is classified even further to include Data Sources, Content Format, Data Stores, Data Staging, and Data Processing.  Figure 1 summarizes the Big Data Classifications, adapted from (Hashem et al., 2015).

Figure 1.  Big Data Classification.  Adapted from (Hashem et al., 2015).

Big Data without Analytics has no value.  Big Data Analytics (BDA) is the process of examining large datasets containing a variety of data types such as unstructured, semi-structured and structured. The purpose of the BDA is to uncover hidden patterns, market trends, unknown correlations, customer preferences and other useful business information that can help the organization (Arora & Bahuguna, 2016).  BDA has been used in various industries such as healthcare.   

Big Data Storage

The explosive growth of the data has challenged the capabilities of the existing storage technologies to store and manage data.  Organizations have been utilizing the traditional storage techniques to store data through the structured relational database.  However, the Big Data and BDA require distributed storage technology based on the Cloud Computing instead of the local storage attached to a computer or electronic device. Cloud Computing technologies provide a powerful framework which performs complex large-scale computing tasks and span a range of IT functions from storage and computation to database and application services. Organizations and users adopt the Cloud Computing technologies because of the need and requirements to store, process and analyze a large amount of data (Hashem et al., 2015).

Various storage technologies have been emerged to meet the requirements when dealing with large volume of data. These storage technologies include Direct Attached Storage (DAS), Network Attached Storage (NAS), and Storage Area Network (SAN).  When using DAS, various hard disk drives are directly connected to the servers. Each hard disk drive receives a certain amount of I/O resource managed by the application. The DAS technology is a good fit for servers that are interconnected on a small scale.  The NAS technology provides a storage device which supports a network through a switch or hub via TCP/IP protocols.  When using NAS, data is transferred as files.  The I/O in the NAS technology is less burden than in DAS because the NAS server can indirectly access a storage device through the networks.  NAS technology can orient the networks such as scalable and bandwidth-intensive networks including the high-speed networks of optical-fiber connections.   The SAN system of data storage is independent with respect to storage on the local area network.  Data management and sharing are maximized by using the multipath data switching which is conducted among internal nodes.  The organization data storage system of DAS, NAS, SAN can be divided into three categories:  disc array, connection and network sub-systems, and storage management software. The disk array provides the storage system.  The connect and network sub-systems provides connection to one or more disc arrays and servers.  The storage management software monitors the data sharing, storage management and disaster recovery tasks for multiple servers (Hashem et al., 2015).

When dealing with Big Data and BDA, the storage system is not physically separated from the processing system.  There are various storage types such as hard drives, solid-state memory, object storage, optical storage and cloud storage. Each type has advantages as well as limitations.  Thus, organizations must examine the goal and the objectives of the data storage first prior selecting any of these storage media.  Table 1 shows a comparison of storage media, adapted from (Hashem et al., 2015).

Table 1.  Comparison of Storage Media.  Adapted from (Hashem et al., 2015).

The Hadoop Distributed File System (HDFS) is a primary component in Hadoop technology, which is emerged to deal with Big Data and BDA. The other major component of Hadoop technology is MapReduce.  The Hadoop framework is described to be the de facto standard for Big Data storage and processing (Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012).  The HDFS is a distributed file system which is designed to run on top of the local file systems of the cluster nodes. It stores extremely large files for streaming purpose.  HDFS is highly faulted tolerant and can scale up from a single server to thousands of nodes, where each offers local computation and storage. 

The Cloud Computing technology can meet the requirement of the Big Data and BDA offering effective framework and platform for computational purpose as well as for storage purpose.  Thus, organizations which tend to take advantage of Big Data and BDA utilize the Cloud Computing technology.  However, the use of the Cloud Computing does not come without a price.  Security and privacy have been major concerns to Cloud Computing users and organizations.  Although Cloud Computing offers several benefits to organizations from scalability, fault tolerance, to data storage, yet, it is curbed by the security and privacy.  Organizations must take the appropriate security measures for data in storage, transit, and processing, such as SSL, Encryption, Access Control, Multi-Factor Authentication and so forth.

In summary, Big Data comes with Big Storage requirement.  Organizations have been facing various challenges when dealing with Big Data, such as data storage and data processing.  Data storage issue is partially solved by Cloud Computing technology.  However, until the security and privacy issues are resolved in the Cloud Computing platform, organizations must apply robust security measures to mitigate and alleviate the security risks.

References

Arora, M., & Bahuguna, H. (2016). Big Data Security–The Big Challenge.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Building Blocks of a System for Healthcare Big Data Analytics

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to create the building blocks of a system for healthcare Big Data Analytics and compare the building block design to a DNA networked cluster currently used by an organization in the current market.

The discussion begins with the Cloud Computing Building Blocks, followed by Big Data Analytics Building Blocks, and DNA Sequencing. The discussion also addresses the building blocks for the health analytics and the building blocks for DNA Sequencing System, and the comparison between both systems.

Cloud Computing Building Blocks

The Cloud Computing model contains two elements: the front end and the back end.  Both elements are connected to the network. The user interacts with the system using the front end, while the cloud itself is the back end. The front end is the client which the user uses to access the cloud through a device such as a smartphone, tablet, and laptops.  The backend represented by the Cloud provides applications, computers, servers and data storage which creates the services (IBM, 2012).   

As indicated in (Macias & Thomas, 2011), three building blocks are required to enable Cloud Computing. The first block is the “Infrastructure,” where the organization can optimize data center consolidation, enhance network performance, connect anyone, anywhere seamlessly, and implement pre-configured solutions.  The second block is the “Applications,” where the organization can identify applications for rapid deployment, and utilize automation and orchestration features.  The third block is the “Services,” where the organization can determine the right implementation model, and create a phased cloud migration plan.

In (Mousannif, Khalil, & Kotsis, 2013-14), the building blocks for the Cloud Computing involve the physical layer, the virtualization layer, and the service layer.  Virtualization is a basic building block in Cloud Computing.  Virtualization is the technology which hides the physical characteristics of the computing platform from the front end users.  Virtualization provides an abstract and emulated computing platform.  The clusters and grids are features and characteristics in Cloud Computing for high-performance computing applications such as simulations. Other building blocks of the Cloud Computing include Service-Oriented Architectures (SOA) and Web Services (Mousannif et al., 2013-14). 

Big Data Building Block

As indicated in (Verhaeghe, n.d.), there are four major building blocks for Big Data Analytics.  The first building block is Big Data Management to enable organization capture, store and protect the data. The second building block for the Big Data is the Big Data Analytics to extract value from the data.  Big Data Integration is the third building block to ensure the application of governance over the data.  The last building block in Big Data is the Big Data Applications for the organization to apply the first three building blocks using the Big Data technologies.

DNA Sequencing

DNA stands for Deoxyribonucleic Acid which represents the smallest building block of life (Matthews, 2016).  As indicated in (Salzberg, 1999), advances in biotechnology have produced enormous volumes of DNA-related information.  However, the rate of data generation is outpacing the ability of the scientists to analyze the data.  DNA Sequencing is a technique used to determine the order of the four chemical building blocks, called “bases,” which make up the DNA molecule (genome.gov, 2015).  The sequence provides the kind of genetic information which is carried in a particular DNA segment.  DNA sequencing can provide valuable information about the role of inheritance in susceptibility to disease and response to the influence of environment.  Moreover, DNA sequencing provides rapid and cost-effective diagnosis and treatments.  Markov chains and hidden Markov models are probabilistic techniques which can be used to analyze the result of the DNA sequencing (Han, Pei, & Kamber, 2011).  Example of the DNA Sequencing application is discussed and analyzed in (Leung et al., 2011), where the researchers employed Data Mining on DNA Sequences biological data sets for Hepatitis B Virus. 

DNA Sequencing was performed on non-networked computers, using a limited subset of data due to the limited computer processing speed (Matthews, 2016).  However, DNA Sequencing has been experiencing various advanced technologies and techniques.  Predictive Analytic is an example of these techniques which are applied to DNA Sequencing resulting Predictive Genomics.  Cloud Computing plays a significant role in the success of the Predictive Genomics for two major reasons.  The first reason is the volume of the genomic data, while the second reason is the low cost (Matthews, 2016).  Cloud Computing is becoming a valuable tool for various domains including the DNA Sequencing.   As cited in (Blaisdell, 2017), the study of the Transparency Market Research showed that the healthcare Cloud Computing market is going to evolve further, reaching up to $6.8 Billion by 2018. 

Building Block for Healthcare System

Healthcare data requires protection due to the security and privacy concerns.  Thus, Private Cloud will be used in this use case.  To build a Private Cloud, the virtualization layer, the physical layer, and the service layer are required.  The virtualization layer consists a hypervisor to allow multiple operating systems to share a single hardware system.  The hypervisor is a program which controls the host processors and resources by allocating the resources to each operating system.  Two types of hypervisors: native and also called bare-metal or type 1 and hosted also called type 2.  Type 1 runs directly on the physical hardware while Type 2 runs on a host operating system which runs on the physical hardware.  Examples of the native hypervisor include VMware’s ESXi, Microsoft’s Hyper-V. Example of the hosted hypervisor includes Oracle VirtualBox and VMware’s Workstation.  The physical layer can consist of two computer pools one for PC and the other for the server (Mousannif et al., 2013-14).   

In (Archenaa & Anita, 2015), the researchers illustrated the secure Healthcare Analytic System.  The Electronic health record is a heterogeneous dataset which is given as input to HDFS through Flume and Sqoop. The analysis of the data is performed using MapReduce and Hive by implementing Machine Learning algorithm to analyze the similar pattern of data, and to predict the risk for patient health condition at an early stage.  HBase database is used for storing the multi-structured data. STORM is used to perform live streaming and any emergency conditions such as patient temperature rate falling beyond the expected level. Lambda function is also used in this healthcare system.  The final component of a building block in Healthcare system involves the reports generated by the top layer tools such as “Hunk.”  Figure 1 illustrates the Healthcare System, adapted from

Figure 1.  Healthcare Analytics System. Adapted from (Archenaa & Anita, 2015)

Building Block for DNA and Next Generation Sequencing System

Besides the DNA Sequencing, there is a next-generation sequencing (NGS) which is increasing exponentially since 2007 (Bhuvaneshwar et al., 2015).  In (Bhuvaneshwar et al., 2015), the Globus Genomic System is proposed as an enhanced Galaxy workflow system made available as a service offering users the capability to process and transfer data easily, reliably and quickly.  This system addresses the end-to-end NGS analysis requirements and is implemented using Amazon Cloud Computing Infrastructure.  Figure 2 illustrates the framework for the Globus Genomic System taking into account the security measures for protecting the data.  Examples of healthcare organizations which are using Genomic Sequencing include Kaiser Permanente in Northern California, and Geisinger Health System in Pennsylvania (Khoury & Feero, 2017).  

Figure 2. Globus Genomics System for Next Generation Sequencing (NGS). Adapted from (Bhuvaneshwar et al., 2015).

In summary, Cloud Computing has reshaped the healthcare industry in many aspects.  Healthcare Cloud Computing and Analytics provide many benefits from the easy access to the electronic patient records to DNA Sequencing and NGS.  The building blocks of the Cloud Computing must be implemented with care for security and privacy consideration to protect the patients’ data from unauthorized users.  The building blocks for Healthcare Analytics system involves advanced technologies such as Hadoop, MapReduce, STORM, Flume as illustrated in Figure 1.  The building blocks for DNA Sequencing and NGS System involves Dynamic Worker Pool, HTCondor, Shared File System, Elastic Provisioner, Globus Transfer and Nexus, and Galaxy as illustrated in Figure 2.  Each system has the required building blocks to perform the analytics tasks.  

References

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., . . . Madhavan, S. (2015). A case study for cloud-based high throughput analysis of NGS data using the globus genomics system. Computational and structural biotechnology journal, 13, 64-74.

Blaisdell, R. (2017). DNA Sequencing in the Cloud. Retrieved from https://rickscloud.com/dna-sequencing-in-the-cloud/.

genome.gov. (2015). DNA Sequencing. Retrieved from https://www.genome.gov/10001177/dna-sequencing-fact-sheet/.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

IBM. (2012). Cloud computing fundamentals: A different way to deliver computer resources. Retrieved from https://www.ibm.com/developerworks/cloud/library/cl-cloudintro/cl-cloudintro-pdf.pdf.

Khoury, M. J., & Feero, G. (2017). Genome Sequencing for Healthy Individuals? Think Big and Act Small! Retrieved from https://blogs.cdc.gov/genomics/2017/05/17/genome-sequencing-2/.

Leung, K., Lee, K., Wang, J., Ng, E. Y., Chan, H. L., Tsui, S. K., . . . Sung, J. J. (2011). Data mining on dna sequences of hepatitis b virus. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(2), 428-440.

Macias, F., & Thomas, G. (2011). Three Building Blocks to Enable the Cloud. Retrieved from https://www.cisco.com/c/dam/en_us/solutions/industries/docs/gov/white_paper_c11-675835.pdf.

Matthews, K. (2016). DNA Sequencing. Retrieved from https://cloudtweaks.com/2016/11/cloud-dna-sequencing/.

Mousannif, H., Khalil, I., & Kotsis, G. (2013-14). Collaborative learning in the clouds. Information Systems Frontiers, 15(2), 159-165. doi:10.1007/s10796-012-9364-y

Salzberg, S. L. (1999). Gene discovery in DNA sequences. IEEE Intelligent Systems and their Applications, 14(6), 44-48.

Verhaeghe, X. (n.d.). The Building Blocks of a Big Data Strategy. Retrieved from https://www.oracle.com/uk/big-data/features/bigdata-strategy/index.html.

Use Case: Analysis of Heart Disease

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project is to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the Fast-and-Frugal Trees (FFTrees).  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

Keywords: Decision Tree, Diagnosis of Heart Disease.

Introduction

            The medical records and the databases to store these records are increasing rapidly.  This rapid increase is leading the researchers and practitioners to employ Big Data technologies.  The Data Mining technique plays a significant role in finding patterns and in extracting knowledge to provide better patient care and effective diagnostic capabilities.  As indicated in (Koh & Tan, 2011), “In healthcare, data mining is becoming increasingly popular, if not increasingly essential.”  Healthcare can benefit from Data Mining application in various areas such as the evaluation of treatment effectiveness, customer and patient relationship management, healthcare management, fraud detection, and prevention.  Moreover, other benefits include predictive medicine and analysis of DNC micro-arrays. 

Various research studies employed various Data Mining techniques in the healthcare.  In (Alexander & Wang, 2017), the main objective of the study was to identify the usage of Big Data Analytics to predict and prevent heart attacks.  The results showed that Big Data Analytics is useful in predicting and pr3eventing attacks.  In (Dineshgar & Singh, 2016), the purpose of the study was to develop a prototype Intelligent Heart Disease Prediction System (IHDPC) using Data Mining technique.  In (Karthiga, Mary, & Yogasini, 2017), the researchers utilized the Data Mining techniques to predict heart disease using the Decision Tree algorithm and Naïve Bayes.  The result showed that the prediction accuracy of 99%. Thus, Data Mining techniques enable the healthcare industry to predict patterns.  In (Kirmani & Ansarullah, 2016), the researchers also applied the Data Mining techniques with the aim to investigate the result after applying different types of Decision Tree methods to obtain the better performance in the heart disease.  These research studies are examples of the vast literature on the use of Big Data Analytics and Data Mining in the healthcare industry.

            In this project, the heart disease dataset is utilized as the Use Case for Data Mining application.  The project used Hortonworks sandbox, with Hive, MapReduce, and Tez.  The project also integrated R with Hive to perform statistical analysis including Decision Tree method.  The project utilized techniques from various research studies such as (Karthiga et al., 2017; Kirmani & Ansarullah, 2016; Martignon, Katsikopoulos, & Woike, 2008; Pandey, Pandey, & Jaiswal, 2013; Phillips, Neth, Woike, & Gaissmaier, 2017; Reddy, Raju, Kumar, Sujatha, & Prakash, 2016).

            The project begins with Phase 1 of Sandbox Configuration, followed by Phase 2 of the Heart Disease Use Case.  The Sandbox configuration included the environment set up from mapping the sandbox IP to the Ambari Console management and the Integration of R and RStudio with Hive.  The Heart Disease Use Case involved fourteen steps starting from understanding the dataset to the analysis of the result.  The project articulates the steps and the commands as required for this project. 

Phase 1:  Sandbox Configuration

1.      Environment Setup

The environment setup begins with the installation of the Virtual Box and Hortonworks Sandbox.

  1. It is installed using the Oracle VM VirtualBox, which was installed from http://www.virtualbox.org
  2. Install Hortonworks Docker Sandbox version 2.6.4 from http://hortonworks.com/sandbox

After the installation, the environment must be configured to function using the following steps fully. 

1.1       IP Address and HTTP Web Port

After the Sandbox is installed, the host must use an IP address depending on the Virtual Machine (VMware, VirtualBox), or container (Docker). After the installation is finished, the local IP address is assigned with the HTTP web access port of 8888 as shown below.  Thus, the local access is using http://127.0.0.1:8888/

1.2       Map the Sandbox IP to Desired Hostname in the Hosts file

            The IP address can be mapped to a hostname using the hosts file a shown below. After setting the hostname to replace the IP address, the sandbox can be accessed from the browser using http://hortonworks-sandbox.com:8888.

1.3     Roles and User Access

There are five major users with different roles in Hortonworks Sandbox.  These roles with their passwords and roles are summarized in Table 1.  

Table 1.  Users, Roles, Service, and Passwords.

With respect to the Access, putty can be used to access the sandbox using SSH with Port 2222.  The root will get the prompt to specify the new password.

1.4       Shell Web Client Method

The shell web client is also known as Shell-in-a-box to issue shell commands without installing additional software.  It uses port 4200.  The admin password can be reset using the Shell Web Client.

1.5       Transfer Data and Files between the Sandbox and Local Machine.

            To transfer files and data between the local machine and the sandbox, secure copy using scp command can be used as illustrated below.

To transfer from the local machine to the sandbox:

To transfer from the sandbox to the local machine.

1.6       Ambari Console and Management

            The Admin can manage Ambari using the web address with port 8080 using the address of  http://hortonworks-sandbox.com:8080 with the admin user and password.  The admin can operate the cluster, manage users and group, and deploy views. The cluster section is the primary UI for Hadoop Operators. The clusters allow admin to grant permission to Ambari users and groups. 

2.      R and RStudio Setup

To download and install RStudio Server, following the following steps:

  •  Execute: $sudo yum install rstudio-server-rhel-0.99.893-x86_64.rpm
  • Install dpkg to divert th location of /sbin/initctl
    • Execute: $yum install dpkg
    • Execute: $dpkg-diver –local –rename –add /sbin/initctl
    • Execute: $ln -s /bin/true /sbin/initctl
  • Install R and verify the installatio or RStudio.
    • Execute: $yum install -y R
    • Execute: $yum -y install libcurl-devel
    • Execute: $rstudio-server verify-installation.
  • The default port of RStduio server is 8787 which is not opened in the Docker Sandbox.  You can use port 8090 which is opened for the Docker.
    • Execute: $sed -I “1 a www-port=8090” /etc/rstudio/rserver.conf
    • Restart the server by typing: exec /usr/lib/rstudio-server/bin/rserver
    • It will close your session. However, you can now browse to RStudion using port 8090 as shown below.
    • RStudio login is amy_ds/amy_ds
  1.  Alternatively, open up the RStudio port 8787 by implementing the following steps:
    • Access the VM VirtualBox Manager tool. 
    • Click on the Hortonworks VM à Network à Advanced à Port Forwarding.  Add Port 8787 for RStudio.   
    • After you open up the port, modify /etc/rstudio/rstudio-server.conf to reflect 8787 port.
    • Stop and start the VM.

Phase 2:  Heart Disease Use Case

 1.      Review and Understand the Dataset

  • Obtain the heart disease dataset from the archive site at:

http://archive.ics.uci.edu/ml/datasets/.

  • Review the dataset.  The dataset has fourteen variables with (N=271).  Table 2 describes these attributes.

Table 2.  Heart Disease Dataset Variables Description.

  1. Load heart.dat file into the Hadoop Distributed File System (HDFS):
    1. Login to Ambari using amy_ds/amy_ds.
    1. File View à user à amy_ds à upload
  • Start Hive database.  Create a table “heart” and import the dataset to Hive database.
  • Retrieve the top 10 records for verification.

2.      Configure MapReduce As the Execution Engine in Hive

There is an option to configure MapReduce in Hive to take advantage of the MapReduce feature in Hive.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: mr (for MapReduce).
    1. Key: hive.auto.convert.join à Value: false.
  1. Test the query using MapReduce as the execution engine.  The following query ran using MapReduce.
  • Configure Tez As the Execution Engine in Hive

The user can also modify the value for hive.execution.engine from mr to tez as Hive is enabled on Tez execution and take advantage of the DAG execution representing the query instead of multiple stages of MapReduce program which involved a log of synchronization, barriers, and I/O overheads.

  1. Click on Hive Settings tab.
  2. Click Add New and add the following Key: Value pairs.
    1. Key: hive.execution.engine -à Value: tez.
    1. Key: hive.auto.convert.join à Value: false.

4.      Integrate TEZ with Hive for Directed Acyclic Graph (DAG)

This integration is implemented on Tez to also take advantage of the Directed Acyclic Graph (DAG) execution. This technique is improved in Tez, by writing intermediate dataset into memory instead of hard disk.

  1. Go to Settings in Hive view.
  2. Change the hive.execution.engine to tez.

5.      Track Hive on Tez jobs in HDP Sandbox using the Web UI.

  1. Track the job from the browser:  http://hortonworks-sandbox.com:8088/cluster, while running or post to see the details.
    1. Retrieve the average age, average cholesterol by gender for female and males.

6. Monitor Cluster Metrix

  1. Monitor Cluster Metrix

    

7. Review the Statistics of the table from Hive.

  1. Table à Statistics à Recompute and check Include Columns.
    1. Click on Tez View.
    1. Click Analyze
  • Click Graphical View

8.      Configure ODBC for Hive Database Connection

  1. Configure a User Data Source in ODBC on the client to connect to Hive database. 
  • Test the ODCB connection to Hive.

9.      Setup R to use the ODBC for Hive Database Connection

  1.  Execute the following to install the odbc packages in R.
    1. >install.packages(“RODBC”)
  2. Execute the following to load and run the required library to establish the database connection from R to Hive:
    1. >library(“RODBC”)
  3. Execute the following command to establish the database connection from R to Hive:
    1. >cs881 <- odbcConnect(“Hive”)
  • Execute the following to retrieve the top 10 records from Hive from R using the ODBC connection:
    • >sqlQuery(cs881,”SELECT TOP(10) FROM heart”)

10.   Create Data Frame

  • Execute the following command to create a data frame:
    • >heart_df <- sqlQuery(cs881,”SELECT * FROM heart”)
  • Review the headers of the columns:
    • >print(head(heart_df))
  • Review and Analyze the Statistics Summary:
    • >summary(head_df).
  • List the names of the columns:
    • >names(heart_df)

11.   Analyze the Data using Descriptive Analysis

  • Find the Heart Disease Patients, Age and Cholesterol Level
    • Among all genders
    • >age_chol_heart_disease <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1”)
    • summary(age_chol_heart_disease)

Figure 1.  Cholesterol Level Among All Heart Disease Patients.

  • Among Female Patients
    • age_chol_heart_disease_female <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 0”)
    • summary(age_chol_heart_disease_female)

Figure 2.  Cholesterol Level Among Heart Disease Female Patients.

  • Among Male Patients
    • age_chol_heart_disease_male <- sqlQuery(cs881, “SELECT age, chol from heart where diagnosis = 1 and sex = 1”)
    • summary(age_chol_heart_disease_male)

Figure 3.  Cholesterol Level Among Heart Disease Male Patients.

 12.  Analyze the Data using Decision Tree

  • Print the headers of the columns
  • Create input.dat for the diagnosis, age, sex, and sugar attributs.
  • Create png file.
  • Install library(party).
    • >install.packages(party)
    • >library(party)
  • Load the Decision Tree

Figure 4.  Decision Tree for Hearth Disease Patients.

13.    Create FFTree for heart disease

Figure 5.  FFT Decision Tree with Low Risk and High-Risk Patients.

Figure 6.  Sensitivity and Specificity for Heart Disease Patients Using FFTree.

Figure 7. Custom Heart Disease FFTree.

14.    Analysis of the Results

The analysis of the heart disease dataset included descriptive analysis and decision tree.  The result of the descriptive analysis showed that the minimum age among the patients who are exposed to the heart disease is 29 years old, while the maximum age is 79, with median and mean of 52 years old.  The result also showed that the minimum cholesterol level for these patients is 126, while the maximum is 564, with a median of 236 and mean of 244 indicating that the cholesterol level also gets increased with the increased age.   

The descriptive analysis drilled down and focused on gender (female vs. male) to identify the impact of the age on the cholesterol for the heart disease patients.  The result showed the same trend among the female heart disease patients, with a minimum age of 34, and maximum age of 76, with median and mean of 54.   The cholesterol level among female heart disease patients begins with the minimum of 141 and maximum of 564, and median of 250 and mean of 257.  The maximum cholesterol level of 564 is an outlier with another outlier in the age of 76.   With respect to the heart of the male disease patients, the result showed the same trend.  Among the male heart disease patients, the results showed that the minimum age is 29 years old, and maximum age of 70 years old, with a median of 52 and mean of 51.  The cholesterol level among these male heart disease patients showed 126 minimum and 325 maximum level, median and mean of 233. There is another outlier among the male heart disease patients at the age of 29.  Due to these outliers in the dataset among the female and heart disease patients, the comparison between male and female patients will not be accurate.   However, Figure 5 and Figure 6 show similarities in the impact of the age on the cholesterol level between both genders.

With regard to the decision tree, the first decision tree shows the data is partitioned among six nodes.  The first two nodes are for the Resting Blood Pressure (RBP) attribute (RestingBP). The first node of this cluster shows 65 heart disease patients have RBP of 138 or less, while the second node of this cluster shows 41 heart disease patients have RBP of greater than 138.  These two nodes show the vessels is zero or less with a heart rate of 165 or less.  For the vessel that exceeds the zero level, there is another node with 95 patients.  The second set of the nodes are for the heart rate of greater than 165.  The three nodes are under the vessels; less than or equal zero vessels, and greater than zero vessels.  Two nodes are under the first categories of zero or less, with 22 heart disease patients with a heart rate of 172 or less. The last node shows the vessels is greater than zero with 15 heart disease patients.  The FFTree results show that the high-risk heart disease patients with vessel greater than zero, while the low-risk patients are of zero or less of vessels. 

Conclusion

The purpose of this project was to articulate all the steps conducted to perform analysis of heart disease use case.  The project contained two main phases: Phase 1:  Sandbox Configuration, and Phase 2: Heart Disease Use Case.  The setup and the configurations are not trivial and did require the integration of Hive with MapReduce and Tez.  It also required the integration of R and RStudio with Hive to perform transactions to retrieve and aggregate data.  The analysis included Descriptive Analysis for all patients and then drilled down to focus on the gender: female and males.  Moreover, the analysis included the Decision Tree and the FFTrees.  The researcher of this paper in agreement with other researchers that Big Data Analytics and Data Mining can play a significant role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention. 

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Dineshgar, G. P., & Singh, L. (2016). A Review of Data Mining For Heart Disease Prediction. International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE), 5(2).

Karthiga, A. S., Mary, M. S., & Yogasini, M. (2017). Early Prediction of Heart Disease Using Decision Tree Algorithm. International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST), 3(3).

Kirmani, M. M., & Ansarullah, S. I. (2016). Prediction of Heart Disease using Decision Tree a Data Mining Technique. IJCSN International Journal of Computer Science and Network, 5(6), 885-892.

Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of healthcare information management, 19(2), 65.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Pandey, A. K., Pandey, P., & Jaiswal, K. (2013). A heart disease prediction model using decision tree. IUP Journal of Computer Sciences, 7(3), 43.

Phillips, N. D., Neth, H., Woike, J. K., & Gaissmaier, W. (2017). FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12(4), 344.

Reddy, R. V. K., Raju, K. P., Kumar, M. J., Sujatha, C., & Prakash, P. R. (2016). Prediction of heart disease using decision tree approach. International Journal of Advanced Research in Computer Science and Engineering, 6(3).

Big Data Analytics Framework and Relevant Tools Used in Healthcare Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Big Data Analytics framework and relevant tools used in healthcare data analytics.  The discussion also provides examples of how healthcare organizations can implement such a framework.

Healthcare can benefit from Big Data Analytics in various domains such as decreasing the overhead costs, curing and diagnosing diseases, increasing the profit, predicting epidemics and heading the quality of human life (Dezyre, 2016).  Healthcare organizations have been generating the very large volume of data mostly generated by various regulatory requirements, record keeping, compliance and patient care.  There is a projection from McKinsey that Big Data Analytics in Healthcare can decrease the costs associated with data management by $300-$500 billion.  Healthcare data includes electronic health records (EHR), clinical reports, prescriptions, diagnostic reports, medical images, pharmacy, insurance information such as claim and billing, social media data, and medical journals (Eswari, Sampath, & Lavanya, 2015; Ward, Marsolo, & Froehle, 2014). 

Various healthcare organizations such as scientific research labs, hospitals, and other medical organizations are leveraging Big Data Analytics to reduce the costs associated with healthcare by modifying the treatment delivery models.  Some of the Big Data Analytics technologies have been applied in the healthcare industry.  For instance, Hadoop technology has been used in healthcare analytics in various domains.  Examples of Hadoop application in healthcare include cancer treatments and genomics, monitoring patient vitals, hospital network, healthcare intelligence, fraud prevention and detection (Dezyre, 2016).  Thus, this discussion is limited to the Hadoop technology in healthcare.  The discussion begins with the types of analytics and the potential benefits of some of the analytic in healthcare, and then followed by the main discussion about Hadoop Framework for Diabetes including its major components of the Hadoop Distributed File System (HDFS) and Map/Reduce.

Types of Analytics

There are four major analytics types:  Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017).  The Descriptive Analytics are used to summarize historical data to provide useful information.  The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling.  The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables.  It is useful to find an optimum solution and the best course of action using the algorithm.  The Prescriptive Analytics, as indicated in (Mohammed et al., 2014) is less used in the clinical field.  The Diagnostic Analytics is described as an advanced type of analytics to get to the cause of a problem using drill-down techniques and data discovery.

Hadoop Framework for Diabetes

The predictive analysis algorithm is utilized by (Eswari et al., 2015) in Hadoop/MapReduce environment in predicting the diabetes types prevalent, the complications associated with each diabetic type, and the required treatment type.  The analysis used by (Eswari et al., 2015) was performed on Indian patients.  In accordance to the World Health Organization, as cited in (Eswari et al., 2015), the probability for the age between 30-70 for patients to die from four major Non-Communicable Diseases (NCD) such as diabetes, cancer, stroke, and respiratory is 26%.   In 2014, 60% of all death in India was caused by NCDs.  Moreover, in accordance with the Global Status Report, as cited in (Eswari et al., 2015), NCD claims will reach 52 million patients globally by the year of 2030. 

The architecture for the predictive analysis included four phases:  Data Collection, Data Warehousing, Predictive Analysis, Processing Analyzed Reports.  Figure 1 illustrates the framework used for the Predictive Analysis System-Healthcare application, adapted from (Eswari et al., 2015). 

Figure 1.  Predictive Analysis Framework for Healthcare. Adapted from (Eswari et al., 2015).

Phase 1:  The Data Collection phase included raw diabetic data which is loaded into the system.  The data is unstructured including EHR, patient health records (PHR), clinical systems and external sources such as government, labs, pharmacies, insurance and so forth.  The data have different formats such as .csv, tables, text.  The data which was collected from various sources in the first phase was stored in Data Warehouses. 

Phase 2:  During the second phase of data warehousing, the data gets cleansed, and loaded to be ready for further processing.

Phase 3:  The third phase involved the Predictive Analysis which used the predictive algorithm in Hadoop, Map/Reduce environment to predict and classify the type of DM, complications associated with each type, and the treatment type to be provided.  Hadoop framework was used in this analysis because it can process extremely large amounts of health data by allocating partitioned data sets to numerous servers.  Hadoop utilized the Map/Reduce technology to solve different parts of the larger problem and integrate them into the final result.  Moreover, Hadoop utilized the Hadoop Distributed File System (HDFS) for the distributed system. The Predictive Analysis phase involved Pattern Discovery and Predictive Pattern Matching. 

With respect to the Pattern Discovery, it was important for DM to test patterns such as plasma, glucose concentration, serum insulin, diastolic blood pressure, diabetes pedigree, Body Mass Index (BM), age, number of times pregnant.   The process of the Pattern Discovery included the association rule mining between the diabetic type and other information such as lab results. It also included clustering to cluster and group similar patterns.  The classification step of the Pattern Discovery included the classification of patients risk based on the health condition.  Statistics were used to analyze the Pattern Discovery.  The last step in the Pattern Discovery involved the application.  The process of the Pattern Discovery of the Predictive Analysis phase is illustrated in Figure 2

Figure 2.  Pattern Discovery of the Predictive Analysis.

With respect to Predictive Pattern Matching of the Predictive Analysis, the Map/Reduce operation was performed whenever the warehoused dataset was sent to Hadoop system.  The Pattern Matching is the process of comparing the analyzed threshold value with the obtained value.   The Mapping phase involved splitting the large data into small tasks for Worker/Slave Nodes (WN).  As illustrated in Figure 3, the Master Node (MN) consists of Name Node (NN) and Job Tracker (JT) which used the Map/Reduce technique.   The MN sends the order to Worker/Slave Node, which process the pattern matching task for diabetes data with the help of Data Node (DN) and Task Tracker (TT) which reside on the same machine of the WN.  If the WN completed the pattern matching based on the requirement, the result was stored in the intermediate disk, known as local write.  If the MN initiated the reduce task, all other allocated Worker Nodes read the processed data from the intermediate disks.  The reduce task is performed in the WN based on the query received from the Client to the MN.  The results of the reduce phase will be distributed in various servers in the cluster.

Figure 3.  Pattern Matching System Using Map/Reduce. Adapted from (Eswari et al., 2015).

Phase 4:  In this phase, the Analyzed Reports are processed and distributed to various servers in the cluster and replicated through several nodes depending on the geographical area.  Using the proper electronic communication technology to exchange the information of patients among healthcare centers can lead to obtaining proper treatment at the right time in remote locations at low cost.

The implementation of Hadoop framework did help in transforming various health records of diabetic patients to useful analyzed result to help patients understand the complication depending on the type of diabetes. 

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Eswari, T., Sampath, P., & Lavanya, S. (2015). Predictive methodology for diabetic data analysis in big data. Procedia Computer Science, 50, 203-208.

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Ward, M. J., Marsolo, K. A., & Froehle, C. M. (2014). Applications of business analytics in healthcare. Business Horizons, 57(5), 571-582.

NoSQL Database Application to Health Informatics Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze a NoSQL database type such as Cassandra, MongoDB and how it is used or applied to health informatics data analytics. The discussion is based on a project implemented by (Klein et al., 2015).  In this project, the researchers performed application-specific prototyping and measurement to identify NoSQL products which can fit a data model and can query use cases to meet the performance requirements of the provider.  The provider was a healthcare provider with specific requirements to employ Electronic Health Record (EHR) system.  The project used three NoSQL databases Cassandra, MongoDB, and Riak as the three candidates based on the maturity of the product and the availability of the enterprise support.  The researchers faced challenges of selecting the right NoSQL during their work on the project.

This research study is selected because it is comprehensive and it has rich information about the implementation of these three data stores in healthcare. Moreover, the research study has additional useful information regarding healthcare such HL7, and healthcare-specific data models of “FHIR Patient Resources” and “FHIR Observation Resources,” besides the performance framework such as YCSB.  

NoSQL Database Application in Healthcare

The provider has been using thick client system running at each site around the globe and connected to a centralized relational database.  The provider has no experience with NoSQL.  The purpose of the project was to evaluate NoSQL databases which will meet their needs.  

The provider was a large healthcare provider requesting a new EHR system which supports healthcare delivery for over nine million patients in more than 100 facilities across the world.  The rate of the data growth is more than one terabyte per month. The data must be retained for ninety-nine years.  The technology of NoSQL was considered for two major reasons.  The first reason involved a Primary Data Store for the EHR system.  The second reason is to improve request latency and availability by using a local cache at each site.  

The project involved four major steps as discussed below.  Step four involved five major configuration tasks to test the identified data stores as discussed below as well.  This EHR system requires robust and strong replica consistency.  A comparison was performed between the identified data stores for the strong replica consistency vs. the eventual consistency.   

Project Implementation

Step 1: Identity the Requirement:  The first step in this project was identify the requirements from the stakeholders of the provider. These requirements are used to develop the evaluation of NoSQL database.  There were two main requirements.  The first requirement involved high availability with low latency with a high load in distributed systems.   This first requirement reflected the performance and the scalability as the measure to evaluate the NoSQL candidates.  The second requirement involved logical data models and query patterns supported by NoSQL, and replica consistency in a distributed framework.  This requirement reflected the data model mapping as the measure to evaluate the NoSQL candidates.

Step 2:  Define Two Primary Use Cases for the Use of the EHR System:  The providers provided two specific use cases for the EHR system.  The first use case was to read recent medical test results for a patient. This use case is regarded to be the core function used to populate the user interface when a clinician selects a new patient.  The second use case was to achieve a strong replica consistency when a new medical test result is written for a patient. The purpose of this strong replica consistency is to allow all clinicians using the EHR framework to see the information to make health care decision for the patient, with no regard to the location of the patient either at the same site or in another location.

Step 3:  Select the Candidate NoSQL Database:  The provider requested the evaluation of different data models of NoSQL data stores such as key-value, column, and document to determine the best-fit NoSQL which can meet their requirements.  Thus, Cassandra, MongoDB, and Riak were the candidates for this project based on the maturity of the product and enterprise support.

Step 4:  Performance Tests Design and Execution:  A systematic test process was designed and executed to evaluate the three candidates based on the use cases requirements defined earlier. This systematic test process included five major Tasks as summarized in Table 1. 

Table 1:  Summary of the Performance Tests Design and Execution Tasks.

Task 1:  Test Environment Configuration:  The test environment was developed using the three identified NoSQL databases: MongoDB, Cassandra, and Riak.  Table 2 shows the identified NoSQL database, types, version, and source.  The test environment included two configurations. The first configuration involved a single node server.  The purpose of the first single-node environment was to validate the test environment for each database type.  The second configuration involved nine-node environment. The purpose of the nine-node environment was to represent the production environment that was geographically distributed across three data centers.   The dataset was shared across three nodes, and then replicated to two additional groups; each group has three nodes.  The replication configuration is implemented based on each NoSQL database.  For instance, when using MongoDB, the replica systems are implemented using the Primary/Secondary feature.  When using Cassandra, the replica configuration is implemented using the data center built-in awareness distribution feature.  When using Riak, the data was shared across all nine nodes, with three replicas of each shared stored across the nine nodes.   Amazon EC2 (Elastic Compute Cloud) instances were used for the test environment implementation.  Table 3 describes the EC2 type and source.

Table 2: Summary of Identified NoSQL Databases, Types, Versions, Sources, and Implementation.

Table 3:  Details of Nodes, Types, and Size.

Task 2:  Data Model Mapping:  A logical data model was mapped to the identified data model for the healthcare.  The identified data model was HL7 Fast Healthcare Interoperability Resources (FHIR).  HL7 is Health Level-7 refers to a set of identified international standards to transfer clinical and administrative data between software applications.  Various healthcare providers use these applications.  These identified international standards focus on the application layer, which is layer-7 in the OSI model (Beeler, 2010).   Two models were used: “FHIR Patient Resources,” and “FHIR Observation Resources.”  The test results for a patient were modeled using the “FHIR Patient Resources” such as names, address, and phone, while other medical related information such as test type, result quantity, and result units were modeled using “FHIR Observation Resources.”  The relationship between patient and test results was a one-to-many relationship (1: M), and between patient and observations was also one-to-many relationship as illustrated in Figure 1.

Figure 1:  The Logical Data Model and the Relationship between Patient, Test Result, and Observations.

This relationship of 1: M between the patient and the test result and the observation, and the need to access efficiently the most recently written test results were very challenging for each identified NoSQL data store.  Thus, for MongoDB, the researchers used a composite index of two attributes (Patient ID, Observation ID) for the test result records, and indexed by the lab result data-time stamp.  Using this approach enabled the efficient retrieval process of the most recent test result records for a particular patient.  For Cassandra, the researchers used a similar approach as MongoDB but using a composite index of three attributes (Patient ID, Lab Result, Data-Time Stamp).  The retrieval of the most recent test result was efficient using this approach of the three-composite index in Cassandra because the result was returned sorted by the server. With respect to Riak, the relationship of 1: M was more complicated than MongoDB and Cassandra.  The key-value data model of Riak enables the retrieval of a value, which has a unique key.  Riak has the feature of a “secondary index” to avoid a full scan when the key is not known.   However, each node in the cluster stores the secondary indices for those shards stored by the nodes.  The secondary index requires a query to match such an index, which results in “scatter-gather” performed by the “request coordinator” asking each node for records with the requested secondary index value, waiting for all nodes to respond, and then sending the list of keys for the matching records back to the requester.  This operation causes latency to locate records and the need for two round trips to retrieve the records had a negative impact on the performance of Riak data store.  There is no technique in Riak to filter and return only the most recent observations for a patient. Thus, all matching records must be returned and then sorted and filtered by the client.   Table 4 summarizes the data model mapping for each of the identified data store, and the impact on performance.

Table 4:  Data Model Mapping and Impact on Performance.

Task 3: Data Generation and Load: The dataset contained one million patient records (Patient Records: N=1,000,000), and ten million records for test results (Test Results Records: N=10,000,000). The number for the test result for a patient ranged from zero to twenty with an average of seven (Test Result for each Patient:  Min=0, Max=20, Mean=7). 

Task 4: Load Test Client:  For this task, the researchers used Yahoo Cloud Serving Benchmark (YCSB) framework to manage the execution of the test, and to test the measurement.  One of the key features of YCSB, as indicated in (Cooper, Silberstein, Tam, Ramakrishnan, & Sears, 2010), is the extensibility, which provides an easy definition of new workload types, and flexibility to benchmark new systems.  YCSB framework and workloads are available in open source for system evaluations.  The researchers replaced the simple data models, data sets, and queries of YCSB framework to reflect the project implementations are meeting the specific use cases of the providers.  Another YCSB feature are the ability to specify the total number of operations and the mix of reading and write operations in a workload.  The researchers utilized this feature and applied the 80% read, and %20 write for each load for EHR system in response to the provider’s requirements.  Thus, the read operations were used to retrieve the five most recent observations for a patient, and the write operations were used to insert a new observation record for a patient.   Two use cases for workload were used.  The first use case was to test the data store as a local cache, which involved the write-only workload operation which was performed on a daily basis to load a local cache from a centralized primary data store with records for patients with a scheduled appointment that day.  The second use case was for the read workload to flush the cache back to the centralized primary data store as illustrated in Figure 2.

Figure 2.  Read (R) and Write (W) Workloads.

The “operation latency” was measured by the YCSB framework, as the time calculated between the request time and the response time from the data store.   The calculation for reading and write operation latency was performed separately using the YCSB framework.  Besides the “operation latency,” the latency distribution is a key scalability metric in Big Data Analytics. Thus, the researchers recorded both the average and the 95% values.  Moreover, the researchers extended the test to include the overall throughput in operations per second, which reflected the total number of operations for reading and write divided by the total workload execution time, excluding the time for the initial setup and the cleaning to obtain a more accurate result. 

Task 5: Test Script Development and Execution:  In this task, the researchers performed three runs to decrease any impact associated with the transient events in the cloud infrastructure.  These three runs were performed for each of the identified data stores.  The standard deviation of the throughput for any three-run set never exceeded 2% of the average.   YCSB allows running multiple execution threads to create concurrent client sessions. Thus, the workload execution was repeated for a defined range of test client threads for each of the three-run tests. This workload execution approach created a corresponding number of concurrent database connections.  The researchers indicated that NoSQL data stores are not designed to operate with a large number of concurrent database client sessions. NoSQL databases can usually handle between 16-64 concurrent sessions.

The researchers analyzed the appropriate approach to distributing the multiple concurrent connections to the database across the server nodes. Based on their analysis, the researchers found that MongoDB utilizes a centralized router node and all clients connected to that single router node.  With respect to Cassandra, the data center built-in awareness distribution feature created three sub-clusters of three nodes each, and client connections were spread uniformly across the three nodes in one sub-clusters.   With respect to the Riak data store, client connections are allowed only to be spread uniformly across the full set of the nine-node.  Table 5 summarizes how each data stores handles concurrent connections in a distributed system.

Table 5.  A Summary of the Data Store Techniques for Concurrent Connections in Distributed System.

Results and Findings

The nine-node topology was configured to represent the production system. Moreover, a single server configuration was also tested which demonstrated its limitation for production use. Thus, the performance result does not reflect the single server configuration but rather the nine-node topology configuration and test execution.  However, the comparison between the single node and distributed nine-node scenario was performed to provide insight on the performance of the data store and the efficiency of the distributed systems and the tradeoff to scale up with more nodes versus using faster nodes with more storage.  The results covered three major areas:  Strong Consistency Evaluation, Eventual Consistency Evaluation, and Performance Evaluation.

Strong Consistency Evaluation

With respect to MongoDB, all writes were committed on the Primary Server, while all reads were performed from the Primary Server.  In Cassandra, all writes were committed on a quorum formed on each of the three sub-clusters, while the read operations required a quorum only on the local sub-cluster.  With respect to Riak, the effect was to require a quorum on the entire nine-node cluster for both read and write operations.  Table 6 summarizes this strong consistency evaluation result.  

Table 6.  Strong Consistency Evaluation. Adapted from (Klein et al., 2015).

Eventual Consistency Evaluation: The eventual consistency tests were performed on Cassandra and Riak.  The results showed that writes were committed on one node with replication occurring after the operation was acknowledged to the client.  The results also showed that read operations were executed on one replica, which may or may not return the latest values written to the data store. 

Performance Evaluation: Cassandra demonstrated the best overall performance result among MongoDB and Riak peaking at approximately 3500 operations per seconds as illustrated in Figure 3.  The comparison in Figure 3 include throughput read-only workload, write-only workload and read/write workload, replicated data, and quorum consistency for these three data stores.

Figure 3.  Workload Comparison among Cassandra, MongoDB, and Riak. 

Adapted from (Klein et al., 2015). 

The decreased contention of storage I/O improved the performance, while additional work of coordinating write and read quorums across replicas and data centers decreased the performance.  With respect to Cassandra, the improved performance exceeded the degraded performance resulting in net higher performance in the distributed configuration than the other two data stores.  The built-in awareness distributed feature in Cassandra participated in the improved performance of Cassandra because this feature separates replication and sharding configurations.   This separation allowed larger read operations to be completed without the requirement of request coordination such as P2P proxying of the client request as the case in Riak.

With respect to latency, the results showed for the read/write workload that MongoDB had a constant average latency as the number of concurrent sessions with the increased number of concurrent sessions.  The result also showed the read/write operations, that Cassandra achieved the highest overall throughput, it had the highest latencies indicating high internal concurrency in processing the requests. 

Conclusion

Cassandra data store demonstrated the best throughput performance, but with the highest latency for the specific workloads and configurations tested.   The researchers analyzed such results of Cassandra that Cassandra provides hash-based sharding spread the request and storage load better than MongoDB.  The second reason is that indexing feature of Cassandra allowed efficient retrieval of the most recently written records, compared to Riak.  The third reason is that the P2P architecture and data center aware feature of Cassandra provide efficient coordination of both reads and write operations across the replicas nodes and the data centers.  

The results also showed that MongoDB and Cassandra provided a more efficient result with respect to the performance than Riak data store.  Moreover, they provided strong replica consistency required for such application of the data models.  The researchers concluded that MongoDB exhibited more transparent data modeling mapping than Cassandra, besides the indexing capabilities of MongoDB, were found to be a better fit for such application. 

Moreover, the results also showed that the throughput varied by a factor of ten, read operation latency varied by a factor of five, and write latency by a factor of four with the highest throughput product delivering the highest latency. The results also showed that the throughput for workloads using strong consistency was 10-25% lower than workloads using eventual consistency.

References

Beeler, G. W. J. (2010). Introduction to:  HL7 References Information Model (RIM).  ANSI/HL7 RIM R3-2010 and ISO 21731. Retrieved from https://www.hl7.org/documentcenter/public_temp_4F08F84F-1C23-BA17-0C2B98D837BC327B/calendarofevents/himss/2011/HL7ReferenceInformationModel.pdf.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., & Sears, R. (2010). Benchmarking cloud serving systems with YCSB. Paper presented at the Proceedings of the 1st ACM symposium on Cloud computing.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.