Proposal: State-of-the-Art Healthcare System in Four States.

Dr. O. Aly
Computer Science

Abstract

The purpose of this proposal is to design a state-of-the-art healthcare system in four States of Colorado, Utah, Arizona, and New Mexico. Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry. The value that is driven by BDA can save lives and minimize costs for patients. The project proposes a design to apply BD and BDA in the healthcare system across these identified four States. Cloud computing is the most appropriate technology to deal with the large volume of healthcare data at the storage level as well as at the data processing level. Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used. VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists. The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT). The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images. Spark will be used for real-time data processing which can be vital for urgent care and emergency services. This project addresses the assumptions and limitations plus the justification for selecting these specific components. All stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process. The rigid culture and silo pattern need to change for better healthcare which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

Keywords: Big Data Analytics; Hadoop; Healthcare Big Data System; Spark.

Introduction

In the age of Big Data (BD), information technology plays a significant role in the healthcare industry (HIMSS, 2018). The role of information technology in healthcare The healthcare sector generates a massive amount of data every day to conform to standards and regulations (Alexandru, Alexandru, Coardos, & Tudora, 2016). The generated Big Data has the potential to support many medical and healthcare operations including clinical decision support, disease surveillance and population health management (Alexandru et al., 2016). This project proposes a state-of-the-art integrated system for hospitals located in Arizona, Colorado, New Mexico, and Utah. The system is based on the Hadoop ecosystem to help the hospitals maintain and improve human health via diagnosis, treatment and disease prevention.

It begins with Big Data Analytics in Healthcare Overview, which covers the benefits and challenges of BD and BDA in the healthcare industry. The overview also covers the various healthcare data sources for data analytics, in different formats such as semi-structured, e.g., XML and JSON, and unstructured, e.g., images and XRays. The second section addresses the healthcare BDA Design Proposal Using Hadoop. This section covers various components. The first component discusses the requirements for this design. These requirements include state-of-the-art technology such as Hadoop/MapReduce, Spark, NoSQL database, Artificial Intelligence (AI), Internet of Things (IoT). The project also covers various diagrams including the data flow diagram, a communication flow chart, and the overall system diagram. The healthcare design system is bounded by regulation, policies, and governance such as HIPAA, that is also covered in this project. The justification, limitation, and assumptions are also discussed.

Big Data Analytics in Healthcare Overview

BD and BDA are terms that have been used interchangeably and described as the next frontier for innovation, competitions, and productivity (Maltby, 2011; Manyika et al., 2011). BD has a multi-V model with unique characteristics, such as volume referring to the large dataset, velocity refers to the speed of the computation as well as data generation, and variety referring to the various data types such as semi-structured and unstructured (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015; Hu, Wen, Chua, & Li, 2014). BD is described as the next frontier for competition, innovation, and productivity. Various industries including healthcare have taken this opportunity and applied BD and BDA in their business models (Manyika et al., 2011). McKinsey Institute predicted $300 billion as a potential annual value to US healthcare (Manyika et al., 2011).

The healthcare industry generated extensive data driven by keeping patients’ records, complying with regulations and policies, and patients care (Raghupathi & Raghupathi, 2014). The current trend is digitalizing this explosive growth of the data in the age of Big Data (BD) and Big Data Analytics (BDA) (Raghupathi & Raghupathi, 2014). BDA has made a revolution in healthcare by transforming the valuable information, knowledge to predict epidemics, cure diseases, improve quality of life, and avoid preventable deaths (Van-Dai, Chuan-Ming, & Nkabinde, 2016). Various applications of BDA in healthcare include pervasive health, fraud detection, pharmaceutical discoveries, clinical decision support system, computer-aided diagnosis, and biomedical applications.

Healthcare Big Data Benefits and Challenges

Healthcare sector employs BDA in various aspect of healthcare such as detecting diseases at early stages, providing evidence-based medicine, minimizing doses of medication to avoid any side effects, and delivering useful medicine base on genetic analysis. The use of BD and BDA can reduce the re-admission rate, and thereby the healthcare related costs for patients are reduced. Healthcare BDA can be used to detect spreading diseases earlier before the disease gets spread using real-time analytics (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018). Example of the application of BDA in the healthcare system is Kaiser Permanente implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records (Fox & Vaidyanathan, 2016).

Despite the various benefits of BD and BDA in the healthcare sector, various challenges and issues are emerging from the application of BDA in healthcare. The nature of the healthcare industry poses challenging to BDA (Groves, Kayyali, Knott, & Kuiken, 2016). The episodic culture, the data puddles, and the IT leadership are the three significant challenges of the healthcare industry to apply BDA. The episodic culture addresses the conservative culture of the healthcare and the lack of IT technologies mindset creating rigid culture. Few providers have overcome this rigid culture and started to use the BDA technology. The data puddles reflect the silo nature of healthcare. Silo is described as one of the most significant flaws in the healthcare sector (Wicklund, 2014). The use of the technology properly is lacking in healthcare sector resulting in making the industry fall behind other industries. All silos use their methods to collect data from labs, diagnosis, radiology, emergency, case management and so forth. The IT leadership is another challenge is caused by the rigid culture of the healthcare industry. The lack of the latest technologies among the IT leadership in the healthcare industry is a severe problem.

Healthcare Data Sources for Data Analytics

The current healthcare data is collected from clinical and non-clinical sources (InformationBuilders, 2018; Van-Dai et al., 2016; Zia & Khan, 2017). The electronic healthcare records are digital copies of the medical history of the patients. It contains a variety of data relevant to the care of the patients such as demographics, medical problems, medications, body mass index, medical history, laboratory test data, radiology reports, clinical notes, and payment information. These electronic healthcare records are the most important data in healthcare data analytics, because it provides effective and efficient methods for the providers and organizations to share data (Botta, de Donato, Persico, & Pescapé, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016; Wang et al., 2018).

The biomedical imaging data plays a crucial role in healthcare data to aid disease monitoring, treatment planning and prognosis. This data can be used to generate quantitative information and make inferences from the images that can provide insights into a medical condition. The images analytics is more complicated due to the noises of the data associated with the images and is one of the significant limitations with biomedical analysis (Ji, Ganchev, O’Droma, Zhang, & Zhang, 2014; Malik & Sangwan, 2015; Van-Dai et al., 2016).

The sensing data is ubiquitous in the medical domain both for real-time and for historical data analysis. The sensing data involve several forms of medical data collection instruments such as the electrocardiogram (ECG) and electroencephalogram (EEG) which are vital sensors to collect signals from various parts of the human body. The sensing data plays a significant role for intensive care units (ICU) and real-time remote monitoring of patients with specific conditions such as diabetes or high blood pressure. The real-time and long-term analysis of various trends and treatment in remote monitoring programs can help providers monitor the state of those patients with certain conditions(Van-Dai et al., 2016).

The biomedical signals are collected from many sources such as hearts, blood pressure, oxygen saturation levels, blood glucose, nerve conduction, and brain activity. Examples of biomedical signals include electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG), electroencephalogram (EEG), electrogastrogram (EGG), and phonocardiogram (PCG). The biomedical signals real-time analytics will provide better management of chronic diseases, earlier detection of adverse events such as heart attacks, and strokes and earlier diagnosis of disease. These biomedical signals can be discrete or continuous based on the kind of care or severity of a particular pathological condition (Malik & Sangwan, 2015; Van-Dai et al., 2016).

The genomic data analysis helps better understand the relationship between various genetic, mutations, and disease conditions. It has great potentials in the development of various gene therapies to cure certain conditions. Furthermore, the genomic data analytics can assist in translating genetic discoveries into personalized medicine practice (Liang & Kelemen, 2016; Luo, Wu, Gopukumar, & Zhao, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016).

The clinical text data analytics using the data mining are the transformation process of the information from clinical notes stored in unstructured data format to useful patterns. The manual coding of clinical notes is costly and time-consuming, because of their unstructured nature, heterogeneity, different format, and context across different patients and practitioners. Various methods such as natural language processing (NLP) and information retrieval can be used to extract useful knowledge from large volume of clinical text and automatically encoding clinical information in a timely manner (Ghani, Zheng, Wei, & Friedman, 2014; Sun & Reddy, 2013; Van-Dai et al., 2016).

The social network healthcare data analytics is based on various kinds of collected social media sources such as social networking sites, e.g., Facebook, Twitter, Web Logs, to discover new patterns and knowledge that can be leveraged to model and predict global health trends such as outbreaks of infections epidemics (InformationBuilders, 2018; Luo et al., 2016; Van-Dai et al., 2016; Zia & Khan, 2017). Figure 1 shows a summary of these healthcare data sources.

Figure 1. Healthcare Data Sources.

Healthcare Big Data Analytics Design Proposal Using Hadoop

The implementation of BDA in the hospitals within the four States aims to improve the safety of the patient, the clinical outcomes, promoting wellness and disease management (Alexandru et al., 2016; HIMSS, 2018). The BDA system will take advantages of the large healthcare-generated data to provide various applied analytical disciplines such as statistical, contextual, quantitative, predictive and cognitive spectrums (Alexandru et al., 2016; HIMSS, 2018). These applied analytical disciplines will drive the fact-based decision making for planning management and learning in hospitals (Alexandru et al., 2016; HIMSS, 2018).

The proposal begins with the requirements, followed by the data flow diagram, the communication flowcharts, and the overall system diagram. The proposal addresses the regulations, policies, and governance for the medical system. The limitation and assumptions are also addressed in this proposal, followed by the justification for the overall design.

1. Basic Design Requirements

The basic requirement for the implementation of this proposal included not only the tools and required software, but also the training at all levels from staff, to nurses, to clinicians, to patients. The list of the requirements is divided into system requirement, implementation requirement, and training requirements.

1.1 Cloud Computing Technology Adoption Requirement

The volume is one of the significant characteristics of BD, especially in the healthcare industry (Manyika et al., 2011). Based on the challenges addressed earlier when dealing with BD and BDA in healthcare, the system requirements cannot be met using the traditional on-premise technology center, as it cannot handle the intensive computation requirements of BD, and the storage requirement for all the medical information from various hospitals from the four States (Hu et al., 2014). Thus, the cloud computing environment is found to be more appropriate and a solution for the implantation of this proposal. Cloud computing plays a significant role in BDA (Assunção et al., 2015). The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016). Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017). However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud. Thus, one of the major requirements is to adopt the Virtual Private Cloud as it has been regarded as the most prominent approach to trusted computing technology (Abdul, Jena, Prasad, & Balraju, 2014).

1.2 Security Requirement

Cloud computing has been facing various threats (Cloud Security Alliance, 2013, 2016, 2017). Records showed that over the last three years from 2015 until 2017, the number of breaches, lost medical records, and settlements of fines are staggering (Thompson, 2017). The Office of Civil Rights (OCR) issued 22 resolution agreements, requiring monetary settlements approaching $36 million (Thompson, 2017). Table 1 shows the data categories and the total for each year.

Table 1. Approximation of Records Lost by Category Disclosed on HHS.gov (Thompson, 2017)

Furthermore, a recent report published by HIPAA showed the first three months of 2018 experienced 77 healthcare data breaches reported to the OCR (HIPAA, 2018d). In the second quarter of 2018, at least 3.14 million healthcare records were exposed (HIPAA, 2018a). In the third quarter of 2018, 4.39 million records exposed in 117 breaches (HIPAA, 2018c).

Thus, the protection of the patients’ private information requires the technology to extract, analyze, and correlated potentially sensitive dataset (HIPAA, 2018b). The implementation of BDA requires security measures and safeguards to protect the privacy of the patients in the healthcare industry (HIPAA, 2018b). Sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016). The security requirements involve security at the VPC cloud deployment model as well as at the local hospitals in each State (Regola & Chawla, 2013). The security at the VPC cloud deployment model should involve the implementation of security groups and network access control lists to allow access to the right individuals to the right applications and patients’ records. Security group in VPC acts as the first line of defense firewall for the associated instances of the VPC (McKelvey, Curran, Gordon, Devlin, & Johnston, 2015). The network access control lists act as the second layer of defense firewall for the associated subnets, controlling the inbound and the outbound traffic at the subnet level (McKelvey et al., 2015).

The security at the local hospitals level in each State is mandatory to protect patients’ records and comply with HIPAA regulations (Regola & Chawla, 2013). The medical equipment must be secured with authentication and authorization techniques so that only the medical staff, nurses and clinicians have access to the medical devices based on their role. The general access should be prohibited as every member of the hospital has a different role with different responses. The encryption should be used to hide the meaning or intent of communication from unintended users (Stewart, Chapple, & Gibson, 2015). The encryption is an essential element in security control especially for the data in transit (Stewart et al., 2015). The hospital in all four State should implement the encryption security control using the same type of the encryption across the hospitals such as PKI, cryptographic application, and cryptography and symmetric key algorithm (Stewart et al., 2015).

The system requirements should also include the identity management systems that can correspond with the hospitals in each state. The identity management system provides authentication and authorization techniques allowing only those who should have access to the patients’ medical records. The proposal requires the implementation of various encryption techniques such as secure socket layer (SSL), Transport Layer Security (TLS), and Internet Protocol Security (IPSec) to protect information transferred in public network (Zhang, R. & Liu, 2010).

1.3 Hadoop Implementation for Data Stream Processing Requirement

While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014). Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015). The implementation requirements include various technologies and various tools. This section covers various components that are required when implementing Hadoop technology in the four States for healthcare BDA system.

Hadoop has three significant limitations, which must be addressed in this design. The first limitation is the lack of technical support and document for open source Hadoop (Guo, 2013). Thus, this design requires the Enterprise Edition of Hadoop to get around this limitation using Cloudera, Hortonworks, and MapR (Guo, 2013). The final decision for which product will be determined by the cost analysis team. The second limitation is that Hadoop is not optimal for real-time data processing (Guo, 2013). The solution for this limitation will require the integration of real-time streaming program as Spark or Storm or Kafka (Guo, 2013; Palanisamy & Thirunavukarasu, 2017). This requirement of integrating Spark is discussed below in a separate requirement for this design (Guo, 2013). The third limitation is that Hadoop is not a good fit for large graph dataset (Guo, 2013). The solution for this limitation requires the integration of GraphLab which is also discussed below in a separate requirement for this design.

1.3.1 Hadoop Ecosystem for Data Processing

Hadoop technologies have been in the front-runner for Big Data application (Bansal et al., 2014; Chrimes, Zamani, Moa, & Kuo, 2018). Hadoop ecosystem will be part of the implementation requirement as it is proven to serve well with intensive computation using large datasets (Raghupathi & Raghupathi, 2014; Wang et al., 2018). The implementation of Hadoop technology will be performed in the VPC deployment model. The Hadoop version that is required is version 2.x to include YARN for resource management (Karanth, 2014). Hadoop 2.x also include HDFS snapshots to provide a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery (Karanth, 2014). The Hadoop platform can be implemented to gain more insight into various areas (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Hadoop ecosystem involves Hadoop Distributed File System, MapReduce, and NoSQL database such as HBase, and Hive to handle a large volume of dataset using various algorithms and machine learning to extract values from the medical records that are structured, semi-structured, and unstructured (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Other components to support Hadoop ecosystem include Oozie for workflow, Pig for scripting, and Mahout for machine learning which is part of the artificial intelligence (AI) (Ankam, 2016; Karanth, 2014). Hadoop ecosystem will also include Flume for log collector, Sqoop for data exchange, and Zookeeper for coordination (Ankam, 2016; Karanth, 2014). HCatalog is a required component to manage the metadata in Hadoop (Ankam, 2016; Karanth, 2014). Figure 2 shows the Hadoop ecosystem before integrating Spark for real-time analytics.

Figure 2. Hadoop Architecture Overview (Alguliyev & Imamverdiyev, 2014).

1.3.2 Hadoop-specific File Format for Splittable and Agnostic Compression

The ability of splittable files plays a significant role during the data processing (Grover, Malaska, Seidman, & Shapira, 2015). Therefore, Hadoop-specific file formats of SequenceFile, and Serialization formats like Avro, and columnar formats such as RCFile and Parquet should be used because these files share two essential characteristics that are essential for Hadoop applications: splittable compression and agnostic compression (Grover et al., 2015). Hadoop allows large files to be split for input to MapReduce and other types of jobs, which is required for parallel processing and an essential key to leveraging data locality feature of Hadoop (Grover et al., 2015). The agnostic compression is required to compress data using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015). Figure 3 summarizes the three Hadoop file types with the two common characteristics.

Figure 3. Three Hadoop File Types with the Two Common Characteristics.

1.3.3 XML and JSON Use in Hadoop

The clinical data include semi-structured formats such as XML and JSON. The split process of XML and JSON is not straightforward and can present unique challenges using Hadoop (Grover et al., 2015). Since and Hadoop does not provide a built-in InputFormat for either format of XML and JSON (Grover et al., 2015). Furthermore, JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record (Grover et al., 2015). When using these file format, two primary considerations must be taken. The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015). A library for processing XML or JSON should be designed (Grover et al., 2015). XMLLoader in PiggyBank library for Pig is an example when using XML data type. The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015).

1.4 HBase and MongoDB NoSQL Database Integration Requirement

In the age of BD and BDA, the traditional data store is found inadequate to handle not only the large volume of the dataset but also the various types of the data format such as unstructured and semi-structured (Hu et al., 2014). Thus, Not Only SQL (NoSQL) database is emerged to meet the requirement of the BDA. These NoSQL data stores are used for modern, and scalable databases (Sahafizadeh & Nematbakhsh, 2015). The scalability feature of the NoSQL data stores enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015). The platform can incorporate two scalability types to support the large volume of the datasets; the horizontal and vertical scalability. The horizontal scaling allows the distribution of the workload across many servers and nodes to increase the throughput, while the vertical scaling requires more processors, more memories and faster hardware to be installed on a single server (Sahafizadeh & Nematbakhsh, 2015).

NoSQL data stores have various types such as MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB. These data stores are categorized into four types: document-oriented, column-oriented or column-family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The document-oriented data store can store and retrieve collections of data and documents using complex data forms in various formats such as XML and JSON as well as PDF and MS word (EMC, 2015; Hashem et al., 2015). MongoDB and CouchDB are examples of document-oriented data stores (EMC, 2015; Hashem et al., 2015). The column-oriented data store can store the content in columns aside from rows with the attributes of the columns stored contiguously (Hashem et al., 2015). This type of datastore can store and render blog entries, tags, and feedback (Hashem et al., 2015). Cassandra, DynamoDB, and HBase are examples of column-oriented data stores (EMC, 2015; Hashem et al., 2015). The key-value can store and scale large volumes of data and contains value and a key to access the value (EMC, 2015; Hashem et al., 2015). The value can be complicated, but this type of data stores can be useful in storing the user’s login ID as the key referencing the value of patients. Redis and Riak are examples of the key-value NoSQL data store (Alexandru et al., 2016). Each of these NoSQL data stores has its limitations and advantages. The graph NoSQL database can store and represent data using graph models with nodes, edges, and properties related to one another through relations which will be useful for unstructured medical data such as images, and lab results. Neo4j is an example of this type of graph NoSQL database (Hashem et al., 2015). Figure 4 summarizes these NoSQL data stores, data types for storage, and examples.

Figure 4. Big Data Analytics NoSQL Data Store Types.

The proposed design requires one or more NoSQL data stores to meet the requirement of BDA using Hadoop environment for this healthcare BDA system. Healthcare big data has unique characteristics which must be addressed when selecting the data store and consideration must be taken for the various types of data. HBase and HDFS are the commonly used storage manager in the Hadoop environment (Grover et al., 2015). HBase is a column-oriented data store which will be used to store multi-structured data (Archenaa & Anita, 2015). HBase sets on top of HDFS in the Hadoop ecosystem framework (Raghupathi & Raghupathi, 2014).

MongoDB will also be used to store the semi-structured data set such as XML and JSON. Metadata for HBase data schema, to improve the accessibility and readability of HBase data schema (Luo et al., 2016). Riak will be used for a key-value dataset which can be used for the dictionary, hash tables and associative arrays that can be used for login and user ID information for patients as well as for providers and clinicians (Klein et al., 2015). Neo4j NoSQL will be used to store the images with nodes and edges such as Lab images, XRays (Alexandru et al., 2016).

The proposed healthcare system has a logical data model and query patterns that need to be supported by NoSQL databases (Klein et al., 2015). The data model will include reading the medical test results for patients is a core function used to populate the user interface. It will also include a strong replica consistency when a new medical result is written for a patient. Providers can make patient care decisions using these records. All providers will be able to see the same information within the hospital systems in the four States, whether they are at the same site as the patients, or providing telemedicine support from another location.

The logical data model includes mapping the application-specific model into the particular data model, indexing, and query language capabilities of each database. The HL7 Fast Healthcare Interoperability Resources (FHIR) is used as the logical data model for records analysis. The patient’s data such as demographic information such as names, addresses, and telephone will be modeled using the FHIR Patient Resources such as result quantity, and result units (Klein et al., 2015).

1.5 Spark Integration for Real-Time Data Processing Requirement

While the architecture of Hadoop ecosystem has been designed in various scenarios for data storage, data management statistical analysis, and statistical association between various data sources distributed computing and batch processing, this proposal requires real-time data processing which cannot be met by Hadoop alone (Basu, 2014). Real-time analytics will tremendous value to the healthcare proposed system. Thus, Apache Spark is another component which is required to implement this proposal (Basu, 2014). Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Basu, 2014). With Spark integration with Hadoop, stream processing, machine learning, interactive analytics, and data integration will be possible (Scott, 2015). Spark will run on top of Hadoop to benefit from YARN and the underlying storage of HDFS, HBase and other Hadoop ecosystem building blocks (Scott, 2015). Figure 5 shows the core engines of the Spark.

Figure 5. Spark Core Engines (Scott, 2015).

1.6 Big Healthcare Data Visualization Requirement

Visualization is one of the most powerful presentations of the data (Jayasingh, Patra, & Mahesh, 2016). It helps in viewing the data in a more meaningful way in the form of graphs, images, pie charts that can be understood easily. It helps in synthesizing a large volume of data set such as healthcare data to get at the core of such raw big data and convey the key points from the data for insight (Meyer, 2018). Some of the commercial visualization tools include Tableau, Spotfire, QlikView, and Adobe Illustrator. However, the most commonly used visualization tools in healthcare include Tableau, PowerBI, and QlikView. This healthcare design proposal will utilize Tableau.

Healthcare providers are successfully transforming data from information to insight using Tableau software. Healthcare organizations can utilize three approaches to get more from the healthcare datasets. The first approach is to break the data access by empowering the departments in healthcare to explore their data. The second approach is to uncover answers with data from multiple systems to reveal trends and outliers. The third approach is to share insights with executives, providers, and others to drive collaboration (Tableau, 2011). It has several advantages including the interactive visualization using drag-n-drop techniques, handling large amounts of data and millions of rows of data with ease, and other scripts such as Python can be integrated with Tableau (absentdata.com, 2018). It also provides mobile support and responsive dashboard. The limitation of Tableau is that it requires substantial training to fully master the platform, among other limitations including lack of automatic refreshing, conditional formatting and 16-column table limit (absentdata.com, 2018). Figure 6 shows the Patient Cycle Time data visualization using Tableau software.

Figure 6. Patient Cycle Time Data Visualization Example (Tableau, 2011).

1.7 Artificial Intelligence Integration Requirement

Artificial Intelligence is a computational technique allowing machines to perform cognitive functions such as acting or reacting to input, similar to the way humans do (Patrizio, 2018). The traditional computing applications react to data, and the reactions and responses must be hand-coded with human intervention (Patrizio, 2018). The AI systems are continuously in a flux mode changing their behavior to accommodate any changes in the results and modifying their reactions accordingly (Patrizio, 2018). The AI techniques can include video recognition, natural language processing, speech recognition, machine learning engines, and automation (Mills, 2018)

Healthcare system can benefit from BDA integration with Artificial Intelligence (AI) (Bresnick, 2018). Since AI can play a significant role in BDA in the healthcare system, this proposal suggests the implementation of machine learning which is part of the AI to deploy more precise and impactful interventions at the right time in the care of patients (Bresnick, 2018). The application of AI in the proposed design requires machine learning (Patrizio, 2018). Since the data used in the AI and machine learning is already cleaned after removing the duplicates and unnecessary data, AI can take advantages of these filtered data leading to many healthcare breakthroughs such as genomic and proteomic experiments to enable personalized medicine (Kersting & Meyer, 2018).

The healthcare industry has been utilizing AI, machine learning (ML) and data mining (DM) to extract value from BD by transforming the large medical datasets into actionable knowledge performing predictive and prescriptive analytics (Palanisamy & Thirunavukarasu, 2017). The ML will be used to utilize the AI to develop sophisticated algorithm processing massive medical datasets including the structured, unstructured, and semi-structured data performing advanced analytics (Palanisamy & Thirunavukarasu, 2017). Apache Mahout, which is an open source for ML, will be integrated with Hadoop to facilitate the execution of scalable machine learning algorithms, offering various techniques such as recommendation, classification, and clustering (Palanisamy & Thirunavukarasu, 2017).

1.8 Internet of Things (IoT) Integration Requirement

Internet of Things (IoT) refers to the increased connected devices with IP addresses which were not common years ago (Anand & Clarice, 2015; Thompson, 2017). These connected devices collect and use the IP addresses to transmit information (Thompson, 2017). Providers in healthcare take advantages of the collected information to find new treatment methods and increase efficiency (Thompson, 2017).

The implementation of IoT will involve various technologies including frequency identification (RFID), near field communication (NFC), machine to machine (M2M), wireless sensor network (WSM), and addressing schemes (AS) (IPv6 addresses) (Anand & Clarice, 2015; Kumari, 2017). The implementation of IoT requires machine learning and algorithm to find patterns, correlations, and anomalies that have the potential of enabling healthcare improvements (O’Brien, 2016). Machine learning is a critical component of artificial intelligence. Thus, the success of IoT depends on AI implementation.

1.9 Training Requirement

This design proposal requires various training to IT professionals, providers and clinician and those who will be using this healthcare ecosystem depending on their role (Alexandru et al., 2016; Archenaa & Anita, 2015). Each component of this ecosystem should have training such as training for Hadoop/MapReduce, Spark, Security, and so forth. The training will play a significant role in the success of this design implementation to apply BD and BDA in the healthcare system in the four States of Colorado, Utah, Arizona, and New Mexico. Patients should be considered in training for remote monitoring programs such as blood sugar monitoring, and blood pressure monitoring applications. The senior generation might face some challenges. However, with the technical support, this challenge can be alleviated.

2. Data Flow Diagram

This section discusses the data flow for the proposed design for the healthcare ecosystem for the application of BDA.

2.1 HBase Cluster and HDFS Data Flow

HBase stores data into table schema and specify the column family (Yang, Liu, Hsu, Lu, & Chu, 2013). The table schema must be predefined, and the column families must be specified. New columns can be added to families as required making the schema-flexible and can adapt to changing application requirements (Yang et al., 2013). HBase is developed in a similar way like HDFS with a NameNode and slave nodes, and MapReduce with JobTracker and TaskTracker slaves (Yang et al., 2013). HBase will play a vital role in the cluster environment of Hadoop system. In HBase master node called HMaster will manage the cluster, and region servers store portions of the tables and perform the work on the data. The HMaster reflects the Master Server and is responsible for monitoring all RegionServer instances in the cluster and is the interface for all metadata changes. This Master executes on the NameNode in the distributed cluster Hadoop environment. The HRegionServer represents the RegionServer and is responsible for serving and managing regions. The RegionServer runs on a DataNode in the distributed cluster Hadoop environment. The ZooKeeper will assist other machines are selected within the cluster as HMaster in case of a failure, unlike HDFS framework where NameNode has a single point of availability issue. Thus, the data flow between the DataNodes and the NameNodes when integrating HBase on top of HDFS is shown in Figure 7.

Figure 7. HBase Cluster Data Flow (Yang et al., 2013).

2.2 HBase and MongoDB with Hadoop/MapReduce and HDFS Data Flow

The healthcare system integrates four significant components such as HBase, MongoDB, MapReduce, and Visualization. HBase is used for data storage, MongoDB is used for metadata, MapReduce using Hadoop for computation, and data visualization tool. The signal data will be stored in HBase while the metadata and other clinical data will be stored in MongoDB. The data stored in both HBase and MongoDB will be accessible from the Hadoop/MapReduce environment for processing and the data visualization layer as well. One master node and eight slave nodes, and several supporting servers. The data will be imported to Hadoop and processed via MapReduce. The result of the computational process will be viewed through a data visualization tool such as Tableau. Figure 8 shows the data flow between these four components of the proposed healthcare ecosystem.

Figure 8. The Proposed Data Flow Between Hadoop/MapReduce and Other Databases.

2.3 XML Design Flow Using ETL Process with MongoDB

Healthcare records have various types of data from structured, semi-structured to unstructured (Luo et al., 2016). Some of these healthcare records are XML-based records in the semi-structured format using tags. XML stands for eXtensible Markup Language (Fawcett, Ayers, & Quin, 2012). Healthcare sector can drive value from these XML documents which reflect semi-structured data (Aravind & Agrawal, 2014). Example of this XML-based patients records shows in Figure 9.

Figure 9. Example of the Patient’s Electronic Health Record (HL7, 2011)

XML-based records need to get ingested into Hadoop system for the analytical purpose to derive value from this semi-structured XML-based data. However, Hadoop does not offer a standard XML “RecordReader” (Lublinsky, Smith, & Yakubovich, 2013). XML is one of the standard file formats for MapReduce. Various approaches can be used to process XML semi-structured data. The process of ETL (Extract, Transform and Load) can be used to process XML data in Hadoop. MongoDB is a NoSQL database which is required in this design proposal. It handles XML document-oriented type.

The ETL process in MongoDB starts with the extract and transform. The MongoDB application provides the ability to map the XML elements within the document to the downstream data structure. The application supports the ability to unwind simple arrays or present embedded documents using appropriate data relationships such as one-to-one (1:1), one-to-many (1: M), or many-to-many (M: M) (MongoDB, 2018). The application infers the schema information by examining a subset of documents within target collections. Organizations can add fields to the discovered data model that may not have been present within the subset of documents used for schema inference. The application infers information about the existing indexes for collections to be queried. It prompts or warns of queries that do not contain any indexes fields. The application can return a subset of fields from documents using query projections. For queries against MongoDB Replica Sets, the application supports the ability to specify custom MongoDB Read Preferences for individual query operations. The application then infers information about sharded cluster deployment and note the shard key fields for each sharded collection. For queries against MongoDB Sharded Clusters, the application warns against queries that do not use proper query isolation. Broadcast queries in a sharded cluster can have a negative impact on database performance (MongoDB, 2018).

The load process in MongoDB is performed after the extract and transform process. The application supports the ability to write data to any MongoDB deployment whether a single node, replica set or sharded cluster. For writes to a MongoDB Sharded Cluster, the application informs or display an error message to the user if XML documents do not contain a shard key. A custom WriteConcern can be used for any write operations to a running MongoDB deployment. For the bulk loading operations, writing documents in batches using the insert() method can be used using the MongoDB 2.6 version or above, which supports the bulk update database command. For the bulk loading into a MongoDB sharded deployment, the bulk insert into a sharded collection is supported, including the pre-splitting of the collections’ shard key and inserting via multiple mongos processes. Figure 10 shows this ETL process for XML-based patients records using MongoDB.

Figure 10. The Proposed XML ETL Process in MongoDB.

2.4 Real-Time Streaming Spark Data Flow

Real-Time streaming can be implemented using any real-time streaming program such as Spark, Kafka, or Storm. This healthcare design proposal will integrate Spark open-source program for the real-time streaming data such as sensing data, from various sources such as intensive care units, remote monitoring programs, biomedical signals. The data from various sources will be flow into Spark for analytics and then imported to the data storage systems. Figure 11 illustrates the data flow for real-time streaming analytics.

Figure 11. The Proposed Spark Data Flow.

3. Communication Workflow

The communication flow involves the stakeholders involves in the healthcare system. These stakeholders include providers, insurer, pharmaceutical, and IT professionals and practitioners. The communication flow is centered with the patient-centric healthcare system using the cloud computing technology for the four States of Colorado, Utah, Arizona, and New Mexico. These stakeholders are from these states. The patient-centric healthcare system is the central point for communication. The patients communicate with the central system using the web-based platform, and clinical forums as needed. The providers communicate with the patient-centric healthcare system using resource usages, patient feedback, and hospital visits, and services details. The insurers communicate with the central system using claims database, and census and societal data. The pharmaceutical vendors will communicate with the central system using prescription and drug reports which can be retrieved by the providers from anywhere in these four states. The IT professionals and practitioners will communicate with the central system for data streaming, medical records, genomics, and all omics data analysis and reporting. Figure 12 shows the communication flow between these stakeholders and the central system in the cloud that can be accessed from any of these identified four States.

Figure 12. The Proposed Patient-Centric Healthcare System Communication Flow.

4. Overall System Diagram

The overall system represents the state-of-the-art healthcare ecosystem system that utilizes the latest technology for healthcare Big Data Analytics. The system is bounded by the regulations and policy such as HIPAA to ensure the protection of the patients’ privacy across the various layers of the overall system. The system integrated components include the Hadoop latest technology with MapReduce and HDFS. The data government layer is the bottom layer which contains three major building blocks: master data management (MDM), data life-cycle management (DLM) components, and data security and privacy management. The MDM component is responsible for data completeness, accuracy, and availability, while the DLM is responsible for archiving the data, maintaining the data warehousing, data deletion, and disposal. The data security and privacy management building block is responsible for sensitive data discovery, vulnerability and configuration assessment, security policies application, auditing and compliance reporting, activity monitoring, identify and access management, and protecting data. The top layers include data layer, data aggregation layer, data analytics layer, and information exploration layer. The data layer is responsible for data sources and content format, while the data aggregation layer involves various components from data acquisition process, transformation engines, and data storage area using Hadoop, HDFS, NoSQL databases such as MongoDB and HBase. The data analytics layer involves the Hadoop/MapReduce mapping process, stream computing, real-time streaming, and database analytics. AI and IoT are part of the data analytics layer. The information exploration layer involves the data visualization layer, visualization reporting, real-time monitoring using healthcare dashboard, and clinical decision support. Figure 13 illustrates the overall system diagram with these layers.

Figure 13. The Proposed Healthcare Overall System Diagram.

5. Regulations, Policies, and Governance for the Medical Industry

Healthcare data must be stored in a secure storage area to protect the information and the privacy of patients (Liveri, Sarri, & Skouloudi, 2015). When the healthcare industry fails to comply with the regulation and policies, the fines and the cost can cause financial stress on the industry (Thompson, 2017). Records showed that the healthcare industry paid millions of dollars in fines. The Advocate Health Care in suburban Chicago agreed to the most significant figure as of August 2016 with a total amount of $5.55 million (Thompson, 2017). Memorial Health System in southern Florida became the second entity to top of paying $5 million (Thompson, 2017). Table 2 shows the five most substantial fines posted to the Office of Civil Rights (OCR) site.

Table 2. Five Largest Fines Posted to OCR Web Site (Thompson, 2017)

The hospitals must adhere to the data privacy regulations and legislative rules carefully to protect the patients’ medical records from data breaches (HIPAA). The proper security policy and risk management must be implemented to ensure the protection of private information as well to minimize the impact of confidential data in case of loss or theft (HIPAA, 2018a, 2018c; Salido, 2010). The healthcare system design proposal requires the implementation of a system for those hospitals or providers who are not compliant with the regulation and policies and the escalation path (Salido, 2010). This design proposal implements four major principles as the best practice to comply with required policies and regulation and protect the confidential data assets of the patients and users (Salido, 2010). The first principle is to honor policies throughout private data life (Salido, 2010). The second principle for best practice in healthcare design system is to minimize the risk of unauthorized access or misuse of confidential data (Salido, 2010). The third principle is to minimize the impact of confidential data loss, while the fourth principle is to document appropriate controls and demonstrate their effectiveness (Salido, 2010). Figure 14 shows these four principles which this healthcare design proposal adheres to ensure protection healthcare data from unauthorized users and comply with the required regulation and policies.

Figure 14. Healthcare Design Proposal Four Principles.

6. Assumptions and Limitations

This design proposal assumes that the healthcare sector in the four States will support the application of BD and BDA across these fours States. The support includes investment in the proper technology, proper tools and proper training based on the requirements of this design proposal. The proposal also assumes that the stakeholders including the providers, patients, insurer, pharmaceutical vendors, and practitioners will welcome the application of BDA to take advantages of it to provide efficient healthcare services, increase productivity, decrease costs for healthcare sector as well as for patients, and provide better care to patients.

The limitation of this proposal is the timeframe that is required to implement it. With the support of the healthcare sector from these four States, the implementation can be expedited. However, the silo and the rigid culture of the healthcare may interfere with the implementation which can take longer than expected. The initial implementation might face unexpected challenges. However, these unexpected challenges will come from the lack of experienced IT professionals and managers in the field of BD and BDA domain. This design proposal will be enhanced based on the observations from the first few months of the implementation.

7. The justification for Overall Design

The traditional database and analytical systems are found inadequate when dealing with healthcare data in the age of BDA. The characteristics of the healthcare datasets including the large volume medical records, the variety of the dataset from structured, to semi-structured, to the unstructured dataset, and the velocity of the dataset generation and the data processing requires technology such as cloud computing (Fernández et al., 2014). Cloud computing is found the best solution when dealing with BD and BDA to address the challenges of BD storage, and the intensive-computing processing demands (Alexandru et al., 2016; Hashem et al., 2015). The healthcare system in the four States will shift the communication technology and services for applications across the hospitals and providers (Hashem et al., 2015). Some of the advantages of cloud computing adoption include virtualized resources, parallel processing, security and data service integration with scalable data storage (Hashem et al., 2015). With the cloud computing technology, the healthcare sector in the four States will reduce the cost, and increase the efficiency (Hashem et al., 2015). When quick access to critical data for patients care is required quickly, the mobility of accessing the data from anywhere is one of the most significant advantages of the cloud computing adoption as recommended by this proposed design (Carutasu, Botezatu, Botezatu, & Pirnau, 2016). The benefits of cloud computing include technological benefits such as visualization, multi-tenancy, data and storage, security and privacy compliance (Chang, 2015). The cloud computing also offers economic benefits such as pay per use, cost reduction, return on investment (Chang, 2015). The non-functional benefits of the cloud computing cover the elasticity, quality of service, reliability, and availability (Chang, 2015). Thus, the proposed design justifies the use of cloud computing for several benefits as cloud computing is proven the best technology for BDA especially for healthcare data analytics.

Although cloud computing offers several benefits to the proposed healthcare system, cloud computing has been suffering from security and privacy concerns (Balasubramanian & Mala, 2015; Kazim & Zhu, 2015). The security concerns involve risk areas such as external data storage, dependency on the public internet, lack of control, multi-tenancy and integration with internal security (Hashizume, Rosado, Fernández-medina, & Fernandez, 2013). The traditional security techniques such as identity, authentication, and authorization are not sufficient for cloud computing environments in their current forms using the standard deployment models of the public cloud, and private cloud (Hashizume et al., 2013). The increasing trend in the security threats data breaches, and the current deployment models of private and public clouds, which are not meeting the security challenges, have triggered the need for another deployment to ensure security and privacy protection. Thus, the VPC deployment model which is a new deployment model of cloud computing technology (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q., Cheng, & Boutaba, 2010). The VPC is taking advantages of technologies such as a virtual private network (VPN) which will allow hospitals and providers to set up their required network settings such as security (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). The VPC deployment model will have dedicated resources with the VPN to provide the required isolation for security to protect the patients’ information (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). Thus, this proposed design will be using VPC cloud computing deployment mode to store and use healthcare data in a secure and isolated environment to protect the patients’ medical records (Regola & Chawla, 2013).

Hadoop ecosystem is a required component in this proposed design for several reasons. Hadoop technology is a commonly used computing paradigm for massive volume data processing in the cloud computing (Bansal et al., 2014; Chrimes et al., 2018; Dhotre et al., 2015). Hadoop is the only technology that enables large healthcare volumes of data to be stored in its native forms (Dezyre, 2016). Hadoop is proven to develop better treatments for diseases such as cancer by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national cancer knowledge network to guide treatment decision (Dezyre, 2016). With Hadoop system, hospitals in the four States will be able to monitor the patient vitals (Dezyre, 2016). The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over six thousand children in their ICU units (Dezyre, 2016).

The proposed design requires the integration of NoSQL database because it offers benefits such as mass storage support, reading and writing operations which are fast, and the expansion is easy with a low cost (Sahafizadeh & Nematbakhsh, 2015). HBase is proposed as a required NoSQL database as it is faster when reading more than six million variants which are required when analyzing large healthcare datasets (Luo et al., 2016). Besides, query engine such as SeqWare can be integrated with HBase as needed to help bioinformatics researchers access large-scale whole-genome datasets (Luo et al., 2016). HBase can store clinical sensors where the row key serves as the time stamp of a single value, and the column stores patients’ physiological values that correspond with the row key time stamp (Luo et al., 2016). HBase is scalable, high-performance and low-cost NoSQL data store that can be integrated with Hadoop sitting on top of HDFS (Yang et al., 2013). As a column-oriented NoSQL data store that runs on top of HDFS of Hadoop ecosystem, HBase is well suited to parse the healthcare large data sets (Yang et al., 2013). HBase supports applications written in Avro, REST and Thrift (Yang et al., 2013). MongoDB is another NoSQL data store, which will be used to store metadata to improve the accessibility and readability of the HBase data schema (Luo et al., 2016).

The integration of Spark is required in order to overcome the Hadoop limitation of real-time data processing because Hadoop is not optimal for real-time data processing (Guo, 2013). Thus, Apache Spark is a required component to implement this proposal so that the healthcare BDA system can take advantages of data processing at rest using the batching technique as well as a motion using the real-time processing technique (Liang & Kelemen, 2016). Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Liang & Kelemen, 2016). Spark is a high integration to the recent Hadoop cluster deployment (Scott, 2015). While Spark is a powerful tool on its own for processing a large volume of medical and healthcare datasets, Spark is not well-suited for production workload. Thus, the integration of Spark with Hadoop ecosystem provides many capabilities which Spark cannot offer on its own, and Hadoop cannot offer on its own.

The integration of AI as part of this proposal is justified by the examination of Harvard Business Review (HBR) that shows ten promising AI application in healthcare (Kalis, Collier, & Fu, 2018). The findings of HBR’s examination showed that the application of AI could create up to $150 billion in annual savings for U.S. healthcare by 2026 (Kalis et al., 2018). The result also showed that AI currently creates the most value in assisting the frontline clinicians to be more productive and in making back-end processes more efficient (Kalis et al., 2018). Furthermore, IBM invested $1 billion in AI through the IBM Watson Group, and healthcare industry is the most significant application of Watson (Power, 2015).

Conclusion

Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry. The value that is driven by BDA can save lives and minimize costs for patients. This project proposes a design to apply BDA in the healthcare system across four States of Colorado, Utah, Arizona, and New Mexico. Cloud computing is the most appropriate technology to deal with the large volume of healthcare data. Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used. VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists.

The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT). The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images. Spark will be used for real-time data processing which can be vital for urgent care and emergency services. This project addressed the assumptions and limitations plus the justification for selecting these specific components.

In summary, all stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process. All stakeholders are responsible to facilitate the integration of BD and BDA into the healthcare system. The rigid culture and silo pattern need to change for better healthcare system which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

References

Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

absentdata.com. (2018). Tableau Advantages and Disadvantages. Retrieved from https://www.absentdata.com/advantages-and-disadvantages-of-tableau/.

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Anand, M., & Clarice, S. (2015). Artificial Intelligence Meets Internet of Things. Retrieved from http://www.ijcset.net/docs/Volumes/volume5issue6/ijcset2015050604.pdf.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Balasubramanian, V., & Mala, T. (2015). A Review On Various Data Security Issues In Cloud Computing Environment And Its Solutions. Journal of Engineering and Applied Sciences, 10(2).

Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Bresnick, J. (2018). Top 12 Ways Artificial Intelligence Will Impact Healthcare. Retrieved from https://healthitanalytics.com/news/top-12-ways-artificial-intelligence-will-impact-healthcare.

Carutasu, G., Botezatu, M., Botezatu, C., & Pirnau, M. (2016). Cloud Computing and Windows Azure. Electronics, Computers and Artificial Intelligence.

Chang, V. (2015). A Proposed Framework for Cloud Computing Adoption. International Journal of Organizational and Collective Intelligence, 6(3).

Chrimes, D., Zamani, H., Moa, B., & Kuo, A. (2018). Simulations of Hadoop/MapReduce-Based Platform to Support its Usability of Big Data Analytics in Healthcare.

Cloud Security Alliance. (2013). The Notorious Nine: Cloud Computing Top Threats in 2013. Cloud Security Alliance: Top Threats Working Group.

Cloud Security Alliance. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance: Top Threats Working Group.

Cloud Security Alliance. (2017). The Treacherous 12 Top Threats to Cloud Computing. Cloud Security Alliance: Top Threats Working Group.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409. doi:10.1002/widm.1134

Fox, M., & Vaidyanathan, G. (2016). Impacts of Healthcare Big Data: A Framwork With Legal and Ethical Insights. Issues in Information Systems, 17(3).

Ghani, K. R., Zheng, K., Wei, J. T., & Friedman, C. P. (2014). Harnessing big data for health care and research: are urologists ready? European urology, 66(6), 975-977.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006

Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5

HIMSS. (2018). 2017 Security Metrics: Guide to HIPAA Compliance: What Healthcare Entities and Business Associates Need to Know. . Retrieved on 12/1/2018 from http://www.himss.org/file/1318331/download?token=h9cBvnl2.

HIPAA. (2018a). At Least 3.14 Million Healthcare Records Were Exposed in Q2, 2018. Retrieved 11/22/2018 from https://www.hipaajournal.com/q2-2018-healthcare-data-breach-report/.

HIPAA. (2018b). How to Defend Against Insider Threats in Healthcare. Retrieved 8/22/2018 from https://www.hipaajournal.com/category/healthcare-cybersecurity/.

HIPAA. (2018c). Q3 Healthcare Data Breach Report: 4.39 Million Records Exposed in 117 Breaches. Retrieved 11/22/2018 from https://www.hipaajournal.com/q3-healthcare-data-breach-report-4-39-million-records-exposed-in-117-breaches/.

HIPAA. (2018d). Report: Healthcare Data Breaches in Q1, 2018. Retrieved 5/15/2018 from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/.

HL7. (2011). Patient Example Instance in XML.

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

InformationBuilders. (2018). Data In Motion – Big Data Analytics in Healthcare. Retrieved from http://docs.media.bitpipe.com/io_10x/io_109369/item_674791/datainmotionbigdataanalytics.pdf, White Paper.

Jayasingh, B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization. Paper presented at the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).

Ji, Z., Ganchev, I., O’Droma, M., Zhang, X., & Zhang, X. (2014). A cloud-based X73 ubiquitous mobile healthcare system: design and implementation. The Scientific World Journal, 2014.

Kalis, B., Collier, M., & Fu, R. (2018). 10 Promising AI Applications in Health Care. Retrieved from https://hbr.org/2018/05/10-promising-ai-applications-in-health-care, Harvard Business Review.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Kazim, M., & Zhu, S. Y. (2015). A Survey on Top Security Threats in Cloud Computing. International Journal Advanced Computer Science and Application, 6(3), 109-113.

Kersting, K., & Meyer, U. (2018). From Big Data to Big Artificial Intelligence? : Springer.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.

Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008

Kumari, W. M. P. (2017). Artificial INtelligence Meets Internet of Things.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liveri, D., Sarri, A., & Skouloudi, C. (2015). Security and Resilience in eHealth: Security Challenges and Risks. European Union Agency For Network And Information Security.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Malik, L., & Sangwan, S. (2015). MapReduce Framework Implementation on the Prescriptive Analytics of Health Industry. International Journal of Computer Science and Mobile Computing, ISSN, 675-688.

Maltby, D. (2011). Big Data Analytics. Paper presented at the Annual Meeting of the Association for Information Science and Technology.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.

McKelvey, N., Curran, K., Gordon, B., Devlin, E., & Johnston, K. (2015). Cloud Computing and Security in the Future Guide to Security Assurance for Cloud Computing (pp. 95-108): Springer.

Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446

Meyer, M. (2018). The Rise of Healthcare Data Visualization.

Mills, T. (2018). Eight Ways Big Data And AI Are Changing The Business World.

MongoDB. (2018). ETL Best Practice.

O’Brien, B. (2016). Why The IoT Needs ARtificial Intelligence to Succeed.

Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.

Patrizio, A. (2018). Big Data vs. Artificial Intelligence.

Power, B. (2015). Artificial Intelligence Is Almost Ready for Business.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Regola, N., & Chawla, N. (2013). Storing and Using Health Data in a Virtual Private Cloud. Journal of medical Internet research, 15(3), 1-12. doi:10.2196/jmir.2076

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Salido, J. (2010). Data Governance for Privacy, Confidentiality and Compliance: A Holistic Approach. ISACA Journal, 6, 17.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide. CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Sultan, N. (2010). Cloud Computing for Education: A New Dawn? International Journal of Information Management, 30(2), 109-116. doi:10.1016/j.ijinfomgt.2009.09.004

Sun, J., & Reddy, C. (2013). Big Data Analytics for Healthcare. Retrieved from https://www.siam.org/meetings/sdm13/sun.pdf.

Tableau. (2011). Three Ways Healthcare Probiders are transforming data from information to insight. White Paper.

Thompson, E. C. (2017). Building a HIPAA-Compliant Cybersecurity Program, Using NIST 800-30 and CSF to Secure Protected Health Information.

Van-Dai, T., Chuan-Ming, L., & Nkabinde, G. W. (2016, 5-7 July 2016). Big data stream computing in healthcare real-time analytics. Paper presented at the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

Venkatesan, T. (2012). A Literature Survey on Cloud Computing. i-Manager’s Journal on Information Technology, 1(1), 44-49.

Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019

Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud Computing: State-of-the-Art and Research Challenges. Journal of internet services and applications, 1(1), 7-18. doi:10.1007/s13174-010-0007-6

Zhang, R., & Liu, L. (2010). Security models and requirements for healthcare application clouds. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Zia, U. A., & Khan, N. (2017). An Analysis of Big Data Approaches in Healthcare Sector. International Journal of Technical Research & Science, 2(4), 254-264.

NoSQL Database Application to Health Informatics Data Analytics.

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze a NoSQL database type such as Cassandra, MongoDB and how it is used or applied to health informatics data analytics. The discussion is based on a project implemented by (Klein et al., 2015). In this project, the researchers performed application-specific prototyping and measurement to identify NoSQL products which can fit a data model and can query use cases to meet the performance requirements of the provider. The provider was a healthcare provider with specific requirements to employ Electronic Health Record (EHR) system. The project used three NoSQL databases Cassandra, MongoDB, and Riak as the three candidates based on the maturity of the product and the availability of the enterprise support. The researchers faced challenges of selecting the right NoSQL during their work on the project.

This research study is selected because it is comprehensive and it has rich information about the implementation of these three data stores in healthcare. Moreover, the research study has additional useful information regarding healthcare such HL7, and healthcare-specific data models of “FHIR Patient Resources” and “FHIR Observation Resources,” besides the performance framework such as YCSB.

NoSQL Database Application in Healthcare

The provider has been using thick client system running at each site around the globe and connected to a centralized relational database. The provider has no experience with NoSQL. The purpose of the project was to evaluate NoSQL databases which will meet their needs.

The provider was a large healthcare provider requesting a new EHR system which supports healthcare delivery for over nine million patients in more than 100 facilities across the world. The rate of the data growth is more than one terabyte per month. The data must be retained for ninety-nine years. The technology of NoSQL was considered for two major reasons. The first reason involved a Primary Data Store for the EHR system. The second reason is to improve request latency and availability by using a local cache at each site.

The project involved four major steps as discussed below. Step four involved five major configuration tasks to test the identified data stores as discussed below as well. This EHR system requires robust and strong replica consistency. A comparison was performed between the identified data stores for the strong replica consistency vs. the eventual consistency.

Project Implementation

Step 1: Identity the Requirement: The first step in this project was identify the requirements from the stakeholders of the provider. These requirements are used to develop the evaluation of NoSQL database. There were two main requirements. The first requirement involved high availability with low latency with a high load in distributed systems. This first requirement reflected the performance and the scalability as the measure to evaluate the NoSQL candidates. The second requirement involved logical data models and query patterns supported by NoSQL, and replica consistency in a distributed framework. This requirement reflected the data model mapping as the measure to evaluate the NoSQL candidates.

Step 2: Define Two Primary Use Cases for the Use of the EHR System: The providers provided two specific use cases for the EHR system. The first use case was to read recent medical test results for a patient. This use case is regarded to be the core function used to populate the user interface when a clinician selects a new patient. The second use case was to achieve a strong replica consistency when a new medical test result is written for a patient. The purpose of this strong replica consistency is to allow all clinicians using the EHR framework to see the information to make health care decision for the patient, with no regard to the location of the patient either at the same site or in another location.

Step 3: Select the Candidate NoSQL Database: The provider requested the evaluation of different data models of NoSQL data stores such as key-value, column, and document to determine the best-fit NoSQL which can meet their requirements. Thus, Cassandra, MongoDB, and Riak were the candidates for this project based on the maturity of the product and enterprise support.

Step 4: Performance Tests Design and Execution: A systematic test process was designed and executed to evaluate the three candidates based on the use cases requirements defined earlier. This systematic test process included five major Tasks as summarized in Table 1.

Table 1: Summary of the Performance Tests Design and Execution Tasks.

Task 1: Test Environment Configuration: The test environment was developed using the three identified NoSQL databases: MongoDB, Cassandra, and Riak. Table 2 shows the identified NoSQL database, types, version, and source. The test environment included two configurations. The first configuration involved a single node server. The purpose of the first single-node environment was to validate the test environment for each database type. The second configuration involved nine-node environment. The purpose of the nine-node environment was to represent the production environment that was geographically distributed across three data centers. The dataset was shared across three nodes, and then replicated to two additional groups; each group has three nodes. The replication configuration is implemented based on each NoSQL database. For instance, when using MongoDB, the replica systems are implemented using the Primary/Secondary feature. When using Cassandra, the replica configuration is implemented using the data center built-in awareness distribution feature. When using Riak, the data was shared across all nine nodes, with three replicas of each shared stored across the nine nodes. Amazon EC2 (Elastic Compute Cloud) instances were used for the test environment implementation. Table 3 describes the EC2 type and source.

Table 2: Summary of Identified NoSQL Databases, Types, Versions, Sources, and Implementation.

Table 3: Details of Nodes, Types, and Size.

Task 2: Data Model Mapping: A logical data model was mapped to the identified data model for the healthcare. The identified data model was HL7 Fast Healthcare Interoperability Resources (FHIR). HL7 is Health Level-7 refers to a set of identified international standards to transfer clinical and administrative data between software applications. Various healthcare providers use these applications. These identified international standards focus on the application layer, which is layer-7 in the OSI model (Beeler, 2010). Two models were used: “FHIR Patient Resources,” and “FHIR Observation Resources.” The test results for a patient were modeled using the “FHIR Patient Resources” such as names, address, and phone, while other medical related information such as test type, result quantity, and result units were modeled using “FHIR Observation Resources.” The relationship between patient and test results was a one-to-many relationship (1: M), and between patient and observations was also one-to-many relationship as illustrated in Figure 1.

Figure 1: The Logical Data Model and the Relationship between Patient, Test Result, and Observations.

This relationship of 1: M between the patient and the test result and the observation, and the need to access efficiently the most recently written test results were very challenging for each identified NoSQL data store. Thus, for MongoDB, the researchers used a composite index of two attributes (Patient ID, Observation ID) for the test result records, and indexed by the lab result data-time stamp. Using this approach enabled the efficient retrieval process of the most recent test result records for a particular patient. For Cassandra, the researchers used a similar approach as MongoDB but using a composite index of three attributes (Patient ID, Lab Result, Data-Time Stamp). The retrieval of the most recent test result was efficient using this approach of the three-composite index in Cassandra because the result was returned sorted by the server. With respect to Riak, the relationship of 1: M was more complicated than MongoDB and Cassandra. The key-value data model of Riak enables the retrieval of a value, which has a unique key. Riak has the feature of a “secondary index” to avoid a full scan when the key is not known. However, each node in the cluster stores the secondary indices for those shards stored by the nodes. The secondary index requires a query to match such an index, which results in “scatter-gather” performed by the “request coordinator” asking each node for records with the requested secondary index value, waiting for all nodes to respond, and then sending the list of keys for the matching records back to the requester. This operation causes latency to locate records and the need for two round trips to retrieve the records had a negative impact on the performance of Riak data store. There is no technique in Riak to filter and return only the most recent observations for a patient. Thus, all matching records must be returned and then sorted and filtered by the client. Table 4 summarizes the data model mapping for each of the identified data store, and the impact on performance.

Table 4: Data Model Mapping and Impact on Performance.

Task 3: Data Generation and Load: The dataset contained one million patient records (Patient Records: N=1,000,000), and ten million records for test results (Test Results Records: N=10,000,000). The number for the test result for a patient ranged from zero to twenty with an average of seven (Test Result for each Patient: Min=0, Max=20, Mean=7).

Task 4: Load Test Client: For this task, the researchers used Yahoo Cloud Serving Benchmark (YCSB) framework to manage the execution of the test, and to test the measurement. One of the key features of YCSB, as indicated in (Cooper, Silberstein, Tam, Ramakrishnan, & Sears, 2010), is the extensibility, which provides an easy definition of new workload types, and flexibility to benchmark new systems. YCSB framework and workloads are available in open source for system evaluations. The researchers replaced the simple data models, data sets, and queries of YCSB framework to reflect the project implementations are meeting the specific use cases of the providers. Another YCSB feature are the ability to specify the total number of operations and the mix of reading and write operations in a workload. The researchers utilized this feature and applied the 80% read, and %20 write for each load for EHR system in response to the provider’s requirements. Thus, the read operations were used to retrieve the five most recent observations for a patient, and the write operations were used to insert a new observation record for a patient. Two use cases for workload were used. The first use case was to test the data store as a local cache, which involved the write-only workload operation which was performed on a daily basis to load a local cache from a centralized primary data store with records for patients with a scheduled appointment that day. The second use case was for the read workload to flush the cache back to the centralized primary data store as illustrated in Figure 2.

Figure 2. Read (R) and Write Workloads.

The “operation latency” was measured by the YCSB framework, as the time calculated between the request time and the response time from the data store. The calculation for reading and write operation latency was performed separately using the YCSB framework. Besides the “operation latency,” the latency distribution is a key scalability metric in Big Data Analytics. Thus, the researchers recorded both the average and the 95^%values. Moreover, the researchers extended the test to include the overall throughput in operations per second, which reflected the total number of operations for reading and write divided by the total workload execution time, excluding the time for the initial setup and the cleaning to obtain a more accurate result.

Task 5: Test Script Development and Execution: In this task, the researchers performed three runs to decrease any impact associated with the transient events in the cloud infrastructure. These three runs were performed for each of the identified data stores. The standard deviation of the throughput for any three-run set never exceeded 2% of the average. YCSB allows running multiple execution threads to create concurrent client sessions. Thus, the workload execution was repeated for a defined range of test client threads for each of the three-run tests. This workload execution approach created a corresponding number of concurrent database connections. The researchers indicated that NoSQL data stores are not designed to operate with a large number of concurrent database client sessions. NoSQL databases can usually handle between 16-64 concurrent sessions.

The researchers analyzed the appropriate approach to distributing the multiple concurrent connections to the database across the server nodes. Based on their analysis, the researchers found that MongoDB utilizes a centralized router node and all clients connected to that single router node. With respect to Cassandra, the data center built-in awareness distribution feature created three sub-clusters of three nodes each, and client connections were spread uniformly across the three nodes in one sub-clusters. With respect to the Riak data store, client connections are allowed only to be spread uniformly across the full set of the nine-node. Table 5 summarizes how each data stores handles concurrent connections in a distributed system.

Table 5. A Summary of the Data Store Techniques for Concurrent Connections in Distributed System.

Results and Findings

The nine-node topology was configured to represent the production system. Moreover, a single server configuration was also tested which demonstrated its limitation for production use. Thus, the performance result does not reflect the single server configuration but rather the nine-node topology configuration and test execution. However, the comparison between the single node and distributed nine-node scenario was performed to provide insight on the performance of the data store and the efficiency of the distributed systems and the tradeoff to scale up with more nodes versus using faster nodes with more storage. The results covered three major areas: Strong Consistency Evaluation, Eventual Consistency Evaluation, and Performance Evaluation.

Strong Consistency Evaluation

With respect to MongoDB, all writes were committed on the Primary Server, while all reads were performed from the Primary Server. In Cassandra, all writes were committed on a quorum formed on each of the three sub-clusters, while the read operations required a quorum only on the local sub-cluster. With respect to Riak, the effect was to require a quorum on the entire nine-node cluster for both read and write operations. Table 6 summarizes this strong consistency evaluation result.

Table 6. Strong Consistency Evaluation. Adapted from (Klein et al., 2015).

Eventual Consistency Evaluation: The eventual consistency tests were performed on Cassandra and Riak. The results showed that writes were committed on one node with replication occurring after the operation was acknowledged to the client. The results also showed that read operations were executed on one replica, which may or may not return the latest values written to the data store.

Performance Evaluation: Cassandra demonstrated the best overall performance result among MongoDB and Riak peaking at approximately 3500 operations per seconds as illustrated in Figure 3. The comparison in Figure 3 include throughput read-only workload, write-only workload and read/write workload, replicated data, and quorum consistency for these three data stores.

Figure 3. Workload Comparison among Cassandra, MongoDB, and Riak.

Adapted from (Klein et al., 2015).

The decreased contention of storage I/O improved the performance, while additional work of coordinating write and read quorums across replicas and data centers decreased the performance. With respect to Cassandra, the improved performance exceeded the degraded performance resulting in net higher performance in the distributed configuration than the other two data stores. The built-in awareness distributed feature in Cassandra participated in the improved performance of Cassandra because this feature separates replication and sharding configurations. This separation allowed larger read operations to be completed without the requirement of request coordination such as P2P proxying of the client request as the case in Riak.

With respect to latency, the results showed for the read/write workload that MongoDB had a constant average latency as the number of concurrent sessions with the increased number of concurrent sessions. The result also showed the read/write operations, that Cassandra achieved the highest overall throughput, it had the highest latencies indicating high internal concurrency in processing the requests.

Conclusion

Cassandra data store demonstrated the best throughput performance, but with the highest latency for the specific workloads and configurations tested. The researchers analyzed such results of Cassandra that Cassandra provides hash-based sharding spread the request and storage load better than MongoDB. The second reason is that indexing feature of Cassandra allowed efficient retrieval of the most recently written records, compared to Riak. The third reason is that the P2P architecture and data center aware feature of Cassandra provide efficient coordination of both reads and write operations across the replicas nodes and the data centers.

The results also showed that MongoDB and Cassandra provided a more efficient result with respect to the performance than Riak data store. Moreover, they provided strong replica consistency required for such application of the data models. The researchers concluded that MongoDB exhibited more transparent data modeling mapping than Cassandra, besides the indexing capabilities of MongoDB, were found to be a better fit for such application.

Moreover, the results also showed that the throughput varied by a factor of ten, read operation latency varied by a factor of five, and write latency by a factor of four with the highest throughput product delivering the highest latency. The results also showed that the throughput for workloads using strong consistency was 10-25% lower than workloads using eventual consistency.

References

Beeler, G. W. J. (2010). Introduction to: HL7 References Information Model (RIM). ANSI/HL7 RIM R3-2010 and ISO 21731. Retrieved from https://www.hl7.org/documentcenter/public_temp_4F08F84F-1C23-BA17-0C2B98D837BC327B/calendarofevents/himss/2011/HL7ReferenceInformationModel.pdf.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., & Sears, R. (2010). Benchmarking cloud serving systems with YCSB. Paper presented at the Proceedings of the 1st ACM symposium on Cloud computing.

NoSQL Databases: Cassandra vs. DynamoDB

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the differences between two NoSQL databases Cassandra and DynamoDB. The discussion begins with a brief overview of NoSQL and the data store types for these NoSQL databases, followed by more focus discussion about Cassandra and DynamoDB.

NoSQL Overview

NoSQL stands for “Not Only SQL” (EMC, 2015; Sahafizadeh & Nematbakhsh, 2015). NoSQL is used for modern, scalable databases in the age of Big Data. The scalability feature enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015). The platform can incorporate two types of scalability to support the processing of Big Data; horizontal scaling and vertical scaling. The horizontal scaling allows distributing the workload across many servers and nodes. Servers can be added in the horizontal scaling to increase the throughput (Sahafizadeh & Nematbakhsh, 2015). The vertical scaling, on the other hands, more processors, more memories, and faster hardware can be installed on a single server (Sahafizadeh & Nematbakhsh, 2015). NoSQL offers benefits such as mass storage support, reading and writing operations are fast, the expansion is easy, and the cost is low (Sahafizadeh & Nematbakhsh, 2015). Examples of the NoSQL databases are MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.

NoSQL Data Stores Types

Data stores are categorized into four types of store: document-oriented, column-oriented or column family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The purpose of the document-oriented database is to store and retrieve collections of information and documents. Moreover, it supports complex data forms in various format such as XML, JSON, in addition to the binary forms such as PDF and MS Word (EMC, 2015; Hashem et al., 2015). The document-oriented database is similar to a tuple or in the relational database. However, the document-oriented database is more flexible and can retrieve documents and information based on their contents. The document-oriented data store offers additional features such as the creation of indexes to increase the search performance of the document (EMC, 2015). The document-oriented data stores can be used for the management of the content of web pages, as well as web analytics of log data (EMC, 2015). Example of the document-oriented data stores includes MongoDB, SimpleDB, and CouchDB (Hashem et al., 2015). The purpose of the column-oriented database is to store the content in columns aside from rows, with attribute values belonging to the same column stored contiguously (Hashem et al., 2015). The column family database is used to store and render blog entries, tags, and viewers’ feedback. It is also used to store and update various web page metrics and counters (EMC, 2015). Example of the column-oriented database is BigTable. In (EMC, 2015; Erl, Khattak, & Buhler, 2016) Cassandra is also listed as a column-family data store. The key-value data store is designed to store and access data with the ability to scale to a very large size (Hashem et al., 2015). The key-value data store contains value and a key to access that value. The values can be complex (EMC, 2015). The key-value data store can be useful in using login ID as the key to the preference value of customers. It is also useful in web session ID as the key with the value for the session. Examples of key-value databases include DynamoDB, HBase, Cassandra, and Voldemort (Hashem et al., 2015). While HBase and Cassandra are described to be the most popular and scalable key-value store (Borkar, Carey, & Li, 2012), DynamoDB and Cassandra are described to be the two popular AP (Availability and Partitioning tolerance) systems (M. Chen, Mao, & Liu, 2014). Others like (Kaoudi & Manolescu, 2015) describes Apache Accumulo, DynamoDB, and HBase as the popular key-value stores. The purpose of the graph database is to store and represent data which uses a graph model with nodes, edges, and properties related to one another through relations. Example of the graph database is Neo4j (Hashem et al., 2015). Table 1 provides examples of NoSQL Data Stores.

Table 1. NoSQL Data Store Types with Examples.

Cassandra

Cassandra is described as the most popular NoSQL database (C. P. Chen & Zhang, 2014; Mishra, Dehuri, & Kim, 2016). It is the second-generation distributed key-value store which was developed by Facebook in 2008 (Bifet, 2012; Cattell, 2011; C. P. Chen & Zhang, 2014; Rabl, Sadoghi, & Jacobsen, 2012). It is also described as a clustered, key-value database which uses column-oriented storage and redundant storage for accessibility in both read/write performance and data sizes (Mishra et al., 2016).

Cassandra can handle the very large amount of data which spread out across many servers. It also provides a highly available service without a single point of failure (Bahrami & Singhal, 2015; Tilmann Rabl et al., 2012). Failure detection and recovery are fully automated (Cattell, 2011). It adopts concepts from both DynamoDB and BigTable. Cassandra integrated the distributed technology of DynamoDB with the data model of BigTable (M. Chen et al., 2014; Tilmann Rabl et al., 2012). Thus, the architecture of Cassandra is a mixture of BigTable of Google and DynamoDB of Amazon, providing availability and scalability (M. Chen et al., 2014; Tilmann Rabl et al., 2012). Example of Cassandra’s application is Netflix which is using it as the back-end database for its streaming services (Bifet, 2012).

Cassandra, like HBase, is written in Java and used under Apache licensing (Cattell, 2011). Cassandra has column groups, uses memory to cache the updates which get flushed into a disk, and the representation of the disk representation is compacted periodically. Cassandra can be used for partitioning and replication (Cattell, 2011). The partition and copy techniques in Cassandra are said to be similar to those of DynamoDB to achieve consistency (M. Chen et al., 2014). However, Cassandra is said to have a weaker concurrency model than other systems, as there is no locking technique and replicas are updated asynchronously (Cattell, 2011).

When using Cassandra, newly available nodes are brought automatically into a cluster using “phi accrual algorithm” to detect node failure and determine cluster membership in a distributed fashion using a “gossip-style algorithm” (Cattell, 2011). Tables in Cassandra are in the form of distributed four-dimensional structured mapping, where the four dimension including row, column, column family and super column (M. Chen et al., 2014). Cassandra provides the concept of “super column” providing another level of grouping within column groups (Cattell, 2011). The row is distinguished by a string-key with arbitrary length (M. Chen et al., 2014). The number of columns to be read or written does not matter because the operation on rows is an auto operation (M. Chen et al., 2014). The columns can constitute clusters which are called column families, similar to the data model of BigTable (M. Chen et al., 2014).

Cassandra uses an “ordered hash index” providing the benefit of both the hash and B-Tree indexes (Cattell, 2011). However, the sorting is slower in “ordered hash index” than with the B-Tree index (Cattell, 2011). Cassandra is said to be gaining a lot of momentum as an open source project as it has reportedly scaled to about 150 machines or more in the production platform of Facebook (Cattell, 2011). Cassandra uses the eventual-consistency model which is said to be not adequate. However, “quorum reads” of a majority of replicas provide a technique to get the latest data (Cattell, 2011). The writes in Cassandra are atomic within a column family (Cattell, 2011). Moreover, Cassandra supports versioning and conflict resolution techniques (Cattell, 2011). The key functions of Cassandra involve P2P (Peer-to-Peer) system structured and unstructured, decentralized storage system, symmetric system orientation, efficient latencies, linear scalability, the map is indexed by a unique “low-key,” “column-key” (Kalid, Syed, Mohammad, & Halgamuge, 2017).

With respect to security, Cassandra supports the encryption of all password using MD5 hash function and the passwords are very weak, which can cause a threat if any malicious user can bypass client authorization (Sahafizadeh & Nematbakhsh, 2015). The user can extract the data because of the lack of authorization technique in inter-node message exchange (Sahafizadeh & Nematbakhsh, 2015). Thus, Cassandra is potential for denial of service attack because it performs one threat per one client and it does not support inline auditing (Sahafizadeh & Nematbakhsh, 2015). Cassandra uses a query language called Cassandra Query Language (CQL), which is similar to SQL (Sahafizadeh & Nematbakhsh, 2015). Experiments showed that the injection attack is possible in Cassandra using CQL like SQL injection (Sahafizadeh & Nematbakhsh, 2015). Moreover, Cassandra has a limitation of managing inactive connection (Sahafizadeh & Nematbakhsh, 2015).

DynamoDB

DynamoDB belongs to Amazon and is highly available, scalable, key-value, and low-latency NoSQL database (Kalid et al., 2017; Mishra et al., 2016; Russell & Van Duren, 2016). DynamoDB is described as one of the earliest NoSQL databases affecting the design of other NoSQL databases such as Cassandra (Mishra et al., 2016). DynamoDB supports both the key-value and document data store (Sahafizadeh & Nematbakhsh, 2015). The goal of DynamoDB is higher performance and high throughput (Thuraisingham, Parveen, Masud, & Khan, 2017). DynamoDB can expand and shrink as required by the applications (Thuraisingham et al., 2017). It supports in-memory cache using DynamoDB Accelerator providing millisecond responses for millions of requests per seconds (Thuraisingham et al., 2017).

With respect to security, in DynamoDB the data security, authentication, and access control can be implemented on a per-table basis, to leverage the AWS identity and access management system (Russell & Van Duren, 2016). However, data encryption is not supported in DynamoDB. It supports the communication between the client and server using the HTTPS protocol. DynamoDB supports authentication and authorization, and requests need to be signed using HMAC-SHA256 (Sahafizadeh & Nematbakhsh, 2015).

Summary of Comparison between DynamoDB and Cassandra

DynamoDB was one of the earliest NoSQL databases impacting the design of other NoSQL databases such as Cassandra. Cassandra integrated the data model from BigTable of Google, and the distributed system technology of DynamoDB is providing availability and scalability. DynamoDB and Cassandra are popular for Availability and Partitioning tolerance. The partitioning and copying techniques in Cassandra are similar to those of DynamoDB to achieve consistency. DynamoDB does not support encryption while Cassandra does use MD5 for all passwords. However, DynamoDB supports HTTPS protocol. Table xx summarizes the comparison between DynamoDB and Cassandra.

Table 2. Summary of Comparison between DynamoDB and Cassandra

References

Bahrami, M., & Singhal, M. (2015). The role of cloud computing architecture in big data Information granularity, big data, and computational intelligence (pp. 275-295): Springer.

Bifet, A. (2012). Mining big data in real time. Informatica, 37(1).

Borkar, V. R., Carey, M. J., & Li, C. (2012). Big data platforms: what’s next? XRDS: Crossroads, The ACM Magazine for Students, 19(1), 44-49.

Cattell, R.-i. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12-27.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Kalid, S., Syed, A., Mohammad, A., & Halgamuge, M. N. (2017). Big-data NoSQL databases: A comparison and analysis of “Big-Table”,“DynamoDB”, and “Cassandra”. Paper presented at the Big Data Analysis (ICBDA), 2017 IEEE 2nd International Conference on.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

Rabl, T., Sadoghi, M., & Jacobsen, H. (2012). Solving Big Data Challenges for Enterprise Application Performance Management.

Russell, B., & Van Duren, D. (2016). Practical Internet of Things Security: Packt Publishing Ltd.

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Thuraisingham, B., Parveen, P., Masud, M. M., & Khan, L. (2017). Big Data Analytics with Applications in Insider Threat Detection: CRC Press.

NoSQL Data Storage System

Dr. Aly, O.
Computer Science

Introduction: For decades, the traditional database such as MySQL, PostgreSQL, SQL Server and Oracle are regarded to be a one-size-fits-all approach for data persistence and retrieval (Sakr & Gaber, 2014). However, these traditional databases are challenged by the increasing demand for scalability, the requirement for new applications, and some web-scale applications (Sakr & Gaber, 2014). The most common architecture to develop enterprise web application is based on the three-tier framework: the server layer, the application layer, and the data layer (Sakr & Gaber, 2014). The data partitioning and the data replication are two commonly used approaches to achieve the availability, scalability, and performance enhancement goals in the distributed data management. There are two main approaches to achieve scalability at the database layer to be able to handle the client requests when the load of the application is increased (Sakr & Gaber, 2014). The first approach is to scale up allocating a bigger machine to act as database servers. The second approach is to scale out replicating and partitioning data across more machines (Sakr & Gaber, 2014). However, the traditional database suffers from serious limitations. Database systems are not easy to scale as they cannot exceed a certain limit. The database systems are not easy to configure and maintain. The specialized database systems for main memory systems as the case with OLTP and column stores as the case with OLAP add more complication and cost when selecting the database system in addition to the peak provisioning unnecessary cost (Sakr & Gaber, 2014).

Thus, new systems called NoSQL started to emerge as an alternative mode for the traditional database systems to be able to deal and handle Big Data (Moniruzzaman & Hossain, 2013; Pokorny, 2013; Sakr & Gaber, 2014). Not Only SQL” known as NoSQL databases emerged to deal with Big Data. NoSQL database systems were developed by major internet companies such as Facebook, Amazon, and Google, when they were confronted with the Big Data challenges (Moniruzzaman & Hossain, 2013). NoSQL databases are found to be suitable for massive scheme-free datasets for Big Data management (Hu, Wen, Chua, & Li, 2014). NoSQL database systems are considered to be the potential data management solution for Big Data (Abbasi, Sarker, & Chiang, 2016).

NoSQL Data Storage and the Tradeoff between Consistency and Availability

The ACID properties are regarded to be one of the basic features of the traditional relational databases (Moniruzzaman & Hossain, 2013; Pokorny, 2013; Sakr & Gaber, 2014). ACID stands for “Atomicity,” “Consistency,” “Isolation,” and “Durability” (Pokorny, 2013). These ACID properties indicate “all or nothing” concept behind the traditional database (Pokorny, 2013). The relational database has been full compliance with ACID principle (Pokorny, 2013). In addition to the ACID properties, there is also CAP theorem which states that for any system sharing data it is impossible to guarantee all of these three properties simultaneously (Pokorny, 2013). These three properties of CAP include “consistency,” “availability,” and “partition tolerance” (Pokorny, 2013; Sakr & Gaber, 2014). Moreover, the traditional relational database is also characterized by a schema, where data is structured in tables, tuples, and fields (Moniruzzaman & Hossain, 2013; Sadalage & Fowler, 2013). The traditional consistency model is not adequate for distributed systems such as the Cloud environment (Sakr & Gaber, 2014).

There are two major consistency models; the strong consistency which includes the linearizability and serializability, and weak consistency which includes the eventual the causal, eventual, and timeline consistency model (Sakr & Gaber, 2014). The causal consistency model ensures total ordering between operations which have causal relations. The eventual consistency model ensures all replicas will gradually and eventually become consistent in the absence of updates (Sakr & Gaber, 2014). The timeline consistency model guarantees all replicas perform operations on one record in the same “correct order” (Sakr & Gaber, 2014).

As indicated in (Pokorny, 2013), the NoSQL database systems scale nearly linearly with the number of servers used (Pokorny, 2013). The reason for such capability for scaling nearly linearly is due to the use of “data partitioning” (Pokorny, 2013). In NoSQL database systems, the method of distributed hash tables (DHT) is often used, in which couples of (key, value) are hashed into buckets – partial storage spaces, each from that placed in a node (Pokorny, 2013). NoSQL is not characterized by a schema or structured data (Hu et al., 2014; Sadalage & Fowler, 2013). NoSQL systems are fast, highly scalable, and reliable (Hu et al., 2014). The term of “NoSQL” database indicates the loosely specified class of non-relational data stores (Pokorny, 2013). NoSQL databases mostly do not use SQL as their query language (Pokorny, 2013). They do not support operations of “JOIN” and “ORDER BY.” The reason is that because portioning data is done horizontally (Pokorny, 2013). The data in NoSQL is often organized into tables on a logical level and accessed only through the Primary Key (Pokorny, 2013).

NoSQL database systems are organized by data models (Hu et al., 2014). They are categorized as (1) key-value stores, (2) column-oriented databases, and (3) document databases (Hu et al., 2014). Most NoSQL database systems are key-value stores or big hash tables, which contain a pair of the key-value dataset (Pokorny, 2013). This approach of the key-value or big hash table increases the efficiency of the lookup (Pokorny, 2013). The key uniquely identifies a value typically a string but it can also be a pointer, where the value is stored. The value can be structured or completely unstructured typically in BLOB (binary large object) format (Pokorny, 2013). The pair of key-value can be of different types, and may not come from the same table (Pokorny, 2013). These characteristics of the NoSQL database using the pair of key-value concept, increase efficiency and scalability of the NoSQL database systems (Pokorny, 2013). Column-oriented databases store and process data by column instead of by row (Hu et al., 2014). The rows and columns will get split over multiple nodes to achieve scalability (Hu et al., 2014). The document-oriented databases support more complex data structures than the key-value stores (Hu et al., 2014). The document-oriented database systems are the most general models of the NoSQL databases (Pokorny, 2013). There are other data models for NoSQL dataset systems, including the graph databases that are not discussed in this paper. Table 1 illustrates some of these data models of the NoSQL databases reflecting the name of the databases for each model, producer, CAP option, and consistency, derived from (Hu et al., 2014).

Table 1: NoSQL Database Models. Adapted from (Hu et al., 2014)

In accordance (Hu et al., 2014; Moniruzzaman & Hossain, 2013; Pokorny, 2013), some of these NoSQL databases do not implement ACID fully, and can only be eventually consistent (Hu et al., 2014; Moniruzzaman & Hossain, 2013; Pokorny, 2013). NoSQL databases implement “eventual consistency” concept instead of “strong consistency,” where any updates and changes are replicated to the entire database eventually, but at any given time (Hu et al., 2014; Moniruzzaman & Hossain, 2013; Pokorny, 2013). The “eventually consistent” term means that the system will eventually become consistent but after some time (Hu et al., 2014; Moniruzzaman & Hossain, 2013; Pokorny, 2013). This principle of “eventual consistency” indicate more availability and greatly improve scalability at the expense of the full and immediate consistency (Hu et al., 2014; Moniruzzaman & Hossain, 2013; Pokorny, 2013). NoSQL database has particular architectures that use different possibilities of distribution, ensuring the availability of the data, and the access to the data using replication techniques. Figure 1 illustrates the characteristics of NoSQL databases, derived from (Moniruzzaman & Hossain, 2013).

Figure 1: Characteristics of NoSQL databases. Adapted from (Moniruzzaman & Hossain, 2013)

In accordance to (Chen & Zhang, 2014), the most popular NoSQL database is Apache Cassandra. Facebook in 2008 released Cassandra as open source (Chen & Zhang, 2014). Other NoSQL implementations include Apache Hadoop, MapReduce, MemcacheDB, and Voldemort (Chen & Zhang, 2014).

References

Abbasi, A., Sarker, S., & Chiang, R. (2016). Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques, and technologies: A survey of Big Data. Information Sciences, 275, 314-347.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Moniruzzaman, A., & Hossain, S. A. (2013). NoSQL database: New era of databases for big data analytics-classification, characteristics, and comparison. arXiv preprint arXiv:1307.0191.

Pokorny, J. (2013). NoSQL databases: a step to database scalability in a web environment. International Journal of Web Information Systems, 9(1), 69-82.

Sadalage, J. P., & Fowler, M. (2013). NoSQL: A Brief Guide to the Emerging World of Polyglot Persistence (1st Edition ed.): Addison-Wesley.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.