Dr. O. Aly
Computer Science
Abstract
The purpose of this proposal is to design a state-of-the-art healthcare system in four States of Colorado, Utah, Arizona, and New Mexico. Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry. The value that is driven by BDA can save lives and minimize costs for patients. The project proposes a design to apply BD and BDA in the healthcare system across these identified four States. Cloud computing is the most appropriate technology to deal with the large volume of healthcare data at the storage level as well as at the data processing level. Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used. VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists. The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT). The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images. Spark will be used for real-time data processing which can be vital for urgent care and emergency services. This project addresses the assumptions and limitations plus the justification for selecting these specific components. All stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process. The rigid culture and silo pattern need to change for better healthcare which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.
Keywords: Big
Data Analytics; Hadoop; Healthcare Big Data System; Spark.
Introduction
In the age of Big Data (BD), information technology plays a significant role in the healthcare industry (HIMSS, 2018). The role of information technology in healthcare The healthcare sector generates a massive amount of data every day to conform to standards and regulations (Alexandru, Alexandru, Coardos, & Tudora, 2016). The generated Big Data has the potential to support many medical and healthcare operations including clinical decision support, disease surveillance and population health management (Alexandru et al., 2016). This project proposes a state-of-the-art integrated system for hospitals located in Arizona, Colorado, New Mexico, and Utah. The system is based on the Hadoop ecosystem to help the hospitals maintain and improve human health via diagnosis, treatment and disease prevention.
It begins with Big Data Analytics in Healthcare Overview, which covers the benefits and challenges of BD and BDA in the healthcare industry. The overview also covers the various healthcare data sources for data analytics, in different formats such as semi-structured, e.g., XML and JSON, and unstructured, e.g., images and XRays. The second section addresses the healthcare BDA Design Proposal Using Hadoop. This section covers various components. The first component discusses the requirements for this design. These requirements include state-of-the-art technology such as Hadoop/MapReduce, Spark, NoSQL database, Artificial Intelligence (AI), Internet of Things (IoT). The project also covers various diagrams including the data flow diagram, a communication flow chart, and the overall system diagram. The healthcare design system is bounded by regulation, policies, and governance such as HIPAA, that is also covered in this project. The justification, limitation, and assumptions are also discussed.
Big Data Analytics in Healthcare Overview
BD and BDA are terms that have been used interchangeably and described as the next frontier for innovation, competitions, and productivity (Maltby, 2011; Manyika et al., 2011). BD has a multi-V model with unique characteristics, such as volume referring to the large dataset, velocity refers to the speed of the computation as well as data generation, and variety referring to the various data types such as semi-structured and unstructured (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015; Hu, Wen, Chua, & Li, 2014). BD is described as the next frontier for competition, innovation, and productivity. Various industries including healthcare have taken this opportunity and applied BD and BDA in their business models (Manyika et al., 2011). McKinsey Institute predicted $300 billion as a potential annual value to US healthcare (Manyika et al., 2011).
The healthcare industry generated extensive data driven by keeping patients’ records, complying with regulations and policies, and patients care (Raghupathi & Raghupathi, 2014). The current trend is digitalizing this explosive growth of the data in the age of Big Data (BD) and Big Data Analytics (BDA) (Raghupathi & Raghupathi, 2014). BDA has made a revolution in healthcare by transforming the valuable information, knowledge to predict epidemics, cure diseases, improve quality of life, and avoid preventable deaths (Van-Dai, Chuan-Ming, & Nkabinde, 2016). Various applications of BDA in healthcare include pervasive health, fraud detection, pharmaceutical discoveries, clinical decision support system, computer-aided diagnosis, and biomedical applications.
Healthcare Big Data Benefits and Challenges
Healthcare sector employs BDA in various aspect of healthcare such as detecting diseases at early stages, providing evidence-based medicine, minimizing doses of medication to avoid any side effects, and delivering useful medicine base on genetic analysis. The use of BD and BDA can reduce the re-admission rate, and thereby the healthcare related costs for patients are reduced. Healthcare BDA can be used to detect spreading diseases earlier before the disease gets spread using real-time analytics (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018). Example of the application of BDA in the healthcare system is Kaiser Permanente implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records (Fox & Vaidyanathan, 2016).
Despite the various benefits of BD and BDA in the healthcare sector, various challenges and issues are emerging from the application of BDA in healthcare. The nature of the healthcare industry poses challenging to BDA (Groves, Kayyali, Knott, & Kuiken, 2016). The episodic culture, the data puddles, and the IT leadership are the three significant challenges of the healthcare industry to apply BDA. The episodic culture addresses the conservative culture of the healthcare and the lack of IT technologies mindset creating rigid culture. Few providers have overcome this rigid culture and started to use the BDA technology. The data puddles reflect the silo nature of healthcare. Silo is described as one of the most significant flaws in the healthcare sector (Wicklund, 2014). The use of the technology properly is lacking in healthcare sector resulting in making the industry fall behind other industries. All silos use their methods to collect data from labs, diagnosis, radiology, emergency, case management and so forth. The IT leadership is another challenge is caused by the rigid culture of the healthcare industry. The lack of the latest technologies among the IT leadership in the healthcare industry is a severe problem.
Healthcare Data Sources for Data Analytics
The current healthcare data is collected from clinical and non-clinical sources (InformationBuilders, 2018; Van-Dai et al., 2016; Zia & Khan, 2017). The electronic healthcare records are digital copies of the medical history of the patients. It contains a variety of data relevant to the care of the patients such as demographics, medical problems, medications, body mass index, medical history, laboratory test data, radiology reports, clinical notes, and payment information. These electronic healthcare records are the most important data in healthcare data analytics, because it provides effective and efficient methods for the providers and organizations to share data (Botta, de Donato, Persico, & Pescapé, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016; Wang et al., 2018).
The biomedical imaging data plays a crucial role in healthcare data to aid disease monitoring, treatment planning and prognosis. This data can be used to generate quantitative information and make inferences from the images that can provide insights into a medical condition. The images analytics is more complicated due to the noises of the data associated with the images and is one of the significant limitations with biomedical analysis (Ji, Ganchev, O’Droma, Zhang, & Zhang, 2014; Malik & Sangwan, 2015; Van-Dai et al., 2016).
The sensing data is ubiquitous in the medical domain both for real-time and for historical data analysis. The sensing data involve several forms of medical data collection instruments such as the electrocardiogram (ECG) and electroencephalogram (EEG) which are vital sensors to collect signals from various parts of the human body. The sensing data plays a significant role for intensive care units (ICU) and real-time remote monitoring of patients with specific conditions such as diabetes or high blood pressure. The real-time and long-term analysis of various trends and treatment in remote monitoring programs can help providers monitor the state of those patients with certain conditions(Van-Dai et al., 2016).
The biomedical signals are collected from many sources such as hearts, blood pressure, oxygen saturation levels, blood glucose, nerve conduction, and brain activity. Examples of biomedical signals include electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG), electroencephalogram (EEG), electrogastrogram (EGG), and phonocardiogram (PCG). The biomedical signals real-time analytics will provide better management of chronic diseases, earlier detection of adverse events such as heart attacks, and strokes and earlier diagnosis of disease. These biomedical signals can be discrete or continuous based on the kind of care or severity of a particular pathological condition (Malik & Sangwan, 2015; Van-Dai et al., 2016).
The genomic data analysis helps better understand the relationship between various genetic, mutations, and disease conditions. It has great potentials in the development of various gene therapies to cure certain conditions. Furthermore, the genomic data analytics can assist in translating genetic discoveries into personalized medicine practice (Liang & Kelemen, 2016; Luo, Wu, Gopukumar, & Zhao, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016).
The clinical text data analytics using the data mining are the transformation process of the information from clinical notes stored in unstructured data format to useful patterns. The manual coding of clinical notes is costly and time-consuming, because of their unstructured nature, heterogeneity, different format, and context across different patients and practitioners. Various methods such as natural language processing (NLP) and information retrieval can be used to extract useful knowledge from large volume of clinical text and automatically encoding clinical information in a timely manner (Ghani, Zheng, Wei, & Friedman, 2014; Sun & Reddy, 2013; Van-Dai et al., 2016).
The social network healthcare data analytics is based on various kinds of collected social media sources such as social networking sites, e.g., Facebook, Twitter, Web Logs, to discover new patterns and knowledge that can be leveraged to model and predict global health trends such as outbreaks of infections epidemics (InformationBuilders, 2018; Luo et al., 2016; Van-Dai et al., 2016; Zia & Khan, 2017). Figure 1 shows a summary of these healthcare data sources.

Figure 1. Healthcare Data Sources.
Healthcare Big Data Analytics Design Proposal Using Hadoop
The implementation of BDA in the hospitals within the four States aims to improve the safety of the patient, the clinical outcomes, promoting wellness and disease management (Alexandru et al., 2016; HIMSS, 2018). The BDA system will take advantages of the large healthcare-generated data to provide various applied analytical disciplines such as statistical, contextual, quantitative, predictive and cognitive spectrums (Alexandru et al., 2016; HIMSS, 2018). These applied analytical disciplines will drive the fact-based decision making for planning management and learning in hospitals (Alexandru et al., 2016; HIMSS, 2018).
The proposal begins with the requirements, followed by the data flow diagram, the communication flowcharts, and the overall system diagram. The proposal addresses the regulations, policies, and governance for the medical system. The limitation and assumptions are also addressed in this proposal, followed by the justification for the overall design.
1. Basic Design Requirements
The basic requirement for the implementation of this proposal included not only the tools and required software, but also the training at all levels from staff, to nurses, to clinicians, to patients. The list of the requirements is divided into system requirement, implementation requirement, and training requirements.
1.1 Cloud Computing Technology Adoption Requirement
The volume is one of the significant characteristics of BD, especially in the healthcare industry (Manyika et al., 2011). Based on the challenges addressed earlier when dealing with BD and BDA in healthcare, the system requirements cannot be met using the traditional on-premise technology center, as it cannot handle the intensive computation requirements of BD, and the storage requirement for all the medical information from various hospitals from the four States (Hu et al., 2014). Thus, the cloud computing environment is found to be more appropriate and a solution for the implantation of this proposal. Cloud computing plays a significant role in BDA (Assunção et al., 2015). The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016). Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017). However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud. Thus, one of the major requirements is to adopt the Virtual Private Cloud as it has been regarded as the most prominent approach to trusted computing technology (Abdul, Jena, Prasad, & Balraju, 2014).
1.2 Security Requirement
Cloud computing has been facing various threats (Cloud Security Alliance, 2013, 2016, 2017). Records showed that over the last three years from 2015 until 2017, the number of breaches, lost medical records, and settlements of fines are staggering (Thompson, 2017). The Office of Civil Rights (OCR) issued 22 resolution agreements, requiring monetary settlements approaching $36 million (Thompson, 2017). Table 1 shows the data categories and the total for each year.
Table 1. Approximation of Records Lost by Category Disclosed on HHS.gov (Thompson, 2017)

Furthermore, a recent report published by HIPAA showed the first three months of 2018 experienced 77 healthcare data breaches reported to the OCR (HIPAA, 2018d). In the second quarter of 2018, at least 3.14 million healthcare records were exposed (HIPAA, 2018a). In the third quarter of 2018, 4.39 million records exposed in 117 breaches (HIPAA, 2018c).
Thus, the protection of the patients’ private information requires the technology to extract, analyze, and correlated potentially sensitive dataset (HIPAA, 2018b). The implementation of BDA requires security measures and safeguards to protect the privacy of the patients in the healthcare industry (HIPAA, 2018b). Sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016). The security requirements involve security at the VPC cloud deployment model as well as at the local hospitals in each State (Regola & Chawla, 2013). The security at the VPC cloud deployment model should involve the implementation of security groups and network access control lists to allow access to the right individuals to the right applications and patients’ records. Security group in VPC acts as the first line of defense firewall for the associated instances of the VPC (McKelvey, Curran, Gordon, Devlin, & Johnston, 2015). The network access control lists act as the second layer of defense firewall for the associated subnets, controlling the inbound and the outbound traffic at the subnet level (McKelvey et al., 2015).
The security at the local hospitals level in each State is mandatory to protect patients’ records and comply with HIPAA regulations (Regola & Chawla, 2013). The medical equipment must be secured with authentication and authorization techniques so that only the medical staff, nurses and clinicians have access to the medical devices based on their role. The general access should be prohibited as every member of the hospital has a different role with different responses. The encryption should be used to hide the meaning or intent of communication from unintended users (Stewart, Chapple, & Gibson, 2015). The encryption is an essential element in security control especially for the data in transit (Stewart et al., 2015). The hospital in all four State should implement the encryption security control using the same type of the encryption across the hospitals such as PKI, cryptographic application, and cryptography and symmetric key algorithm (Stewart et al., 2015).
The system requirements should also include the identity management systems that can correspond with the hospitals in each state. The identity management system provides authentication and authorization techniques allowing only those who should have access to the patients’ medical records. The proposal requires the implementation of various encryption techniques such as secure socket layer (SSL), Transport Layer Security (TLS), and Internet Protocol Security (IPSec) to protect information transferred in public network (Zhang, R. & Liu, 2010).
1.3 Hadoop Implementation for Data Stream Processing Requirement
While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014). Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015). The implementation requirements include various technologies and various tools. This section covers various components that are required when implementing Hadoop technology in the four States for healthcare BDA system.
Hadoop has three significant limitations, which must be addressed in this design. The first limitation is the lack of technical support and document for open source Hadoop (Guo, 2013). Thus, this design requires the Enterprise Edition of Hadoop to get around this limitation using Cloudera, Hortonworks, and MapR (Guo, 2013). The final decision for which product will be determined by the cost analysis team. The second limitation is that Hadoop is not optimal for real-time data processing (Guo, 2013). The solution for this limitation will require the integration of real-time streaming program as Spark or Storm or Kafka (Guo, 2013; Palanisamy & Thirunavukarasu, 2017). This requirement of integrating Spark is discussed below in a separate requirement for this design (Guo, 2013). The third limitation is that Hadoop is not a good fit for large graph dataset (Guo, 2013). The solution for this limitation requires the integration of GraphLab which is also discussed below in a separate requirement for this design.
1.3.1 Hadoop Ecosystem for Data Processing
Hadoop technologies have been in the front-runner for Big Data application (Bansal et al., 2014; Chrimes, Zamani, Moa, & Kuo, 2018). Hadoop ecosystem will be part of the implementation requirement as it is proven to serve well with intensive computation using large datasets (Raghupathi & Raghupathi, 2014; Wang et al., 2018). The implementation of Hadoop technology will be performed in the VPC deployment model. The Hadoop version that is required is version 2.x to include YARN for resource management (Karanth, 2014). Hadoop 2.x also include HDFS snapshots to provide a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery (Karanth, 2014). The Hadoop platform can be implemented to gain more insight into various areas (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Hadoop ecosystem involves Hadoop Distributed File System, MapReduce, and NoSQL database such as HBase, and Hive to handle a large volume of dataset using various algorithms and machine learning to extract values from the medical records that are structured, semi-structured, and unstructured (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Other components to support Hadoop ecosystem include Oozie for workflow, Pig for scripting, and Mahout for machine learning which is part of the artificial intelligence (AI) (Ankam, 2016; Karanth, 2014). Hadoop ecosystem will also include Flume for log collector, Sqoop for data exchange, and Zookeeper for coordination (Ankam, 2016; Karanth, 2014). HCatalog is a required component to manage the metadata in Hadoop (Ankam, 2016; Karanth, 2014). Figure 2 shows the Hadoop ecosystem before integrating Spark for real-time analytics.

Figure 2. Hadoop Architecture Overview (Alguliyev & Imamverdiyev, 2014).
1.3.2 Hadoop-specific File Format for Splittable and Agnostic Compression
The ability of splittable files plays a significant role during the data processing (Grover, Malaska, Seidman, & Shapira, 2015). Therefore, Hadoop-specific file formats of SequenceFile, and Serialization formats like Avro, and columnar formats such as RCFile and Parquet should be used because these files share two essential characteristics that are essential for Hadoop applications: splittable compression and agnostic compression (Grover et al., 2015). Hadoop allows large files to be split for input to MapReduce and other types of jobs, which is required for parallel processing and an essential key to leveraging data locality feature of Hadoop (Grover et al., 2015). The agnostic compression is required to compress data using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015). Figure 3 summarizes the three Hadoop file types with the two common characteristics.

Figure 3. Three Hadoop File Types
with the Two Common Characteristics.
1.3.3 XML and JSON Use in Hadoop
The clinical data include semi-structured formats such as XML and JSON. The split process of XML and JSON is not straightforward and can present unique challenges using Hadoop (Grover et al., 2015). Since and Hadoop does not provide a built-in InputFormat for either format of XML and JSON (Grover et al., 2015). Furthermore, JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record (Grover et al., 2015). When using these file format, two primary considerations must be taken. The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015). A library for processing XML or JSON should be designed (Grover et al., 2015). XMLLoader in PiggyBank library for Pig is an example when using XML data type. The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015).
1.4 HBase and MongoDB NoSQL Database Integration Requirement
In the age of BD and BDA, the traditional data store is found inadequate to handle not only the large volume of the dataset but also the various types of the data format such as unstructured and semi-structured (Hu et al., 2014). Thus, Not Only SQL (NoSQL) database is emerged to meet the requirement of the BDA. These NoSQL data stores are used for modern, and scalable databases (Sahafizadeh & Nematbakhsh, 2015). The scalability feature of the NoSQL data stores enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015). The platform can incorporate two scalability types to support the large volume of the datasets; the horizontal and vertical scalability. The horizontal scaling allows the distribution of the workload across many servers and nodes to increase the throughput, while the vertical scaling requires more processors, more memories and faster hardware to be installed on a single server (Sahafizadeh & Nematbakhsh, 2015).
NoSQL data stores have various types such as MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB. These data stores are categorized into four types: document-oriented, column-oriented or column-family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The document-oriented data store can store and retrieve collections of data and documents using complex data forms in various formats such as XML and JSON as well as PDF and MS word (EMC, 2015; Hashem et al., 2015). MongoDB and CouchDB are examples of document-oriented data stores (EMC, 2015; Hashem et al., 2015). The column-oriented data store can store the content in columns aside from rows with the attributes of the columns stored contiguously (Hashem et al., 2015). This type of datastore can store and render blog entries, tags, and feedback (Hashem et al., 2015). Cassandra, DynamoDB, and HBase are examples of column-oriented data stores (EMC, 2015; Hashem et al., 2015). The key-value can store and scale large volumes of data and contains value and a key to access the value (EMC, 2015; Hashem et al., 2015). The value can be complicated, but this type of data stores can be useful in storing the user’s login ID as the key referencing the value of patients. Redis and Riak are examples of the key-value NoSQL data store (Alexandru et al., 2016). Each of these NoSQL data stores has its limitations and advantages. The graph NoSQL database can store and represent data using graph models with nodes, edges, and properties related to one another through relations which will be useful for unstructured medical data such as images, and lab results. Neo4j is an example of this type of graph NoSQL database (Hashem et al., 2015). Figure 4 summarizes these NoSQL data stores, data types for storage, and examples.

Figure 4. Big Data Analytics NoSQL Data Store Types.
The proposed design requires one or more NoSQL data stores to meet the requirement of BDA using Hadoop environment for this healthcare BDA system. Healthcare big data has unique characteristics which must be addressed when selecting the data store and consideration must be taken for the various types of data. HBase and HDFS are the commonly used storage manager in the Hadoop environment (Grover et al., 2015). HBase is a column-oriented data store which will be used to store multi-structured data (Archenaa & Anita, 2015). HBase sets on top of HDFS in the Hadoop ecosystem framework (Raghupathi & Raghupathi, 2014).
MongoDB will also be used to store the semi-structured data set such as XML and JSON. Metadata for HBase data schema, to improve the accessibility and readability of HBase data schema (Luo et al., 2016). Riak will be used for a key-value dataset which can be used for the dictionary, hash tables and associative arrays that can be used for login and user ID information for patients as well as for providers and clinicians (Klein et al., 2015). Neo4j NoSQL will be used to store the images with nodes and edges such as Lab images, XRays (Alexandru et al., 2016).
The proposed healthcare system has a logical data model and query patterns that need to be supported by NoSQL databases (Klein et al., 2015). The data model will include reading the medical test results for patients is a core function used to populate the user interface. It will also include a strong replica consistency when a new medical result is written for a patient. Providers can make patient care decisions using these records. All providers will be able to see the same information within the hospital systems in the four States, whether they are at the same site as the patients, or providing telemedicine support from another location.
The logical data model includes mapping the application-specific model into the particular data model, indexing, and query language capabilities of each database. The HL7 Fast Healthcare Interoperability Resources (FHIR) is used as the logical data model for records analysis. The patient’s data such as demographic information such as names, addresses, and telephone will be modeled using the FHIR Patient Resources such as result quantity, and result units (Klein et al., 2015).
1.5 Spark Integration for Real-Time Data Processing Requirement
While the architecture of Hadoop ecosystem has been designed in various scenarios for data storage, data management statistical analysis, and statistical association between various data sources distributed computing and batch processing, this proposal requires real-time data processing which cannot be met by Hadoop alone (Basu, 2014). Real-time analytics will tremendous value to the healthcare proposed system. Thus, Apache Spark is another component which is required to implement this proposal (Basu, 2014). Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Basu, 2014). With Spark integration with Hadoop, stream processing, machine learning, interactive analytics, and data integration will be possible (Scott, 2015). Spark will run on top of Hadoop to benefit from YARN and the underlying storage of HDFS, HBase and other Hadoop ecosystem building blocks (Scott, 2015). Figure 5 shows the core engines of the Spark.

Figure 5. Spark Core Engines (Scott, 2015).
1.6 Big Healthcare Data Visualization Requirement
Visualization is one of the most powerful presentations of the data (Jayasingh, Patra, & Mahesh, 2016). It helps in viewing the data in a more meaningful way in the form of graphs, images, pie charts that can be understood easily. It helps in synthesizing a large volume of data set such as healthcare data to get at the core of such raw big data and convey the key points from the data for insight (Meyer, 2018). Some of the commercial visualization tools include Tableau, Spotfire, QlikView, and Adobe Illustrator. However, the most commonly used visualization tools in healthcare include Tableau, PowerBI, and QlikView. This healthcare design proposal will utilize Tableau.
Healthcare providers are successfully transforming data from information to insight using Tableau software. Healthcare organizations can utilize three approaches to get more from the healthcare datasets. The first approach is to break the data access by empowering the departments in healthcare to explore their data. The second approach is to uncover answers with data from multiple systems to reveal trends and outliers. The third approach is to share insights with executives, providers, and others to drive collaboration (Tableau, 2011). It has several advantages including the interactive visualization using drag-n-drop techniques, handling large amounts of data and millions of rows of data with ease, and other scripts such as Python can be integrated with Tableau (absentdata.com, 2018). It also provides mobile support and responsive dashboard. The limitation of Tableau is that it requires substantial training to fully master the platform, among other limitations including lack of automatic refreshing, conditional formatting and 16-column table limit (absentdata.com, 2018). Figure 6 shows the Patient Cycle Time data visualization using Tableau software.

Figure 6. Patient Cycle Time Data
Visualization Example (Tableau,
2011).
1.7 Artificial Intelligence Integration Requirement
Artificial Intelligence is a computational technique allowing machines to perform cognitive functions such as acting or reacting to input, similar to the way humans do (Patrizio, 2018). The traditional computing applications react to data, and the reactions and responses must be hand-coded with human intervention (Patrizio, 2018). The AI systems are continuously in a flux mode changing their behavior to accommodate any changes in the results and modifying their reactions accordingly (Patrizio, 2018). The AI techniques can include video recognition, natural language processing, speech recognition, machine learning engines, and automation (Mills, 2018)
Healthcare system can benefit from BDA integration with Artificial Intelligence (AI) (Bresnick, 2018). Since AI can play a significant role in BDA in the healthcare system, this proposal suggests the implementation of machine learning which is part of the AI to deploy more precise and impactful interventions at the right time in the care of patients (Bresnick, 2018). The application of AI in the proposed design requires machine learning (Patrizio, 2018). Since the data used in the AI and machine learning is already cleaned after removing the duplicates and unnecessary data, AI can take advantages of these filtered data leading to many healthcare breakthroughs such as genomic and proteomic experiments to enable personalized medicine (Kersting & Meyer, 2018).
The healthcare industry has been utilizing AI, machine learning (ML) and data mining (DM) to extract value from BD by transforming the large medical datasets into actionable knowledge performing predictive and prescriptive analytics (Palanisamy & Thirunavukarasu, 2017). The ML will be used to utilize the AI to develop sophisticated algorithm processing massive medical datasets including the structured, unstructured, and semi-structured data performing advanced analytics (Palanisamy & Thirunavukarasu, 2017). Apache Mahout, which is an open source for ML, will be integrated with Hadoop to facilitate the execution of scalable machine learning algorithms, offering various techniques such as recommendation, classification, and clustering (Palanisamy & Thirunavukarasu, 2017).
1.8 Internet of Things (IoT) Integration Requirement
Internet of Things (IoT) refers to the increased connected devices with IP addresses which were not common years ago (Anand & Clarice, 2015; Thompson, 2017). These connected devices collect and use the IP addresses to transmit information (Thompson, 2017). Providers in healthcare take advantages of the collected information to find new treatment methods and increase efficiency (Thompson, 2017).
The implementation of IoT will involve various technologies including frequency identification (RFID), near field communication (NFC), machine to machine (M2M), wireless sensor network (WSM), and addressing schemes (AS) (IPv6 addresses) (Anand & Clarice, 2015; Kumari, 2017). The implementation of IoT requires machine learning and algorithm to find patterns, correlations, and anomalies that have the potential of enabling healthcare improvements (O’Brien, 2016). Machine learning is a critical component of artificial intelligence. Thus, the success of IoT depends on AI implementation.
1.9 Training Requirement
This design proposal requires various training to IT professionals, providers and clinician and those who will be using this healthcare ecosystem depending on their role (Alexandru et al., 2016; Archenaa & Anita, 2015). Each component of this ecosystem should have training such as training for Hadoop/MapReduce, Spark, Security, and so forth. The training will play a significant role in the success of this design implementation to apply BD and BDA in the healthcare system in the four States of Colorado, Utah, Arizona, and New Mexico. Patients should be considered in training for remote monitoring programs such as blood sugar monitoring, and blood pressure monitoring applications. The senior generation might face some challenges. However, with the technical support, this challenge can be alleviated.
2. Data Flow Diagram
This section discusses the data flow for the proposed design for the healthcare ecosystem for the application of BDA.
2.1 HBase Cluster and HDFS Data Flow
HBase stores data into table schema and specify the column family (Yang, Liu, Hsu, Lu, & Chu, 2013). The table schema must be predefined, and the column families must be specified. New columns can be added to families as required making the schema-flexible and can adapt to changing application requirements (Yang et al., 2013). HBase is developed in a similar way like HDFS with a NameNode and slave nodes, and MapReduce with JobTracker and TaskTracker slaves (Yang et al., 2013). HBase will play a vital role in the cluster environment of Hadoop system. In HBase master node called HMaster will manage the cluster, and region servers store portions of the tables and perform the work on the data. The HMaster reflects the Master Server and is responsible for monitoring all RegionServer instances in the cluster and is the interface for all metadata changes. This Master executes on the NameNode in the distributed cluster Hadoop environment. The HRegionServer represents the RegionServer and is responsible for serving and managing regions. The RegionServer runs on a DataNode in the distributed cluster Hadoop environment. The ZooKeeper will assist other machines are selected within the cluster as HMaster in case of a failure, unlike HDFS framework where NameNode has a single point of availability issue. Thus, the data flow between the DataNodes and the NameNodes when integrating HBase on top of HDFS is shown in Figure 7.

Figure 7. HBase Cluster Data Flow (Yang et al., 2013).
2.2 HBase and MongoDB with Hadoop/MapReduce and HDFS Data Flow
The healthcare system integrates four significant components such as HBase, MongoDB, MapReduce, and Visualization. HBase is used for data storage, MongoDB is used for metadata, MapReduce using Hadoop for computation, and data visualization tool. The signal data will be stored in HBase while the metadata and other clinical data will be stored in MongoDB. The data stored in both HBase and MongoDB will be accessible from the Hadoop/MapReduce environment for processing and the data visualization layer as well. One master node and eight slave nodes, and several supporting servers. The data will be imported to Hadoop and processed via MapReduce. The result of the computational process will be viewed through a data visualization tool such as Tableau. Figure 8 shows the data flow between these four components of the proposed healthcare ecosystem.

Figure 8. The Proposed Data Flow Between Hadoop/MapReduce and Other Databases.
2.3 XML Design Flow Using ETL Process with MongoDB
Healthcare records have various types of data from structured, semi-structured to unstructured (Luo et al., 2016). Some of these healthcare records are XML-based records in the semi-structured format using tags. XML stands for eXtensible Markup Language (Fawcett, Ayers, & Quin, 2012). Healthcare sector can drive value from these XML documents which reflect semi-structured data (Aravind & Agrawal, 2014). Example of this XML-based patients records shows in Figure 9.

Figure 9. Example of the Patient’s Electronic Health
Record (HL7, 2011)
XML-based records need to get ingested into Hadoop system for the analytical purpose to derive value from this semi-structured XML-based data. However, Hadoop does not offer a standard XML “RecordReader” (Lublinsky, Smith, & Yakubovich, 2013). XML is one of the standard file formats for MapReduce. Various approaches can be used to process XML semi-structured data. The process of ETL (Extract, Transform and Load) can be used to process XML data in Hadoop. MongoDB is a NoSQL database which is required in this design proposal. It handles XML document-oriented type.
The ETL process in MongoDB starts with the extract and transform. The MongoDB application provides the ability to map the XML elements within the document to the downstream data structure. The application supports the ability to unwind simple arrays or present embedded documents using appropriate data relationships such as one-to-one (1:1), one-to-many (1: M), or many-to-many (M: M) (MongoDB, 2018). The application infers the schema information by examining a subset of documents within target collections. Organizations can add fields to the discovered data model that may not have been present within the subset of documents used for schema inference. The application infers information about the existing indexes for collections to be queried. It prompts or warns of queries that do not contain any indexes fields. The application can return a subset of fields from documents using query projections. For queries against MongoDB Replica Sets, the application supports the ability to specify custom MongoDB Read Preferences for individual query operations. The application then infers information about sharded cluster deployment and note the shard key fields for each sharded collection. For queries against MongoDB Sharded Clusters, the application warns against queries that do not use proper query isolation. Broadcast queries in a sharded cluster can have a negative impact on database performance (MongoDB, 2018).
The load process in MongoDB is performed after the extract and transform process. The application supports the ability to write data to any MongoDB deployment whether a single node, replica set or sharded cluster. For writes to a MongoDB Sharded Cluster, the application informs or display an error message to the user if XML documents do not contain a shard key. A custom WriteConcern can be used for any write operations to a running MongoDB deployment. For the bulk loading operations, writing documents in batches using the insert() method can be used using the MongoDB 2.6 version or above, which supports the bulk update database command. For the bulk loading into a MongoDB sharded deployment, the bulk insert into a sharded collection is supported, including the pre-splitting of the collections’ shard key and inserting via multiple mongos processes. Figure 10 shows this ETL process for XML-based patients records using MongoDB.

Figure 10. The Proposed XML ETL Process in MongoDB.
2.4 Real-Time Streaming Spark Data Flow
Real-Time streaming can be implemented using any real-time streaming program such as Spark, Kafka, or Storm. This healthcare design proposal will integrate Spark open-source program for the real-time streaming data such as sensing data, from various sources such as intensive care units, remote monitoring programs, biomedical signals. The data from various sources will be flow into Spark for analytics and then imported to the data storage systems. Figure 11 illustrates the data flow for real-time streaming analytics.

Figure 11. The Proposed Spark Data Flow.
3. Communication Workflow
The communication flow involves the stakeholders involves in the healthcare system. These stakeholders include providers, insurer, pharmaceutical, and IT professionals and practitioners. The communication flow is centered with the patient-centric healthcare system using the cloud computing technology for the four States of Colorado, Utah, Arizona, and New Mexico. These stakeholders are from these states. The patient-centric healthcare system is the central point for communication. The patients communicate with the central system using the web-based platform, and clinical forums as needed. The providers communicate with the patient-centric healthcare system using resource usages, patient feedback, and hospital visits, and services details. The insurers communicate with the central system using claims database, and census and societal data. The pharmaceutical vendors will communicate with the central system using prescription and drug reports which can be retrieved by the providers from anywhere in these four states. The IT professionals and practitioners will communicate with the central system for data streaming, medical records, genomics, and all omics data analysis and reporting. Figure 12 shows the communication flow between these stakeholders and the central system in the cloud that can be accessed from any of these identified four States.

Figure 12. The Proposed Patient-Centric Healthcare System Communication Flow.
4. Overall System Diagram
The overall system represents the state-of-the-art healthcare ecosystem system that utilizes the latest technology for healthcare Big Data Analytics. The system is bounded by the regulations and policy such as HIPAA to ensure the protection of the patients’ privacy across the various layers of the overall system. The system integrated components include the Hadoop latest technology with MapReduce and HDFS. The data government layer is the bottom layer which contains three major building blocks: master data management (MDM), data life-cycle management (DLM) components, and data security and privacy management. The MDM component is responsible for data completeness, accuracy, and availability, while the DLM is responsible for archiving the data, maintaining the data warehousing, data deletion, and disposal. The data security and privacy management building block is responsible for sensitive data discovery, vulnerability and configuration assessment, security policies application, auditing and compliance reporting, activity monitoring, identify and access management, and protecting data. The top layers include data layer, data aggregation layer, data analytics layer, and information exploration layer. The data layer is responsible for data sources and content format, while the data aggregation layer involves various components from data acquisition process, transformation engines, and data storage area using Hadoop, HDFS, NoSQL databases such as MongoDB and HBase. The data analytics layer involves the Hadoop/MapReduce mapping process, stream computing, real-time streaming, and database analytics. AI and IoT are part of the data analytics layer. The information exploration layer involves the data visualization layer, visualization reporting, real-time monitoring using healthcare dashboard, and clinical decision support. Figure 13 illustrates the overall system diagram with these layers.

Figure 13. The Proposed Healthcare Overall System Diagram.
5. Regulations, Policies, and Governance for the Medical Industry
Healthcare data must be stored in a secure storage area to protect the information and the privacy of patients (Liveri, Sarri, & Skouloudi, 2015). When the healthcare industry fails to comply with the regulation and policies, the fines and the cost can cause financial stress on the industry (Thompson, 2017). Records showed that the healthcare industry paid millions of dollars in fines. The Advocate Health Care in suburban Chicago agreed to the most significant figure as of August 2016 with a total amount of $5.55 million (Thompson, 2017). Memorial Health System in southern Florida became the second entity to top of paying $5 million (Thompson, 2017). Table 2 shows the five most substantial fines posted to the Office of Civil Rights (OCR) site.
Table 2. Five Largest Fines Posted to OCR Web Site (Thompson, 2017)

The hospitals must adhere to the data privacy regulations and legislative rules carefully to protect the patients’ medical records from data breaches (HIPAA). The proper security policy and risk management must be implemented to ensure the protection of private information as well to minimize the impact of confidential data in case of loss or theft (HIPAA, 2018a, 2018c; Salido, 2010). The healthcare system design proposal requires the implementation of a system for those hospitals or providers who are not compliant with the regulation and policies and the escalation path (Salido, 2010). This design proposal implements four major principles as the best practice to comply with required policies and regulation and protect the confidential data assets of the patients and users (Salido, 2010). The first principle is to honor policies throughout private data life (Salido, 2010). The second principle for best practice in healthcare design system is to minimize the risk of unauthorized access or misuse of confidential data (Salido, 2010). The third principle is to minimize the impact of confidential data loss, while the fourth principle is to document appropriate controls and demonstrate their effectiveness (Salido, 2010). Figure 14 shows these four principles which this healthcare design proposal adheres to ensure protection healthcare data from unauthorized users and comply with the required regulation and policies.

Figure 14. Healthcare Design Proposal Four Principles.
6. Assumptions and Limitations
This design proposal assumes that the healthcare sector in the four States will support the application of BD and BDA across these fours States. The support includes investment in the proper technology, proper tools and proper training based on the requirements of this design proposal. The proposal also assumes that the stakeholders including the providers, patients, insurer, pharmaceutical vendors, and practitioners will welcome the application of BDA to take advantages of it to provide efficient healthcare services, increase productivity, decrease costs for healthcare sector as well as for patients, and provide better care to patients.
The limitation of this proposal is the timeframe that is required to implement it. With the support of the healthcare sector from these four States, the implementation can be expedited. However, the silo and the rigid culture of the healthcare may interfere with the implementation which can take longer than expected. The initial implementation might face unexpected challenges. However, these unexpected challenges will come from the lack of experienced IT professionals and managers in the field of BD and BDA domain. This design proposal will be enhanced based on the observations from the first few months of the implementation.
7. The justification for Overall Design
The traditional database and analytical systems are found inadequate when dealing with healthcare data in the age of BDA. The characteristics of the healthcare datasets including the large volume medical records, the variety of the dataset from structured, to semi-structured, to the unstructured dataset, and the velocity of the dataset generation and the data processing requires technology such as cloud computing (Fernández et al., 2014). Cloud computing is found the best solution when dealing with BD and BDA to address the challenges of BD storage, and the intensive-computing processing demands (Alexandru et al., 2016; Hashem et al., 2015). The healthcare system in the four States will shift the communication technology and services for applications across the hospitals and providers (Hashem et al., 2015). Some of the advantages of cloud computing adoption include virtualized resources, parallel processing, security and data service integration with scalable data storage (Hashem et al., 2015). With the cloud computing technology, the healthcare sector in the four States will reduce the cost, and increase the efficiency (Hashem et al., 2015). When quick access to critical data for patients care is required quickly, the mobility of accessing the data from anywhere is one of the most significant advantages of the cloud computing adoption as recommended by this proposed design (Carutasu, Botezatu, Botezatu, & Pirnau, 2016). The benefits of cloud computing include technological benefits such as visualization, multi-tenancy, data and storage, security and privacy compliance (Chang, 2015). The cloud computing also offers economic benefits such as pay per use, cost reduction, return on investment (Chang, 2015). The non-functional benefits of the cloud computing cover the elasticity, quality of service, reliability, and availability (Chang, 2015). Thus, the proposed design justifies the use of cloud computing for several benefits as cloud computing is proven the best technology for BDA especially for healthcare data analytics.
Although cloud computing offers several benefits to the proposed healthcare system, cloud computing has been suffering from security and privacy concerns (Balasubramanian & Mala, 2015; Kazim & Zhu, 2015). The security concerns involve risk areas such as external data storage, dependency on the public internet, lack of control, multi-tenancy and integration with internal security (Hashizume, Rosado, Fernández-medina, & Fernandez, 2013). The traditional security techniques such as identity, authentication, and authorization are not sufficient for cloud computing environments in their current forms using the standard deployment models of the public cloud, and private cloud (Hashizume et al., 2013). The increasing trend in the security threats data breaches, and the current deployment models of private and public clouds, which are not meeting the security challenges, have triggered the need for another deployment to ensure security and privacy protection. Thus, the VPC deployment model which is a new deployment model of cloud computing technology (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q., Cheng, & Boutaba, 2010). The VPC is taking advantages of technologies such as a virtual private network (VPN) which will allow hospitals and providers to set up their required network settings such as security (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). The VPC deployment model will have dedicated resources with the VPN to provide the required isolation for security to protect the patients’ information (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). Thus, this proposed design will be using VPC cloud computing deployment mode to store and use healthcare data in a secure and isolated environment to protect the patients’ medical records (Regola & Chawla, 2013).
Hadoop ecosystem is a required component in this proposed design for several reasons. Hadoop technology is a commonly used computing paradigm for massive volume data processing in the cloud computing (Bansal et al., 2014; Chrimes et al., 2018; Dhotre et al., 2015). Hadoop is the only technology that enables large healthcare volumes of data to be stored in its native forms (Dezyre, 2016). Hadoop is proven to develop better treatments for diseases such as cancer by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national cancer knowledge network to guide treatment decision (Dezyre, 2016). With Hadoop system, hospitals in the four States will be able to monitor the patient vitals (Dezyre, 2016). The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over six thousand children in their ICU units (Dezyre, 2016).
The proposed design requires the integration of NoSQL database because it offers benefits such as mass storage support, reading and writing operations which are fast, and the expansion is easy with a low cost (Sahafizadeh & Nematbakhsh, 2015). HBase is proposed as a required NoSQL database as it is faster when reading more than six million variants which are required when analyzing large healthcare datasets (Luo et al., 2016). Besides, query engine such as SeqWare can be integrated with HBase as needed to help bioinformatics researchers access large-scale whole-genome datasets (Luo et al., 2016). HBase can store clinical sensors where the row key serves as the time stamp of a single value, and the column stores patients’ physiological values that correspond with the row key time stamp (Luo et al., 2016). HBase is scalable, high-performance and low-cost NoSQL data store that can be integrated with Hadoop sitting on top of HDFS (Yang et al., 2013). As a column-oriented NoSQL data store that runs on top of HDFS of Hadoop ecosystem, HBase is well suited to parse the healthcare large data sets (Yang et al., 2013). HBase supports applications written in Avro, REST and Thrift (Yang et al., 2013). MongoDB is another NoSQL data store, which will be used to store metadata to improve the accessibility and readability of the HBase data schema (Luo et al., 2016).
The integration of Spark is required in order to overcome the Hadoop limitation of real-time data processing because Hadoop is not optimal for real-time data processing (Guo, 2013). Thus, Apache Spark is a required component to implement this proposal so that the healthcare BDA system can take advantages of data processing at rest using the batching technique as well as a motion using the real-time processing technique (Liang & Kelemen, 2016). Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Liang & Kelemen, 2016). Spark is a high integration to the recent Hadoop cluster deployment (Scott, 2015). While Spark is a powerful tool on its own for processing a large volume of medical and healthcare datasets, Spark is not well-suited for production workload. Thus, the integration of Spark with Hadoop ecosystem provides many capabilities which Spark cannot offer on its own, and Hadoop cannot offer on its own.
The integration of AI as part of this proposal is justified by the examination of Harvard Business Review (HBR) that shows ten promising AI application in healthcare (Kalis, Collier, & Fu, 2018). The findings of HBR’s examination showed that the application of AI could create up to $150 billion in annual savings for U.S. healthcare by 2026 (Kalis et al., 2018). The result also showed that AI currently creates the most value in assisting the frontline clinicians to be more productive and in making back-end processes more efficient (Kalis et al., 2018). Furthermore, IBM invested $1 billion in AI through the IBM Watson Group, and healthcare industry is the most significant application of Watson (Power, 2015).
Conclusion
Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry. The value that is driven by BDA can save lives and minimize costs for patients. This project proposes a design to apply BDA in the healthcare system across four States of Colorado, Utah, Arizona, and New Mexico. Cloud computing is the most appropriate technology to deal with the large volume of healthcare data. Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used. VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists.
The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT). The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images. Spark will be used for real-time data processing which can be vital for urgent care and emergency services. This project addressed the assumptions and limitations plus the justification for selecting these specific components.
In summary, all
stakeholders in the healthcare sector
including providers, insurers, pharmaceuticals, practitioners should cooperate
and coordinate to facilitate the implementation process. All stakeholders are responsible to
facilitate the integration of BD and BDA into the healthcare system. The rigid culture and silo pattern need to
change for better healthcare system which can save millions of dollars to the
healthcare industry and provide excellent care to the patients at the same
time.
References
Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).
Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.
absentdata.com. (2018). Tableau Advantages and Disadvantages. Retrieved from https://www.absentdata.com/advantages-and-disadvantages-of-tableau/.
Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.
Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.
Anand, M., & Clarice, S. (2015). Artificial Intelligence Meets Internet of Things. Retrieved from http://www.ijcset.net/docs/Volumes/volume5issue6/ijcset2015050604.pdf.
Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.
Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.
Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.
Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003
Balasubramanian, V., & Mala, T. (2015). A Review On Various Data Security Issues In Cloud Computing Environment And Its Solutions. Journal of Engineering and Applied Sciences, 10(2).
Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.
Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.
Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.
Bresnick, J. (2018). Top 12 Ways Artificial Intelligence Will Impact Healthcare. Retrieved from https://healthitanalytics.com/news/top-12-ways-artificial-intelligence-will-impact-healthcare.
Carutasu, G., Botezatu, M., Botezatu, C., & Pirnau, M. (2016). Cloud Computing and Windows Azure. Electronics, Computers and Artificial Intelligence.
Chang, V. (2015). A Proposed Framework for Cloud Computing Adoption. International Journal of Organizational and Collective Intelligence, 6(3).
Chrimes, D., Zamani, H., Moa, B., & Kuo, A. (2018). Simulations of Hadoop/MapReduce-Based Platform to Support its Usability of Big Data Analytics in Healthcare.
Cloud Security Alliance. (2013). The Notorious Nine: Cloud Computing Top Threats in 2013. Cloud Security Alliance: Top Threats Working Group.
Cloud Security Alliance. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance: Top Threats Working Group.
Cloud Security Alliance. (2017). The Treacherous 12 Top Threats to Cloud Computing. Cloud Security Alliance: Top Threats Working Group.
Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.
Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.
EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.
Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409. doi:10.1002/widm.1134
Fox, M., & Vaidyanathan, G. (2016). Impacts of Healthcare Big Data: A Framwork With Legal and Ethical Insights. Issues in Information Systems, 17(3).
Ghani, K. R., Zheng, K., Wei, J. T., & Friedman, C. P. (2014). Harnessing big data for health care and research: are urologists ready? European urology, 66(6), 975-977.
Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.
Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.
Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.
Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006
Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5
HIMSS. (2018). 2017 Security Metrics: Guide to HIPAA Compliance: What Healthcare Entities and Business Associates Need to Know. . Retrieved on 12/1/2018 from http://www.himss.org/file/1318331/download?token=h9cBvnl2.
HIPAA. (2018a). At Least 3.14 Million Healthcare Records Were Exposed in Q2, 2018. Retrieved 11/22/2018 from https://www.hipaajournal.com/q2-2018-healthcare-data-breach-report/.
HIPAA. (2018b). How to Defend Against Insider Threats in Healthcare. Retrieved 8/22/2018 from https://www.hipaajournal.com/category/healthcare-cybersecurity/.
HIPAA. (2018c). Q3 Healthcare Data Breach Report: 4.39 Million Records Exposed in 117 Breaches. Retrieved 11/22/2018 from https://www.hipaajournal.com/q3-healthcare-data-breach-report-4-39-million-records-exposed-in-117-breaches/.
HIPAA. (2018d). Report: Healthcare Data Breaches in Q1, 2018. Retrieved 5/15/2018 from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/.
HL7. (2011). Patient Example Instance in XML.
Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453
InformationBuilders. (2018). Data In Motion – Big Data Analytics in Healthcare. Retrieved from http://docs.media.bitpipe.com/io_10x/io_109369/item_674791/datainmotionbigdataanalytics.pdf, White Paper.
Jayasingh, B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization. Paper presented at the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).
Ji, Z., Ganchev, I., O’Droma, M., Zhang, X., & Zhang, X. (2014). A cloud-based X73 ubiquitous mobile healthcare system: design and implementation. The Scientific World Journal, 2014.
Kalis, B., Collier, M., & Fu, R. (2018). 10 Promising AI Applications in Health Care. Retrieved from https://hbr.org/2018/05/10-promising-ai-applications-in-health-care, Harvard Business Review.
Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.
Kazim, M., & Zhu, S. Y. (2015). A Survey on Top Security Threats in Cloud Computing. International Journal Advanced Computer Science and Application, 6(3), 109-113.
Kersting, K., & Meyer, U. (2018). From Big Data to Big Artificial Intelligence? : Springer.
Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.
Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008
Kumari, W. M. P. (2017). Artificial INtelligence Meets Internet of Things.
Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).
Liveri, D., Sarri, A., & Skouloudi, C. (2015). Security and Resilience in eHealth: Security Challenges and Risks. European Union Agency For Network And Information Security.
Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.
Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Malik, L., & Sangwan, S. (2015). MapReduce Framework Implementation on the Prescriptive Analytics of Health Industry. International Journal of Computer Science and Mobile Computing, ISSN, 675-688.
Maltby, D. (2011). Big Data Analytics. Paper presented at the Annual Meeting of the Association for Information Science and Technology.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.
McKelvey, N., Curran, K., Gordon, B., Devlin, E., & Johnston, K. (2015). Cloud Computing and Security in the Future Guide to Security Assurance for Cloud Computing (pp. 95-108): Springer.
Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446
Meyer, M. (2018). The Rise of Healthcare Data Visualization.
Mills, T. (2018). Eight Ways Big Data And AI Are Changing The Business World.
MongoDB. (2018). ETL Best Practice.
O’Brien, B. (2016). Why The IoT Needs ARtificial Intelligence to Succeed.
Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.
Patrizio, A. (2018). Big Data vs. Artificial Intelligence.
Power, B. (2015). Artificial Intelligence Is Almost Ready for Business.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.
Regola, N., & Chawla, N. (2013). Storing and Using Health Data in a Virtual Private Cloud. Journal of medical Internet research, 15(3), 1-12. doi:10.2196/jmir.2076
Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.
Salido, J. (2010). Data Governance for Privacy, Confidentiality and Compliance: A Holistic Approach. ISACA Journal, 6, 17.
Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.
Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide. CISSP Security Professional Official Study Guide (7th ed.): Wiley.
Sultan, N. (2010). Cloud Computing for Education: A New Dawn? International Journal of Information Management, 30(2), 109-116. doi:10.1016/j.ijinfomgt.2009.09.004
Sun, J., & Reddy, C. (2013). Big Data Analytics for Healthcare. Retrieved from https://www.siam.org/meetings/sdm13/sun.pdf.
Tableau. (2011). Three Ways Healthcare Probiders are transforming data from information to insight. White Paper.
Thompson, E. C. (2017). Building a HIPAA-Compliant Cybersecurity Program, Using NIST 800-30 and CSF to Secure Protected Health Information.
Van-Dai, T., Chuan-Ming, L., & Nkabinde, G. W. (2016, 5-7 July 2016). Big data stream computing in healthcare real-time analytics. Paper presented at the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).
Venkatesan, T. (2012). A Literature Survey on Cloud Computing. i-Manager’s Journal on Information Technology, 1(1), 44-49.
Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019
Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.
Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.
Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud Computing: State-of-the-Art and Research Challenges. Journal of internet services and applications, 1(1), 7-18. doi:10.1007/s13174-010-0007-6
Zhang, R., & Liu, L. (2010). Security models and requirements for healthcare application clouds. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.
Zia, U. A., & Khan, N. (2017). An Analysis of Big Data Approaches in Healthcare Sector. International Journal of Technical Research & Science, 2(4), 254-264.
































































