Proposal: State-of-the-Art Healthcare System in Four States.

Dr. O. Aly
Computer Science

Abstract

The purpose of this proposal is to design a state-of-the-art healthcare system in four States of Colorado, Utah, Arizona, and New Mexico.   Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry.  The value that is driven by BDA can save lives and minimize costs for patients.  The project proposes a design to apply BD and BDA in the healthcare system across these identified four States.  Cloud computing is the most appropriate technology to deal with the large volume of healthcare data at the storage level as well as at the data processing level.  Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used.  VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists.   The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT).  The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images.  Spark will be used for real-time data processing which can be vital for urgent care and emergency services.  This project addresses the assumptions and limitations plus the justification for selecting these specific components.  All stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process.  The rigid culture and silo pattern need to change for better healthcare which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

Keywords: Big Data Analytics; Hadoop; Healthcare Big Data System; Spark.

Introduction

            In the age of Big Data (BD), information technology plays a significant role in the healthcare industry (HIMSS, 2018).  The role of information technology in healthcare The healthcare sector generates a massive amount of data every day to conform to standards and regulations (Alexandru, Alexandru, Coardos, & Tudora, 2016).  The generated Big Data has the potential to support many medical and healthcare operations including clinical decision support, disease surveillance and population health management (Alexandru et al., 2016). This project proposes a state-of-the-art integrated system for hospitals located in Arizona, Colorado, New Mexico, and Utah.  The system is based on the Hadoop ecosystem to help the hospitals maintain and improve human health via diagnosis, treatment and disease prevention. 

It begins with Big Data Analytics in Healthcare Overview, which covers the benefits and challenges of BD and BDA in the healthcare industry.  The overview also covers the various healthcare data sources for data analytics, in different formats such as semi-structured, e.g., XML and JSON, and unstructured, e.g., images and XRays.  The second section addresses the healthcare BDA Design Proposal Using Hadoop. This section covers various components.  The first component discusses the requirements for this design.  These requirements include state-of-the-art technology such as Hadoop/MapReduce, Spark, NoSQL database, Artificial Intelligence (AI), Internet of Things (IoT).  The project also covers various diagrams including the data flow diagram, a communication flow chart, and the overall system diagram.  The healthcare design system is bounded by regulation, policies, and governance such as HIPAA, that is also covered in this project.  The justification, limitation, and assumptions are also discussed.

Big Data Analytics in Healthcare Overview

BD and BDA are terms that have been used interchangeably and described as the next frontier for innovation, competitions, and productivity (Maltby, 2011; Manyika et al., 2011).  BD has a multi-V model with unique characteristics, such as volume referring to the large dataset, velocity refers to the speed of the computation as well as data generation, and variety referring to the various data types such as semi-structured and unstructured (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015; Hu, Wen, Chua, & Li, 2014).  BD is described as the next frontier for competition, innovation, and productivity.  Various industries including healthcare have taken this opportunity and applied BD and BDA in their business models (Manyika et al., 2011).  McKinsey Institute predicted $300 billion as a potential annual value to US healthcare (Manyika et al., 2011).  

The healthcare industry generated extensive data driven by keeping patients’ records, complying with regulations and policies, and patients care (Raghupathi & Raghupathi, 2014).  The current trend is digitalizing this explosive growth of the data in the age of Big Data (BD) and Big Data Analytics (BDA) (Raghupathi & Raghupathi, 2014).  BDA has made a revolution in healthcare by transforming the valuable information, knowledge to predict epidemics, cure diseases, improve quality of life, and avoid preventable deaths (Van-Dai, Chuan-Ming, & Nkabinde, 2016).  Various applications of BDA in healthcare include pervasive health, fraud detection, pharmaceutical discoveries, clinical decision support system, computer-aided diagnosis, and biomedical applications. 

Healthcare Big Data Benefits and Challenges

            Healthcare sector employs BDA in various aspect of healthcare such as detecting diseases at early stages, providing evidence-based medicine, minimizing doses of medication to avoid any side effects, and delivering useful medicine base on genetic analysis.  The use of BD and BDA can reduce the re-admission rate, and thereby the healthcare related costs for patients are reduced.  Healthcare BDA can be used to detect spreading diseases earlier before the disease gets spread using real-time analytics (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018).   Example of the application of BDA in the healthcare system is Kaiser Permanente implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records (Fox & Vaidyanathan, 2016).

            Despite the various benefits of BD and BDA in the healthcare sector, various challenges and issues are emerging from the application of BDA in healthcare.  The nature of the healthcare industry poses challenging to BDA (Groves, Kayyali, Knott, & Kuiken, 2016).  The episodic culture, the data puddles, and the IT leadership are the three significant challenges of the healthcare industry to apply BDA.  The episodic culture addresses the conservative culture of the healthcare and the lack of IT technologies mindset creating rigid culture.  Few providers have overcome this rigid culture and started to use the BDA technology. The data puddles reflect the silo nature of healthcare.  Silo is described as one of the most significant flaws in the healthcare sector (Wicklund, 2014).  The use of the technology properly is lacking in healthcare sector resulting in making the industry fall behind other industries. All silos use their methods to collect data from labs, diagnosis, radiology, emergency, case management and so forth.  The IT leadership is another challenge is caused by the rigid culture of the healthcare industry.  The lack of the latest technologies among the IT leadership in the healthcare industry is a severe problem. 

Healthcare Data Sources for Data Analytics

            The current healthcare data is collected from clinical and non-clinical sources (InformationBuilders, 2018; Van-Dai et al., 2016; Zia & Khan, 2017).  The electronic healthcare records are digital copies of the medical history of the patients.  It contains a variety of data relevant to the care of the patients such as demographics, medical problems, medications, body mass index, medical history, laboratory test data, radiology reports, clinical notes, and payment information. These electronic healthcare records are the most important data in healthcare data analytics, because it provides effective and efficient methods for the providers and organizations to share data (Botta, de Donato, Persico, & Pescapé, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016; Wang et al., 2018).  

The biomedical imaging data plays a crucial role in healthcare data to aid disease monitoring, treatment planning and prognosis.  This data can be used to generate quantitative information and make inferences from the images that can provide insights into a medical condition.  The images analytics is more complicated due to the noises of the data associated with the images and is one of the significant limitations with biomedical analysis (Ji, Ganchev, O’Droma, Zhang, & Zhang, 2014; Malik & Sangwan, 2015; Van-Dai et al., 2016). 

The sensing data is ubiquitous in the medical domain both for real-time and for historical data analysis.  The sensing data involve several forms of medical data collection instruments such as the electrocardiogram (ECG) and electroencephalogram (EEG) which are vital sensors to collect signals from various parts of the human body.  The sensing data plays a significant role for intensive care units (ICU) and real-time remote monitoring of patients with specific conditions such as diabetes or high blood pressure.  The real-time and long-term analysis of various trends and treatment in remote monitoring programs can help providers monitor the state of those patients with certain conditions(Van-Dai et al., 2016). 

The biomedical signals are collected from many sources such as hearts, blood pressure, oxygen saturation levels, blood glucose, nerve conduction, and brain activity.  Examples of biomedical signals include electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG), electroencephalogram (EEG), electrogastrogram (EGG), and phonocardiogram (PCG).  The biomedical signals real-time analytics will provide better management of chronic diseases, earlier detection of adverse events such as heart attacks, and strokes and earlier diagnosis of disease.   These biomedical signals can be discrete or continuous based on the kind of care or severity of a particular pathological condition (Malik & Sangwan, 2015; Van-Dai et al., 2016).

The genomic data analysis helps better understand the relationship between various genetic, mutations, and disease conditions. It has great potentials in the development of various gene therapies to cure certain conditions.  Furthermore, the genomic data analytics can assist in translating genetic discoveries into personalized medicine practice (Liang & Kelemen, 2016; Luo, Wu, Gopukumar, & Zhao, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016).

The clinical text data analytics using the data mining are the transformation process of the information from clinical notes stored in unstructured data format to useful patterns.  The manual coding of clinical notes is costly and time-consuming, because of their unstructured nature, heterogeneity, different format, and context across different patients and practitioners.  Various methods such as natural language processing (NLP) and information retrieval can be used to extract useful knowledge from large volume of clinical text and automatically encoding clinical information in a timely manner (Ghani, Zheng, Wei, & Friedman, 2014; Sun & Reddy, 2013; Van-Dai et al., 2016).

The social network healthcare data analytics is based on various kinds of collected social media sources such as social networking sites, e.g., Facebook, Twitter, Web Logs, to discover new patterns and knowledge that can be leveraged to model and predict global health trends such as outbreaks of infections epidemics (InformationBuilders, 2018; Luo et al., 2016; Van-Dai et al., 2016; Zia & Khan, 2017). Figure 1 shows a summary of these healthcare data sources.


Figure 1.  Healthcare Data Sources.

Healthcare Big Data Analytics Design Proposal Using Hadoop

            The implementation of BDA in the hospitals within the four States aims to improve the safety of the patient, the clinical outcomes, promoting wellness and disease management (Alexandru et al., 2016; HIMSS, 2018).  The BDA system will take advantages of the large healthcare-generated data to provide various applied analytical disciplines such as statistical, contextual, quantitative, predictive and cognitive spectrums (Alexandru et al., 2016; HIMSS, 2018).  These applied analytical disciplines will drive the fact-based decision making for planning management and learning in hospitals (Alexandru et al., 2016; HIMSS, 2018). 

            The proposal begins with the requirements, followed by the data flow diagram, the communication flowcharts, and the overall system diagram.  The proposal addresses the regulations, policies, and governance for the medical system.  The limitation and assumptions are also addressed in this proposal, followed by the justification for the overall design.

1.      Basic Design Requirements

The basic requirement for the implementation of this proposal included not only the tools and required software, but also the training at all levels from staff, to nurses, to clinicians, to patients.  The list of the requirements is divided into system requirement, implementation requirement, and training requirements. 

1.1 Cloud Computing Technology Adoption Requirement

The volume is one of the significant characteristics of BD, especially in the healthcare industry (Manyika et al., 2011).  Based on the challenges addressed earlier when dealing with BD and BDA in healthcare, the system requirements cannot be met using the traditional on-premise technology center, as it cannot handle the intensive computation requirements of BD, and the storage requirement for all the medical information from various hospitals from the four States (Hu et al., 2014). Thus, the cloud computing environment is found to be more appropriate and a solution for the implantation of this proposal.  Cloud computing plays a significant role in BDA (Assunção et al., 2015).  The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016).  Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017).  However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud.  Thus, one of the major requirements is to adopt the Virtual Private Cloud as it has been regarded as the most prominent approach to trusted computing technology (Abdul, Jena, Prasad, & Balraju, 2014).

 1.2 Security Requirement

Cloud computing has been facing various threats (Cloud Security Alliance, 2013, 2016, 2017).   Records showed that over the last three years from 2015 until 2017, the number of breaches, lost medical records, and settlements of fines are staggering (Thompson, 2017).  The Office of Civil Rights (OCR) issued 22 resolution agreements, requiring monetary settlements approaching $36 million (Thompson, 2017).  Table 1 shows the data categories and the total for each year. 

Table 1.  Approximation of Records Lost by Category Disclosed on HHS.gov (Thompson, 2017)

Furthermore, a recent report published by HIPAA showed the first three months of 2018 experienced 77 healthcare data breaches reported to the OCR (HIPAA, 2018d).  In the second quarter of 2018, at least 3.14 million healthcare records were exposed (HIPAA, 2018a).  In the third quarter of 2018, 4.39 million records exposed in 117 breaches (HIPAA, 2018c).

Thus, the protection of the patients’ private information requires the technology to extract, analyze, and correlated potentially sensitive dataset (HIPAA, 2018b).  The implementation of BDA requires security measures and safeguards to protect the privacy of the patients in the healthcare industry (HIPAA, 2018b).  Sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016).  The security requirements involve security at the VPC cloud deployment model as well as at the local hospitals in each State (Regola & Chawla, 2013).  The security at the VPC cloud deployment model should involve the implementation of security groups and network access control lists to allow access to the right individuals to the right applications and patients’ records.  Security group in VPC acts as the first line of defense firewall for the associated instances of the VPC (McKelvey, Curran, Gordon, Devlin, & Johnston, 2015).  The network access control lists act as the second layer of defense firewall for the associated subnets, controlling the inbound and the outbound traffic at the subnet level (McKelvey et al., 2015). 

The security at the local hospitals level in each State is mandatory to protect patients’ records and comply with HIPAA regulations (Regola & Chawla, 2013).  The medical equipment must be secured with authentication and authorization techniques so that only the medical staff, nurses and clinicians have access to the medical devices based on their role.  The general access should be prohibited as every member of the hospital has a different role with different responses.  The encryption should be used to hide the meaning or intent of communication from unintended users (Stewart, Chapple, & Gibson, 2015).   The encryption is an essential element in security control especially for the data in transit (Stewart et al., 2015).  The hospital in all four State should implement the encryption security control using the same type of the encryption across the hospitals such as PKI, cryptographic application, and cryptography and symmetric key algorithm (Stewart et al., 2015).

The system requirements should also include the identity management systems that can correspond with the hospitals in each state. The identity management system provides authentication and authorization techniques allowing only those who should have access to the patients’ medical records.  The proposal requires the implementation of various encryption techniques such as secure socket layer (SSL), Transport Layer Security (TLS), and Internet Protocol Security (IPSec) to protect information transferred in public network (Zhang, R. & Liu, 2010).  

 1.3 Hadoop Implementation for Data Stream Processing Requirement

While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014).  Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015).  The implementation requirements include various technologies and various tools.  This section covers various components that are required when implementing Hadoop technology in the four States for healthcare BDA system.

Hadoop has three significant limitations, which must be addressed in this design.  The first limitation is the lack of technical support and document for open source Hadoop (Guo, 2013).   Thus, this design requires the Enterprise Edition of Hadoop to get around this limitation using Cloudera, Hortonworks, and MapR (Guo, 2013). The final decision for which product will be determined by the cost analysis team.  The second limitation is that Hadoop is not optimal for real-time data processing (Guo, 2013). The solution for this limitation will require the integration of real-time streaming program as Spark or Storm or Kafka (Guo, 2013; Palanisamy & Thirunavukarasu, 2017). This requirement of integrating Spark is discussed below in a separate requirement for this design (Guo, 2013). The third limitation is that Hadoop is not a good fit for large graph dataset (Guo, 2013). The solution for this limitation requires the integration of GraphLab which is also discussed below in a separate requirement for this design.

1.3.1 Hadoop Ecosystem for Data Processing

Hadoop technologies have been in the front-runner for Big Data application (Bansal et al., 2014; Chrimes, Zamani, Moa, & Kuo, 2018).  Hadoop ecosystem will be part of the implementation requirement as it is proven to serve well with intensive computation using large datasets (Raghupathi & Raghupathi, 2014; Wang et al., 2018).   The implementation of Hadoop technology will be performed in the VPC deployment model.  The Hadoop version that is required is version 2.x to include YARN for resource management  (Karanth, 2014).  Hadoop 2.x also include HDFS snapshots to provide a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery (Karanth, 2014). The Hadoop platform can be implemented to gain more insight into various areas (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Hadoop ecosystem involves Hadoop Distributed File System, MapReduce, and NoSQL database such as HBase, and Hive to handle a large volume of dataset using various algorithms and machine learning to extract values from the medical records that are structured, semi-structured, and unstructured (Raghupathi & Raghupathi, 2014; Wang et al., 2018).  Other components to support Hadoop ecosystem include Oozie for workflow, Pig for scripting, and Mahout for machine learning which is part of the artificial intelligence (AI) (Ankam, 2016; Karanth, 2014).  Hadoop ecosystem will also include Flume for log collector, Sqoop for data exchange, and Zookeeper for coordination (Ankam, 2016; Karanth, 2014).  HCatalog is a required component to manage the metadata in Hadoop (Ankam, 2016; Karanth, 2014).   Figure 2 shows the Hadoop ecosystem before integrating Spark for real-time analytics.


Figure 2.  Hadoop Architecture Overview (Alguliyev & Imamverdiyev, 2014).

1.3.2 Hadoop-specific File Format for Splittable and Agnostic Compression

The ability of splittable files plays a significant role during the data processing (Grover, Malaska, Seidman, & Shapira, 2015).  Therefore, Hadoop-specific file formats of SequenceFile, and Serialization formats like Avro, and columnar formats such as RCFile and Parquet should be used because these files share two essential characteristics that are essential for Hadoop applications: splittable compression and agnostic compression (Grover et al., 2015).  Hadoop allows large files to be split for input to MapReduce and other types of jobs, which is required for parallel processing and an essential key to leveraging data locality feature of Hadoop (Grover et al., 2015). The agnostic compression is required to compress data using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 3 summarizes the three Hadoop file types with the two common characteristics.  


Figure 3. Three Hadoop File Types with the Two Common Characteristics.  

1.3.3 XML and JSON Use in Hadoop

The clinical data include semi-structured formats such as XML and JSON.  The split process of XML and JSON is not straightforward and can present unique challenges using Hadoop (Grover et al., 2015).  Since and Hadoop does not provide a built-in InputFormat for either format of XML and JSON (Grover et al., 2015).  Furthermore, JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record (Grover et al., 2015). When using these file format, two primary considerations must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed (Grover et al., 2015).  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

1.4 HBase and MongoDB NoSQL Database Integration Requirement

In the age of BD and BDA, the traditional data store is found inadequate to handle not only the large volume of the dataset but also the various types of the data format such as unstructured and semi-structured (Hu et al., 2014).   Thus, Not Only SQL (NoSQL) database is emerged to meet the requirement of the BDA.  These NoSQL data stores are used for modern, and scalable databases (Sahafizadeh & Nematbakhsh, 2015).  The scalability feature of the NoSQL data stores enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015).  The platform can incorporate two scalability types to support the large volume of the datasets; the horizontal and vertical scalability.  The horizontal scaling allows the distribution of the workload across many servers and nodes to increase the throughput, while the vertical scaling requires more processors, more memories and faster hardware to be installed on a single server (Sahafizadeh & Nematbakhsh, 2015). 

NoSQL data stores have various types such as MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.  These data stores are categorized into four types: document-oriented, column-oriented or column-family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The document-oriented data store can store and retrieve collections of data and documents using complex data forms in various formats such as XML and JSON as well as PDF and MS word (EMC, 2015; Hashem et al., 2015).  MongoDB and CouchDB are examples of document-oriented data stores (EMC, 2015; Hashem et al., 2015).  The column-oriented data store can store the content in columns aside from rows with the attributes of the columns stored contiguously (Hashem et al., 2015).  This type of datastore can store and render blog entries, tags, and feedback (Hashem et al., 2015).  Cassandra, DynamoDB, and HBase are examples of column-oriented data stores (EMC, 2015; Hashem et al., 2015).  The key-value can store and scale large volumes of data and contains value and a key to access the value (EMC, 2015; Hashem et al., 2015).  The value can be complicated, but this type of data stores can be useful in storing the user’s login ID as the key referencing the value of patients.  Redis and Riak are examples of the key-value NoSQL data store (Alexandru et al., 2016).  Each of these NoSQL data stores has its limitations and advantages.  The graph NoSQL database can store and represent data using graph models with nodes, edges, and properties related to one another through relations which will be useful for unstructured medical data such as images, and lab results. Neo4j is an example of this type of graph NoSQL database (Hashem et al., 2015).  Figure 4 summarizes these NoSQL data stores, data types for storage, and examples.

Figure 4.  Big Data Analytics NoSQL Data Store Types.

The proposed design requires one or more NoSQL data stores to meet the requirement of BDA using Hadoop environment for this healthcare BDA system.  Healthcare big data has unique characteristics which must be addressed when selecting the data store and consideration must be taken for the various types of data.   HBase and HDFS are the commonly used storage manager in the Hadoop environment (Grover et al., 2015).  HBase is a column-oriented data store which will be used to store multi-structured data (Archenaa & Anita, 2015).  HBase sets on top of HDFS in the Hadoop ecosystem framework (Raghupathi & Raghupathi, 2014).   

MongoDB will also be used to store the semi-structured data set such as XML and JSON. Metadata for HBase data schema, to improve the accessibility and readability of HBase data schema (Luo et al., 2016).  Riak will be used for a key-value dataset which can be used for the dictionary, hash tables and associative arrays that can be used for login and user ID information for patients as well as for providers and clinicians (Klein et al., 2015).  Neo4j NoSQL will be used to store the images with nodes and edges such as Lab images, XRays (Alexandru et al., 2016).

The proposed healthcare system has a logical data model and query patterns that need to be supported by NoSQL databases (Klein et al., 2015). The data model will include reading the medical test results for patients is a core function used to populate the user interface. It will also include a strong replica consistency when a new medical result is written for a patient.  Providers can make patient care decisions using these records.  All providers will be able to see the same information within the hospital systems in the four States, whether they are at the same site as the patients, or providing telemedicine support from another location. 

The logical data model includes mapping the application-specific model into the particular data model, indexing, and query language capabilities of each database.  The HL7 Fast Healthcare Interoperability Resources (FHIR) is used as the logical data model for records analysis.  The patient’s data such as demographic information such as names, addresses, and telephone will be modeled using the FHIR Patient Resources such as result quantity, and result units (Klein et al., 2015). 

1.5 Spark Integration for Real-Time Data Processing Requirement

While the architecture of Hadoop ecosystem has been designed in various scenarios for data storage, data management statistical analysis, and statistical association between various data sources distributed computing and batch processing, this proposal requires real-time data processing which cannot be met by Hadoop alone (Basu, 2014).  Real-time analytics will tremendous value to the healthcare proposed system.  Thus, Apache Spark is another component which is required to implement this proposal (Basu, 2014).  Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Basu, 2014).  With Spark integration with Hadoop, stream processing, machine learning, interactive analytics, and data integration will be possible (Scott, 2015).  Spark will run on top of Hadoop to benefit from YARN and the underlying storage of HDFS, HBase and other Hadoop ecosystem building blocks (Scott, 2015).  Figure 5 shows the core engines of the Spark.


Figure 5. Spark Core Engines (Scott, 2015).

 1.6 Big Healthcare Data Visualization Requirement

Visualization is one of the most powerful presentations of the data (Jayasingh, Patra, & Mahesh, 2016).  It helps in viewing the data in a more meaningful way in the form of graphs, images, pie charts that can be understood easily.  It helps in synthesizing a large volume of data set such as healthcare data to get at the core of such raw big data and convey the key points from the data for insight (Meyer, 2018).  Some of the commercial visualization tools include Tableau, Spotfire, QlikView, and Adobe Illustrator.  However, the most commonly used visualization tools in healthcare include Tableau, PowerBI, and QlikView. This healthcare design proposal will utilize Tableau. 

Healthcare providers are successfully transforming data from information to insight using Tableau software.  Healthcare organizations can utilize three approaches to get more from the healthcare datasets.  The first approach is to break the data access by empowering the departments in healthcare to explore their data.  The second approach is to uncover answers with data from multiple systems to reveal trends and outliers.  The third approach is to share insights with executives, providers, and others to drive collaboration (Tableau, 2011).  It has several advantages including the interactive visualization using drag-n-drop techniques, handling large amounts of data and millions of rows of data with ease, and other scripts such as Python can be integrated with Tableau (absentdata.com, 2018).  It also provides mobile support and responsive dashboard.  The limitation of Tableau is that it requires substantial training to fully master the platform, among other limitations including lack of automatic refreshing,  conditional formatting and 16-column table limit (absentdata.com, 2018).   Figure 6 shows the Patient Cycle Time data visualization using Tableau software.


Figure 6. Patient Cycle Time Data Visualization Example (Tableau, 2011).

1.7 Artificial Intelligence Integration Requirement

Artificial Intelligence is a computational technique allowing machines to perform cognitive functions such as acting or reacting to input, similar to the way humans do (Patrizio, 2018).  The traditional computing applications react to data, and the reactions and responses must be hand-coded with human intervention (Patrizio, 2018).  The AI systems are continuously in a flux mode changing their behavior to accommodate any changes in the results and modifying their reactions accordingly (Patrizio, 2018). The AI techniques can include video recognition, natural language processing, speech recognition, machine learning engines, and automation (Mills, 2018)

Healthcare system can benefit from BDA integration with Artificial Intelligence (AI) (Bresnick, 2018).  Since AI can play a significant role in BDA in the healthcare system, this proposal suggests the implementation of machine learning which is part of the AI to deploy more precise and impactful interventions at the right time in the care of patients (Bresnick, 2018).  The application of AI in the proposed design requires machine learning (Patrizio, 2018).  Since the data used in the AI and machine learning is already cleaned after removing the duplicates and unnecessary data, AI can take advantages of these filtered data leading to many healthcare breakthroughs such as genomic and proteomic experiments to enable personalized medicine (Kersting & Meyer, 2018).

The healthcare industry has been utilizing AI, machine learning (ML) and data mining (DM) to extract value from BD by transforming the large medical datasets into actionable knowledge performing predictive and prescriptive analytics (Palanisamy & Thirunavukarasu, 2017).   The ML will be used to utilize the AI to develop sophisticated algorithm processing massive medical datasets including the structured, unstructured, and semi-structured data performing advanced analytics (Palanisamy & Thirunavukarasu, 2017).  Apache Mahout, which is an open source for ML, will be integrated with Hadoop to facilitate the execution of scalable machine learning algorithms, offering various techniques such as recommendation, classification, and clustering (Palanisamy & Thirunavukarasu, 2017).

1.8 Internet of Things (IoT) Integration Requirement

Internet of Things (IoT) refers to the increased connected devices with IP addresses which were not common years ago  (Anand & Clarice, 2015; Thompson, 2017).  These connected devices collect and use the IP addresses to transmit information (Thompson, 2017).    Providers in healthcare take advantages of the collected information to find new treatment methods and increase efficiency (Thompson, 2017).

The implementation of IoT will involve various technologies including frequency identification (RFID), near field communication (NFC), machine to machine (M2M), wireless sensor network (WSM), and addressing schemes (AS) (IPv6 addresses) (Anand & Clarice, 2015; Kumari, 2017).  The implementation of IoT requires machine learning and algorithm to find patterns, correlations, and anomalies that have the potential of enabling healthcare improvements (O’Brien, 2016).  Machine learning is a critical component of artificial intelligence. Thus, the success of IoT depends on AI implementation. 

1.9 Training Requirement

This design proposal requires various training to IT professionals, providers and clinician and those who will be using this healthcare ecosystem depending on their role (Alexandru et al., 2016; Archenaa & Anita, 2015). Each component of this ecosystem should have training such as training for Hadoop/MapReduce, Spark, Security, and so forth.  The training will play a significant role in the success of this design implementation to apply BD and BDA in the healthcare system in the four States of Colorado, Utah, Arizona, and New Mexico.   Patients should be considered in training for remote monitoring programs such as blood sugar monitoring, and blood pressure monitoring applications.  The senior generation might face some challenges.  However, with the technical support, this challenge can be alleviated.

2.      Data Flow Diagram

            This section discusses the data flow for the proposed design for the healthcare ecosystem for the application of BDA. 

2.1 HBase Cluster and HDFS Data Flow

HBase stores data into table schema and specify the column family (Yang, Liu, Hsu, Lu, & Chu, 2013).  The table schema must be predefined, and the column families must be specified.  New columns can be added to families as required making the schema-flexible and can adapt to changing application requirements (Yang et al., 2013).   HBase is developed in a similar way like HDFS with a NameNode and slave nodes, and MapReduce with JobTracker and TaskTracker slaves (Yang et al., 2013).  HBase will play a vital role in the cluster environment of Hadoop system.  In HBase master node called HMaster will manage the cluster, and region servers store portions of the tables and perform the work on the data. The HMaster reflects the Master Server and is responsible for monitoring all RegionServer instances in the cluster and is the interface for all metadata changes.  This Master executes on the NameNode in the distributed cluster Hadoop environment.  The HRegionServer represents the RegionServer and is responsible for serving and managing regions.  The RegionServer runs on a DataNode in the distributed cluster Hadoop environment.   The ZooKeeper will assist other machines are selected within the cluster as HMaster in case of a failure, unlike HDFS framework where NameNode has a single point of availability issue.  Thus, the data flow between the DataNodes and the NameNodes when integrating HBase on top of HDFS is shown in Figure 7.  


Figure 7.  HBase Cluster Data Flow (Yang et al., 2013).

2.2 HBase and MongoDB with Hadoop/MapReduce and HDFS Data Flow

The healthcare system integrates four significant components such as HBase, MongoDB, MapReduce, and Visualization.  HBase is used for data storage, MongoDB is used for metadata, MapReduce using Hadoop for computation, and data visualization tool.  The signal data will be stored in HBase while the metadata and other clinical data will be stored in MongoDB.  The data stored in both HBase and MongoDB will be accessible from the Hadoop/MapReduce environment for processing and the data visualization layer as well.   One master node and eight slave nodes, and several supporting servers.   The data will be imported to Hadoop and processed via MapReduce.  The result of the computational process will be viewed through a data visualization tool such as Tableau.  Figure 8 shows the data flow between these four components of the proposed healthcare ecosystem.


Figure 8.  The Proposed Data Flow Between Hadoop/MapReduce and Other Databases.

2.3 XML Design Flow Using ETL Process with MongoDB 

Healthcare records have various types of data from structured, semi-structured to unstructured (Luo et al., 2016).   Some of these healthcare records are XML-based records in the semi-structured format using tags.  XML stands for eXtensible Markup Language (Fawcett, Ayers, & Quin, 2012).  Healthcare sector can drive value from these XML documents which reflect semi-structured data (Aravind & Agrawal, 2014).  Example of this XML-based patients records shows in Figure 9.


Figure 9.  Example of the Patient’s Electronic Health Record (HL7, 2011)

XML-based records need to get ingested into Hadoop system for the analytical purpose to derive value from this semi-structured XML-based data.   However, Hadoop does not offer a standard XML “RecordReader” (Lublinsky, Smith, & Yakubovich, 2013).  XML is one of the standard file formats for MapReduce.  Various approaches can be used to process XML semi-structured data.  The process of ETL (Extract, Transform and Load) can be used to process XML data in Hadoop.  MongoDB is a NoSQL database which is required in this design proposal.  It handles XML document-oriented type. 

The ETL process in MongoDB starts with the extract and transform.  The MongoDB application provides the ability to map the XML elements within the document to the downstream data structure.  The application supports the ability to unwind simple arrays or present embedded documents using appropriate data relationships such as one-to-one (1:1), one-to-many (1: M), or many-to-many (M: M) (MongoDB, 2018).  The application infers the schema information by examining a subset of documents within target collections.  Organizations can add fields to the discovered data model that may not have been present within the subset of documents used for schema inference.  The application infers information about the existing indexes for collections to be queried.  It prompts or warns of queries that do not contain any indexes fields.  The application can return a subset of fields from documents using query projections.  For queries against MongoDB Replica Sets, the application supports the ability to specify custom MongoDB Read Preferences for individual query operations.  The application then infers information about sharded cluster deployment and note the shard key fields for each sharded collection.  For queries against MongoDB Sharded Clusters, the application warns against queries that do not use proper query isolation.  Broadcast queries in a sharded cluster can have a negative impact on database performance (MongoDB, 2018). 

The load process in MongoDB is performed after the extract and transform process.  The application supports the ability to write data to any MongoDB deployment whether a single node, replica set or sharded cluster.  For writes to a MongoDB Sharded Cluster, the application informs or display an error message to the user if XML documents do not contain a shard key.  A custom WriteConcern can be used for any write operations to a running MongoDB deployment.  For the bulk loading operations, writing documents in batches using the insert() method can be used using the MongoDB 2.6 version or above, which supports the bulk update database command. For the bulk loading into a MongoDB sharded deployment, the bulk insert into a sharded collection is supported, including the pre-splitting of the collections’ shard key and inserting via multiple mongos processes.   Figure 10 shows this ETL process for XML-based patients records using MongoDB.


Figure 10.  The Proposed XML ETL Process in MongoDB.

2.4 Real-Time Streaming Spark Data Flow

Real-Time streaming can be implemented using any real-time streaming program such as Spark, Kafka, or Storm.  This healthcare design proposal will integrate Spark open-source program for the real-time streaming data such as sensing data, from various sources such as intensive care units, remote monitoring programs, biomedical signals. The data from various sources will be flow into Spark for analytics and then imported to the data storage systems.  Figure 11 illustrates the data flow for real-time streaming analytics.

Figure 11.  The Proposed Spark Data Flow.

3.      Communication Workflow

The communication flow involves the stakeholders involves in the healthcare system. These stakeholders include providers, insurer, pharmaceutical, and IT professionals and practitioners.  The communication flow is centered with the patient-centric healthcare system using the cloud computing technology for the four States of Colorado, Utah, Arizona, and New Mexico.  These stakeholders are from these states.  The patient-centric healthcare system is the central point for communication.  The patients communicate with the central system using the web-based platform, and clinical forums as needed.  The providers communicate with the patient-centric healthcare system using resource usages, patient feedback, and hospital visits, and services details.  The insurers communicate with the central system using claims database, and census and societal data. The pharmaceutical vendors will communicate with the central system using prescription and drug reports which can be retrieved by the providers from anywhere in these four states. The IT professionals and practitioners will communicate with the central system for data streaming, medical records, genomics, and all omics data analysis and reporting.  Figure 12 shows the communication flow between these stakeholders and the central system in the cloud that can be accessed from any of these identified four States.

Figure 12.  The Proposed Patient-Centric Healthcare System Communication Flow.

4.      Overall System Diagram

The overall system represents the state-of-the-art healthcare ecosystem system that utilizes the latest technology for healthcare Big Data Analytics. The system is bounded by the regulations and policy such as HIPAA to ensure the protection of the patients’ privacy across the various layers of the overall system.  The system integrated components include the Hadoop latest technology with MapReduce and HDFS.  The data government layer is the bottom layer which contains three major building blocks:  master data management (MDM), data life-cycle management (DLM) components, and data security and privacy management.  The MDM component is responsible for data completeness, accuracy, and availability, while the DLM is responsible for archiving the data, maintaining the data warehousing, data deletion, and disposal.   The data security and privacy management building block is responsible for sensitive data discovery, vulnerability and configuration assessment, security policies application, auditing and compliance reporting, activity monitoring, identify and access management, and protecting data.  The top layers include data layer, data aggregation layer, data analytics layer, and information exploration layer.  The data layer is responsible for data sources and content format, while the data aggregation layer involves various components from data acquisition process, transformation engines, and data storage area using Hadoop, HDFS, NoSQL databases such as MongoDB and HBase.  The data analytics layer involves the Hadoop/MapReduce mapping process, stream computing, real-time streaming, and database analytics.  AI and IoT are part of the data analytics layer.  The information exploration layer involves the data visualization layer, visualization reporting, real-time monitoring using healthcare dashboard, and clinical decision support. Figure 13 illustrates the overall system diagram with these layers.


Figure 13.  The Proposed Healthcare Overall System Diagram.

5.      Regulations, Policies, and Governance for the Medical Industry

Healthcare data must be stored in a secure storage area to protect the information and the privacy of patients (Liveri, Sarri, & Skouloudi, 2015).  When the healthcare industry fails to comply with the regulation and policies, the fines and the cost can cause financial stress on the industry (Thompson, 2017).  Records showed that the healthcare industry paid millions of dollars in fines.  The Advocate Health Care in suburban Chicago agreed to the most significant figure as of August 2016 with a total amount of $5.55 million (Thompson, 2017).  Memorial Health System in southern Florida became the second entity to top of paying $5 million (Thompson, 2017). Table 2 shows the five most substantial fines posted to the Office of Civil Rights (OCR) site. 

Table 2.  Five Largest Fines Posted to OCR Web Site (Thompson, 2017)

The hospitals must adhere to the data privacy regulations and legislative rules carefully to protect the patients’ medical records from data breaches (HIPAA).  The proper security policy and risk management must be implemented to ensure the protection of private information as well to minimize the impact of confidential data in case of loss or theft (HIPAA, 2018a, 2018c; Salido, 2010).  The healthcare system design proposal requires the implementation of a system for those hospitals or providers who are not compliant with the regulation and policies and the escalation path (Salido, 2010).  This design proposal implements four major principles as the best practice to comply with required policies and regulation and protect the confidential data assets of the patients and users (Salido, 2010).  The first principle is to honor policies throughout private data life (Salido, 2010).  The second principle for best practice in healthcare design system is to minimize the risk of unauthorized access or misuse of confidential data (Salido, 2010).  The third principle is to minimize the impact of confidential data loss, while the fourth principle is to document appropriate controls and demonstrate their effectiveness (Salido, 2010).  Figure 14 shows these four principles which this healthcare design proposal adheres to ensure protection healthcare data from unauthorized users and comply with the required regulation and policies. 


Figure 14.  Healthcare Design Proposal Four Principles.

6.      Assumptions and Limitations

This design proposal assumes that the healthcare sector in the four States will support the application of BD and BDA across these fours States.  The support includes investment in the proper technology, proper tools and proper training based on the requirements of this design proposal.  The proposal also assumes that the stakeholders including the providers, patients, insurer, pharmaceutical vendors, and practitioners will welcome the application of BDA to take advantages of it to provide efficient healthcare services, increase productivity, decrease costs for healthcare sector as well as for patients, and provide better care to patients.

            The limitation of this proposal is the timeframe that is required to implement it.  With the support of the healthcare sector from these four States, the implementation can be expedited.  However, the silo and the rigid culture of the healthcare may interfere with the implementation which can take longer than expected.   The initial implementation might face unexpected challenges. However, these unexpected challenges will come from the lack of experienced IT professionals and managers in the field of BD and BDA domain.  This design proposal will be enhanced based on the observations from the first few months of the implementation. 

7.      The justification for Overall Design

            The traditional database and analytical systems are found inadequate when dealing with healthcare data in the age of BDA.  The characteristics of the healthcare datasets including the large volume medical records, the variety of the dataset from structured, to semi-structured, to the unstructured dataset, and the velocity of the dataset generation and the data processing requires technology such as cloud computing (Fernández et al., 2014). Cloud computing is found the best solution when dealing with BD and BDA to address the challenges of BD storage, and the intensive-computing processing demands (Alexandru et al., 2016; Hashem et al., 2015).  The healthcare system in the four States will shift the communication technology and services for applications across the hospitals and providers (Hashem et al., 2015).  Some of the advantages of cloud computing adoption include virtualized resources, parallel processing, security and data service integration with scalable data storage (Hashem et al., 2015).  With the cloud computing technology, the healthcare sector in the four States will reduce the cost, and increase the efficiency (Hashem et al., 2015).  When quick access to critical data for patients care is required quickly, the mobility of accessing the data from anywhere is one of the most significant advantages of the cloud computing adoption as recommended by this proposed design  (Carutasu, Botezatu, Botezatu, & Pirnau, 2016). The benefits of cloud computing include technological benefits such as visualization, multi-tenancy, data and storage, security and privacy compliance (Chang, 2015).  The cloud computing also offers economic benefits such as pay per use, cost reduction, return on investment (Chang, 2015).  The non-functional benefits of the cloud computing cover the elasticity, quality of service, reliability, and availability (Chang, 2015).  Thus, the proposed design justifies the use of cloud computing for several benefits as cloud computing is proven the best technology for BDA especially for healthcare data analytics.

            Although cloud computing offers several benefits to the proposed healthcare system, cloud computing has been suffering from security and privacy concerns (Balasubramanian & Mala, 2015; Kazim & Zhu, 2015).  The security concerns involve risk areas such as external data storage, dependency on the public internet, lack of control, multi-tenancy and integration with internal security (Hashizume, Rosado, Fernández-medina, & Fernandez, 2013). The traditional security techniques such as identity, authentication, and authorization are not sufficient for cloud computing environments in their current forms using the standard deployment models of the public cloud, and private cloud  (Hashizume et al., 2013).  The increasing trend in the security threats data breaches, and the current deployment models of private and public clouds, which are not meeting the security challenges, have triggered the need for another deployment to ensure security and privacy protection.  Thus, the VPC deployment model which is a new deployment model of cloud computing technology (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q., Cheng, & Boutaba, 2010).  The VPC is taking advantages of technologies such as a virtual private network (VPN) which will allow hospitals and providers to set up their required network settings such as security (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010).  The VPC deployment model will have dedicated resources with the VPN to provide the required isolation for security to protect the patients’ information (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). Thus, this proposed design will be using VPC cloud computing deployment mode to store and use healthcare data in a secure and isolated environment to protect the patients’ medical records (Regola & Chawla, 2013).

Hadoop ecosystem is a required component in this proposed design for several reasons.  Hadoop technology is a commonly used computing paradigm for massive volume data processing in the cloud computing (Bansal et al., 2014; Chrimes et al., 2018; Dhotre et al., 2015).  Hadoop is the only technology that enables large healthcare volumes of data to be stored in its native forms (Dezyre, 2016).  Hadoop is proven to develop better treatments for diseases such as cancer by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national cancer knowledge network to guide treatment decision (Dezyre, 2016).  With Hadoop system, hospitals in the four States will be able to monitor the patient vitals (Dezyre, 2016).  The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over six thousand children in their ICU units (Dezyre, 2016).

The proposed design requires the integration of NoSQL database because it offers benefits such as mass storage support, reading and writing operations which are fast, and the expansion is easy with a low cost (Sahafizadeh & Nematbakhsh, 2015). HBase is proposed as a required NoSQL database as it is faster when reading more than six million variants which are required when analyzing large healthcare datasets (Luo et al., 2016).  Besides, query engine such as SeqWare can be integrated with HBase as needed to help bioinformatics researchers access large-scale whole-genome datasets (Luo et al., 2016).  HBase can store clinical sensors where the row key serves as the time stamp of a single value, and the column stores patients’ physiological values that correspond with the row key time stamp (Luo et al., 2016). HBase is scalable, high-performance and low-cost NoSQL data store that can be integrated with Hadoop sitting on top of HDFS (Yang et al., 2013). As a column-oriented NoSQL data store that runs on top of HDFS of Hadoop ecosystem, HBase is well suited to parse the healthcare large data sets (Yang et al., 2013). HBase supports applications written in Avro, REST and Thrift (Yang et al., 2013).  MongoDB is another NoSQL data store, which will be used to store metadata to improve the accessibility and readability of the HBase data schema (Luo et al., 2016).

The integration of Spark is required in order to overcome the Hadoop limitation of real-time data processing because Hadoop is not optimal for real-time data processing (Guo, 2013).  Thus, Apache Spark is a required component to implement this proposal so that the healthcare BDA system can take advantages of data processing at rest using the batching technique as well as a motion using the real-time processing technique (Liang & Kelemen, 2016).  Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Liang & Kelemen, 2016).   Spark is a high integration to the recent Hadoop cluster deployment (Scott, 2015).  While Spark is a powerful tool on its own for processing a large volume of medical and healthcare datasets, Spark is not well-suited for production workload.  Thus, the integration of Spark with Hadoop ecosystem provides many capabilities which Spark cannot offer on its own, and Hadoop cannot offer on its own.

The integration of AI as part of this proposal is justified by the examination of Harvard Business Review (HBR) that shows ten promising AI application in healthcare (Kalis, Collier, & Fu, 2018). The findings of HBR’s examination showed that the application of AI could create up to $150 billion in annual savings for U.S. healthcare by 2026 (Kalis et al., 2018).  The result also showed that AI currently creates the most value in assisting the frontline clinicians to be more productive and in making back-end processes more efficient (Kalis et al., 2018).   Furthermore, IBM invested $1 billion in AI through the IBM Watson Group, and healthcare industry is the most significant application of Watson (Power, 2015).

Conclusion

Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry.  The value that is driven by BDA can save lives and minimize costs for patients.  This project proposes a design to apply BDA in the healthcare system across four States of Colorado, Utah, Arizona, and New Mexico.  Cloud computing is the most appropriate technology to deal with the large volume of healthcare data.  Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used.  VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists. 

The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT).  The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images.  Spark will be used for real-time data processing which can be vital for urgent care and emergency services.  This project addressed the assumptions and limitations plus the justification for selecting these specific components. 

In summary, all stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process.  All stakeholders are responsible to facilitate the integration of BD and BDA into the healthcare system.  The rigid culture and silo pattern need to change for better healthcare system which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

References

Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

absentdata.com. (2018). Tableau Advantages and Disadvantages. Retrieved from https://www.absentdata.com/advantages-and-disadvantages-of-tableau/.

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Anand, M., & Clarice, S. (2015). Artificial Intelligence Meets Internet of Things. Retrieved from http://www.ijcset.net/docs/Volumes/volume5issue6/ijcset2015050604.pdf.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Balasubramanian, V., & Mala, T. (2015). A Review On Various Data Security Issues In Cloud Computing Environment And Its Solutions. Journal of Engineering and Applied Sciences, 10(2).

Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Bresnick, J. (2018). Top 12 Ways Artificial Intelligence Will Impact Healthcare. Retrieved from https://healthitanalytics.com/news/top-12-ways-artificial-intelligence-will-impact-healthcare.

Carutasu, G., Botezatu, M., Botezatu, C., & Pirnau, M. (2016). Cloud Computing and Windows Azure. Electronics, Computers and Artificial Intelligence.

Chang, V. (2015). A Proposed Framework for Cloud Computing Adoption. International Journal of Organizational and Collective Intelligence, 6(3).

Chrimes, D., Zamani, H., Moa, B., & Kuo, A. (2018). Simulations of Hadoop/MapReduce-Based Platform to Support its Usability of Big Data Analytics in Healthcare.

Cloud Security Alliance. (2013). The Notorious Nine: Cloud Computing Top Threats in 2013. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2017). The Treacherous 12 Top Threats to Cloud Computing. Cloud Security Alliance: Top Threats Working Group. 

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409. doi:10.1002/widm.1134

Fox, M., & Vaidyanathan, G. (2016). Impacts of Healthcare Big Data:  A Framwork With Legal and Ethical Insights. Issues in Information Systems, 17(3).

Ghani, K. R., Zheng, K., Wei, J. T., & Friedman, C. P. (2014). Harnessing big data for health care and research: are urologists ready? European urology, 66(6), 975-977.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006

Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5

HIMSS. (2018). 2017 Security Metrics:  Guide to HIPAA Compliance: What Healthcare Entities and Business Associates Need to Know. . Retrieved on 12/1/2018 from  http://www.himss.org/file/1318331/download?token=h9cBvnl2. 

HIPAA. (2018a). At Least 3.14 Million Healthcare Records Were Exposed in Q2, 2018. Retrieved 11/22/2018 from https://www.hipaajournal.com/q2-2018-healthcare-data-breach-report/. 

HIPAA. (2018b). How to Defend Against Insider Threats in Healthcare. Retrieved 8/22/2018 from https://www.hipaajournal.com/category/healthcare-cybersecurity/. 

HIPAA. (2018c). Q3 Healthcare Data Breach Report: 4.39 Million Records Exposed in 117 Breaches. Retrieved 11/22/2018 from https://www.hipaajournal.com/q3-healthcare-data-breach-report-4-39-million-records-exposed-in-117-breaches/. 

HIPAA. (2018d). Report: Healthcare Data Breaches in Q1, 2018. Retrieved 5/15/2018 from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/. 

HL7. (2011). Patient Example Instance in XML.  

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

InformationBuilders. (2018). Data In Motion – Big Data Analytics in Healthcare. Retrieved from http://docs.media.bitpipe.com/io_10x/io_109369/item_674791/datainmotionbigdataanalytics.pdf, White Paper.

Jayasingh, B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization. Paper presented at the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).

Ji, Z., Ganchev, I., O’Droma, M., Zhang, X., & Zhang, X. (2014). A cloud-based X73 ubiquitous mobile healthcare system: design and implementation. The Scientific World Journal, 2014.

Kalis, B., Collier, M., & Fu, R. (2018). 10 Promising AI Applications in Health Care. Retrieved from https://hbr.org/2018/05/10-promising-ai-applications-in-health-care, Harvard Business Review.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Kazim, M., & Zhu, S. Y. (2015). A Survey on Top Security Threats in Cloud Computing. International Journal Advanced Computer Science and Application, 6(3), 109-113.

Kersting, K., & Meyer, U. (2018). From Big Data to Big Artificial Intelligence? : Springer.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.

Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008

Kumari, W. M. P. (2017). Artificial INtelligence Meets Internet of Things.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liveri, D., Sarri, A., & Skouloudi, C. (2015). Security and Resilience in eHealth: Security Challenges and Risks. European Union Agency For Network And Information Security.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Malik, L., & Sangwan, S. (2015). MapReduce Framework Implementation on the Prescriptive Analytics of Health Industry. International Journal of Computer Science and Mobile Computing, ISSN, 675-688.

Maltby, D. (2011). Big Data Analytics. Paper presented at the Annual Meeting of the Association for Information Science and Technology.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.

McKelvey, N., Curran, K., Gordon, B., Devlin, E., & Johnston, K. (2015). Cloud Computing and Security in the Future Guide to Security Assurance for Cloud Computing (pp. 95-108): Springer.

Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446

Meyer, M. (2018). The Rise of Healthcare Data Visualization.

Mills, T. (2018). Eight Ways Big Data And AI Are Changing The Business World.

MongoDB. (2018). ETL Best Practice.  

O’Brien, B. (2016). Why The IoT Needs ARtificial Intelligence to Succeed.

Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.

Patrizio, A. (2018). Big Data vs. Artificial Intelligence.

Power, B. (2015). Artificial Intelligence Is Almost Ready for Business.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Regola, N., & Chawla, N. (2013). Storing and Using Health Data in a Virtual Private Cloud. Journal of medical Internet research, 15(3), 1-12. doi:10.2196/jmir.2076

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Salido, J. (2010). Data Governance for Privacy, Confidentiality and Compliance: A Holistic Approach. ISACA Journal, 6, 17.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Sultan, N. (2010). Cloud Computing for Education: A New Dawn? International Journal of Information Management, 30(2), 109-116. doi:10.1016/j.ijinfomgt.2009.09.004

Sun, J., & Reddy, C. (2013). Big Data Analytics for Healthcare. Retrieved from https://www.siam.org/meetings/sdm13/sun.pdf.

Tableau. (2011). Three Ways Healthcare Probiders are transforming data from information to insight. White Paper.

Thompson, E. C. (2017). Building a HIPAA-Compliant Cybersecurity Program, Using NIST 800-30 and CSF to Secure Protected Health Information.

Van-Dai, T., Chuan-Ming, L., & Nkabinde, G. W. (2016, 5-7 July 2016). Big data stream computing in healthcare real-time analytics. Paper presented at the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

Venkatesan, T. (2012). A Literature Survey on Cloud Computing. i-Manager’s Journal on Information Technology, 1(1), 44-49.

Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019

Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud Computing: State-of-the-Art and Research Challenges. Journal of internet services and applications, 1(1), 7-18. doi:10.1007/s13174-010-0007-6

Zhang, R., & Liu, L. (2010). Security models and requirements for healthcare application clouds. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Zia, U. A., & Khan, N. (2017). An Analysis of Big Data Approaches in Healthcare Sector. International Journal of Technical Research & Science, 2(4), 254-264.

 

Hadoop: Manageable Size of Data.

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss how data can be handled before Hadoop can take action on breaking data into manageable sizes.  The discussion begins with an overview of Hadoop providing a brief history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion involves the Big Data Analytics process using Hadoop which involves six significant steps including the pre-processing data and ETL process where the data must be converted and cleaned before processing it.  Before data processing, some consideration must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and data retrieval as it will affect how data can be split among various nodes in the distributed environment because not all tools can split the data.  This consideration begins with the data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  The compression of the data must be considered carefully because not all compression types are “splittable.” The discussion also involves the schema design consideration for HDFS and HBase since they are used often in the Hadoop ecosystem. 

Keywords: Big Data Analytics; Hadoop; Data Modelling in Hadoop; Schema Design in Hadoop.

Introduction

In the age of Big Data, dealing with large datasets in terabytes and petabytes is a reality and requires specific technology as the traditional technology was found inappropriate for it (Dittrich & Quiané-Ruiz, 2012).  Hadoop is developed to store, and process such large datasets efficiently.  Hadoop is becoming a data processing engine for Big Data (Dittrich & Quiané-Ruiz, 2012).  One of the significant advantages of Hadoop MapReduce is allowing non-expert users to run easily analytical tasks over Big Data (Dittrich & Quiané-Ruiz, 2012). However, before the analytical process takes place, some schema design and data modeling consideration must be taken for Hadoop so that the data process can be efficient (Grover, Malaska, Seidman, & Shapira, 2015).  Hadoop requires splitting the data. Some tools can split the data while others cannot split the data natively and requires integration (Grover et al., 2015). 

This project discusses these considerations to ensure the appropriate schema design for Hadoop and its components of HDFS, HBase where the data gets stored in a distributed environment.   The discussion begins with an overview of Hadoop first, followed by the data analytics process and ends with the data modeling techniques and consideration for Hadoop which can assist in splitting the data appropriately for better data processing performance and better data retrieval.

Overview of Hadoop

            Google published and disclosed its MapReduce technique and implementation early around 2004 (Karanth, 2014).  It also introduced the Google File System (GFS) which is associated with MapReduce implementation.  The MapReduce, since then, has become the most common technique to process massive data sets in parallel and distributed settings across many companies (Karanth, 2014).  In 2008, Yahoo released Hadoop as an open-source implementation of the MapReduce framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).  

The Apache Hadoop is the parent project for all subsequence projects of Hadoop (Karanth, 2014).  It contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21 branch.  The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop 2.0.  Two additional releases for Hadoop involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS append and security-related features into Hadoop respectively.  The timeline for Hadoop technology is outlined in Figure 1.


Figure 1.  Hadoop Timeline from 2003 until 2013 (Karanth, 2014).

Hadoop version 1.0 was the inception and evolution of Hadoop as a simple MapReduce job-processing framework (Karanth, 2014).  It exceeded its expectations with wide adoption of massive data processing.  The stable version of the 1.x release includes features such as append and security.  Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop clusters in enterprises.  Hadoop is becoming a common cluster-computing and storage platform from being limited to MapReduce only, because it has been moving faster than MapReduce to stay leading in massive scale data processing with the challenge of being backward compatible (Karanth, 2014). 

            In Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).  MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The yet another resource negotiator (YARN) was developed to separate concerns relating to resource management and application execution, which enables other application paradigms to be added into Hadoop computing cluster. The support for diverse applications result in the efficient and effective utilization of the resources and integrates well with the infrastructure of the business (Karanth, 2014).  YARN maintains backward compatibility with Hadoop version 1.x APIs  (Karanth, 2014).  Thus, the old MapReduce program can still execute in YARN with no code changes, but it has to be recompiled (Karanth, 2014).

            YARN abstracts out the resource management functions to form a platform layer called ResourceManager (RM) (Karanth, 2014).  Every cluster must have RM to keep track of cluster resource usage and activity.  RM is also responsible for allocation of the resources and resolving contentions among resource seekers in the cluster.  RM utilizes a generalized resource model and is agnostic to application-specific resource needs.  RM does not need to know the resources corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows Hadoop 1.x and Hadoop 2.x with YARN layer.   


Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).

Hadoop 2.x involves various enhancement at the storage layer as well.   These enhancements include the high availability feature to have a hot standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can become active NameNode in a matter of minutes.  The Zookeeper or any other HA monitoring service can be utilized to track NameNode failure (Karanth, 2014).  The failover process to promote the hot standby as the active NameNode is triggered with the assistance of the Zookeeper.  The HDFS federation is another enhancement in Hadoop 2.x, which is a more generalized storage model, where the block storage has been generalized and separated from the filesystem layer (Karanth, 2014).  The HDFS snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery.   Other enhancements added in Hadoop 2.x include the Protocol Buffers (Karanth, 2014). The wire protocol for RPCs within Hadoop is based on Protocol Buffers.  Hadoop 2.x is aware of the type of storage and expose this information to the application, to optimize data fetch and placement strategies (Karanth, 2014).  HDFS append support has been another enhancement in Hadoop 2.x.

Hadoop is regarded to be the de facto open-source framework for dealing with large-scale, massively parallel, and distributed data processing (Karanth, 2014).  The framework of Hadoop includes two layers for computation and data layer (Karanth, 2014).  The computation layer is used for parallel and distributed computation processing, while the data layer is used for a highly fault-tolerant data storage layer which is associated with the computation layer.  These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other similar hardware (Karanth, 2014).

Hadoop Architecture

Apache Hadoop has four projects: Hadoop Common, Hadoop Distributed File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016).  The HDFS is used to store data, MapReduce is used to process data, and YARN is used to manage the resources such as CPU and memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014).  Apache Hadoop integrates with other tools such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).

            Hadoop three significant components for Big Data Analytics.  The HDFS is a framework for reliable distributed data storage (Ankam, 2016; Karanth, 2014).  Some considerations must be taken when storing data into HDFS (Grover et al., 2015).  The multiple frameworks for parallel processing of data include MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes and DataNodes.  It also includes Oozie for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse.  Sqoop for data exchange, and Flume for log collection.  YARN is in Hadoop 2.0 as discussed earlier for distributed computing, while HCatalog for Hadoop metadata management.  HBase is for columnar database and Zookeeper for coordination (Alguliyev & Imamverdiyev, 2014).  Figure 3 shows the Hadoop ecosystem components.


Figure 3.  Hadoop Architecture (Alguliyev & Imamverdiyev, 2014).

Big Data Analytics Process Using Hadoop

The process of Big Data Analytics involves six essential steps (Ankam, 2016). The identification of the business problem and outcomes is the first step.  Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in the call volumes, and so forth.  Examples of the outcome include improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing call volume by 50% by next quarter while keeping customers happy.  The required data must be identified where data sources can be data warehouse using online analytical processing, application database using online transactional processing, log files from servers, documents from the internet, sensor-generated data, and so forth, based on the case and the problem.  Data collection is the third step in analyzing the Big Data (Ankam, 2016).  Sqoop tool can be used to collect data from the relational database, and Flume can be used for stream data.  Apache Kafka can be used for reliable intermediate storage.  The data collection and design should be implemented using the fault tolerance strategy (Ankam, 2016).  The preprocessing data and ETL process is the fourth step in the analytical process.  The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it, it needs to be converted to the required format and cleaned from inconsistent, invalid or corrupted data.  Apache Hive, Apache Pig, and Spark SQL can be used for preprocessing massive amounts of data.  The analytics implementation is the fifth steps which should be in order to answer the business questions and problems. The analytical process requires understanding the data and relationships between data points.  The types of data analytics include descriptive and diagnostic analytics to present the past and current views of the data, to answer questions such as what and why happened.  The predictive analytics is performed to answer questions such as what would happen based on a hypothesis. Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing mode.  Real-time analytics tools including Impala, Tez, Drill, and Spark SQL can be integrated into the traditional business intelligence (BI) using any of BI tools such as Tableau, QlikView, and others for interactive analytics. The last step in this process involves the visualization of the data to present the analytics output in a graphical or pictorial format to understand the analysis better for decision making.  The finished data is exported from Hadoop to a relational database using Sqoop, for integration into visualization systems or visualizing systems are directly integrated into tools such as Tableau, QlikView, Excel, and so forth.  Web-based notebooks such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize data by integrating Hadoop and Spark components (Ankam, 2016). 

Data Preprocessing, Modeling and Design Consideration in Hadoop

            Before processing any data, and before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better retrieval (Grover et al., 2015).  The traditional data management system is referred to as Schema-on-Write system which requires the definition of the schema of the data store before the data is loaded (Grover et al., 2015).  This traditional data management system results in long analysis cycles, data modeling, data transformation loading, testing, and so forth before the data can be accessed (Grover et al., 2015).   In addition to this long analysis cycle, if anything changes or wrong decision was made, the cycle must start from the beginning which will take longer time for processing (Grover et al., 2015).   This section addresses various types of consideration before processing the data from Hadoop for analytical purpose.

Data Pre-Processing Consideration

The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014).  Preprocessing techniques must be used to improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013).  The data pre-processing involves three techniques: data integration, data cleansing, and redundancy elimination.

The data integration techniques are used to combine data residing in different sources and provide users with a unified view of the data (Hu et al., 2014).  The traditional database approach has well-established data integration system including the data warehouse method, and the data federation method (Hu et al., 2014).  The data warehouse approach is also known as ETL consisting of extraction, transformation, and loading (Hu et al., 2014).  The extraction step involves the connection to the source systems and selecting and collecting the required data to be processed for analytical purposes.  The transformation step involves the application of a series of rules to the extracted data to convert it into a standard format.  The load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014).  The federation approach creates a virtual database to query and aggregate data from various sources (Hu et al., 2014).  The virtual database contains information or metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014).  These two data pre-processing are called store-and-pull techniques which is not appropriate for Big Data processing, with high computation and high streaming, and dynamic nature (Hu et al., 2014).  

The data cleansing process is a vital process to keep the data consistent and updated to get widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014).  The cleansing process is required to determine the incomplete, inaccurate, or unreasonable data and then remove these data to improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014).  The first step is to define and determine the error types.  The second step is to search and identify error instances.  The third step is to correct the errors, and then document error instances and error types. The last step is to modify data entry procedures to reduce future errors.  Various types of checks must be done at the cleansing process, including the format checks, completeness checks, reasonableness checks, and limit checks (Hu et al., 2014).  The process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014).  The data cleansing process depends on the complex relationship model, and it has extra computation and delay overhead (Hu et al., 2014).  Organizations must seek a balance between the complexity of the data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014). 

The data redundancy is the third data pre-processing step where data is repeated increasing the overhead of the data transmission and causes limitawtions for storage systems, including wasted space, inconsistency of the data, corruption of the dta, and reduced reliability (Hu et al., 2014).  Various redundancy reduction methods include redundancy detection and data compression (Hu et al., 2014).  The data compression method poses an extra computation burden in the data compression and decompression processes (Hu et al., 2014).

Data Modeling and Design Consideration

Schema-on-Write system is used when the application or structure is well understood and frequently accessed through queries and reports on high-value data (Grover et al., 2015).        The term Schema-on-Read is used in the context of Hadoop data management system (Ankam, 2016; Grover et al., 2015). This term refers to the raw data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the requirement of the processing application (Ankam, 2016; Grover et al., 2015).  The Schema-on-Read is used when the application or structure of data is not well understood (Ankam, 2016; Grover et al., 2015).  The agility of the process is implemented through the schema-on-read providing valuable insights on data not previously accessible (Grover et al., 2015).

            Five factors must be considered before storing data into Hadoop for processing (Grover et al., 2015).  The data storage format must be considered as there are some file formats and compression formats supported on Hadoop.  Each type of format has strengths that make it better suited to specific applications.   Although Hadoop Distributed File System (HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented on top of HDFS such as HBase for traditional data access functionality, and Hive for additional data management functionality (Grover et al., 2015).  These systems of HBase for data access functionality and Hive for data management functionality must be taken into consideration before storing data into Hadoop (Grover et al., 2015). The second factor involves the multitenancy which is a common approach for clusters to host multiple users, groups and application types. The multi-tenant clusters involve essential considerations for data storage.  The schema design factor should also be considered before storing data into Hadoop even if Hadoop is a schema-less (Grover et al., 2015).  The schema design consideration involves directory structures for data loaded into HDFS and the output of the data processing and analysis, including the schema of objects stored in systems such as HBase and Hive.  The last factor for consideration before storing data into Hadoop is represented in the metadata management.  Metadata is related to the stored data and is often regarded as necessary as the data.  The understanding of the metadata management plays a significant role as it can affect the accessibility of the data.  The security is another factor which should be considered before storing data into Hadoop system.  The security of the data decision involves authentication, fine-grained access control, and encryption. These security measures should be considered for data at rest when it gets stored as well as in motion during the processing (Grover et al., 2015).  Figure 4 summarizes these considerations before storing data into the Hadoop system. 


Figure 4.  Considerations Before Storing Data into Hadoop.

Data Storage Format Considerations

            When architecting a solution on Hadoop, the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in Hadoop involve file format, compression, data storage system (Grover et al., 2015).  The standard file formats involve three types:  text data, structured text data, and binary data.  Figure 5 summarizes these three standard file formats.


Figure 5.  Standard File Formats.

The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).  These text data format can come in many forms such as CSV files, or unstructured data such as emails.  Compression of the file is recommended, and the selection of the compression is influenced by how the data will be used (Grover et al., 2015).  For instance, if the data is for archival, the most compact compression method can be used, while if the data are used in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).  The splittable format enables Hadoop to split files into chunks for processing, which is essential to efficient parallel processing (Grover et al., 2015).

In most cases, the use of container formats such as SequenceFiles or Avro provides benefits making it the preferred format for most file system including text (Grover et al., 2015).  It is worth noting that these container formats provide functionality to support splittable compression among other benefits (Grover et al., 2015).   The binary data involves images which can be stored in Hadoop as well.  The container format such as SequenceFile is preferred when storing binary data in Hadoop.  If the binary data splittable unit is more than 64MB, the data should be put into its file, without using the container format (Grover et al., 2015).

XML and JSON Format Challenges with Hadoop

The structured text data include formats such as XML and JSON, which can present unique challenges using Hadoop because splitting XML and JSON files for processing is not straightforward, and Hadoop does not provide a built-in InputFormat for either (Grover et al., 2015).  JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record.  When using these file format, two primary consideration must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed.  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

Hadoop File Types Considerations

            Several Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015).  The Hadoop-specific file formats include file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet (Grover et al., 2015).  These files types share two essential characteristics that are important for Hadoop application: splittable compression and agnostic compression.  The ability of splittable files play a significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging data locality feature of Hadoop (Grover et al., 2015).  The agnostic compression is the ability to compress using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable compression and agnostic compression.


Figure 6. Three Hadoop File Types with the Two Common Characteristics.  

1.      SequenceFiles Format Consideration

SequenceFiles format is the most widely used Hadoop file-based formats.  SequenceFile format store data as binary key-value pairs (Grover et al., 2015).  It involves three formats for records stored within SequenceFiles:  uncompressed, record-compressed, and block-compressed.  Every SequenceFile uses a standard header format containing necessary metadata about the file such as the compression codec used, key and value class names, user-defined metadata, and a randomly generated syn marker.  The SequenceFiles arewell supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language.  The frequent use case for SequenceFiles is a container for smaller files.  However, storing a large number of small files in Hadoop can cause memory issue and excessive overhead in processing.  Packing smaller files into a SequenceFile can make the storage and processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015).   Other file-based formats include the MapFiles, SetFiles, Array-Files, and BloomMapFiles.  These formats offer a high level of integration for all forms of MapReduce jobs, including those run via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015).  Figure 7 summarizes the three formats for records stored within SequenceFiles.


Figure 7.  Three Formats for Records Stored within SequenceFile.

2.      Serialization Formats Consideration

Serialization is the process of moving data structures into bytes for storage or for transferring data over the network (Grover et al., 2015).   The de-serialization is the opposite process of converting a byte stream back into a data structure (Grover et al., 2015).  The serialization process is the fundamental building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can be efficiently stored and transferred across a network connection (Grover et al., 2015).  Figure 8 summarizes the serialization formats when architecting for Hadoop.


Figure 8.  Serialization Process vs. Deserialization Process.

The serialization involves two aspects of data processing in a distributed system of interprocess communication using data storage, and remote procedure calls or RPC (Grover et al., 2015).  Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only.  Other serialization frameworks have been increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers and Avro (Grover et al., 2015).  Avro is a language-neutral data serialization system (Grover et al., 2015).  It was designed to address the limitation of the Writables of Hadoop which is lack of language portability.  Similar to Thrift and Protocol Buffers, Avro is described through a language-independent schema (Grover et al., 2015).   Avro, unlike Thrift and Protocol Buffers, the code generation is optional.  Table 1 provides a comparison between these serialization formats.

Table 1:  Comparison between Serialization Formats.

3.      Columnar Format Consideration

Row-oriented systems have been used to fetch data stored in the database (Grover et al., 2015).  This type of data retrieval has been used as the analysis heavily relied on fetching all fields for records that belonged to a specific time range.  This process is efficient if all columns of the record are available at the time or writing because the record can be written with a single disk seek.  The column storage has recently been used to fetch data.  The use of columnar storage has four main benefits over the row-oriented system (Grover et al., 2015).  The skips I/O and decompression on columns that are not part of the query is one of the benefits of the columnar storage.  Columnar data storage works better for queries that access a small subset of columns than the row-oriented data storage, which can be used when many columns are retrieved.  The compression on columns provides efficiency because data is more similar within the same column than it is in a block of rows.  The columnar data storage is more appropriate for data warehousing-based applications where aggregations are implemented using specific columns than an extensive collection of records (Grover et al., 2015).  Hadoop applications have been using the columnar file formats including the RCFile format, Optimized Row Columnar (ORC), and Parquet.  The RCFile format has been used as a Hive Format.  It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization.  It breaks files into row splits, and within each split uses columnar-oriented storage.  Despite its advantages of the query and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and compression.  The newer version of the columnar formats ORC and Parquet are designed to address many of the limitations of the RCFile (Grover et al., 2015). 

Compression Consideration

Compression is another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving the data processing performance (Grover et al., 2015).  Some compression formats supported on Hadoop are not splittable (Grover et al., 2015).  MapReduce framework splits data for input to multiple tasks; the nonsplittable compression format is an obstacle to efficient processing.  Thus, the splittability is a critical consideration in selecting the compression format and file format for Hadoop.  Various compression types for Hadoop include Snappy, LZO, Gzip, bzip2.  Google developed Snappy for speed. However, it does not offer the best compression size. It is designed to be used with a container format like SequenceFile or Avro because it is not inherently splittable.  It is being distributed with Hadoop. Similar to Snappy, LZO is optimized for speed as opposed to size.  However, LZO, unlike Snappy support splittability of the compressed files, but it requires indexing.  LZO, unlike Snappy, is not distributed with Hadoop and requires a license and separate installation.  Gzip, like Snappy, provides good compression performance, but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like the Snappy.  Gzip is slower than Snappy for write processing.  Gzip is not splittable and should be used with a container format.  The use of smaller blocks with Gzip can result in better performance.   The bzip2 is another compression type for Hadoop.  It provides good compression performance, but it can be slower than another compression codec such as Snappy.  It is not an ideal codec for Hadoop storage. Bzip2, unlike Snappy and Gzip, is inherently splittable.  It inserts synchronization markers between blocks.  It can be used for active archival purposes (Grover et al., 2015).

The compression format can become splittable when used with container file formats such as Avro, SequenceFile which compress blocks of records or each record individually (Grover et al., 2015).  If the compression is implemented on the entire file without using the container file format, the compression format that inherently supports splittable must be used such as bzip2.  The compression use with Hadoop has three recommendation (Grover et al., 2015).  The first recommendation is to enable compression of MapReduce intermediate output, which improves performance by decreasing the among of intermediate data that needs to be read and written from and to disk.  The second recommendation s to pay attention to the order of the data.  When the data is close together, it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization of those chunks determines the final compression.   The last recommendation is to consider the use of a compact file format with support for splittable compression such as Avro.  Avro and SequenceFiles support splittability with non-splittable compression formats.  A single HDFS block can contain multiple Avro or SequenceFile blocks. Each block of the Avro or SequenceFile can be compressed and decompressed individually and independently of any other blocks of Avro or SequenceFile. This technique makes the data splittable because each block can be compressed and decompressed individually.  Figure 9 shows the Avro and SequenceFile splittability support (Grover et al., 2015).  


Figure 9.  Compression Example Using Avro (Grover et al., 2015).

Design Consideration for HDFS Schema

HDFS and HBase are the commonly used storage managers in the Hadoop ecosystem.  Organizations can store the data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015).  When storing data in HDFS, some design techniques must be taken into consideration.  The schema-on-read model of Hadoop does not impose any requirement when loading data into Hadoop, as data can be ingested into HDFS by one of many methods without the requirements to associate a schema or preprocess the data.  Although Hadoop has been used to load many types of data such as the unstructured data, semi-structured data, some order is still required, because Hadoop serves as a central location for the entire organization and the data stored in HDFS is intended to be shared across various departments and teams in the organization (Grover et al., 2015).  The data repository should be carefully structured and organized to provide various benefits to the organization  (Grover et al., 2015).   When there is a standard directory structure, it becomes easier to share data among teams working with the same data set.  The data gets staged in a separate location before processing it.  The standard stage technique can help not processing data that has not been appropriately staged or entirely yet.  The standard organization of data allows for some code reuse that may process the data (Grover et al., 2015).  The placement of data assumptions can help simplify the loading process of the data into Hadoop.   The HDFS data model design for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to the traditional schema  (Grover et al., 2015).  The HDFS data model design for projects of unstructured and semi-structured data is likely to focus on directory placement and metadata management (Grover et al., 2015). 

Grover et al. (2015) suggested three key considerations when designing the schema, regardless of the data model design project.  The first consideration is to develop standard practices that can be followed by all teams.  The second point is to ensure the design works well with the chosen tools.  For instance, if the version of Hive can support only table partitions on directories that are named a certain way, it will affect the schema design and the names of the table subdirectories.  The last consideration when designing a schema is to keep usage patterns in mind, because different data processing and querying patterns work better with different schema designs (Grover et al., 2015). 

HDFS Files Location Consideration

            The first step when designing an HDFS schema involves the determination of the location of the file.  Standard file location plays a significant role in finding and sharing data among various departments and teams. It also helps in the assignment of permission to access files to various groups and users.  The recommended file locations are summarized in Table 2.


Table 2.  Standard Files Locations.

HDFS Schema Design Consideration

The HDFS schema design involves advanced techniques to organize data into files (Grover et al., 2015).   A few strategies are recommended to organize the data set. These strategies for data organization involve partitioning, bucketing, and denormalizing process.  The partitioning process of the data set is a common technique used to reduce the amount of I/O required to process the data set.  HDFS does not store indexes on the data unlike the traditional data warehouse. Such a lack of indexes in HDFS plays a key role in speeding up data ingest, with a full table scan cost where every query will have to read the entire dataset even when processing a small subset of data. Breaking up the data set into smaller subsets, or partitions can help with the full table scan, allowing queries to read only the specific partitions reducing the amount of I/O and improving the query time processing significantly (Grover et al., 2015). When data is placed in the filesystem, the directory format for partition should be as shown below.  The order data sets are partitioned by date because there are a large number of orders done daily and the partitions will contain large enough files which are optimized by HDFS.  Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing (Grover et al., 2015).

  • <data set name>/<partition_column_name=partition_column_value>/(Armstrong)
  • e.g. medication_orders/date=20181107/[order1.csv, order2.csv]

Bucketing is another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015).  The bucketing technique is similar to the hash partitions which is used in the relational database.   Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing. The partition example above was implemented using the date which resulted in large data files which can be optimized by HDFS (Grover et al., 2015).  However, if the data sets are partitioned by a the category of the physician, the result will be too many small files, which leads to small file problems, which can lead to excessive memory use for the NameNode, since metadata for each file stored in HDFS is stored in memory (Grover et al., 2015).  Many small files can also lead to many processing tasks, causing excessive overhead in processing.  The solution for too many small files is to use the bucketing process for the physician in this example, which uses the hashing function to map physician into a specified number of buckets (Grover et al., 2015).

The bucketing technique controls the size of the data subsets and optimizes the query speed (Grover et al., 2015).  The recommended average bucket size is a few multiples of the HDFS block size. The distribution of data when hashed on the bucketing column is essential because it results in consistent bucketing (Grover et al., 2015).  The use of the number of buckets as a power of two is every day.   Bucketing allows joining two data sets.  The join, in this case, is used to represent the general idea of combining two data sets to retrieve a result. The joins can be implemented through the SQL-on-Hadoop systems and also in MapReduce, or Spark, or other programming interfaces to Hadoop.  When using join in the bucketing technique, it joins corresponding buckets individually without having to join the entire datasets, which help in minimizing the time complexity for the reduce-side join of the two datasets process, which is computationally expensive (Grover et al., 2015).   The join is implemented in the map stage of a MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into memory, which is called map-side join process.  The map-side join process improves the join performance as compared to a reduce-side join process.  A hive for data analysis recognizes the tables are bucketed and optimize the process accordingly.

Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get stored in memory when joining, resulting in the faster process and much less memory than a simple bucket join.  Hive supports this optimization as well.  The use of both sorting and bucketing on large tables that are frequently joined together using the join key for bucketing is recommended (Grover et al., 2015).

The schema design depends on how the data will be queried (Grover et al., 2015).  Thus, the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is implemented.   In some cases, when the identification of one partitioning key is challenging, storing the same data set multiple times can be implemented, each with the different physical organization, which is regarded to be an anti-pattern in a relational database.  However, this solution can be implemented with Hadoop, because with Hadoop is write-once, and few updates are expected.  Thus, the overhead of keeping duplicated data set in sync is reduced.  The cost of storage in Hadoop clusters is reduced as well  (Grover et al., 2015). The duplicated data set in sync provides better query speed processing in such cases (Grover et al., 2015). 

Regarding the denormalizing process, it is another technique of trading the disk space for query performance, where joining the entire data set need is minimized (Grover et al., 2015).   In the relational database model, the data is stored in the third standard form (NF3), where redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity.  In this relational model, most queries require joining a large number of tables together to produce a final result as desired (Grover et al., 2015).  However, in Hadoop, joins are often the slowest operations and consume the most resources from the cluster.  Specifically, the reduce-side join requires sending the entire table over the network, which is computationally costly.  While sorting and bucketing help minimizing this computational cost, another solution is to create data sets that are pre-joined or pre-aggregated (Grover et al., 2015).  Thus, the data can be joined once and store it in this form instead of running the join operations every time there is a query for that data.  Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process  (Grover et al., 2015).  Other techniques to speed up the process include the aggregation or data type conversion.  The duplication of the data is of less concern; thus, when the processing is frequent for a large number of queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational database.  In Hadoop, the new dataset is created that contains the same data in its aggregated form (Grover et al., 2015).

To summarize, the partitioning process is used to reduce the I/O overhead of processing by selectively reading and writing data in particular partitions.  The bucketing can be used to speed up queries that involve joins or sampling, by reducing the I/O as well.  The denormalization can be implemented to speed up Hadoop jobs.   In this section, a review of advanced techniques to organize data into files is discussed.  The discussion includes the use of a small number of large files versus a large number of small files.  Hadoop prefers working with a small number of large files than a large number of small files.  The discussion also addresses the reduce-side join versus map-side join techniques.   The reduce-side join is computationally costly. Hence, the map-side join technique is preferred and recommend. 

HBase Schema Design Consideration

HBase is not a relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013).  HBase is similar to a large hash table, which allows the association of values with keys and performs a fast lookup of the value based on a given key  (Grover et al., 2015). The operations of hash tables involve put, get, scan, increment and delete.  HBase provides scalability and flexibility and is useful in many applications, including fraud detection, which is a widespread application for HBase (Grover et al., 2015).

The framework of HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore, HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016).  Each component of the HBase framework plays a significant role in data storage and processing.  Figure 10 illustrated the HBase framework.


Figure 10.  HBase Architecture (Bhojwani & Shah, 2016).

            The following consideration must be taken when designing the schema for HBase (Grover et al., 2015).

  • Row Key Consideration.
  • Timestamp Consideration.
  • Hops Consideration.
  • Tables and Regions Consideration.
  • Columns Use Consideration.
  • Column Families Use Consideration.
  • Time-To-Live Consideration.

The row key is one of the most critical factors for well-architected HBase schema design (Grover et al., 2015).  The row key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and uniqueness.  The row key is critical for retrieving records from HBase. In the relational database, the composite key can be used to combine multiple primary keys.  In HBase, multiple pieces of information can be combined in a single key.  For instance, a key of customer_id, order_id, and timestamp will be a row key for a row describing an order. In a relational database, they are three different columns in the relational database, but in HBase, they will be combined into a single unique identifier.  Another consideration for selecting the row key is the get operation because a get operation of a single record is the fasted operation in HBase.  A single get operation can retrieve the most common uses of the data improves the performance, which requires to put much information in a single record which is called denormalized design.    For instance, while in the relational database, customer information will be placed in various tables, in HBase all customer information will be stored in a single record where get operation will be used. The distribution is another consideration for HBase schema design.  The row key determines the regions of HBase cluster for a given table, which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013).   The row keys are sorted, and each region stores a range of these sorted row keys  (Grover et al., 2015).  Each region is pinned to a region server namely a node in the cluster  (Grover et al., 2015).  The combination of device ID and timestamp or reverse timestamp is commonly used to “salt” the key in machine data  (Grover et al., 2015).  The block cache is a least recently used (LRU) cache which caches data blocks in memory  (Grover et al., 2015).  HBase reads records in chunks of 64KB from the disk by default. Each of these chunks is called HBase block  (Grover et al., 2015).  When the HBase block is read from disk, it will be put into the block cache  (Grover et al., 2015).   The choice of the row key can affect the scan operation as well.  HBase scan rates are about eight times slower than HSFS scan rates.  Thus, reducing I/O requirements has a significant performance advantage.  The size of the row key determines the performance of the workload.  The short row key is better than, the long row key because it has lower storage overhead and faster read/ writes performance.  The readability of the row key is critical. Thus, it is essential to start with human-readable row key.  The uniqueness of the row key is also critical since a row key is equivalent to a key in hash table analogy.  If the row key is based on the non-unique attribute, the application should handle such cases and only put data in HBase with a unique row key (Grover et al., 2015).

The timestamp is the second essential consideration for good HBase schema design (Grover et al., 2015).  The timestamp provides advantages of determining which records are newer in case of put operation to modify the record.  It also determines the order where records are returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records because time-to-live (TTL) operation compared with the timestamp shows the record value has either been overwritten by another put or deleted (Grover et al., 2015).

The hop term refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the better because of the overhead.  Although multi-hop requests with HBase can be made, it is best to avoid them through better schema design, for example by leveraging de-normalization, because every hop is a round-trip to HBase which has a significant performance overhead (Grover et al., 2015).

The number of tables and regions per table in HBase can have a negative impact on the performance and distribution of the data (Grover et al., 2015).  If the number of tables and regions are not implemented correctly, it can result in an imbalance in the distribution of the load.  Important considerations include one region server per node, many regions in a region server, a give region is pinned to a particular region server, and tables are split into regions and scattered across region servers.  A table must have at least one region.  All regions in a region server receive “put” requests and share the region server’s “memstore,” which is a cache structure present on every HBase region server. The “memstore” caches the write is sent to that region server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region server; the less memstore space is available per region.  The default configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can be divided by 100MB and result should be the maximum number of regions which can be put on that region server.   The vast region takes a long time to compact.  The upper limit on the size of a region is around 20GB. However, there are successful HBase clusters with upward of 120GB regions.  The regions can be assigned to HBase table using one of two techniques. The first technique is to create the table with a single default region, which auto splits as data increases.  The second technique is to create the table with a given number of regions and set the region size to a high enough value, e.g., 100GB per region to avoid auto splitting (Grover et al., 2015).  Figure 11 shows a topology of region servers, regions and tables. 


Figure 11.  The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).

The columns used in HBase is different from the traditional relational database (Grover et al., 2015; Yang et al., 2013).  In HBase, unlike the traditional database, a record can have a million columns, and the next record can have a million completely different columns, which is not recommended but possible (Grover et al., 2015).   HBase stores data in a format called HFile, where each column value gets its row in HFile (Grover et al., 2015; Yang et al., 2013). The row has files like row key, timestamp, column names, and values. The file format provides various functionality, like versioning and sparse column storage (Grover et al., 2015). 

HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013).  A column family is a container for columns.  In HBase, a table can have one or more column families.  Each column family has its set of HFiles and gets compacted independently of other column families in the same table.  In many cases, no more than one column family is needed per table.  The use of more than one column family per table can be done when the operation is done, or the rate of change on a subset of the columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013).  The last consideration for HBase schema design is the use of TTL, which is a built-in feature of HBase which ages out data based on its timestamp (Grover et al., 2015).  If TTL is not used and an aging requirement is needed, then a much more I/O intensive operation would need to be done.   The objects in HBase begin with table object, followed by regions for the table, store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013).  Figure 12 shows the hierarchy of objects in HBase.

Figure 12.  The Hierarchy of Objects in HBase (Yang et al., 2013).

To summarize this section, HBase schema design requires seven key consideration starting with the row key, which should be selected carefully for record retrieval, distribution, block cache, ability to scan, size, readability, and uniqueness.  The timestamp and hops are other schema design consideration for HBase.  Tables and regions must be considered for put performance, and compacting time.  The use of columns and column families should also be considered when designing the schema for HBase. The TTL to remove data that aged is another consideration for HBase schema design. 

Metadata Consideration

The above discussion has been about the data and the techniques to store it in Hadoop.  Metadata is as essential as the data itself.  Metadata is data about the data (Grover et al., 2015)).  Hadoop ecosystem has various forms of metadata.   Metadata about logical dataset usually stored in a separate metadata repository include the information like the location of a data set such as directory in HDFS or HBase table name, the schema associated with the dataset, the partitioning and sorting properties of the data set, the format of the data set e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such files and the location of various blocks on data nodes, usually stored and managed by Hadoop NameNode (Grover et al., 2015).  Metadata about tables in HBase include information like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually stored and managed by HBase (Grover et al., 2015).  Metadata about data ingest and transformation include information like which user-generated a given dataset, where the dataset came from, how long it took to generate it, and how many records there are, or the size of the data load (Grover et al., 2015).  Metadata about dataset statistics include information like the number of rows in a dataset, number of unique values in each column, a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015).  Figure 13 summarizes this various metadata.


Figure 13.  Various Metadata in Hadoop.

Apache Hive was the first project in the Hadoop ecosystem to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015).  Hives stores this metadata in a relational database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015).  Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015).  The query process in Hive goes to the metastore to get the metadata for the desired query, and metastore sends the metadata to Hive generating execution plan, followed by executing the job using the Hadoop cluster, which implements the job and Hive send the fetched result to the user (Antony et al., 2016; Grover et al., 2015).  Figure 14 shows the query process and the role of the metastore in Hive framework.


Figure 14.  Query Process and the Role of Metastore in Hive (Antony et al., 2016).

More projects have utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the usage of Hive metastore outside of Hive (Grover et al., 2015).  HCatalog is a part of Hive and allows other tools like Pig and MapReduce to integrate with Hive metastore.  It also opens the access to Hive metastore to other tools such as REST API via WebHCat server.  MapReduce, Pig, and standalone applications can talk directly to the metastore of Hive through its APIs, but HCatalog allows easy access through its WebHCat REST APIs, and it allows the cluster administrators to lock down access to the Hive metastore to address security concerns. Other ways to store metadata include the embedding of metadata in file paths and names.  Another technique to store metadata involves storing it in HDFS in a hidden file, e.g., .metadata.  Figure 15 shows the HCatalog as an accessibility veneer around the Hive metastore (Grover et al., 2015). 


Figure 15.  HCatalog acts an accessibility veneer around the Hive metastore (Grover et al., 2015).

Hive Metastore and HCatalog Limitations

There are some limitations for Hive metastore and HCatalog, including the problem with high availability (Grover et al., 2015).  The HA database cluster solutions to bring HA to the Hive metastore database.  For the metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the cluster.  However, concurrency issues related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues. 

The fixed schema for metadata is another limitation.  Hadoop provides much flexibility on the type of data that can be stored, mainly because of the Schema-on-Read concept. Hive metastore provides a fixed schema for the metadata itself. It provides a tabular abstraction for the data sets.   The data in metastore is moving the part in the infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).

Conclusion

This project has discussed essential topics related to Hadoop technology.  It began with an overview of Hadoop providing a history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x.  The discussion involved the Big Data Analytics Process using Hadoop technology.  The process involves six significant steps starting with the problem identification, required data to be collected, and the data collection process. The pre-processing data and ETL process must be implemented before performing the analytics. The last step is the visualization of the data for decision making.  Before processing any data and before collecting any data for storage, some considerations must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and better data retrieval, giving some tools cannot split the data while others can.  These considerations begin with data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  Compression must be considered when designing the schema for Hadoop. Since HDFS and HBase are commonly used in Hadoop for data storage, the discussion involved the consideration for the HDFS and HBase schema design considerations.  To summarize the design of the schema for Hadoop, HDFS, and HBase makes a difference in storing data in various nodes using the right tools for splitting the data.  Thus, organizations must pay attention to the process and the design requirements before storing data into Hadoop for better computational processing. 

References

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Antony, B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley & Sons.

Armstrong, D. (n.d.). R: Learning by Example: Lattice Graphics. Retrieved from https://quantoid.net/files/rbe/lattice.pdf.

Bhojwani, N., & Shah, A. P. V. (2016). A SURVEY ON HADOOP HBASE SYSTEM. Development, 3(1).

Dittrich, J., & Quiané-Ruiz, J.-A. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014-2015.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

sas.com. (2018). Hadoop – why it is and why it matters. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

 

Abstract

The purpose of this project is to discuss how data can be handled before Hadoop can take action on breaking data into manageable sizes.  The discussion begins with an overview of Hadoop providing a brief history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion involves the Big Data Analytics process using Hadoop which involves six significant steps including the pre-processing data and ETL process where the data must be converted and cleaned before processing it.  Before data processing, some consideration must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and data retrieval as it will affect how data can be split among various nodes in the distributed environment because not all tools can split the data.  This consideration begins with the data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  The compression of the data must be considered carefully because not all compression types are “splittable.” The discussion also involves the schema design consideration for HDFS and HBase since they are used often in the Hadoop ecosystem. 

Keywords: Big Data Analytics; Hadoop; Data Modelling in Hadoop; Schema Design in Hadoop.

Introduction

In the age of Big Data, dealing with large datasets in terabytes and petabytes is a reality and requires specific technology as the traditional technology was found inappropriate for it (Dittrich & Quiané-Ruiz, 2012).  Hadoop is developed to store, and process such large datasets efficiently.  Hadoop is becoming a data processing engine for Big Data (Dittrich & Quiané-Ruiz, 2012).  One of the significant advantages of Hadoop MapReduce is allowing non-expert users to run easily analytical tasks over Big Data (Dittrich & Quiané-Ruiz, 2012). However, before the analytical process takes place, some schema design and data modeling consideration must be taken for Hadoop so that the data process can be efficient (Grover, Malaska, Seidman, & Shapira, 2015).  Hadoop requires splitting the data. Some tools can split the data while others cannot split the data natively and requires integration (Grover et al., 2015). 

This project discusses these considerations to ensure the appropriate schema design for Hadoop and its components of HDFS, HBase where the data gets stored in a distributed environment.   The discussion begins with an overview of Hadoop first, followed by the data analytics process and ends with the data modeling techniques and consideration for Hadoop which can assist in splitting the data appropriately for better data processing performance and better data retrieval.

Overview of Hadoop

            Google published and disclosed its MapReduce technique and implementation early around 2004 (Karanth, 2014).  It also introduced the Google File System (GFS) which is associated with MapReduce implementation.  The MapReduce, since then, has become the most common technique to process massive data sets in parallel and distributed settings across many companies (Karanth, 2014).  In 2008, Yahoo released Hadoop as an open-source implementation of the MapReduce framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).  

The Apache Hadoop is the parent project for all subsequence projects of Hadoop (Karanth, 2014).  It contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21 branch.  The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop 2.0.  Two additional releases for Hadoop involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS append and security-related features into Hadoop respectively.  The timeline for Hadoop technology is outlined in Figure 1.


Figure 1.  Hadoop Timeline from 2003 until 2013 (Karanth, 2014).

Hadoop version 1.0 was the inception and evolution of Hadoop as a simple MapReduce job-processing framework (Karanth, 2014).  It exceeded its expectations with wide adoption of massive data processing.  The stable version of the 1.x release includes features such as append and security.  Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop clusters in enterprises.  Hadoop is becoming a common cluster-computing and storage platform from being limited to MapReduce only, because it has been moving faster than MapReduce to stay leading in massive scale data processing with the challenge of being backward compatible (Karanth, 2014). 

            In Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).  MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The yet another resource negotiator (YARN) was developed to separate concerns relating to resource management and application execution, which enables other application paradigms to be added into Hadoop computing cluster. The support for diverse applications result in the efficient and effective utilization of the resources and integrates well with the infrastructure of the business (Karanth, 2014).  YARN maintains backward compatibility with Hadoop version 1.x APIs  (Karanth, 2014).  Thus, the old MapReduce program can still execute in YARN with no code changes, but it has to be recompiled (Karanth, 2014).

            YARN abstracts out the resource management functions to form a platform layer called ResourceManager (RM) (Karanth, 2014).  Every cluster must have RM to keep track of cluster resource usage and activity.  RM is also responsible for allocation of the resources and resolving contentions among resource seekers in the cluster.  RM utilizes a generalized resource model and is agnostic to application-specific resource needs.  RM does not need to know the resources corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows Hadoop 1.x and Hadoop 2.x with YARN layer.   


Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).

Hadoop 2.x involves various enhancement at the storage layer as well.   These enhancements include the high availability feature to have a hot standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can become active NameNode in a matter of minutes.  The Zookeeper or any other HA monitoring service can be utilized to track NameNode failure (Karanth, 2014).  The failover process to promote the hot standby as the active NameNode is triggered with the assistance of the Zookeeper.  The HDFS federation is another enhancement in Hadoop 2.x, which is a more generalized storage model, where the block storage has been generalized and separated from the filesystem layer (Karanth, 2014).  The HDFS snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery.   Other enhancements added in Hadoop 2.x include the Protocol Buffers (Karanth, 2014). The wire protocol for RPCs within Hadoop is based on Protocol Buffers.  Hadoop 2.x is aware of the type of storage and expose this information to the application, to optimize data fetch and placement strategies (Karanth, 2014).  HDFS append support has been another enhancement in Hadoop 2.x.

Hadoop is regarded to be the de facto open-source framework for dealing with large-scale, massively parallel, and distributed data processing (Karanth, 2014).  The framework of Hadoop includes two layers for computation and data layer (Karanth, 2014).  The computation layer is used for parallel and distributed computation processing, while the data layer is used for a highly fault-tolerant data storage layer which is associated with the computation layer.  These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other similar hardware (Karanth, 2014).

Hadoop Architecture

Apache Hadoop has four projects: Hadoop Common, Hadoop Distributed File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016).  The HDFS is used to store data, MapReduce is used to process data, and YARN is used to manage the resources such as CPU and memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014).  Apache Hadoop integrates with other tools such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).

            Hadoop three significant components for Big Data Analytics.  The HDFS is a framework for reliable distributed data storage (Ankam, 2016; Karanth, 2014).  Some considerations must be taken when storing data into HDFS (Grover et al., 2015).  The multiple frameworks for parallel processing of data include MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes and DataNodes.  It also includes Oozie for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse.  Sqoop for data exchange, and Flume for log collection.  YARN is in Hadoop 2.0 as discussed earlier for distributed computing, while HCatalog for Hadoop metadata management.  HBase is for columnar database and Zookeeper for coordination (Alguliyev & Imamverdiyev, 2014).  Figure 3 shows the Hadoop ecosystem components.


Figure 3.  Hadoop Architecture (Alguliyev & Imamverdiyev, 2014).

Big Data Analytics Process Using Hadoop

The process of Big Data Analytics involves six essential steps (Ankam, 2016). The identification of the business problem and outcomes is the first step.  Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in the call volumes, and so forth.  Examples of the outcome include improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing call volume by 50% by next quarter while keeping customers happy.  The required data must be identified where data sources can be data warehouse using online analytical processing, application database using online transactional processing, log files from servers, documents from the internet, sensor-generated data, and so forth, based on the case and the problem.  Data collection is the third step in analyzing the Big Data (Ankam, 2016).  Sqoop tool can be used to collect data from the relational database, and Flume can be used for stream data.  Apache Kafka can be used for reliable intermediate storage.  The data collection and design should be implemented using the fault tolerance strategy (Ankam, 2016).  The preprocessing data and ETL process is the fourth step in the analytical process.  The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it, it needs to be converted to the required format and cleaned from inconsistent, invalid or corrupted data.  Apache Hive, Apache Pig, and Spark SQL can be used for preprocessing massive amounts of data.  The analytics implementation is the fifth steps which should be in order to answer the business questions and problems. The analytical process requires understanding the data and relationships between data points.  The types of data analytics include descriptive and diagnostic analytics to present the past and current views of the data, to answer questions such as what and why happened.  The predictive analytics is performed to answer questions such as what would happen based on a hypothesis. Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing mode.  Real-time analytics tools including Impala, Tez, Drill, and Spark SQL can be integrated into the traditional business intelligence (BI) using any of BI tools such as Tableau, QlikView, and others for interactive analytics. The last step in this process involves the visualization of the data to present the analytics output in a graphical or pictorial format to understand the analysis better for decision making.  The finished data is exported from Hadoop to a relational database using Sqoop, for integration into visualization systems or visualizing systems are directly integrated into tools such as Tableau, QlikView, Excel, and so forth.  Web-based notebooks such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize data by integrating Hadoop and Spark components (Ankam, 2016). 

Data Preprocessing, Modeling and Design Consideration in Hadoop

            Before processing any data, and before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better retrieval (Grover et al., 2015).  The traditional data management system is referred to as Schema-on-Write system which requires the definition of the schema of the data store before the data is loaded (Grover et al., 2015).  This traditional data management system results in long analysis cycles, data modeling, data transformation loading, testing, and so forth before the data can be accessed (Grover et al., 2015).   In addition to this long analysis cycle, if anything changes or wrong decision was made, the cycle must start from the beginning which will take longer time for processing (Grover et al., 2015).   This section addresses various types of consideration before processing the data from Hadoop for analytical purpose.

Data Pre-Processing Consideration

The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014).  Preprocessing techniques must be used to improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013).  The data pre-processing involves three techniques: data integration, data cleansing, and redundancy elimination.

The data integration techniques are used to combine data residing in different sources and provide users with a unified view of the data (Hu et al., 2014).  The traditional database approach has well-established data integration system including the data warehouse method, and the data federation method (Hu et al., 2014).  The data warehouse approach is also known as ETL consisting of extraction, transformation, and loading (Hu et al., 2014).  The extraction step involves the connection to the source systems and selecting and collecting the required data to be processed for analytical purposes.  The transformation step involves the application of a series of rules to the extracted data to convert it into a standard format.  The load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014).  The federation approach creates a virtual database to query and aggregate data from various sources (Hu et al., 2014).  The virtual database contains information or metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014).  These two data pre-processing are called store-and-pull techniques which is not appropriate for Big Data processing, with high computation and high streaming, and dynamic nature (Hu et al., 2014).  

The data cleansing process is a vital process to keep the data consistent and updated to get widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014).  The cleansing process is required to determine the incomplete, inaccurate, or unreasonable data and then remove these data to improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014).  The first step is to define and determine the error types.  The second step is to search and identify error instances.  The third step is to correct the errors, and then document error instances and error types. The last step is to modify data entry procedures to reduce future errors.  Various types of checks must be done at the cleansing process, including the format checks, completeness checks, reasonableness checks, and limit checks (Hu et al., 2014).  The process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014).  The data cleansing process depends on the complex relationship model, and it has extra computation and delay overhead (Hu et al., 2014).  Organizations must seek a balance between the complexity of the data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014). 

The data redundancy is the third data pre-processing step where data is repeated increasing the overhead of the data transmission and causes limitawtions for storage systems, including wasted space, inconsistency of the data, corruption of the dta, and reduced reliability (Hu et al., 2014).  Various redundancy reduction methods include redundancy detection and data compression (Hu et al., 2014).  The data compression method poses an extra computation burden in the data compression and decompression processes (Hu et al., 2014).

Data Modeling and Design Consideration

Schema-on-Write system is used when the application or structure is well understood and frequently accessed through queries and reports on high-value data (Grover et al., 2015).        The term Schema-on-Read is used in the context of Hadoop data management system (Ankam, 2016; Grover et al., 2015). This term refers to the raw data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the requirement of the processing application (Ankam, 2016; Grover et al., 2015).  The Schema-on-Read is used when the application or structure of data is not well understood (Ankam, 2016; Grover et al., 2015).  The agility of the process is implemented through the schema-on-read providing valuable insights on data not previously accessible (Grover et al., 2015).

            Five factors must be considered before storing data into Hadoop for processing (Grover et al., 2015).  The data storage format must be considered as there are some file formats and compression formats supported on Hadoop.  Each type of format has strengths that make it better suited to specific applications.   Although Hadoop Distributed File System (HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented on top of HDFS such as HBase for traditional data access functionality, and Hive for additional data management functionality (Grover et al., 2015).  These systems of HBase for data access functionality and Hive for data management functionality must be taken into consideration before storing data into Hadoop (Grover et al., 2015). The second factor involves the multitenancy which is a common approach for clusters to host multiple users, groups and application types. The multi-tenant clusters involve essential considerations for data storage.  The schema design factor should also be considered before storing data into Hadoop even if Hadoop is a schema-less (Grover et al., 2015).  The schema design consideration involves directory structures for data loaded into HDFS and the output of the data processing and analysis, including the schema of objects stored in systems such as HBase and Hive.  The last factor for consideration before storing data into Hadoop is represented in the metadata management.  Metadata is related to the stored data and is often regarded as necessary as the data.  The understanding of the metadata management plays a significant role as it can affect the accessibility of the data.  The security is another factor which should be considered before storing data into Hadoop system.  The security of the data decision involves authentication, fine-grained access control, and encryption. These security measures should be considered for data at rest when it gets stored as well as in motion during the processing (Grover et al., 2015).  Figure 4 summarizes these considerations before storing data into the Hadoop system. 


Figure 4.  Considerations Before Storing Data into Hadoop.

Data Storage Format Considerations

            When architecting a solution on Hadoop, the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in Hadoop involve file format, compression, data storage system (Grover et al., 2015).  The standard file formats involve three types:  text data, structured text data, and binary data.  Figure 5 summarizes these three standard file formats.


Figure 5.  Standard File Formats.

The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).  These text data format can come in many forms such as CSV files, or unstructured data such as emails.  Compression of the file is recommended, and the selection of the compression is influenced by how the data will be used (Grover et al., 2015).  For instance, if the data is for archival, the most compact compression method can be used, while if the data are used in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).  The splittable format enables Hadoop to split files into chunks for processing, which is essential to efficient parallel processing (Grover et al., 2015).

In most cases, the use of container formats such as SequenceFiles or Avro provides benefits making it the preferred format for most file system including text (Grover et al., 2015).  It is worth noting that these container formats provide functionality to support splittable compression among other benefits (Grover et al., 2015).   The binary data involves images which can be stored in Hadoop as well.  The container format such as SequenceFile is preferred when storing binary data in Hadoop.  If the binary data splittable unit is more than 64MB, the data should be put into its file, without using the container format (Grover et al., 2015).

XML and JSON Format Challenges with Hadoop

The structured text data include formats such as XML and JSON, which can present unique challenges using Hadoop because splitting XML and JSON files for processing is not straightforward, and Hadoop does not provide a built-in InputFormat for either (Grover et al., 2015).  JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record.  When using these file format, two primary consideration must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed.  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

Hadoop File Types Considerations

            Several Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015).  The Hadoop-specific file formats include file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet (Grover et al., 2015).  These files types share two essential characteristics that are important for Hadoop application: splittable compression and agnostic compression.  The ability of splittable files play a significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging data locality feature of Hadoop (Grover et al., 2015).  The agnostic compression is the ability to compress using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable compression and agnostic compression.


Figure 6. Three Hadoop File Types with the Two Common Characteristics.  

1.      SequenceFiles Format Consideration

SequenceFiles format is the most widely used Hadoop file-based formats.  SequenceFile format store data as binary key-value pairs (Grover et al., 2015).  It involves three formats for records stored within SequenceFiles:  uncompressed, record-compressed, and block-compressed.  Every SequenceFile uses a standard header format containing necessary metadata about the file such as the compression codec used, key and value class names, user-defined metadata, and a randomly generated syn marker.  The SequenceFiles arewell supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language.  The frequent use case for SequenceFiles is a container for smaller files.  However, storing a large number of small files in Hadoop can cause memory issue and excessive overhead in processing.  Packing smaller files into a SequenceFile can make the storage and processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015).   Other file-based formats include the MapFiles, SetFiles, Array-Files, and BloomMapFiles.  These formats offer a high level of integration for all forms of MapReduce jobs, including those run via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015).  Figure 7 summarizes the three formats for records stored within SequenceFiles.


Figure 7.  Three Formats for Records Stored within SequenceFile.

2.      Serialization Formats Consideration

Serialization is the process of moving data structures into bytes for storage or for transferring data over the network (Grover et al., 2015).   The de-serialization is the opposite process of converting a byte stream back into a data structure (Grover et al., 2015).  The serialization process is the fundamental building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can be efficiently stored and transferred across a network connection (Grover et al., 2015).  Figure 8 summarizes the serialization formats when architecting for Hadoop.


Figure 8.  Serialization Process vs. Deserialization Process.

The serialization involves two aspects of data processing in a distributed system of interprocess communication using data storage, and remote procedure calls or RPC (Grover et al., 2015).  Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only.  Other serialization frameworks have been increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers and Avro (Grover et al., 2015).  Avro is a language-neutral data serialization system (Grover et al., 2015).  It was designed to address the limitation of the Writables of Hadoop which is lack of language portability.  Similar to Thrift and Protocol Buffers, Avro is described through a language-independent schema (Grover et al., 2015).   Avro, unlike Thrift and Protocol Buffers, the code generation is optional.  Table 1 provides a comparison between these serialization formats.

Table 1:  Comparison between Serialization Formats.

3.      Columnar Format Consideration

Row-oriented systems have been used to fetch data stored in the database (Grover et al., 2015).  This type of data retrieval has been used as the analysis heavily relied on fetching all fields for records that belonged to a specific time range.  This process is efficient if all columns of the record are available at the time or writing because the record can be written with a single disk seek.  The column storage has recently been used to fetch data.  The use of columnar storage has four main benefits over the row-oriented system (Grover et al., 2015).  The skips I/O and decompression on columns that are not part of the query is one of the benefits of the columnar storage.  Columnar data storage works better for queries that access a small subset of columns than the row-oriented data storage, which can be used when many columns are retrieved.  The compression on columns provides efficiency because data is more similar within the same column than it is in a block of rows.  The columnar data storage is more appropriate for data warehousing-based applications where aggregations are implemented using specific columns than an extensive collection of records (Grover et al., 2015).  Hadoop applications have been using the columnar file formats including the RCFile format, Optimized Row Columnar (ORC), and Parquet.  The RCFile format has been used as a Hive Format.  It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization.  It breaks files into row splits, and within each split uses columnar-oriented storage.  Despite its advantages of the query and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and compression.  The newer version of the columnar formats ORC and Parquet are designed to address many of the limitations of the RCFile (Grover et al., 2015). 

Compression Consideration

Compression is another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving the data processing performance (Grover et al., 2015).  Some compression formats supported on Hadoop are not splittable (Grover et al., 2015).  MapReduce framework splits data for input to multiple tasks; the nonsplittable compression format is an obstacle to efficient processing.  Thus, the splittability is a critical consideration in selecting the compression format and file format for Hadoop.  Various compression types for Hadoop include Snappy, LZO, Gzip, bzip2.  Google developed Snappy for speed. However, it does not offer the best compression size. It is designed to be used with a container format like SequenceFile or Avro because it is not inherently splittable.  It is being distributed with Hadoop. Similar to Snappy, LZO is optimized for speed as opposed to size.  However, LZO, unlike Snappy support splittability of the compressed files, but it requires indexing.  LZO, unlike Snappy, is not distributed with Hadoop and requires a license and separate installation.  Gzip, like Snappy, provides good compression performance, but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like the Snappy.  Gzip is slower than Snappy for write processing.  Gzip is not splittable and should be used with a container format.  The use of smaller blocks with Gzip can result in better performance.   The bzip2 is another compression type for Hadoop.  It provides good compression performance, but it can be slower than another compression codec such as Snappy.  It is not an ideal codec for Hadoop storage. Bzip2, unlike Snappy and Gzip, is inherently splittable.  It inserts synchronization markers between blocks.  It can be used for active archival purposes (Grover et al., 2015).

The compression format can become splittable when used with container file formats such as Avro, SequenceFile which compress blocks of records or each record individually (Grover et al., 2015).  If the compression is implemented on the entire file without using the container file format, the compression format that inherently supports splittable must be used such as bzip2.  The compression use with Hadoop has three recommendation (Grover et al., 2015).  The first recommendation is to enable compression of MapReduce intermediate output, which improves performance by decreasing the among of intermediate data that needs to be read and written from and to disk.  The second recommendation s to pay attention to the order of the data.  When the data is close together, it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization of those chunks determines the final compression.   The last recommendation is to consider the use of a compact file format with support for splittable compression such as Avro.  Avro and SequenceFiles support splittability with non-splittable compression formats.  A single HDFS block can contain multiple Avro or SequenceFile blocks. Each block of the Avro or SequenceFile can be compressed and decompressed individually and independently of any other blocks of Avro or SequenceFile. This technique makes the data splittable because each block can be compressed and decompressed individually.  Figure 9 shows the Avro and SequenceFile splittability support (Grover et al., 2015).  


Figure 9.  Compression Example Using Avro (Grover et al., 2015).

Design Consideration for HDFS Schema

HDFS and HBase are the commonly used storage managers in the Hadoop ecosystem.  Organizations can store the data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015).  When storing data in HDFS, some design techniques must be taken into consideration.  The schema-on-read model of Hadoop does not impose any requirement when loading data into Hadoop, as data can be ingested into HDFS by one of many methods without the requirements to associate a schema or preprocess the data.  Although Hadoop has been used to load many types of data such as the unstructured data, semi-structured data, some order is still required, because Hadoop serves as a central location for the entire organization and the data stored in HDFS is intended to be shared across various departments and teams in the organization (Grover et al., 2015).  The data repository should be carefully structured and organized to provide various benefits to the organization  (Grover et al., 2015).   When there is a standard directory structure, it becomes easier to share data among teams working with the same data set.  The data gets staged in a separate location before processing it.  The standard stage technique can help not processing data that has not been appropriately staged or entirely yet.  The standard organization of data allows for some code reuse that may process the data (Grover et al., 2015).  The placement of data assumptions can help simplify the loading process of the data into Hadoop.   The HDFS data model design for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to the traditional schema  (Grover et al., 2015).  The HDFS data model design for projects of unstructured and semi-structured data is likely to focus on directory placement and metadata management (Grover et al., 2015). 

Grover et al. (2015) suggested three key considerations when designing the schema, regardless of the data model design project.  The first consideration is to develop standard practices that can be followed by all teams.  The second point is to ensure the design works well with the chosen tools.  For instance, if the version of Hive can support only table partitions on directories that are named a certain way, it will affect the schema design and the names of the table subdirectories.  The last consideration when designing a schema is to keep usage patterns in mind, because different data processing and querying patterns work better with different schema designs (Grover et al., 2015). 

HDFS Files Location Consideration

            The first step when designing an HDFS schema involves the determination of the location of the file.  Standard file location plays a significant role in finding and sharing data among various departments and teams. It also helps in the assignment of permission to access files to various groups and users.  The recommended file locations are summarized in Table 2.


Table 2.  Standard Files Locations.

HDFS Schema Design Consideration

The HDFS schema design involves advanced techniques to organize data into files (Grover et al., 2015).   A few strategies are recommended to organize the data set. These strategies for data organization involve partitioning, bucketing, and denormalizing process.  The partitioning process of the data set is a common technique used to reduce the amount of I/O required to process the data set.  HDFS does not store indexes on the data unlike the traditional data warehouse. Such a lack of indexes in HDFS plays a key role in speeding up data ingest, with a full table scan cost where every query will have to read the entire dataset even when processing a small subset of data. Breaking up the data set into smaller subsets, or partitions can help with the full table scan, allowing queries to read only the specific partitions reducing the amount of I/O and improving the query time processing significantly (Grover et al., 2015). When data is placed in the filesystem, the directory format for partition should be as shown below.  The order data sets are partitioned by date because there are a large number of orders done daily and the partitions will contain large enough files which are optimized by HDFS.  Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing (Grover et al., 2015).

  • <data set name>/<partition_column_name=partition_column_value>/(Armstrong)
  • e.g. medication_orders/date=20181107/[order1.csv, order2.csv]

Bucketing is another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015).  The bucketing technique is similar to the hash partitions which is used in the relational database.   Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing. The partition example above was implemented using the date which resulted in large data files which can be optimized by HDFS (Grover et al., 2015).  However, if the data sets are partitioned by a the category of the physician, the result will be too many small files, which leads to small file problems, which can lead to excessive memory use for the NameNode, since metadata for each file stored in HDFS is stored in memory (Grover et al., 2015).  Many small files can also lead to many processing tasks, causing excessive overhead in processing.  The solution for too many small files is to use the bucketing process for the physician in this example, which uses the hashing function to map physician into a specified number of buckets (Grover et al., 2015).

The bucketing technique controls the size of the data subsets and optimizes the query speed (Grover et al., 2015).  The recommended average bucket size is a few multiples of the HDFS block size. The distribution of data when hashed on the bucketing column is essential because it results in consistent bucketing (Grover et al., 2015).  The use of the number of buckets as a power of two is every day.   Bucketing allows joining two data sets.  The join, in this case, is used to represent the general idea of combining two data sets to retrieve a result. The joins can be implemented through the SQL-on-Hadoop systems and also in MapReduce, or Spark, or other programming interfaces to Hadoop.  When using join in the bucketing technique, it joins corresponding buckets individually without having to join the entire datasets, which help in minimizing the time complexity for the reduce-side join of the two datasets process, which is computationally expensive (Grover et al., 2015).   The join is implemented in the map stage of a MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into memory, which is called map-side join process.  The map-side join process improves the join performance as compared to a reduce-side join process.  A hive for data analysis recognizes the tables are bucketed and optimize the process accordingly.

Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get stored in memory when joining, resulting in the faster process and much less memory than a simple bucket join.  Hive supports this optimization as well.  The use of both sorting and bucketing on large tables that are frequently joined together using the join key for bucketing is recommended (Grover et al., 2015).

The schema design depends on how the data will be queried (Grover et al., 2015).  Thus, the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is implemented.   In some cases, when the identification of one partitioning key is challenging, storing the same data set multiple times can be implemented, each with the different physical organization, which is regarded to be an anti-pattern in a relational database.  However, this solution can be implemented with Hadoop, because with Hadoop is write-once, and few updates are expected.  Thus, the overhead of keeping duplicated data set in sync is reduced.  The cost of storage in Hadoop clusters is reduced as well  (Grover et al., 2015). The duplicated data set in sync provides better query speed processing in such cases (Grover et al., 2015). 

Regarding the denormalizing process, it is another technique of trading the disk space for query performance, where joining the entire data set need is minimized (Grover et al., 2015).   In the relational database model, the data is stored in the third standard form (NF3), where redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity.  In this relational model, most queries require joining a large number of tables together to produce a final result as desired (Grover et al., 2015).  However, in Hadoop, joins are often the slowest operations and consume the most resources from the cluster.  Specifically, the reduce-side join requires sending the entire table over the network, which is computationally costly.  While sorting and bucketing help minimizing this computational cost, another solution is to create data sets that are pre-joined or pre-aggregated (Grover et al., 2015).  Thus, the data can be joined once and store it in this form instead of running the join operations every time there is a query for that data.  Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process  (Grover et al., 2015).  Other techniques to speed up the process include the aggregation or data type conversion.  The duplication of the data is of less concern; thus, when the processing is frequent for a large number of queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational database.  In Hadoop, the new dataset is created that contains the same data in its aggregated form (Grover et al., 2015).

To summarize, the partitioning process is used to reduce the I/O overhead of processing by selectively reading and writing data in particular partitions.  The bucketing can be used to speed up queries that involve joins or sampling, by reducing the I/O as well.  The denormalization can be implemented to speed up Hadoop jobs.   In this section, a review of advanced techniques to organize data into files is discussed.  The discussion includes the use of a small number of large files versus a large number of small files.  Hadoop prefers working with a small number of large files than a large number of small files.  The discussion also addresses the reduce-side join versus map-side join techniques.   The reduce-side join is computationally costly. Hence, the map-side join technique is preferred and recommend. 

HBase Schema Design Consideration

HBase is not a relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013).  HBase is similar to a large hash table, which allows the association of values with keys and performs a fast lookup of the value based on a given key  (Grover et al., 2015). The operations of hash tables involve put, get, scan, increment and delete.  HBase provides scalability and flexibility and is useful in many applications, including fraud detection, which is a widespread application for HBase (Grover et al., 2015).

The framework of HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore, HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016).  Each component of the HBase framework plays a significant role in data storage and processing.  Figure 10 illustrated the HBase framework.


Figure 10.  HBase Architecture (Bhojwani & Shah, 2016).

            The following consideration must be taken when designing the schema for HBase (Grover et al., 2015).

  • Row Key Consideration.
  • Timestamp Consideration.
  • Hops Consideration.
  • Tables and Regions Consideration.
  • Columns Use Consideration.
  • Column Families Use Consideration.
  • Time-To-Live Consideration.

The row key is one of the most critical factors for well-architected HBase schema design (Grover et al., 2015).  The row key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and uniqueness.  The row key is critical for retrieving records from HBase. In the relational database, the composite key can be used to combine multiple primary keys.  In HBase, multiple pieces of information can be combined in a single key.  For instance, a key of customer_id, order_id, and timestamp will be a row key for a row describing an order. In a relational database, they are three different columns in the relational database, but in HBase, they will be combined into a single unique identifier.  Another consideration for selecting the row key is the get operation because a get operation of a single record is the fasted operation in HBase.  A single get operation can retrieve the most common uses of the data improves the performance, which requires to put much information in a single record which is called denormalized design.    For instance, while in the relational database, customer information will be placed in various tables, in HBase all customer information will be stored in a single record where get operation will be used. The distribution is another consideration for HBase schema design.  The row key determines the regions of HBase cluster for a given table, which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013).   The row keys are sorted, and each region stores a range of these sorted row keys  (Grover et al., 2015).  Each region is pinned to a region server namely a node in the cluster  (Grover et al., 2015).  The combination of device ID and timestamp or reverse timestamp is commonly used to “salt” the key in machine data  (Grover et al., 2015).  The block cache is a least recently used (LRU) cache which caches data blocks in memory  (Grover et al., 2015).  HBase reads records in chunks of 64KB from the disk by default. Each of these chunks is called HBase block  (Grover et al., 2015).  When the HBase block is read from disk, it will be put into the block cache  (Grover et al., 2015).   The choice of the row key can affect the scan operation as well.  HBase scan rates are about eight times slower than HSFS scan rates.  Thus, reducing I/O requirements has a significant performance advantage.  The size of the row key determines the performance of the workload.  The short row key is better than, the long row key because it has lower storage overhead and faster read/ writes performance.  The readability of the row key is critical. Thus, it is essential to start with human-readable row key.  The uniqueness of the row key is also critical since a row key is equivalent to a key in hash table analogy.  If the row key is based on the non-unique attribute, the application should handle such cases and only put data in HBase with a unique row key (Grover et al., 2015).

The timestamp is the second essential consideration for good HBase schema design (Grover et al., 2015).  The timestamp provides advantages of determining which records are newer in case of put operation to modify the record.  It also determines the order where records are returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records because time-to-live (TTL) operation compared with the timestamp shows the record value has either been overwritten by another put or deleted (Grover et al., 2015).

The hop term refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the better because of the overhead.  Although multi-hop requests with HBase can be made, it is best to avoid them through better schema design, for example by leveraging de-normalization, because every hop is a round-trip to HBase which has a significant performance overhead (Grover et al., 2015).

The number of tables and regions per table in HBase can have a negative impact on the performance and distribution of the data (Grover et al., 2015).  If the number of tables and regions are not implemented correctly, it can result in an imbalance in the distribution of the load.  Important considerations include one region server per node, many regions in a region server, a give region is pinned to a particular region server, and tables are split into regions and scattered across region servers.  A table must have at least one region.  All regions in a region server receive “put” requests and share the region server’s “memstore,” which is a cache structure present on every HBase region server. The “memstore” caches the write is sent to that region server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region server; the less memstore space is available per region.  The default configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can be divided by 100MB and result should be the maximum number of regions which can be put on that region server.   The vast region takes a long time to compact.  The upper limit on the size of a region is around 20GB. However, there are successful HBase clusters with upward of 120GB regions.  The regions can be assigned to HBase table using one of two techniques. The first technique is to create the table with a single default region, which auto splits as data increases.  The second technique is to create the table with a given number of regions and set the region size to a high enough value, e.g., 100GB per region to avoid auto splitting (Grover et al., 2015).  Figure 11 shows a topology of region servers, regions and tables. 


Figure 11.  The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).

The columns used in HBase is different from the traditional relational database (Grover et al., 2015; Yang et al., 2013).  In HBase, unlike the traditional database, a record can have a million columns, and the next record can have a million completely different columns, which is not recommended but possible (Grover et al., 2015).   HBase stores data in a format called HFile, where each column value gets its row in HFile (Grover et al., 2015; Yang et al., 2013). The row has files like row key, timestamp, column names, and values. The file format provides various functionality, like versioning and sparse column storage (Grover et al., 2015). 

HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013).  A column family is a container for columns.  In HBase, a table can have one or more column families.  Each column family has its set of HFiles and gets compacted independently of other column families in the same table.  In many cases, no more than one column family is needed per table.  The use of more than one column family per table can be done when the operation is done, or the rate of change on a subset of the columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013).  The last consideration for HBase schema design is the use of TTL, which is a built-in feature of HBase which ages out data based on its timestamp (Grover et al., 2015).  If TTL is not used and an aging requirement is needed, then a much more I/O intensive operation would need to be done.   The objects in HBase begin with table object, followed by regions for the table, store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013).  Figure 12 shows the hierarchy of objects in HBase.

Figure 12.  The Hierarchy of Objects in HBase (Yang et al., 2013).

To summarize this section, HBase schema design requires seven key consideration starting with the row key, which should be selected carefully for record retrieval, distribution, block cache, ability to scan, size, readability, and uniqueness.  The timestamp and hops are other schema design consideration for HBase.  Tables and regions must be considered for put performance, and compacting time.  The use of columns and column families should also be considered when designing the schema for HBase. The TTL to remove data that aged is another consideration for HBase schema design. 

Metadata Consideration

The above discussion has been about the data and the techniques to store it in Hadoop.  Metadata is as essential as the data itself.  Metadata is data about the data (Grover et al., 2015)).  Hadoop ecosystem has various forms of metadata.   Metadata about logical dataset usually stored in a separate metadata repository include the information like the location of a data set such as directory in HDFS or HBase table name, the schema associated with the dataset, the partitioning and sorting properties of the data set, the format of the data set e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such files and the location of various blocks on data nodes, usually stored and managed by Hadoop NameNode (Grover et al., 2015).  Metadata about tables in HBase include information like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually stored and managed by HBase (Grover et al., 2015).  Metadata about data ingest and transformation include information like which user-generated a given dataset, where the dataset came from, how long it took to generate it, and how many records there are, or the size of the data load (Grover et al., 2015).  Metadata about dataset statistics include information like the number of rows in a dataset, number of unique values in each column, a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015).  Figure 13 summarizes this various metadata.


Figure 13.  Various Metadata in Hadoop.

Apache Hive was the first project in the Hadoop ecosystem to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015).  Hives stores this metadata in a relational database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015).  Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015).  The query process in Hive goes to the metastore to get the metadata for the desired query, and metastore sends the metadata to Hive generating execution plan, followed by executing the job using the Hadoop cluster, which implements the job and Hive send the fetched result to the user (Antony et al., 2016; Grover et al., 2015).  Figure 14 shows the query process and the role of the metastore in Hive framework.


Figure 14.  Query Process and the Role of Metastore in Hive (Antony et al., 2016).

More projects have utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the usage of Hive metastore outside of Hive (Grover et al., 2015).  HCatalog is a part of Hive and allows other tools like Pig and MapReduce to integrate with Hive metastore.  It also opens the access to Hive metastore to other tools such as REST API via WebHCat server.  MapReduce, Pig, and standalone applications can talk directly to the metastore of Hive through its APIs, but HCatalog allows easy access through its WebHCat REST APIs, and it allows the cluster administrators to lock down access to the Hive metastore to address security concerns. Other ways to store metadata include the embedding of metadata in file paths and names.  Another technique to store metadata involves storing it in HDFS in a hidden file, e.g., .metadata.  Figure 15 shows the HCatalog as an accessibility veneer around the Hive metastore (Grover et al., 2015). 


Figure 15.  HCatalog acts an accessibility veneer around the Hive metastore (Grover et al., 2015).

Hive Metastore and HCatalog Limitations

There are some limitations for Hive metastore and HCatalog, including the problem with high availability (Grover et al., 2015).  The HA database cluster solutions to bring HA to the Hive metastore database.  For the metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the cluster.  However, concurrency issues related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues. 

The fixed schema for metadata is another limitation.  Hadoop provides much flexibility on the type of data that can be stored, mainly because of the Schema-on-Read concept. Hive metastore provides a fixed schema for the metadata itself. It provides a tabular abstraction for the data sets.   The data in metastore is moving the part in the infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).

Conclusion

This project has discussed essential topics related to Hadoop technology.  It began with an overview of Hadoop providing a history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x.  The discussion involved the Big Data Analytics Process using Hadoop technology.  The process involves six significant steps starting with the problem identification, required data to be collected, and the data collection process. The pre-processing data and ETL process must be implemented before performing the analytics. The last step is the visualization of the data for decision making.  Before processing any data and before collecting any data for storage, some considerations must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and better data retrieval, giving some tools cannot split the data while others can.  These considerations begin with data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  Compression must be considered when designing the schema for Hadoop. Since HDFS and HBase are commonly used in Hadoop for data storage, the discussion involved the consideration for the HDFS and HBase schema design considerations.  To summarize the design of the schema for Hadoop, HDFS, and HBase makes a difference in storing data in various nodes using the right tools for splitting the data.  Thus, organizations must pay attention to the process and the design requirements before storing data into Hadoop for better computational processing. 

References

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Antony, B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley & Sons.

Armstrong, D. (n.d.). R: Learning by Example: Lattice Graphics. Retrieved from https://quantoid.net/files/rbe/lattice.pdf.

Bhojwani, N., & Shah, A. P. V. (2016). A SURVEY ON HADOOP HBASE SYSTEM. Development, 3(1).

Dittrich, J., & Quiané-Ruiz, J.-A. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014-2015.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

sas.com. (2018). Hadoop – why it is and why it matters. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

 

 

The Impact of XML on MapReduce

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the impact of XML on MapReduce. The discussion addresses the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce is also discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

XML Query Processing Using MapReduce

XML format has been used to store data for multiple applications (Aravind & Agrawal, 2014).  Data needs to be ingested into Hadoop and get analyzed to obtain value from the XML data (Aravind & Agrawal, 2014).  Hadoop ecosystem needs to understand XML when it gets ingested into it and be able to interpret it (Aravind & Agrawal, 2014).  MapReduce is a building block of Hadoop ecosystem.  In the age of Big Data, XML documents are expected to be very large and to be scalable and distributed.  The process of XML queries using MapReduce requires the decomposition of a big XML document and distribute portions to different nodes.  The relational approach is not appropriate as it is expensive because transforming a big XML document into relational database tables can be extremely time consuming and θ-joins among relational table (Wu, 2014).  Various research studies have proposed various approaches to implement native XML query processing algorithms using MapReduce. 

            (Dede, Fadika, Gupta, & Govindaraju, 2011) have discussed and analyzed the scalable and distributed processing of scientific XML data, and how the MapReduce model should be used in XML metadata indexing.   The study has presented performance results using two MapReduce implementations of Apache Hadoop framework and proposed framework of LEMO-MR.  The study has provided an indexing framework that is capable of indexing and efficiently searching large-scale scientific XML datasets.  The framework has been tailed for integration with any framework that uses the MapReduce model to meet the scalability and variety requirements. 

(Fegaras, Li, Gupta, & Philip, 2011) have also discussed and analyzed query optimization in a MapReduce environment. The study has presented a novel query language for large-scale analysis of XML data on a MapReduce environment, called MRQL for MapReduce Query Language, that is designed to capture most common data analysis tasks which can be optimized.   XML data fragmentation is also discussed in this study.  When using a parallel data computation, it expects the input data to be fragmented into small manageable pieces, that determine the granularity of the computation.  In a MapReduce environment, each map worker is assigned a data split that consists of data fragments.  A map worker processes these data one fragment at a time.  The fragment is a relational tuple for relational data that is structured, while for a text file, a fragment can be a single line in the file.  However, for hierarchical data and nested collections data such as XML data, the fragment size and structure depend on the actual application that processes these data.  For instance, XML data may consist of some XML documents, each one containing a single XML element, whose size may exceed the memory capability of a map worker.  Thus, when processing XML data, it is recommended to allow custom fragmentation to meet a wide range of applications requirements.  (Fegaras et al., 2011) have argued that Hadoop provides a simple input format for XML fragmentation based on a single tag name.  XML document data can be split, which may start and end at arbitrary points in the document, even in the middle of tag names.  (Fegaras et al., 2011; Sakr & Gaber, 2014) have indicated that this input format allows reading the document as a stream of string fragments, so that each string will contain a single complete element that has the requested tag name.   XML parser can then be used to parse these strings and convert them to objects.  The fragmentation process is complex because the requested elements may cross data split boundaries and these data splits may reside in different data nodes in the data file system (DFS).  Hadoop DFS is the implicit solution for this problem allowing to scan beyond a data split to the next, subject to some overhead for transferring data between nodes.  (Fegaras et al., 2011) have proposed XML fragmentation technique that was built on top of the existing Hadoop XML input format, providing a higher level of abstraction and better customization.  It is a higher level of abstraction because it constructs XML data in the MRQL data model, ready to be processed by MRQL queries instead of deriving a string for each XML element (Fegaras et al., 2011).

(Sakr & Gaber, 2014) have also discussed briefly another language that has been proposed to support distributed XML processing using the MapReduce framework, called ChuQL.  It presents a MapReduce-based extension for the syntax, grammar, and semantics of XQuery, the standard W3C language for querying the XML documents.  The implementation of ChuQL takes care of distributing the computation to multiple XQuery engines running in Hadoop nodes, as described by one or more ChuQL MapReduce expressions.  The representation of the “word count” example program in the ChuQL language using its extended expressions where the MapReduce expression is used to describe a MapReduce job.  The clauses of input and output are used to read and write onto HDFS respectively. The clauses of rr and rw are used for describing the record reader and writer respectively.  The clauses of the map and reduce represent the standard map and reduce phases of the framework where they process XML values or key/value pairs of XML values to match the MapReduce model which are specified using XQuery expressions. Figure 1 show the word count example in ChQL using XML in distributed environment.

Figure 1.  The Word Count Example Program in ChQL Using XML in Distributed Env. (Sakr & Gaber, 2014).

(Vasilenko & Kurapati, 2014) have discussed and analyzed the efficient processing of XML documents in Hadoop MapReduce environment.  They argued that the most common approach to process XML data is to introduce a custom solution based on the user-defined functions or scripts.  The common choices vary from introducing an ETL process for extracting the data of interest to the transformation of XML into other formats that are natively supported by Hive.   They have addressed a generic approach to handling XML based on Apache Hive architecture.  The researchers have described an approach that complements the existing family of Hive serializers and de-serializers for other popular data formats, such as JSON, and makes it much easier for users to deal with the large XML dataset format.  The implementation included logical splits for the input files each of which is assigned to an individual Mapper. The mapper relies on the implemented Apache Hive XML SerDe to break the split into XML fragments using a specified start/end byte sequences. Each fragment corresponds to a single Hive record.  The fragments are handled by the XML processor to extract value for the record column utilizing specified XPath queries.  The reduce phase was not required in this implementation (Vasilenko & Kurapati, 2014). 

            (Wu, 2014) have discussed and analyzed the partitioning XML documents and distributing XML fragments into different compute nodes, which can introduce high overhead in XML fragment transferring from one node to another during the MapReduce process execution.  The researchers have proposed a technique to use MapReduce to distribute labels in inverted lists in a computing cluster so that structural joins can be parallelly performed to process queries.  They have also proposed an optimization technique to reduce the computing space in the proposed framework to improve the performance of query processing.  They have argued that their approaches are different from the current shred and distributed XML document into different nodes in a cluster approach.  The process includes reading and distributing the inverted lists that are required for input queries during the query processing, and their size is much smaller than the size of the whole document.  The process also includes the partition of the total computing space for structural joins so that each sub-space can be handled by one reducer to perform structural joins.  The researchers have also proposed a pruning-based optimization algorithm to improve the performance of their approach.

Conclusion

This discussion has addressed the XML query processing using MapReduce environment.  The discussion has addressed the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce has also been discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

References

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Dede, E., Fadika, Z., Gupta, C., & Govindaraju, M. (2011). Scalable and distributed processing of scientific XML data. Paper presented at the Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on.

Fegaras, L., Li, C., Gupta, U., & Philip, J. (2011). XML Query Optimization in Map-Reduce.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Vasilenko, D., & Kurapati, M. (2014). Efficient processing of xml documents in hadoop map reduce.

Wu, H. (2014). Parallelizing structural joins to process queries over big XML data using MapReduce. Paper presented at the International Conference on Database and Expert Systems Applications.

Big Data Analytics Tools

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify and describe a tool in the market for data analytics, how the tool is used and where it can be used.  The discussion begins with an overview of the Big Data Analytics tools, followed by the top five tools for 2018, among which RapidMiner is selected as the BDA tool for this discussion.  The discussion of the RapidMiner as one of the top five BDA tools include the features, technical specification, use, advantages, and limitation.  The application of RapidMiner in various industries such as medical and education is also addressed in this discussion. 

Overview of Big Data Analytics Tools

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.

Regarding the BDA tools, various BDA tools exist in the market for different business purposes based on the business model of the organization.  Organizations must select the right tool that will serve their business model.  Various studies have discussed various tools for BDA implementation.  (Chen & Zhang, 2014) have examined various types of BD tools. Some tools are based on batch processing such as Apache Hadoop, Dryad, Apache Mahout, and Tableau, while other tools are based on stream processing such as Storm, S4, Splunk, Apache Kafka, and SAP Hana as summarized in Table 1 and Table 2.  Each tool provides certain features for BDA implementation and offers various advantages to those BDA-adapted organizations.


Table 1.  Big Data Tools Based on Batch Processing (Chen & Zhang, 2014).


Table 2.  Big Data Tools Based on Stream Processing (Chen & Zhang, 2014).

Other studies such as (Rangra & Bansal, 2014) have provided a comparative study of data mining tools such as Weka, Keel, R-Programming, Knime, RapidMiner, and Orange, their technical specification, general features, specialization, advantages, and limitations. (Choi, 2017) have discussed the BDA tools by categories.  These BDA tools are categorized by open source data tools, data visualization tools, sentiment tools, and data extraction tools. Figure 1 provides a summary of some of the examples of BDA tools including the databases sources to download big datasets for analysis.


Figure 1.  A Summary of Big Data Analytics Tools.

(Al-Khoder & Harmouch, 2014) have evaluated four of the most popular open source and free data mining tools including R, RapidMiner, Weka, and Knime.  R foundation has developed R-Programming, while Rapid-I company have developed RapidMinder.  Weka is developed by University of Waikato, and Knime is developed by Knime.com AG. Figure 2 provides a summary of these four BDA most popular open source and free data mining tools, with the logo, description, launch date, current version at the time of writing the study, and development team.


Figure 2.  Open Source and Free Data Mining Tools Analyzed by (Al-Khoder & Harmouch, 2014).

The top five of BDA tools for 2018 include Tableau Public, Rapid Miner, Hadoop, R-Programming, IBM Big Data (Seli, 2017). The present discussion focuses on one of these two five BDA tools for 2018.  Figure 3 summarizes these top five BDA tools for 2018.


Figure 3.  Top Five BDA Tools for 2018.

RapidMiner Big Data Analytic Tool

RapidMiner Big Data Analytic tool is selected for the present discussion since it was among the top five BDA tools for 2018.  RapidMiner is an open source platform for BDA, based on Java programming language. RapidMiner provides machine learning procedures and data mining.  It also provides data visualization, processing, statistical modeling, deployment, evaluation and predictive analytics (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014; Seli, 2017).  RapidMiner is known for its commercial and business applications, as it provides an integrated environment and platform for machine learning, data mining, predictive analysis, and business analytics  (Hofmann & Klinkenberg, 2013; Seli, 2017).  It is also used for research, education, training, rapid prototyping, and application development (Rangra & Bansal, 2014).  It is specialized in predictive analysis and statistical computing. It supports all steps of the data mining process (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014). RapidMiner uses the client/server model, where the server can be software, or a service or on cloud infrastructures (Rangra & Bansal, 2014).

RapidMiner was released on 2006.  The latest version of RapidMiner server is 7.2 with a free version of server and Radoop and can be downloaded from RapidMiner site (rapidminer, 2018).  It can be installed on any operating system (Rangra & Bansal, 2014).  The advantages of the RapidMiner include an integrated environment for all steps that are required for data mining process, easy to use graphical user interface (GUI) for the design of data mining process, the visualization of the result and data, the validation and optimization of these processes.  RapidMiner can be integrated into more complex systems (Hofmann & Klinkenberg, 2013).  RapidMiner also stores the data mining processes in a machine-readable XML format, which can be executed with a click of a button, providing a visualized graphics of the data mining processes (Hofmann & Klinkenberg, 2013). It contains over a hundred learning schemes for regression classification and clustering analysis (Rangra & Bansal, 2014).  RapidMiner has a few limitations including the size constraints of the number of rows and more hardware resources than other tools such as SAS for the same task and data (Seli, 2017).  RapidMiner also requires prominent knowledge of the database handling (Rangra & Bansal, 2014).  

RapidMiner Use and Application

Data Mining requires six essential steps to extract value from a large dataset (Chisholm, 2013). The process of Data mining framework begins with business understanding, followed by the data understanding and data preparation.  The modeling, evaluation and deployment phases develop the models for predictions, testing, and deploying them in real-time. Figure 4 illustrates these six steps of the data mining.


Figure 4.  Data Mining Six Phases Process Framework (Chisholm, 2013).

Before working with RapidMiner, the user must know the common terms used by RapidMiner.  Some of these standard terms are a process, operator, macro, repository, attribute, role, label, and ID (Chisholm, 2013).  The data mining process in RapidMiner begins with loading the data into RapidMiner.  Loading the data into RapidMiner using import technique for either data in files, or databases.  The process of splitting the large file into pieces can be implemented in RapidMiner.  In some cases, the dataset can be split into chunks using RapidMiner process which reads each line in the file such as CSV file to be split into chunks. If the dataset is based on a database, a Java Data Connectivity (JDBC) driver must be used. RapidMiner support MySQL, PostgreSQL, SQL Server, Oracle and Access (Chisholm, 2013).  After loading the data into RapidMinder and generating data for testing, a predictive model can be created based on the loaded dataset, followed by the process execution and reviewing the result visually. RapidMiner provides various techniques to visualize the data.  It uses scatter plots, scatter 3D color, parallel and deviations, quartile color, plotting series, and survey plotter. Figure 5 illustrates scatter 3D color visualization of the data in RapidMiner (Chisholm, 2013).


Figure 5.  Scatter 3D Color Visualization of the Data in RapidMiner (Chisholm, 2013).

RapidMiner supports statistical analysis such as K-Nearest Neighbor Classifications, Naïve Bayes Classification, which can be used for credit approval and in education (Hofmann & Klinkenberg, 2013). RapidMiner application is also witnessed in other industries such as marketing, cross-selling and recommender system (Hofmann & Klinkenberg, 2013).  Other useful use cases of the RapidMiner application include the clustering in medical and education domains (Hofmann & Klinkenberg, 2013).  RapidMinder can also be used for text mining scenarios such as spam detection, language detection, and customer feedback analysis.  Other applications of RapidMiner include anomaly detection and instance selection.

Conclusion

This discussion has identified the different tools for Big Data Analytics (BDA). Over thirty analytic tools which can be used to overcome some of the BDA. Some are open source tools such as Knime, R-Programming, RapidMiner which can be downloaded for free, while others are described as visualization tools such as Tableau Public, Google Fusion to provide compelling visual images of the data in various scenarios.  Other tools are more semantic such as OpenText and Opinion Crawl.  Data extraction tools for BDA include Octoparse and Content Grabber. The users can download large datasets for BDA from various databases such as data.gov. 

The discussion has also addressed the top five BDA tools for 2018, such as Tableau Public, RapidMiner, Hadoop, R-Programming and IBM Big Data. RapidMiner was selected as BDA tools for this discussion.  The focus of the discussion on RapidMiner included the technical specification, use, advantages, and limitation.  The data mining process and steps when using RapidMiner have also been discussed.  The analytic process begins with the data upload to RapidMiner, during which the data can be split using the RapidMiner capabilities.  After the load and the cleaning of the data, the data model is developed and tested, followed by the visualization.  The visualization capabilities of RapidMiner include statistical analysis such as K-Nearest Neighbor and Naïve Bay Classification.  RapidMiner use cases have been addressed as well to include the medical and education domains, text mining scenarios such as spam detection.  Organizations must select the appropriate BDA tools based on the business model.

References

Al-Khoder, A., & Harmouch, H. (2014). Evaluating four of the most popular open source and free data mining tools.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chisholm, A. (2013). Exploring data with RapidMiner: Packt Publishing Ltd.

Choi, N. (2017). Top 30 Big Data Tools for Data Analysis. Retrieved from https://bigdata-madesimple.com/top-30-big-data-tools-data-analysis/.

Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and business analytics applications: CRC Press.

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Rangra, K., & Bansal, K. (2014). Comparative study of data mining tools. International journal of advanced research in computer science and software engineering, 4(6).

rapidminer. (2018). Introducing RapidMiner 7.2, Free Versions of Server & Radoop, and New Pricing. Retrieved from https://rapidminer.com/blog/introducing-new-rapidminer-pricing-free-versions-server-radoop/.

Seli, A. (2017). Top 5 Big Data Analytics Tools for 2018. Retrieved from http://heartofcodes.com/big-data-analytics-tools-for-2018/.

Case Study: Hadoop in Healthcare Industry

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify a real-life case study where Hadoop was used.  The discussion also addresses the view of the researcher whether Hadoop was used in the amplest manner.  The benefits of Hadoop to the identified industry of the use case are also discussed.  

Hadoop Real Life Case Study and Applications in the Healthcare Industry

Various research studies and reports have discussed Spark solution for real-time data processing in particular industries such as Healthcare, while others have discussed Hadoop solution for healthcare data analytics. For instance, (Shruika & Kudale, 2018) have discussed the use of Big Data in Healthcare with Spark, while (Beall, 2016) have indicated that United Healthcare is processing data using Hadoop framework for clinical advancements, financial analysis, and fraud and waste monitoring.  United Healthcare has utilized Hadoop to obtain a 360-degree view of each of its 85 million members (Beall, 2016). 

The emphasis of this discussion is on Hadoop in the Healthcare industry.  The data growth in the Healthcare industry is increasing exponentially (Dezyre, 2016).  McKinsey have anticipated the potential annual value for healthcare in the US is $300 billion, and 7% annual productivity growth using BDA (Manyika et al., 2011).  (Dezyre, 2016) have reported that the healthcare informatics poses challenges such as data knowledge representation, database design, data querying, and clinical decision support which contribute to the development of BDA.  

Big Data in healthcare include data such as patient-related data from electronic health records (EHRs), computerized physician order entry systems (CPOE), clinical decision support systems, medical devices and sensor, lab results and images such as Xrays, and so forth (Alexandru, Alexandru, Coardos, & Tudora, 2016; Wang, Kung, & Byrd, 2018).  Big Data framework for healthcare includes data layer, data aggregation layer, the analytical layer, the information exploration layer (Alexandru et al., 2016). Hadoop resides in the analytical layer of the Big Data framework (Alexandru et al., 2016).  

The data analysis involves Hadoop and MapReduce processing large dataset in batch form economically, analyzing both data types of structured and unstructured in a massively parallel processing environment (Alexandru et al., 2016).  (Alexandru et al., 2016) have indicated that stream computing can also be implemented using real-time or near real-time analysis to identify and respond to any health care fraud quickly.  The third type of analytics at the analytic layer also involves in-database analytics using data warehouse for data mining allowing high-speed parallel processing which can be used for prediction scenarios (Alexandru et al., 2016).  The in-database analytics can be used for preventive health care and pharmaceutical management.  Using Big Data framework including Hadoop ecosystem provides additional health care benefits such as scalability, security, confidentially and optimization features (Alexandru et al., 2016).

Hadoop technology was found to be the only technology that enables healthcare to store data in its native forms (Dezyre, 2016).   There are five successful use cases and applications of Hadoop in the healthcare industry (Dezyre, 2016).   The first application of Hadoop technology in healthcare is the cancer treatments and genomics.  Hadoop help develops better treatments for diseases such as cancel by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national “cancer knowledge network” to guide treatment decisions (Dezyre, 2016).  Hadoop can also be used to monitor the patient vitals.  The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over 6,200 children in their ICU units.  Through Hadoop, the hospital was able to store and analyze the vital signs, and if there is any pattern change, an alert is generated and sent to the physicians (Dezyre, 2016).  The third application of Hadoop in Healthcare industry involves the hospital network.  The Cleveland Clinic spinoff company, known as “Explorys” is taking advantages of Hadoop by developing the most extensive database in the healthcare industry. As a result, Explorys was able to provide clinical support, reduce the cost of care measurement and manage the population of at-risk patients (Dezyre, 2016).  The fourth application of Hadoop in Healthcare industry involves healthcare intelligence, where healthcare insurance businesses are interested in finding the age of individuals in specific regions, who below a certain age are not a victim of certain diseases.  Through Hadoop technology, the healthcare insurance companies can compute the cost of insurance policy.  Pig, Hive, and MapReduce of Hadoop ecosystem are used in this scenario to process such a large dataset (Dezyre, 2016).  The last application of Hadoop in the healthcare industry involves fraud prevention and detection.  

Conclusion

In conclusion, the healthcare industry has taken advantages of Hadoop technology in various areas not only for better treatment and better medication but also for reducing the cost and increasing productivity and efficiency.  It has also used Hadoop for fraud protection.  These are not only the benefits which Hadoop offers the healthcare industry.  Hadoop also offers storage capabilities, scalability, and analytics capabilities of various types of datasets using parallel processing and distributed file system.  From the viewpoint of the researcher, utilizing Spark on top of Hadoop will empower the healthcare industry not only at the batching processing level but also at the real-time data processing. (Basu, 2014) have reported that the healthcare industry can take advantages of Spark and Shark with Apache Hadoop for real-time healthcare analytics.  Although Hadoop alone offers excellent benefits to the healthcare industry, its integration with other analytic tools such as Spark can make a huge difference at the patient care level as well as at the industry return on investment level.

References

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data, and Cloud Computing. Management, 1, 2.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Beall, A. (2016). Big data in healthcare: How three organizations are using big data to improve patient care and more. Retrieved from https://www.sas.com/en_us/insights/articles/big-data/big-data-in-healthcare.html.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Shruika, D., & Kudale, R. A. (2018). Use of Big Data in Healthcare with Spark. International Journal of Science and Research (IJSR).

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

RDF Data Query Processing Performance

Dr. Aly, O.
Computer Science

Abstract

The purpose of this paper is to provide a survey on the state-of-the-art techniques for applying MapReduce to improve the RDF data query processing performance.  Tremendous effort from the industry and researchers have been exerted to develop efficient and scalable RDF processing system.   The project discusses and analyzes the RDF framework and the major building blocks of the semantic web architecture.   The RDF store architecture and the MapReduce Parallel Processing Framework and Hadoop are discussed in this project.  Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques.   This paper also discusses and analyzes the most prominent standardized semantic web technologies RDF and SPARQL.  The discussion and the analysis of the various techniques which are applied on MapReduce to improve the RDF query processing performance include techniques such as RDFPath, PigSPARQL, Interactive SPARQL Query Processing on Hadoop (Sempala), Map-Side Index Nested Loop Join (MAPSIN JOIN), HadoopRDF, RDF-3X (RDF Triple eXpress), and Rya (a Scalable RDF Triple Store for the Clouds). 

Keywords: RDF, SPARQL, MapReduce, Performance

MapReduce and RDF Data Query Processing Optimized Performance

            This project provides a survey on the state-of-the-art techniques for applying MapReduce to improve the Resource Description Framework (RDF) data query processing performance.  There has been a tremendous effort from the industry and researchers to develop efficient and scalable RDF processing systems.  Most of the complex data-processing tasks require multiple cycles of the MapReduce which are chained together into sequential.  The decomposition of a task into cycles or subtasks are often implemented.  Thus, the low overall workflow cost is a key element in the decomposition.  Each cycle of the MapReduce results in significant overhead.  When using RDF, the decomposition problem reflects the distribution of operations such as SELECT, and JOIN into subtasks which is supported by MapReduce cycle. The issue of the decomposition is related to the operations order because the neighboring operation in a query plan can be effectively grouped into the same subtasks.  When using MapReduce, the operation order is based on the requirement of key partitioning so that the neighboring operations do not cause any conflict.  Various techniques are proposed to enhance the performance of the semantic web queries using RDF and MapReduce. 

This project begins with the overview of RDF, followed by RDF Store Architecture, MapReduce Parallel Processing Framework, and Hadoop.  RDF and SPARQL using semantic query is discussed covering the syntax of SPARQL and the missing features that are required to enhance the performance of RDF using MapReduce.  Various techniques are discussed and analyzed on the application of MapReduce to improve RDF query processing performance.  Some of these techniques include HadoopRDF, RDFPath, and PigSPARQL.

Resource Description Framework (RDF)

Resource Description Framework (RDF) is described as an emerging standard for processing metadata (Punnoose, Crainiceanu, & Rapp, 2012; Tiwana & Balasubramaniam, 2001) (Punnoose et al., 2012; Tiwana & Balasubramaniam, 2001).  RDF provides interoperability between applications that exchange machine-understandable information on the Web (Sakr & Gaber, 2014; Tiwana & Balasubramaniam, 2001).   The primary goal of RDF is to define a mechanism and provide standards for the metadata and for describing resources on the Web (Boussaid, Tanasescu, Bentayeb, & Darmont, 2007; Firat & Kuzu, 2011; Tiwana & Balasubramaniam, 2001) which makes no a priori assumptions about a particular application domain or the associated semantics (Tiwana & Balasubramaniam, 2001).  These standards or mechanisms provided by the RDF can prevent users from accessing irrelevant subjects because RDF provided metadata that is relevant to the desired information (Firat & Kuzu, 2011).

RDF is also described as a Data Model (Choi, Son, Cho, Sung, & Chung, 2009; Myung, Yeon, & Lee, 2010) for representing labeled directed graphs (Choi et al., 2009; Nicolaidis & Iniewski, 2017), and useful for a Data Warehousing solution as the MapReduce framework (Myung et al., 2010).  RDF is used as an important building block of Semantic Web of Web 3.0 (see Figure 1) (Choi et al., 2009; Firat & Kuzu, 2011).  The technologies of the Semantic Web are useful for maintaining data in the Cloud (M. F. Husain, Khan, Kantarcioglu, & Thuraisingham, 2010).  These technologies of the Semantic Web provide the ability to specify and query heterogeneous data in a standard manner (M. F. Husain et al., 2010).   

RDF Data Model can be extended to ontologies to include RDF Schema (RDFS) and Ontology Web Language (OWL) to provide techniques to define and identify vocabularies specified to a certain domain, schema and relations between the elements of the vocabulary (Choi et al., 2009).   RDS can be exported in various file formats (Sun & Jara, 2014).  The most common of these formats is RDF + XML and XSD (Sun & Jara, 2014).  The OWL is used to add semantics to the schema (Sun & Jara, 2014).  For instance, if “A isAssociatedWith B,” which implies that “B is AssociatedWith A” (Sun & Jara, 2014).  The OWL allows the ability to express these two things the same way (Sun & Jara, 2014).  This similarity feature allowed by OWL is very useful for “joining” data expressed in different schemas (Sun & Jara, 2014).  This feature allows building relationship and joining up data from multiple sites, described as “Linked Data” facilitating the heterogeneous data stream integration (Sun & Jara, 2014).  OWL enables new facts to be derived from known facts using the inference rules (Nicolaidis & Iniewski, 2017).  Another example which can be used to enforce the inference technique using OWL is when a triple states that a car is a subtype of a vehicle and another triple state that a Cabrio is a subtype of a car, the new fact will be that Cabrio is a vehicle which can be inferred from the previous facts (Nicolaidis & Iniewski, 2017).

RDF Data Model is described to be a simple and flexible framework (Myung et al., 2010).   The underlying form and expression in RDF is a collection of “triples,” each consisting of a subject (s), a predicate (p), and an object (o) (Brickley, 2014; Connolly & Begg, 2015; Nicolaidis & Iniewski, 2017; Przyjaciel-Zablocki, Schätzle, Skaley, Hornung, & Lausen, 2013; Punnoose et al., 2012).   The subjects and predicates are Resources; each encoded as a Uniform Resource Identifier (URI) to ensure the uniqueness, while the object can be a Resource or a Literal such as string, date or number (Nicolaidis & Iniewski, 2017).  In (Firat & Kuzu, 2011), the basic structure of the RDF Data Model is based on a triplet of the object (O), quality (A) and value (V) (Firat & Kuzu, 2011).  The basic role of RDF is to provide Data Model of the object, quality and value (OAV) (Firat & Kuzu, 2011).  RDF Data Model is similar to the XML Data Model, where both do not include form-related information or names (Firat & Kuzu, 2011).

Figure 1:  The Major Building Blocks of the Semantic Web Architectures. Adapted from (Firat & Kuzu, 2011).

            RDF has been commonly used in applications such as Semantic Web, Bioinformatics, and Social Networks because of its great flexibility and applicability (Choi et al., 2009).   These applications require a huge computation over a large set of data (Choi et al., 2009).  Thus, the large-scale graph datasets are very common among these applications of Semantic Web, Bioinformatics, and Social Networks (Choi et al., 2009).  However, the traditional techniques for processing such large-scale of the dataset are found to be inadequate (Choi et al., 2009).   Moreover, RDF Data Model enables existing heterogeneous database systems to be integrated into a Data Warehouse because of its flexibility (Myung et al., 2010).   The flexibility of the RDF Data Model also provides users the inference capability to discover unknown knowledge which is useful for large-scale data analysis (Myung et al., 2010).   RDF triples require terabytes of disk space for storage and analysis (M. Husain, McGlothlin, Masud, Khan, & Thuraisingham, 2011; M. F. Husain et al., 2010).  Researchers are encouraged to develop efficient repositories, because there are only a few existing frameworks such as RDF-3X, Jena, Sesame, BigOWLIM for Semantic Web technologies (M. Husain et al., 2011; M. F. Husain et al., 2010).  These frameworks are single-machine RDF systems and are widely used because they are user-friendly and perform well for small and medium-sized RDF datasets (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).  The RDF-3X is regarded to be the fastest single machine RDF systems regarding query performance that vastly outperforms previous single machine systems (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).  However, the performance of RDF-3X diminishes for queries with unbound objects and low selectivity factor (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).  These frameworks are confronted by the large RDF graphs (M. Husain et al., 2011; M. F. Husain et al., 2010).   Therefore, the storage of a large volume of RDF triples and the efficient query of the RDF triples are challenging and are regarded to be critical problems in Semantic Web (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).  These challenges also limit the scaling capabilities (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).

RDF Store Architecture

            The main purpose of the RDF store is to build a database for storing and retrieving data of any data expressed in RDF (Modoni, Sacco, & Terkaj, 2014).  The term RDF store is used as an abstract for any system that can handle RDF data, allowing the ingestion of serialized RDF data and the retrieval of these data, and providing a set of APIs to facilitate the integration with other third-party application as the client application (Modoni et al., 2014).  The term triple store often refers to these types of systems (Modoni et al., 2014).  RDF store includes two major components; the Repository and the Middleware.  The Repository represents a set of files or database (Modoni et al., 2014). The Middleware is on top of the repository and in constant communication with it (Modoni et al., 2014).  Figure 2 illustrates the RDF store architecture.  The Middleware has its components; Storage Provider, Query Engine, Parser/Serializer, and Client Connector (Modoni et al., 2014).  The current RDF stores are categorized into three groups; database based stores, native stores, and hybrid stores (Modoni et al., 2014).  Examples of the database based stores include MySQL, Oracle 12c, which are built on top of existing commercial database engines (Modoni et al., 2014).  Examples of native stores are AllegroGraph, OWLIM which are built as database engines from scratch (Modoni et al., 2014).  Examples of the hybrid stores are Virtuoso, and Sesame which supports architectural styles; native and DBMS-backed (Modoni et al., 2014).

Figure 2:  RDF Store Architecture (Modoni et al., 2014)

MapReduce Parallel Processing Framework and Hadoop

In 2004, Google introduced MapReduce framework as a Parallel Processing framework which deals with a large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012-important-buildingblocks).  The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012).  Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012).  The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012).  MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014).  DFS is a major feature of the MapReduce framework (Fadzil et al., 2012).  

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012-important-buildingblocks).  MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down, there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007).  MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber, 2014).  MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber, 2014). 

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014).  The complexity of the parallelization is solved by using two techniques:  Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014).  The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012).   The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012).  MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012).  It addresses the structural scalability by using the DFS which allows forming large virtual storage for the framework by adding off-the-shelf hardware.  MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012). 

However, the earlier version of MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014).  Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014).  The standard MapReduce does not share the execution of work to reduce the overall amount of work  (Sakr & Gaber, 2014).  Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).  Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data.   It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012-important-buildingblocks), and the processing of large unstructured data sets (Bakshi, 2012).  MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

Hadoop is a software framework which is derived from Big Table and MapReduce and managed by Apache.  It was created by Doug Cutting and was named after his son’s toy elephant (Mishra et al., 2016).  Hadoop allows applications to run on huge clusters of commodity hardware based on MapReduce (Mishra et al., 2016).  The underlying concept of Hadoop is to allow the parallel processing of the data across different computing nodes to speed up computations and hide the latency (Mishra et al., 2016).  The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; Cloud Security Alliance, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015).  The MapReduce programming model is another major component of the Hadoop framework (Bao et al., 2012; Cloud Security Alliance, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015).  HBase is the third component of Hadoop framework (Bao et al., 2012).  HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012). 

The key features of Hadoop include the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014).  Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014).  The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also allows nodes to crash without affecting the data processing.  It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012-important-buildingblocks). 

RDF and SPARQL Using Semantic Query

Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques (M. Husain et al., 2011; M. F. Husain et al., 2010). RDF, SPARQL (Simple Protocol And RDF Query Language) are the most prominent standardized semantic web technologies (M. Husain et al., 2011; M. F. Husain et al., 2010).   The Data Access Working Group (DAWG) of the World Wide Web Consortium (W3C) in 2007 recommended SPARQL and provided standards to be the query language for RDF, a protocol definition for sending SPARQL queries from a client to a query processor and an XML-based serialization format for results returned by the SPARQL query (Konstantinou, Spanos, Stavrou, & Mitrou, 2010; Sakr & Gaber, 2014).  RDF is regarded to be the standard for storing and representing data  (M. Husain et al., 2011; M. F. Husain et al., 2010).  SPARQL is the query language to retrieve data from RDF triplestore (M. Husain et al., 2011; M. F. Husain et al., 2010; Nicolaidis & Iniewski, 2017; Sakr & Gaber, 2014; Zeng, Yang, Wang, Shao, & Wang, 2013).   Like RDF, SPARQL is built on the “triple pattern,” which also contains the “subject,” “predicate,” and “object” and is terminated with a full stop (Connolly & Begg, 2015).  RDF triple is regarded to be a SPARQL triple pattern (Connolly & Begg, 2015).  URIs are written inside angle brackets for identifying resources; literal strings are denoted with either double or single quote; properties, like Name, can be identified by their URI or more normally using a QName-style syntax to improve readability (Connolly & Begg, 2015). The triple pattern can include variables which are not like the triple (Connolly & Begg, 2015). Any or all of the values of subject, predicate, and object in a triple pattern may be replaced by a variable, which indicates data items of interest that will be returned by a query (Connolly & Begg, 2015).   

The semantic query plays a significant role in the Semantic Web, and the standardization of SPARQL plays a significant role to achieve such semantic queries (Konstantinou et al., 2010).  Unlike the traditional query languages, SPARQL does not consider the graph level, but rather it models the graph as a set of triples (Konstantinou et al., 2010).  Thus, when using the SPARQL query, the graph pattern is identified, and the nodes which match this pattern are returned (Konstantinou et al., 2010; Zeng et al., 2013).  SPARQL syntax is similar to SQL such as SELECT FROM WHERE syntax which is the most striking syntax (Konstantinou et al., 2010).  The core syntax of SPARQL is a conjunctive set of triple patterns called as “basic graph pattern” (Zeng et al., 2013).  Table 1 shows the syntax of SPARQL to retrieve data from RDF using SELECT statement. 

Table 1:  Example of SPARQL syntax (Sakr & Gaber, 2014)

Although SPARQL syntax is similar to SQL in the context of SELECT to retrieve data, SPARQL is not as mature as SQL (Konstantinou et al., 2010).  The current form of SPARQL allows the access to the raw data using URIs from RDF or OWL graph and letting the user perform the result processing (Konstantinou et al., 2010).  However, SPARQL is expected to be the gateway to query information and knowledge supporting as many features as SQL does (Konstantinou et al., 2010).   SPARQL does not support any aggregated functions such as MAX, MIN, SUM, AVG, COUNT, and the GROUP BY operations (Konstantinou et al., 2010).  Moreover, SPARQL supports ORDER BY only on a global level and not solely on the OPTIONAL part of the query (Konstantinou et al., 2010).  For mathematical operations, SPARQL does not extend its support beyond the basic mathematical operations (Konstantinou et al., 2010).   SPARQL does not support the nested queries, meaning it does not allow CONSTRUCT query in the FROM part of the query.  Moreover, SPARQL is missing the functionality offered by SELECT WHERE LIKE statement in SQL, allowing for keyword-based queries (Konstantinou et al., 2010).  While SPARQL offers regex() function for string pattern matching, it cannot emulate the functionality of the LIKE operator (Konstantinou et al., 2010).  SPARQL enables only the unbound variables in the SELECT part and rejecting the use of functions or other operators (Konstantinou et al., 2010). This limitation places SPARQL as elementary query language where URIs or literals only are returned, while users look for some result processing in the practical use cases (Konstantinou et al., 2010).  SPARQL can be enhanced to include these missing features and functionality to include stored procedures, triggers, and operations for data manipulations such as update, insert, and delete (Konstantinou et al., 2010). 

There is a group called SPARQL Working Group who are working on integrating these missing features in SPARQL.  SPARQL/Update is an extension to SPARQL included in the leading Semantic Web development framework “Jena” allowing the update operation, the creation and the removal of the RDF graphs (Konstantinou et al., 2010). ARQ is a query engine for Jena which supports the SPARQL RDF Query language (Apache, 2017a).  Some of the key features of the ARQ include the update, the GROUP BY, access and extension of the SPARQL algebra, and support for the federated query (Apache, 2017a).   LARQ integrates SPARQL with Apache’s full-text search framework Lucene (Konstantinou et al., 2010) adding free text searches to SPARQL (Apache, 2017b).  SPARQL+ extension of the ARC RDF sore offers most of the common aggregates and extends the SPARUL’s INSERT with CONSTRUCT clause (Konstantinou et al., 2010).  The OpenLink’s Virtuoso extends SPARQL with aggregate functions, nesting, and subqueries, allowing the user to insert SPARQL queries inside SQL (Konstantinou et al., 2010).  SPASQL offers a similar functionality embedding SPARQL into SQL (Konstantinou et al., 2010).    

Although SPARQL is missing a lot of SQL features, it offers other features which are not part of SQL (Konstantinou et al., 2010).  Some of these features include the OPTIONAL operator which does not modify the results in case of non-existence and it can be met in almost all of the query languages for RDF (Konstantinou et al., 2010).  This feature is equivalent to the LEFT OUTER JOIN in SQL (Konstantinou et al., 2010).  However, SPARQL syntax is much more user-friendly and intuitive than SQL (Konstantinou et al., 2010). 

Techniques Applied on MapReduce

To Improve RDF Query Processing Performance

With the explosive growth of the data size, the traditional approach of analyzing the data in a centralized server is not adequate to scale up (Punnoose et al., 2012; Sakr & Gaber, 2014), and cannot scale concerning the increasing RDF datasets (Sakr & Gaber, 2014).   Although SPARQL is used to query RDF data, the query of RDF dataset at the web scale is challenging because the computation of SPARQL queries requires several joins between subsets of the dataset (Sakr & Gaber, 2014).  New methods are introduced to improve the parallel computing and allow storage and retrieval of RDF across large compute clusters which enables processing data of unprecedented magnitude (Punnoose et al., 2012).   Various solutions are introduced to solve these challenges and achieve scalable RDF processing using the MapReduce framework such as PigSPARQL, and RDFPath.

  1. RDFPath

In (Przyjaciel-Zablocki, Schätzle, Hornung, & Lausen, 2011), the RDFPath is proposed as a declarative path query language for RDF which provides a natural mapping to the MapReduce programming model by design, while remaining extensible (Przyjaciel-Zablocki et al., 2011).  It supports the exploration of graph properties such as shortest connections between two nodes in an RDF graph (Przyjaciel-Zablocki et al., 2011).  RDFPath is regarded to be a valuable tool for the analysis of social graphs (Przyjaciel-Zablocki et al., 2011).  RDFPath combines an intuitive syntax for path queries with an effective execution strategy using MapReduce (Przyjaciel-Zablocki et al., 2011).  RDFPath does benefit from the horizontal scaling properties of MapReduce when adding more nodes to improve the overall executions time significantly (Przyjaciel-Zablocki et al., 2011).  Using RDFPath, large RDF graphs can be handled while scaling linearly with the size of the graph that RDFPathh can be used to investigate graph properties such as a variant of the famous six degrees of separation paradigm typically encountered in social graphs (Przyjaciel-Zablocki et al., 2011).   It focuses on the path queries and studies their implementation based on MapReduce.  There are various RDF query languages such as RQL, SeRQL, RDQL, Triple, N3, Versa, RxPath, RPL, and SPARQL (Przyjaciel-Zablocki et al., 2011).  RDFPath has a competitive expressiveness to these other RDF query languages (Przyjaciel-Zablocki et al., 2011).   A comparison of RDFPath capabilities with these other RDF query language shows that RDFPath has the same capabilities of SPARQL 1.1for the adjacent nodes, adjacent edges, the degree of a node, and fixed-length path.  However, RDFPath shows more capabilities than SPARQL 1.1 in areas like the distance between two nodes and shortest paths as it has partial support for these two properties.  However, SPARQL 1.1 shows full support to the aggregate functions while RDFPath shows only partial support (Przyjaciel-Zablocki et al., 2011).  Table 2 shows the comparison of RDFPath with other RDF query languages including SPARQL.  

Table 2: Comparison of RDF Query Language, adapted from (Przyjaciel-Zablocki et al., 2011).

2. PigSPARQL

PigSPARQL is regarded as a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing n large RDF graphs out of the box (Schätzle, Przyjaciel-Zablocki, Hornung, & Lausen, 2013).  PigSPARQL is described as a system for processing SPARQL queries using the MapReduce framework by translating them into Pig Latin programs where each Pig Latin program is executed by a series of MapReduce jobs on a Hadoop cluster (Sakr & Gaber, 2014; Schätzle et al., 2013). PigSPARQL utilizes the query language of Pig, which is a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce (Schätzle et al., 2013).  That intermediate layer provides an abstraction level which makes PigSPARQL independent of Hadoop version and accordingly ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer (Schätzle et al., 2013).  This intermediate layer of Pig Latin approach provides the sustainability of PigSPARQL and is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which are also underpinned by the competitiveness with the existing systems such as HadoopRDF (Schätzle et al., 2013).  As illustrated in Figure 3, the PigSPARQL workflow begins with the SPARQL that is mapped to Pig Latin by parsing the SPARQL query to generate an abstract syntax tree which is translated into a SPARQL Algebra tree (Schätzle et al., 2013).  Several optimizations are applied on the Algebra level like the early execution of filters and a re-arrangement of triple patterns by selectivity (Schätzle et al., 2013).  The optimized Algebra tree is traversed bottom-up, and an equivalent sequence of Pig Latin expressions are generated for every SPARQL Algebra operator (Schätzle et al., 2013).  Pig automatically maps the resulting Pig Latin script into a sequence of MapReduce iterations at the runtime (Schätzle et al., 2013).

PigSPARQL is described as easy to use and competitive baseline for the comparison of MapReduce based SPARQL processing.  PigSPARQL exceeds the functionalities of most existing research prototypes with the support of SPARQL 1.0 (Schätzle et al., 2013). 

Figure 3: PigSPARQL Workflow From SPARQL to MapReduce, adapted from (Schätzle et al., 2013).

3. Interactive SPARQL Query Processing on Hadoop: Sempala

            In (Schätzle, Przyjaciel-Zablocki, Neu, & Lausen, 2014), an interactive SPARQL query processing techniques “Sempala” on Hadoop is proposed.  Sempala is a SPARQL-over-SQL-on-Hadoop approach designed with selective queries (Schätzle et al., 2014).  It shows significant performance improvements compared to existing approaches (Schätzle et al., 2014).  The approach of Sempala is inspired by the trend of applying SQL-on-Hadoop field where several new systems are developed for interactive SQL query processing such as Hive, Sharl, Presto, Phoenix, Impala and so forth (Schätzle et al., 2014).  Thus, Sempala as the SPARQL-over-SQL approach is introduced to follow the trend and provide interactive-time SPARQL query processing on Hadoop (Schätzle et al., 2014).  With Sempala, the data is stored in RFD in a columnar layout on HDFS and use Impala, which is an open source massive parallel processing (MPP) SQL query engine for Hadoop, to serve as the execution layer on top (Schätzle et al., 2014).   The architecture of Sempala is illustrated in Figure 4. 

Figure 4:  Sempala Architecture adapted from (Schätzle et al., 2014).

            Two main components of the architecture of the proposed Sempala; RDF Loader and Query Compiler (Schätzle et al., 2014).  The RDF Loader converts an RDF dataset into the data layout used by Sempala.  The Query Compiler rewrites a given SPARQL query into the SQL dialect of Impala based on the layout of the data (Schätzle et al., 2014).   The Query Compiler of Sempala is based on the algebraic representation of SPARQL expressions defined by W3C recommendation (Schätzle et al., 2014).  Jena ARQ is used to parse a SPARQL query into the corresponding algebra tree (Schätzle et al., 2014).  Some basic algebraic optimization such as filter pushing is applied (Schätzle et al., 2014).   The final step is to traverse the tree bottom up to generate the equivalent Impala SQL expressions based on the unified property table layout (Schätzle et al., 2014).  In a comparison of Sempala with other Hadoop based systems such as Hive, PigSPARQL, MapMerge, and MAPSIN.   Hive is the standard SQL warehouse for Hadoop based on MapReduce (Schätzle et al., 2014).  The same query with minor syntactical modification can run on the same data because Impala is developed to be highly compatible with Hive (Schätzle et al., 2014).  Sempala seems to follow a similar approach as PigSPARQL.   However, in PigSPARQL, the Pig is used as the underlying system and intermediate level between MapReduce and SPARQL (Schätzle et al., 2014).  MapMerge is an efficient map-side merge join implementation for scalable SPARQL BGP (“BasicGraphPatterns” (W3C, 2016)) which reduces the shuffling of the data between map and reduce phases in MapReduce (Schätzle et al., 2014).  MAPSIN is an approach that uses HBase, which is standard NoSQL database for Hadoop to store RDF data and applies a map-side index nested loop join which avoids the reduce phase of the MapReduce (Schätzle et al., 2014).  The findings of (Schätzle et al., 2014) shows that Sempala outperforms Hive and PigSPARQL, while MapMerge and MAPSIN could not be used because they only support SPARQL BGP (Schätzle et al., 2014).

4. Map-Side Index Nested Loop Join (MAPSIN JOIN)

            MapReduce is facing the challenge of processing joins because the datasets are very large (Sakr & Gaber, 2014).  Two datasets can be joined using MapReduce, but they have to be located on the same machine, which is not practical (Sakr & Gaber, 2014).  Thus, solutions such as reduce-side approach are used and regarded to be the most prominent and flexible join technique in MapReduce (Sakr & Gaber, 2014).  The reduce-side approach is also known as “Repartition Join” because datasets at the map phase are read and repartition according to the join key at the shuffle phase, while the actual computation for join is done in the reduce phase (Sakr & Gaber, 2014).  The problem with this approach is that the datasets are transferred through the network with no regard to the join output which can consume a lot of the bandwidth of the network and cause bottleneck (Sakr & Gaber, 2014; Schätzle et al., 2013).  Another solution called map-side join solution is introduced, where the actual join processing is done in the map phase to avoid the shuffle and reduce phase and avoid transferring both datasets over the network (Sakr & Gaber, 2014).   The most common approach is the map-side merge join, although it is hard to cascade, in addition to the advantage of avoiding the shuffle and reduce phase is lost (Sakr & Gaber, 2014).  Thus, the MAPSIN approach is proposed which is a map-side index nested loop join based on HBase (Sakr & Gaber, 2014; Schätzle et al., 2013).  The MAPSIN join has the indexing capabilities of HBase which improves the query performance of the selective queries (Sakr & Gaber, 2014; Schätzle et al., 2013).  The capabilities retain the flexibility of reduce-side joins while utilizing the effectiveness of a map-side join without any modification to the underlying framework (Sakr & Gaber, 2014; Schätzle et al., 2013). 

Comparing MAPSIN with PigSPARQL, MAPSIN performs faster than PigSPARQL when using a sophisticated storage schema based on HBase which works well for selective queries but diminishes significantly in performance for less selective queries (Schätzle et al., 2013).  However, MAPSIN does not support the queries of LUBM (Lehigh University Benchmark (W3C, 2016).  The query runtime of MAPSIN is close to the runtime of the merge join approach (Schätzle et al., 2013). 

5. HadoopRDF

            HadoopRDF is proposed by (Tian, Du, Wang, Ni, & Yu, 2012) to combine the advantages of high fault tolerance and high throughput of the MapReduce distributed framework and the sophisticated indexing and query answering mechanism (Tian et al., 2012).  HadoopRDF is developed on Hadoop cluster with many computers and echo node in the cluster has a sesame server to supply the service for storing and retrieving the RDF data (Tian et al., 2012).  HadoopRDF is a MapReduce-based RDF system which stores data directly in HDFS and does not require any modification to the Hadoop framework (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Tian et al., 2012).  The basic idea is to substitute the rudimentary HDFS without indexes and a query execution engine, with more elaborated RDF stores (Tian et al., 2012).  The architecture of HadoopRDF is illustrated in Figure 5.

Figure 5: HadoopRDF Architecture, adapted from (Tian et al., 2012).

            The architecture of HadoopRDF is similar to the architecture of Hadoop which scales up to thousands of nodes (Tian et al., 2012).  Hadoop framework is the core of the HadoopRDF (Tian et al., 2012).  Hadoop is built on top of HDFS, which is a replicated key-value store under the control of a central NameNode (Tian et al., 2012).  Files in HDFS are broken into chunks fixed size, and the replica of these chunks are distributed across a group of DataNodes (Tian et al., 2012).  The NameNode tracks the size and location of each replica (Tian et al., 2012). Hadoop which is a MapReduce framework is used for the computational purpose in the data-intensive application (Tian et al., 2012).  In the architecture of HadoopRDF, the RDF stores are incorporated into the MapReduce framework.  HadoopRDF is an advanced SPARQL engine which splits the original RDF graph according to predicates and objects and utilizes a cost-based query execution plan for reduce-side join (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Schätzle et al., 2013).  HadoopRDF can re-balance automatically when the cluster size changes but join processing is also done in the reduce phase (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014).  The findings of (M. Husain et al., 2011) indicated that HadoopRDF is more scalable and handles low selectivity queries more efficiently than RDF-3X.  Moreover, the result showed that HadoopRDF is much more scalable than BigOWLIM and provides more efficient queries for the large data set (M. Husain et al., 2011).   HadoopRDF requires a pre-processing phase like most systems (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014).

6. RDF-3X:  RDF Triple eXpress

            RDF-3X is proposed by (Neumann & Weikum, 2008).  The RDF-3X engine is an implementation of SPARQL which achieves excellent performance by pursuing a RISC-style architecture with a streamlined architecture (Neumann & Weikum, 2008).   RISC is Reduced Instruction Set Computer which is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instruction often found in other types of architectures (Neumann & Weikum, 2008).  Thus, RDF-3X follows the concept of RISC-style with “reduced instruction set” designed to support RDF.  RDF-3X is described to be a generic solution for storing and indexing RDF triples that eliminates the need for physical-design turning (Neumann & Weikum, 2008).  RDF-3X provides a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths (Neumann & Weikum, 2008).   It also provides a powerful and simple query processor which leverage fast merge joins to the large-scale data (Neumann & Weikum, 2008).  Three major components in RDF-3X; physical design, query processor, and the query optimizer.  The physical design component is workload-independent by creating appropriate indexes over a single “giant triples table” (Neumann & Weikum, 2008).  The query processor is RISC-style by relying mostly on merge joins over sorted index lists.  The query optimizer focuses on join order in its generation of the execution plan (Neumann & Weikum, 2008).

The findings of (Neumann & Weikum, 2008) showed that RDF-3X addressed the challenge of schema-free data and copes very well with data that exhibit a large diversity of property names (Neumann & Weikum, 2008).  The optimizer of RDF-3X is known to produce efficient query execution plan (Galarraga, Hose, & Schenkel, 2014).  The RDF-3X maintains local indexes for all possible orders and combinations of the triple components and for aggregations which enable efficient local data access (Galarraga et al., 2014).  RDF-3X does not support LUMB.  RDF-3X is a single-node RDF-store which builds indexes over all possible permutations of subject, predicate and object (Huang, Abadi, & Ren, 2011; M. Husain et al., 2011; Schätzle et al., 2014; Zeng et al., 2013).  RDF-X3 is regarded to be the fastest existing semantic web repository and state-of-the-art “benchmark” engine for single place machines (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013).  Thus, it outperforms any other solution for queries with bound objects and aggregate queries (M. Husain et al., 2011). However, the performance of RDF-3X diminishes exponentially for unbound queries and queries with even simple joins if the selectivity factor is low (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013).  The experiment of (M. Husain et al., 2011) showed that RDF-3X is not only slower for such queries, it often aborts and cannot complete the query (M. Husain et al., 2011). 

7. Rya: A Scalable RDF Triple Store for the Clouds

            In (Punnoose et al., 2012), the Rya is proposed as a new scalable system for storing and retrieving RDF data in cluster nodes.  In Rya, OWL model is used as a set of triples and store them in the triple store (Punnoose et al., 2012).  Storing all the data in the triple store provides the benefits of using Hadoop MapReduce to run large batch processing jobs against the data set (Punnoose et al., 2012).  The first phase of the process is performed only once at the time when the OWL model is loaded into Rya (Punnoose et al., 2012).  In phase 1, MapReduce job runs to iterate through the entire graph of relationships and output the implicit relationships found as explicit RDF triples stores into the RDF store (Punnoose et al., 2012).  The second phase of the process is performed every time a query is run, and once all explicit and implicit relationships are stored in Rya, the Rya query planner can expand the query at the runtime to utilize all these relationships (Punnoose et al., 2012).   Three table index for indexing RDF triples is used to enhance the performance (Punnoose et al., 2012).  The results of (Punnoose, Crainiceanu, & Rapp, 2015) showed that Rya outperformed SHARD. Moreover, in comparison with the graph-partitioning algorithm introduced by (Huang et al., 2011), as indicated in (Punnoose et al., 2015), the performance of Rya showed superiority in many cases over Graph Partitioning (Punnoose et al., 2015).

Conclusion

This project provided a survey on the state-of-the-art techniques for applying MapReduce to improve the RDF data query processing performance.  Tremendous effort from the industry and researchers have been exerted to develop efficient and scalable RDF processing system.   The project discussed the RDF framework and the major building blocks of the semantic web architecture.   The RDF store architecture and the MapReduce Parallel Processing Framework and Hadoop are discussed in this project.  Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques.   This paper also discussed the most prominent standardized semantic web technologies RDF and SPARQL.  The project also discussed and analyzed in details various techniques applied on MapReduce to improve the RDF query processing performance.  These techniques include RDFPath, PigSPARQL, Interactive SPARQL Query Processing on Hadoop (Sempala), Map-Side Index Nested Loop Join (MAPSIN JOIN), HadoopRDF, RDF-3X (RDF Triple eXpress), and Rya (a Scalable RDF Triple Store for the Clouds). 

References

Apache, J. (2017a). ARQ – A SPARQL Processor for Jena

Apache, J. (2017b). LARQ – Adding Free Text Searches to SPARQL

Bakshi, K. (2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

Boussaid, O., Tanasescu, A., Bentayeb, F., & Darmont, J. (2007). Integration and dimensional modeling approaches for complex data warehousing. Journal of Global Optimization, 37(4), 571. doi:10.1007/s10898-006-9064-6

Brickley, D., & Guha, R. V. (Eds.). (2014). RDF schema 1.1. Retrieved from the W3C Web site: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/.

Choi, H., Son, J., Cho, Y., Sung, M. K., & Chung, Y. D. (2009). SPIDER: a system for scalable, parallel/distributed evaluation of large-scale RDF data. Paper presented at the Proceedings of the 18th ACM conference on Information and knowledge management.

Cloud Security Alliance. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Firat, M., & Kuzu, A. (2011). Semantic web for e-learning bottlenecks: disorientation and cognitive overload. International Journal of Web & Semantic Technology, 2(4), 55.

Galarraga, L., Hose, K., & Schenkel, R. (2014). Partout: a distributed engine for efficient RDF processing. Paper presented at the Proceedings of the 23rd International Conference on World Wide Web.

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

Hu, P., & Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Huang, J., Abadi, D. J., & Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. Proceedings of the VLDB Endowment, 4(11), 1123-1134.

Husain, M., McGlothlin, J., Masud, M. M., Khan, L., & Thuraisingham, B. M. (2011). Heuristics-based query processing for large RDF graphs using cloud computing. IEEE transactions on knowledge and data engineering, 23(9), 1312-1327.

Husain, M. F., Khan, L., Kantarcioglu, M., & Thuraisingham, B. (2010). Data intensive query processing for large RDF graphs using cloud computing tools. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security Issues Associated with Big Data in Cloud Computing. International Journal of Network Security & Its Applications, 6(3), 45. doi:10.5121/ijnsa.2014.6304

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, M. W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 1-18. doi:10.1155/2014/712826

Konstantinou, N., Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically approaching the semantic web bottleneck. International Journal of Web Engineering and Technology, 6(1), 83-111.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Modoni, G. E., Sacco, M., & Terkaj, W. (2014). A survey of RDF store solutions. Paper presented at the Engineering, Technology and Innovation (ICE), 2014 International ICE Conference on.

Myung, J., Yeon, J., & Lee, S.-g. (2010). SPARQL basic graph pattern processing with iterative MapReduce. Paper presented at the Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud.

Neumann, T., & Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1), 647-659.

Nicolaidis, I., & Iniewski, K. (2017). Building Sensor Networks: CRC Press.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., & Lausen, G. (2011). Rdfpath: Path query processing on large rdf graphs with mapreduce. Paper presented at the Extended Semantic Web Conference.

Przyjaciel-Zablocki, M., Schätzle, A., Skaley, E., Hornung, T., & Lausen, G. (2013). Map-side merge joins for scalable SPARQL BGP processing. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on.

Punnoose, R., Crainiceanu, A., & Rapp, D. (2012). Rya: a scalable RDF triple store for the clouds. Paper presented at the Proceedings of the 1st International Workshop on Cloud Intelligence.

Punnoose, R., Crainiceanu, A., & Rapp, D. (2015). SPARQL in the cloud using Rya. Information Systems, 48, 181-195.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., & Lausen, G. (2013). PigSPARQL: a SPARQL query processing baseline for big data. Paper presented at the Proceedings of the 12th International Semantic Web Conference (Posters & Demonstrations Track)-Volume 1035.

Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014). Sempala: interactive SPARQL query processing on hadoop. Paper presented at the International Semantic Web Conference.

Sun, Y., & Jara, A. J. (2014). An extensible and active semantic model of information organizing for the Internet of Things. Personal and Ubiquitous Computing, 18(8), 1821-1833. doi:10.1007/s00779-014-0786-z

Tian, Y., Du, J., Wang, H., Ni, Y., & Yu, Y. (2012). Hadooprdf: A scalable rdf data analysis system. 8th ICIC, 633-641.

Tiwana, A., & Balasubramaniam, R. (2001). Integrating knowledge on the web. IEEE Internet Computing, 5(3), 32-39.

W3C. (2016). RDF  Store Benchmarking. Paper presented at the Retrieved from https://www.w3.org/wiki/RdfStoreBenchmarking.

White, T. (2012-important-buildingblocks). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Zeng, K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A distributed graph engine for web scale RDF data. Paper presented at the Proceedings of the VLDB Endowment.

Various Efforts to Improve Performance of Incremental Computation

Dr. Aly, O.
Computer Science

Hadoop was developed by Yahoo and Apache to run jobs in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012).  A various large corporation such as Facebook, Amazon have used Hadoop as it offers high efficiency, high scalability, and high reliability (Yan et al., 2012).  Hadoop has faced various limitation such as low-level programming paradigm and schema, strictly batch processing, time skew and incremental computation (Alam & Ahmed, 2014).  The incremental computation is regarded to be one of the major shortcomings of Hadoop technology (Alam & Ahmed, 2014).   The efficiency on handling incremented data is at the expense of losing the incompatibility with programming models which are offered by non-incremental systems such as MapReduce, which requires the implementation of incremental algorithms and increasing the complexity of the algorithm and the code (Alam & Ahmed, 2014).   The caching technique is proposed by (Alam & Ahmed, 2014) as a solution.  This caching solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014). 

Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, & Pasquin, 2011).   The Incoop proposed solution is to extend the open-source implementation of Hadoop of MapReduce programming paradigm to run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Incoop allows programmers to increment the MapReduce programs automatically without any modification to the code (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Moreover, information about the previously executed MapReduce tasks are recorded by Incoop to be reused in subsequent MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber, 2014).  

The Incoop is not a perfect solution, and it has some shortcomings which are addressed by (Sakr & Gaber, 2014; Zhang, Chen, Wang, & Yu, 2015).  Some enhancements are implemented to Incoop to include incremental HDFS called Inc-HDFS, Contraction Phase, and “Memoization-aware Scheduler” (Sakr & Gaber, 2014).  The Inc-HDFS provides the delta technique in the inputs of two consecutive job runs and splits the input based on the contents where the compatibility with HDFS is maintained.  The Contraction phase is a new phase in the MapReduce framework consisting of breaking up the Reduce tasks into smaller sub-computation forming an inverted tree allowing the small portion of the input changes to the path from the corresponding leaf to the root to be computed (Sakr & Gaber, 2014).  The Memoization-aware Scheduler is a modified version of the scheduler of Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).

Another solution called  i2MapReduce proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015).  The i2MapReduce does not perform the task-level computation but rather a key-value pair level incremental processing.  This solution also supports more complex iterative computation, which is used in data mining and reduces the I/O overhead by applying various techniques (Zhang et al., 2015).  IncMR is an enhanced framework for the large-scale incremental data processing (Yan et al., 2012).  It inherits the simplicity of the standard MapReduce, it does not modify HDFS and utilizes the same APIs of the MapReduce (Yan et al., 2012).  When using IncMR, all programs can complete incremental data processing without any modification (Yan et al., 2012). 

In conclusion, various efforts are exerted by researchers to overcome the incremental computation limitation of Hadoop, such as Incoop, Inc-HDFS, i2MapReduce, and IncMR.  Each proposed solution is an attempt to enhance and extend the standard Hadoop to avoid overheads such as I/O, to increase the efficiency, and without increasing the complexing of the computation and without causing any modification to the code.

References

Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. Paper presented at the Computational Science and Computational Intelligence (CSCI), 2014 International Conference on.

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Paper presented at the Proceedings of the 2nd ACM Symposium on Cloud Computing.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012). Income: Incremental data processing based on MapReduce. Paper presented at the Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.

Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for Mining Evolving Big Data. IEEE transactions on knowledge and data engineering, 27(7), 1906-1919.

The Weakness of the Initial MapReduce Framework in Iterative Computation

Dr. Aly, O.
Computer Science

The standard MapReduce framework faced the challenge of the iterative computation which is required in various operations such as data mining, PageRank, network traffic analysis, graph analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010; Sakr & Gaber, 2014).   These analyses techniques require the data to be processed iteratively until the computation satisfies a convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014).   Due to this limitation, and to this critical requirement, this iterative process is implemented and executed manually using a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014).   However, the manual implementation and execution of such iterative computation have two major problems (Bu et al., 2010; Sakr & Gaber, 2014).  The first problem is reflected in loading unchanged data from iteration to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition when the output of the application did not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014).  This termination condition may require an extra MapReduce job on each iteration which causes overhead for scheduling extra tasks, reading extra data from disk, and moving data across the network (Bu et al., 2010; Sakr & Gaber, 2014). 

Researchers exerted efforts to solve the iterative computation.  HaLoop is proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010).   One solution to the iterative computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by  (Ekanayake et al., 2010) are to identify and keep invariant data during the iterations, where reading unnecessary data repeatedly is avoided.  The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014).  The first caching technique is implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique is implemented on the outputs of reducer making the check for the fixpoint more efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).

The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010).  This solution provides the synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and does not replicate remote values locally (Malewicz et al., 2010).  Mahoot is another solution that was introduced to solve the iterative computing by grouping a series of chained jobs to obtain the results (Polato, Ré, Goldman, & Kon, 2014).   In Mahoot solution, the result of each job is pushed into the next job until the final results are obtained (Polato et al., 2014).  The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011).   The task scheduler of the iHadoop utilizes the inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011). 

Apache Hadoop and Apache Spark are the most popular technology for the iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011).  Hadoop defines the iterative computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS) independently, processes the data, and writes the data back to HDFS (Liang et al., 2011).   Dacoop was proposed by Liang as an extension to Hadoop to handle the data-iterative applications, by using cache technique for repeatedly data processing and introducing shared memory-based data cache mechanism (Liang et al., 2011).  The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent tasks of the map and reduce during the whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012).   The iMapReduce avoid three major overheads.  The first overhead is the job startup overhead which is avoided by building an internal loop from reduce to map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state data from the static structure data.  The third overhead is the synchronization overhead which is avoided by allowing asynchronous map task execution (Zhang et al., 2012).

References

Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative MapReduce. Paper presented at the Proceedings of the 19th ACM international symposium on high performance distributed computing.

Elnikety, E., Elsayed, T., & Ramadan, H. E. (2011). iHadoop: asynchronous iterations for MapReduce. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on.

Liang, Y., Li, G., Wang, L., & Hu, Y. (2011). Dacoop: Accelerating data-iterative applications on Map/Reduce cluster. Paper presented at the Parallel and Distributed Computing, Applications, and Technologies (PDCAT), 2011 12th International Conference on.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.

Hadoop and MapReduce

Dr. Aly, O.
Computer Science

Basic Concepts of the MapReduce Framework

In 2004, Google introduced MapReduce framework as a Parallel Processing framework which deals with large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012).  The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012).  Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). 

The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012).  MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014).  DFS is a major feature of the MapReduce framework (Fadzil et al., 2012).   

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012).  MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007).  MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the current running job (P. Hu & Dai, 2014-7; Sakr & Gaber, 2014).  MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014-7; Sakr & Gaber, 2014).  

The Top Three Features of Hadoop

The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015).  The MapReduce programming model is another major component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015).  HBase is the third component of Hadoop framework (Bao et al., 2012).  HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012). 

The key features of Hadoop include the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014).  Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014).  The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also allows nodes to crash without affecting the data processing.  It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012). 

Pros and Cons of MapReduce Framework

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014).  The complexity of the parallelization is solved by using two techniques:  Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014).  The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012).   The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012).  MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012).  It addresses the structural scalability by using the DFS which allows forming large virtual storage for the framework by adding off-the-shelf hardware.  MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012). 

However, the earlier version of MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014).  Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014).  The standard MapReduce does not share the execution of work to reduce the overall amount of work  (Sakr & Gaber, 2014).  Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).  Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data.   It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the processing of large unstructured data sets (Bakshi, 2012).  MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

References

Bakshi, K. (2012, 3-10 March 2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, P., & Dai, W. (2014-7). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

White, T. (2012). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.