The Impact of Big Data Analytics (BDA)on Artificial Intelligence (AI)

Dr. O. Aly
Computer Science

The purpose of this discussion is to discuss the future impact of Big Data Analytics for Artificial Intelligence.  The discussion will also provide an example of the AI use in Big Data generation and analysis.  The discussion begins with artificial intelligence, followed by an advanced level of big data analysis.  The impact of the Big Data (BD) on the artificial intelligence is also discussed addressing various examples showing how artificial intelligence is empowered by BD.

Artificial Intelligence

Artificial Intelligence (AI) has eight definitions laid out across two dimensions of thinking and acting (Table 1) (Russell & Norvig, 2016). The top definitions are concerned with thought processes and reasoning, while the bottom definitions address the behavior.  The definitions on the left measure success regarding fidelity to human performance, while the definitions on the rights measure against an ideal performance measure called “rationality” (Russell & Norvig, 2016).  The system is “rational” if it does the “right thing” given what it knows.

Table 1:  Some Definitions of Artificial Intelligence, Organized Into Four Categories (Russell & Norvig, 2016).

The study (Patrizio, 2018) defined artificial intelligence as a computational technique allowing machines to perform cognitive functions such as acting or reacting to input, similar to the way humans do.  The traditional computing applications react to data, but the reactions and responses have to be hand-coded.   However, the app cannot react to unexpected results (Patrizio, 2018).  The artificial intelligence systems are continuously in a flux mode changing their behavior to accommodate any changes in the results and modifying their reactions (Patrizio, 2018).  The artificial intelligence-enabled system is designed to analyze and interpret data and address the issues based on those interpretations (Patrizio, 2018).  The computer learns once how to act or react to a particular result and knows in the future to act in the same way using the machine learning algorithms (Patrizio, 2018).  IBM has invested $1 billion in artificial intelligence through the launch of its IBM Watson Group (Power, 2015).  The health care industry is the most significant application of Watson (Power, 2015).

Advanced Level of Big Data Analysis

The fundamental analytics techniques include descriptive analytics allowing breaking down big data into smaller, more useful pieces of information about what has happened, and focusing on the insight gained from the historical data to provide trending information on past or current events (Liang & Kelemen, 2016).  However, the advanced level computational tools focus on predictive analytics, to determine patterns and predict future outcomes and trends through quantifying effects of future decision to advise on possible outcomes (Liang & Kelemen, 2016).  The prescriptive analytic includes functions as a decision support tool exploring a set of possible actions and proposing actions based on descriptive and predictive analysis of complex data.  The advanced level computational techniques include real-time analytics.

Advanced level of data analysis includes various techniques.  The real-time analytics and meta-analysis can be used to integrate multiple data sources (Liang & Kelemen, 2016).  The hierarchical or multi-level model can be used for spatial data, a longitudinal and mixed model for real-time or dynamic temporal data rather than static data (Liang & Kelemen, 2016).  The data mining, pattern recognition can be used for trends, and pattern detection (Liang & Kelemen, 2016).  The natural language processing (NLP) can be used for text mining, machine learning, statistical learning Bayesian learning with auto-extraction of data and variables (Liang & Kelemen, 2016).  The artificial intelligence with automatic ensemble techniques and intelligent agent, and deep learning such as neural network, support vector machine, dynamic state-space model, automatic can be used for automated analysis and information retrieval (Liang & Kelemen, 2016).  The causal inferences and Bayesian approach can be used for probabilistic interpretations (Liang & Kelemen, 2016).  

Big Data Empowers Artificial Intelligence

            The trend of artificial intelligence implementation is increasing.  It is anticipated that 70% of enterprises will implement artificial intelligence (AI) by the end of 2018, which is up from 40% in 2016 and 51% in 2017 (Mills, 2018).  A survey conducted by NewVantage Partners of c-level executive decision-makers found that 97.2% of executives stated that their companies are investing in, building, or launching Big Data and artificial intelligence initiatives (Bean, 2018; Patrizio, 2018).  The same survey has found that 76.5% of the executives feel that the artificial intelligence and Big Data are becoming interconnected closely and the availability of the data is empowering the artificial intelligence and cognitive initiatives within their organizations (Patrizio, 2018).

Artificial intelligence requires data to develop its intelligence, particularly machine learning (Patrizio, 2018).  The data used in artificial intelligence and machine learning is already cleaned, with extraneous, duplicate and unnecessary data already removed, which is regarded to be the first big step when using Big Data and artificial intelligence (Patrizio, 2018).  CERN data center has accumulated over 200 petabytes of filtered data (Kersting & Meyer, 2018). Machine learning and artificial intelligence can take advantages of this filtered data leading to many breakthroughs (Kersting & Meyer, 2018).  An example of these breakthroughs includes genomic and proteomic experiments to enable personalized medicine (Kersting & Meyer, 2018).  Another example includes the historical climate data which can be used to understand global warming and to predict weather better (Kersting & Meyer, 2018).  The massive amounts of sensor network readings and hyperspectral images of plants is another example to identify drought conditions and gain insights into plant growth and development (Kersting & Meyer, 2018). 

Multiple technologies such as artificial intelligence, machine learning, and data mining techniques have been used together to extract the maximum value from Big Data (Luo, Wu, Gopukumar, & Zhao, 2016). Artificial intelligence, machine learning, and data mining have been used in healthcare (Luo et al., 2016).  Computational tools such as neural networks, genetic algorithms, support vector machines, case-based reasoning have been used in prediction (Mishra, Dehuri, & Kim, 2016; Qin, 2012) of stock markets and other financial markets (Qin, 2012). 

AI has impacted the business world through social media and the large volume of the collected data from social media (Mills, 2018).  For instance, the personalized content in real time is increasing to enhance the sales opportunities (Mills, 2018).   The artificial intelligence makes use of effective behavioral targeting methodologies (Mills, 2018).  Big Data improves customer services by making it proactive and allows companies to make customer responsive products (Mills, 2018).  The Big Data Analytics (BDA) assist in predicting what is wanted out of a product (Mills, 2018).  BDA has been playing a significant role in fraud preventions using artificial intelligence (Mills, 2018).  Artificial intelligence techniques such as video recognition, natural language processing, speech recognition, machine learning engines, and automation have been used to help businesses protect against these sophisticated fraud schemes (Mills, 2018).   

The healthcare industry has utilized the machine learning to transform the large volume of the medical data into actionable knowledge performing predictive and prescriptive analytics (Palanisamy & Thirunavukarasu, 2017).  The machine learning platform utilizes artificial intelligence to develop sophisticated algorithm processing massive datasets (structured and unstructured) performing advanced analytics (Palanisamy & Thirunavukarasu, 2017).  For a distributed environment, Apache Mahout (2017), which is an open source machine learning library, integrates with Hadoop to facilitate the execution of scalable machine learning algorithms, offering various techniques such as recommendation, classification, and clustering (Palanisamy & Thirunavukarasu, 2017).

Conclusion

Big Data has attracted the attention of various industries including academia, healthcare and even the government. Artificial intelligence has been around for some time.  Big Data offers various advantages to organizations from increasing sales, to reduce costs to health care.  Artificial intelligence also has its advantages, providing real-time analysis reacting to changes continuously.  The use of Big Data has empowered the artificial intelligence.  Various industries such as the healthcare industry are taking advantages of Big Data and artificial intelligence.  Their growing trend is increasingly demonstrating the realization of businesses to the importance of artificial intelligence in the age of Big Data, and the importance of Big Data role in the artificial intelligence domain.

References

Bean, R. (2018). How Big Data and AI Are Driving Business Innovation in 2018. Retrieved from https://sloanreview.mit.edu/article/how-big-data-and-ai-are-driving-business-innovation-in-2018/.

Kersting, K., & Meyer, U. (2018). From Big Data to Big Artificial Intelligence? : Springer.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Mills, T. (2018). Eight Ways Big Data And AI Are Changing The Business World.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.

Patrizio, A. (2018). Big Data vs. Artificial Intelligence.

Power, B. (2015). Artificial Intelligence Is Almost Ready for Business.

Qin, X. (2012). Making use of the big data: next generation of algorithm trading. Paper presented at the International Conference on Artificial Intelligence and Computational Intelligence.

Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach: Malaysia; Pearson Education Limited.

Hadoop: Manageable Size of Data.

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss how data can be handled before Hadoop can take action on breaking data into manageable sizes.  The discussion begins with an overview of Hadoop providing a brief history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion involves the Big Data Analytics process using Hadoop which involves six significant steps including the pre-processing data and ETL process where the data must be converted and cleaned before processing it.  Before data processing, some consideration must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and data retrieval as it will affect how data can be split among various nodes in the distributed environment because not all tools can split the data.  This consideration begins with the data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  The compression of the data must be considered carefully because not all compression types are “splittable.” The discussion also involves the schema design consideration for HDFS and HBase since they are used often in the Hadoop ecosystem. 

Keywords: Big Data Analytics; Hadoop; Data Modelling in Hadoop; Schema Design in Hadoop.

Introduction

In the age of Big Data, dealing with large datasets in terabytes and petabytes is a reality and requires specific technology as the traditional technology was found inappropriate for it (Dittrich & Quiané-Ruiz, 2012).  Hadoop is developed to store, and process such large datasets efficiently.  Hadoop is becoming a data processing engine for Big Data (Dittrich & Quiané-Ruiz, 2012).  One of the significant advantages of Hadoop MapReduce is allowing non-expert users to run easily analytical tasks over Big Data (Dittrich & Quiané-Ruiz, 2012). However, before the analytical process takes place, some schema design and data modeling consideration must be taken for Hadoop so that the data process can be efficient (Grover, Malaska, Seidman, & Shapira, 2015).  Hadoop requires splitting the data. Some tools can split the data while others cannot split the data natively and requires integration (Grover et al., 2015). 

This project discusses these considerations to ensure the appropriate schema design for Hadoop and its components of HDFS, HBase where the data gets stored in a distributed environment.   The discussion begins with an overview of Hadoop first, followed by the data analytics process and ends with the data modeling techniques and consideration for Hadoop which can assist in splitting the data appropriately for better data processing performance and better data retrieval.

Overview of Hadoop

            Google published and disclosed its MapReduce technique and implementation early around 2004 (Karanth, 2014).  It also introduced the Google File System (GFS) which is associated with MapReduce implementation.  The MapReduce, since then, has become the most common technique to process massive data sets in parallel and distributed settings across many companies (Karanth, 2014).  In 2008, Yahoo released Hadoop as an open-source implementation of the MapReduce framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).  

The Apache Hadoop is the parent project for all subsequence projects of Hadoop (Karanth, 2014).  It contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21 branch.  The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop 2.0.  Two additional releases for Hadoop involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS append and security-related features into Hadoop respectively.  The timeline for Hadoop technology is outlined in Figure 1.


Figure 1.  Hadoop Timeline from 2003 until 2013 (Karanth, 2014).

Hadoop version 1.0 was the inception and evolution of Hadoop as a simple MapReduce job-processing framework (Karanth, 2014).  It exceeded its expectations with wide adoption of massive data processing.  The stable version of the 1.x release includes features such as append and security.  Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop clusters in enterprises.  Hadoop is becoming a common cluster-computing and storage platform from being limited to MapReduce only, because it has been moving faster than MapReduce to stay leading in massive scale data processing with the challenge of being backward compatible (Karanth, 2014). 

            In Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).  MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The yet another resource negotiator (YARN) was developed to separate concerns relating to resource management and application execution, which enables other application paradigms to be added into Hadoop computing cluster. The support for diverse applications result in the efficient and effective utilization of the resources and integrates well with the infrastructure of the business (Karanth, 2014).  YARN maintains backward compatibility with Hadoop version 1.x APIs  (Karanth, 2014).  Thus, the old MapReduce program can still execute in YARN with no code changes, but it has to be recompiled (Karanth, 2014).

            YARN abstracts out the resource management functions to form a platform layer called ResourceManager (RM) (Karanth, 2014).  Every cluster must have RM to keep track of cluster resource usage and activity.  RM is also responsible for allocation of the resources and resolving contentions among resource seekers in the cluster.  RM utilizes a generalized resource model and is agnostic to application-specific resource needs.  RM does not need to know the resources corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows Hadoop 1.x and Hadoop 2.x with YARN layer.   


Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).

Hadoop 2.x involves various enhancement at the storage layer as well.   These enhancements include the high availability feature to have a hot standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can become active NameNode in a matter of minutes.  The Zookeeper or any other HA monitoring service can be utilized to track NameNode failure (Karanth, 2014).  The failover process to promote the hot standby as the active NameNode is triggered with the assistance of the Zookeeper.  The HDFS federation is another enhancement in Hadoop 2.x, which is a more generalized storage model, where the block storage has been generalized and separated from the filesystem layer (Karanth, 2014).  The HDFS snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery.   Other enhancements added in Hadoop 2.x include the Protocol Buffers (Karanth, 2014). The wire protocol for RPCs within Hadoop is based on Protocol Buffers.  Hadoop 2.x is aware of the type of storage and expose this information to the application, to optimize data fetch and placement strategies (Karanth, 2014).  HDFS append support has been another enhancement in Hadoop 2.x.

Hadoop is regarded to be the de facto open-source framework for dealing with large-scale, massively parallel, and distributed data processing (Karanth, 2014).  The framework of Hadoop includes two layers for computation and data layer (Karanth, 2014).  The computation layer is used for parallel and distributed computation processing, while the data layer is used for a highly fault-tolerant data storage layer which is associated with the computation layer.  These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other similar hardware (Karanth, 2014).

Hadoop Architecture

Apache Hadoop has four projects: Hadoop Common, Hadoop Distributed File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016).  The HDFS is used to store data, MapReduce is used to process data, and YARN is used to manage the resources such as CPU and memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014).  Apache Hadoop integrates with other tools such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).

            Hadoop three significant components for Big Data Analytics.  The HDFS is a framework for reliable distributed data storage (Ankam, 2016; Karanth, 2014).  Some considerations must be taken when storing data into HDFS (Grover et al., 2015).  The multiple frameworks for parallel processing of data include MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes and DataNodes.  It also includes Oozie for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse.  Sqoop for data exchange, and Flume for log collection.  YARN is in Hadoop 2.0 as discussed earlier for distributed computing, while HCatalog for Hadoop metadata management.  HBase is for columnar database and Zookeeper for coordination (Alguliyev & Imamverdiyev, 2014).  Figure 3 shows the Hadoop ecosystem components.


Figure 3.  Hadoop Architecture (Alguliyev & Imamverdiyev, 2014).

Big Data Analytics Process Using Hadoop

The process of Big Data Analytics involves six essential steps (Ankam, 2016). The identification of the business problem and outcomes is the first step.  Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in the call volumes, and so forth.  Examples of the outcome include improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing call volume by 50% by next quarter while keeping customers happy.  The required data must be identified where data sources can be data warehouse using online analytical processing, application database using online transactional processing, log files from servers, documents from the internet, sensor-generated data, and so forth, based on the case and the problem.  Data collection is the third step in analyzing the Big Data (Ankam, 2016).  Sqoop tool can be used to collect data from the relational database, and Flume can be used for stream data.  Apache Kafka can be used for reliable intermediate storage.  The data collection and design should be implemented using the fault tolerance strategy (Ankam, 2016).  The preprocessing data and ETL process is the fourth step in the analytical process.  The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it, it needs to be converted to the required format and cleaned from inconsistent, invalid or corrupted data.  Apache Hive, Apache Pig, and Spark SQL can be used for preprocessing massive amounts of data.  The analytics implementation is the fifth steps which should be in order to answer the business questions and problems. The analytical process requires understanding the data and relationships between data points.  The types of data analytics include descriptive and diagnostic analytics to present the past and current views of the data, to answer questions such as what and why happened.  The predictive analytics is performed to answer questions such as what would happen based on a hypothesis. Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing mode.  Real-time analytics tools including Impala, Tez, Drill, and Spark SQL can be integrated into the traditional business intelligence (BI) using any of BI tools such as Tableau, QlikView, and others for interactive analytics. The last step in this process involves the visualization of the data to present the analytics output in a graphical or pictorial format to understand the analysis better for decision making.  The finished data is exported from Hadoop to a relational database using Sqoop, for integration into visualization systems or visualizing systems are directly integrated into tools such as Tableau, QlikView, Excel, and so forth.  Web-based notebooks such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize data by integrating Hadoop and Spark components (Ankam, 2016). 

Data Preprocessing, Modeling and Design Consideration in Hadoop

            Before processing any data, and before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better retrieval (Grover et al., 2015).  The traditional data management system is referred to as Schema-on-Write system which requires the definition of the schema of the data store before the data is loaded (Grover et al., 2015).  This traditional data management system results in long analysis cycles, data modeling, data transformation loading, testing, and so forth before the data can be accessed (Grover et al., 2015).   In addition to this long analysis cycle, if anything changes or wrong decision was made, the cycle must start from the beginning which will take longer time for processing (Grover et al., 2015).   This section addresses various types of consideration before processing the data from Hadoop for analytical purpose.

Data Pre-Processing Consideration

The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014).  Preprocessing techniques must be used to improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013).  The data pre-processing involves three techniques: data integration, data cleansing, and redundancy elimination.

The data integration techniques are used to combine data residing in different sources and provide users with a unified view of the data (Hu et al., 2014).  The traditional database approach has well-established data integration system including the data warehouse method, and the data federation method (Hu et al., 2014).  The data warehouse approach is also known as ETL consisting of extraction, transformation, and loading (Hu et al., 2014).  The extraction step involves the connection to the source systems and selecting and collecting the required data to be processed for analytical purposes.  The transformation step involves the application of a series of rules to the extracted data to convert it into a standard format.  The load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014).  The federation approach creates a virtual database to query and aggregate data from various sources (Hu et al., 2014).  The virtual database contains information or metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014).  These two data pre-processing are called store-and-pull techniques which is not appropriate for Big Data processing, with high computation and high streaming, and dynamic nature (Hu et al., 2014).  

The data cleansing process is a vital process to keep the data consistent and updated to get widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014).  The cleansing process is required to determine the incomplete, inaccurate, or unreasonable data and then remove these data to improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014).  The first step is to define and determine the error types.  The second step is to search and identify error instances.  The third step is to correct the errors, and then document error instances and error types. The last step is to modify data entry procedures to reduce future errors.  Various types of checks must be done at the cleansing process, including the format checks, completeness checks, reasonableness checks, and limit checks (Hu et al., 2014).  The process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014).  The data cleansing process depends on the complex relationship model, and it has extra computation and delay overhead (Hu et al., 2014).  Organizations must seek a balance between the complexity of the data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014). 

The data redundancy is the third data pre-processing step where data is repeated increasing the overhead of the data transmission and causes limitawtions for storage systems, including wasted space, inconsistency of the data, corruption of the dta, and reduced reliability (Hu et al., 2014).  Various redundancy reduction methods include redundancy detection and data compression (Hu et al., 2014).  The data compression method poses an extra computation burden in the data compression and decompression processes (Hu et al., 2014).

Data Modeling and Design Consideration

Schema-on-Write system is used when the application or structure is well understood and frequently accessed through queries and reports on high-value data (Grover et al., 2015).        The term Schema-on-Read is used in the context of Hadoop data management system (Ankam, 2016; Grover et al., 2015). This term refers to the raw data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the requirement of the processing application (Ankam, 2016; Grover et al., 2015).  The Schema-on-Read is used when the application or structure of data is not well understood (Ankam, 2016; Grover et al., 2015).  The agility of the process is implemented through the schema-on-read providing valuable insights on data not previously accessible (Grover et al., 2015).

            Five factors must be considered before storing data into Hadoop for processing (Grover et al., 2015).  The data storage format must be considered as there are some file formats and compression formats supported on Hadoop.  Each type of format has strengths that make it better suited to specific applications.   Although Hadoop Distributed File System (HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented on top of HDFS such as HBase for traditional data access functionality, and Hive for additional data management functionality (Grover et al., 2015).  These systems of HBase for data access functionality and Hive for data management functionality must be taken into consideration before storing data into Hadoop (Grover et al., 2015). The second factor involves the multitenancy which is a common approach for clusters to host multiple users, groups and application types. The multi-tenant clusters involve essential considerations for data storage.  The schema design factor should also be considered before storing data into Hadoop even if Hadoop is a schema-less (Grover et al., 2015).  The schema design consideration involves directory structures for data loaded into HDFS and the output of the data processing and analysis, including the schema of objects stored in systems such as HBase and Hive.  The last factor for consideration before storing data into Hadoop is represented in the metadata management.  Metadata is related to the stored data and is often regarded as necessary as the data.  The understanding of the metadata management plays a significant role as it can affect the accessibility of the data.  The security is another factor which should be considered before storing data into Hadoop system.  The security of the data decision involves authentication, fine-grained access control, and encryption. These security measures should be considered for data at rest when it gets stored as well as in motion during the processing (Grover et al., 2015).  Figure 4 summarizes these considerations before storing data into the Hadoop system. 


Figure 4.  Considerations Before Storing Data into Hadoop.

Data Storage Format Considerations

            When architecting a solution on Hadoop, the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in Hadoop involve file format, compression, data storage system (Grover et al., 2015).  The standard file formats involve three types:  text data, structured text data, and binary data.  Figure 5 summarizes these three standard file formats.


Figure 5.  Standard File Formats.

The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).  These text data format can come in many forms such as CSV files, or unstructured data such as emails.  Compression of the file is recommended, and the selection of the compression is influenced by how the data will be used (Grover et al., 2015).  For instance, if the data is for archival, the most compact compression method can be used, while if the data are used in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).  The splittable format enables Hadoop to split files into chunks for processing, which is essential to efficient parallel processing (Grover et al., 2015).

In most cases, the use of container formats such as SequenceFiles or Avro provides benefits making it the preferred format for most file system including text (Grover et al., 2015).  It is worth noting that these container formats provide functionality to support splittable compression among other benefits (Grover et al., 2015).   The binary data involves images which can be stored in Hadoop as well.  The container format such as SequenceFile is preferred when storing binary data in Hadoop.  If the binary data splittable unit is more than 64MB, the data should be put into its file, without using the container format (Grover et al., 2015).

XML and JSON Format Challenges with Hadoop

The structured text data include formats such as XML and JSON, which can present unique challenges using Hadoop because splitting XML and JSON files for processing is not straightforward, and Hadoop does not provide a built-in InputFormat for either (Grover et al., 2015).  JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record.  When using these file format, two primary consideration must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed.  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

Hadoop File Types Considerations

            Several Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015).  The Hadoop-specific file formats include file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet (Grover et al., 2015).  These files types share two essential characteristics that are important for Hadoop application: splittable compression and agnostic compression.  The ability of splittable files play a significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging data locality feature of Hadoop (Grover et al., 2015).  The agnostic compression is the ability to compress using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable compression and agnostic compression.


Figure 6. Three Hadoop File Types with the Two Common Characteristics.  

1.      SequenceFiles Format Consideration

SequenceFiles format is the most widely used Hadoop file-based formats.  SequenceFile format store data as binary key-value pairs (Grover et al., 2015).  It involves three formats for records stored within SequenceFiles:  uncompressed, record-compressed, and block-compressed.  Every SequenceFile uses a standard header format containing necessary metadata about the file such as the compression codec used, key and value class names, user-defined metadata, and a randomly generated syn marker.  The SequenceFiles arewell supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language.  The frequent use case for SequenceFiles is a container for smaller files.  However, storing a large number of small files in Hadoop can cause memory issue and excessive overhead in processing.  Packing smaller files into a SequenceFile can make the storage and processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015).   Other file-based formats include the MapFiles, SetFiles, Array-Files, and BloomMapFiles.  These formats offer a high level of integration for all forms of MapReduce jobs, including those run via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015).  Figure 7 summarizes the three formats for records stored within SequenceFiles.


Figure 7.  Three Formats for Records Stored within SequenceFile.

2.      Serialization Formats Consideration

Serialization is the process of moving data structures into bytes for storage or for transferring data over the network (Grover et al., 2015).   The de-serialization is the opposite process of converting a byte stream back into a data structure (Grover et al., 2015).  The serialization process is the fundamental building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can be efficiently stored and transferred across a network connection (Grover et al., 2015).  Figure 8 summarizes the serialization formats when architecting for Hadoop.


Figure 8.  Serialization Process vs. Deserialization Process.

The serialization involves two aspects of data processing in a distributed system of interprocess communication using data storage, and remote procedure calls or RPC (Grover et al., 2015).  Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only.  Other serialization frameworks have been increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers and Avro (Grover et al., 2015).  Avro is a language-neutral data serialization system (Grover et al., 2015).  It was designed to address the limitation of the Writables of Hadoop which is lack of language portability.  Similar to Thrift and Protocol Buffers, Avro is described through a language-independent schema (Grover et al., 2015).   Avro, unlike Thrift and Protocol Buffers, the code generation is optional.  Table 1 provides a comparison between these serialization formats.

Table 1:  Comparison between Serialization Formats.

3.      Columnar Format Consideration

Row-oriented systems have been used to fetch data stored in the database (Grover et al., 2015).  This type of data retrieval has been used as the analysis heavily relied on fetching all fields for records that belonged to a specific time range.  This process is efficient if all columns of the record are available at the time or writing because the record can be written with a single disk seek.  The column storage has recently been used to fetch data.  The use of columnar storage has four main benefits over the row-oriented system (Grover et al., 2015).  The skips I/O and decompression on columns that are not part of the query is one of the benefits of the columnar storage.  Columnar data storage works better for queries that access a small subset of columns than the row-oriented data storage, which can be used when many columns are retrieved.  The compression on columns provides efficiency because data is more similar within the same column than it is in a block of rows.  The columnar data storage is more appropriate for data warehousing-based applications where aggregations are implemented using specific columns than an extensive collection of records (Grover et al., 2015).  Hadoop applications have been using the columnar file formats including the RCFile format, Optimized Row Columnar (ORC), and Parquet.  The RCFile format has been used as a Hive Format.  It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization.  It breaks files into row splits, and within each split uses columnar-oriented storage.  Despite its advantages of the query and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and compression.  The newer version of the columnar formats ORC and Parquet are designed to address many of the limitations of the RCFile (Grover et al., 2015). 

Compression Consideration

Compression is another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving the data processing performance (Grover et al., 2015).  Some compression formats supported on Hadoop are not splittable (Grover et al., 2015).  MapReduce framework splits data for input to multiple tasks; the nonsplittable compression format is an obstacle to efficient processing.  Thus, the splittability is a critical consideration in selecting the compression format and file format for Hadoop.  Various compression types for Hadoop include Snappy, LZO, Gzip, bzip2.  Google developed Snappy for speed. However, it does not offer the best compression size. It is designed to be used with a container format like SequenceFile or Avro because it is not inherently splittable.  It is being distributed with Hadoop. Similar to Snappy, LZO is optimized for speed as opposed to size.  However, LZO, unlike Snappy support splittability of the compressed files, but it requires indexing.  LZO, unlike Snappy, is not distributed with Hadoop and requires a license and separate installation.  Gzip, like Snappy, provides good compression performance, but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like the Snappy.  Gzip is slower than Snappy for write processing.  Gzip is not splittable and should be used with a container format.  The use of smaller blocks with Gzip can result in better performance.   The bzip2 is another compression type for Hadoop.  It provides good compression performance, but it can be slower than another compression codec such as Snappy.  It is not an ideal codec for Hadoop storage. Bzip2, unlike Snappy and Gzip, is inherently splittable.  It inserts synchronization markers between blocks.  It can be used for active archival purposes (Grover et al., 2015).

The compression format can become splittable when used with container file formats such as Avro, SequenceFile which compress blocks of records or each record individually (Grover et al., 2015).  If the compression is implemented on the entire file without using the container file format, the compression format that inherently supports splittable must be used such as bzip2.  The compression use with Hadoop has three recommendation (Grover et al., 2015).  The first recommendation is to enable compression of MapReduce intermediate output, which improves performance by decreasing the among of intermediate data that needs to be read and written from and to disk.  The second recommendation s to pay attention to the order of the data.  When the data is close together, it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization of those chunks determines the final compression.   The last recommendation is to consider the use of a compact file format with support for splittable compression such as Avro.  Avro and SequenceFiles support splittability with non-splittable compression formats.  A single HDFS block can contain multiple Avro or SequenceFile blocks. Each block of the Avro or SequenceFile can be compressed and decompressed individually and independently of any other blocks of Avro or SequenceFile. This technique makes the data splittable because each block can be compressed and decompressed individually.  Figure 9 shows the Avro and SequenceFile splittability support (Grover et al., 2015).  


Figure 9.  Compression Example Using Avro (Grover et al., 2015).

Design Consideration for HDFS Schema

HDFS and HBase are the commonly used storage managers in the Hadoop ecosystem.  Organizations can store the data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015).  When storing data in HDFS, some design techniques must be taken into consideration.  The schema-on-read model of Hadoop does not impose any requirement when loading data into Hadoop, as data can be ingested into HDFS by one of many methods without the requirements to associate a schema or preprocess the data.  Although Hadoop has been used to load many types of data such as the unstructured data, semi-structured data, some order is still required, because Hadoop serves as a central location for the entire organization and the data stored in HDFS is intended to be shared across various departments and teams in the organization (Grover et al., 2015).  The data repository should be carefully structured and organized to provide various benefits to the organization  (Grover et al., 2015).   When there is a standard directory structure, it becomes easier to share data among teams working with the same data set.  The data gets staged in a separate location before processing it.  The standard stage technique can help not processing data that has not been appropriately staged or entirely yet.  The standard organization of data allows for some code reuse that may process the data (Grover et al., 2015).  The placement of data assumptions can help simplify the loading process of the data into Hadoop.   The HDFS data model design for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to the traditional schema  (Grover et al., 2015).  The HDFS data model design for projects of unstructured and semi-structured data is likely to focus on directory placement and metadata management (Grover et al., 2015). 

Grover et al. (2015) suggested three key considerations when designing the schema, regardless of the data model design project.  The first consideration is to develop standard practices that can be followed by all teams.  The second point is to ensure the design works well with the chosen tools.  For instance, if the version of Hive can support only table partitions on directories that are named a certain way, it will affect the schema design and the names of the table subdirectories.  The last consideration when designing a schema is to keep usage patterns in mind, because different data processing and querying patterns work better with different schema designs (Grover et al., 2015). 

HDFS Files Location Consideration

            The first step when designing an HDFS schema involves the determination of the location of the file.  Standard file location plays a significant role in finding and sharing data among various departments and teams. It also helps in the assignment of permission to access files to various groups and users.  The recommended file locations are summarized in Table 2.


Table 2.  Standard Files Locations.

HDFS Schema Design Consideration

The HDFS schema design involves advanced techniques to organize data into files (Grover et al., 2015).   A few strategies are recommended to organize the data set. These strategies for data organization involve partitioning, bucketing, and denormalizing process.  The partitioning process of the data set is a common technique used to reduce the amount of I/O required to process the data set.  HDFS does not store indexes on the data unlike the traditional data warehouse. Such a lack of indexes in HDFS plays a key role in speeding up data ingest, with a full table scan cost where every query will have to read the entire dataset even when processing a small subset of data. Breaking up the data set into smaller subsets, or partitions can help with the full table scan, allowing queries to read only the specific partitions reducing the amount of I/O and improving the query time processing significantly (Grover et al., 2015). When data is placed in the filesystem, the directory format for partition should be as shown below.  The order data sets are partitioned by date because there are a large number of orders done daily and the partitions will contain large enough files which are optimized by HDFS.  Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing (Grover et al., 2015).

  • <data set name>/<partition_column_name=partition_column_value>/(Armstrong)
  • e.g. medication_orders/date=20181107/[order1.csv, order2.csv]

Bucketing is another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015).  The bucketing technique is similar to the hash partitions which is used in the relational database.   Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing. The partition example above was implemented using the date which resulted in large data files which can be optimized by HDFS (Grover et al., 2015).  However, if the data sets are partitioned by a the category of the physician, the result will be too many small files, which leads to small file problems, which can lead to excessive memory use for the NameNode, since metadata for each file stored in HDFS is stored in memory (Grover et al., 2015).  Many small files can also lead to many processing tasks, causing excessive overhead in processing.  The solution for too many small files is to use the bucketing process for the physician in this example, which uses the hashing function to map physician into a specified number of buckets (Grover et al., 2015).

The bucketing technique controls the size of the data subsets and optimizes the query speed (Grover et al., 2015).  The recommended average bucket size is a few multiples of the HDFS block size. The distribution of data when hashed on the bucketing column is essential because it results in consistent bucketing (Grover et al., 2015).  The use of the number of buckets as a power of two is every day.   Bucketing allows joining two data sets.  The join, in this case, is used to represent the general idea of combining two data sets to retrieve a result. The joins can be implemented through the SQL-on-Hadoop systems and also in MapReduce, or Spark, or other programming interfaces to Hadoop.  When using join in the bucketing technique, it joins corresponding buckets individually without having to join the entire datasets, which help in minimizing the time complexity for the reduce-side join of the two datasets process, which is computationally expensive (Grover et al., 2015).   The join is implemented in the map stage of a MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into memory, which is called map-side join process.  The map-side join process improves the join performance as compared to a reduce-side join process.  A hive for data analysis recognizes the tables are bucketed and optimize the process accordingly.

Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get stored in memory when joining, resulting in the faster process and much less memory than a simple bucket join.  Hive supports this optimization as well.  The use of both sorting and bucketing on large tables that are frequently joined together using the join key for bucketing is recommended (Grover et al., 2015).

The schema design depends on how the data will be queried (Grover et al., 2015).  Thus, the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is implemented.   In some cases, when the identification of one partitioning key is challenging, storing the same data set multiple times can be implemented, each with the different physical organization, which is regarded to be an anti-pattern in a relational database.  However, this solution can be implemented with Hadoop, because with Hadoop is write-once, and few updates are expected.  Thus, the overhead of keeping duplicated data set in sync is reduced.  The cost of storage in Hadoop clusters is reduced as well  (Grover et al., 2015). The duplicated data set in sync provides better query speed processing in such cases (Grover et al., 2015). 

Regarding the denormalizing process, it is another technique of trading the disk space for query performance, where joining the entire data set need is minimized (Grover et al., 2015).   In the relational database model, the data is stored in the third standard form (NF3), where redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity.  In this relational model, most queries require joining a large number of tables together to produce a final result as desired (Grover et al., 2015).  However, in Hadoop, joins are often the slowest operations and consume the most resources from the cluster.  Specifically, the reduce-side join requires sending the entire table over the network, which is computationally costly.  While sorting and bucketing help minimizing this computational cost, another solution is to create data sets that are pre-joined or pre-aggregated (Grover et al., 2015).  Thus, the data can be joined once and store it in this form instead of running the join operations every time there is a query for that data.  Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process  (Grover et al., 2015).  Other techniques to speed up the process include the aggregation or data type conversion.  The duplication of the data is of less concern; thus, when the processing is frequent for a large number of queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational database.  In Hadoop, the new dataset is created that contains the same data in its aggregated form (Grover et al., 2015).

To summarize, the partitioning process is used to reduce the I/O overhead of processing by selectively reading and writing data in particular partitions.  The bucketing can be used to speed up queries that involve joins or sampling, by reducing the I/O as well.  The denormalization can be implemented to speed up Hadoop jobs.   In this section, a review of advanced techniques to organize data into files is discussed.  The discussion includes the use of a small number of large files versus a large number of small files.  Hadoop prefers working with a small number of large files than a large number of small files.  The discussion also addresses the reduce-side join versus map-side join techniques.   The reduce-side join is computationally costly. Hence, the map-side join technique is preferred and recommend. 

HBase Schema Design Consideration

HBase is not a relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013).  HBase is similar to a large hash table, which allows the association of values with keys and performs a fast lookup of the value based on a given key  (Grover et al., 2015). The operations of hash tables involve put, get, scan, increment and delete.  HBase provides scalability and flexibility and is useful in many applications, including fraud detection, which is a widespread application for HBase (Grover et al., 2015).

The framework of HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore, HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016).  Each component of the HBase framework plays a significant role in data storage and processing.  Figure 10 illustrated the HBase framework.


Figure 10.  HBase Architecture (Bhojwani & Shah, 2016).

            The following consideration must be taken when designing the schema for HBase (Grover et al., 2015).

  • Row Key Consideration.
  • Timestamp Consideration.
  • Hops Consideration.
  • Tables and Regions Consideration.
  • Columns Use Consideration.
  • Column Families Use Consideration.
  • Time-To-Live Consideration.

The row key is one of the most critical factors for well-architected HBase schema design (Grover et al., 2015).  The row key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and uniqueness.  The row key is critical for retrieving records from HBase. In the relational database, the composite key can be used to combine multiple primary keys.  In HBase, multiple pieces of information can be combined in a single key.  For instance, a key of customer_id, order_id, and timestamp will be a row key for a row describing an order. In a relational database, they are three different columns in the relational database, but in HBase, they will be combined into a single unique identifier.  Another consideration for selecting the row key is the get operation because a get operation of a single record is the fasted operation in HBase.  A single get operation can retrieve the most common uses of the data improves the performance, which requires to put much information in a single record which is called denormalized design.    For instance, while in the relational database, customer information will be placed in various tables, in HBase all customer information will be stored in a single record where get operation will be used. The distribution is another consideration for HBase schema design.  The row key determines the regions of HBase cluster for a given table, which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013).   The row keys are sorted, and each region stores a range of these sorted row keys  (Grover et al., 2015).  Each region is pinned to a region server namely a node in the cluster  (Grover et al., 2015).  The combination of device ID and timestamp or reverse timestamp is commonly used to “salt” the key in machine data  (Grover et al., 2015).  The block cache is a least recently used (LRU) cache which caches data blocks in memory  (Grover et al., 2015).  HBase reads records in chunks of 64KB from the disk by default. Each of these chunks is called HBase block  (Grover et al., 2015).  When the HBase block is read from disk, it will be put into the block cache  (Grover et al., 2015).   The choice of the row key can affect the scan operation as well.  HBase scan rates are about eight times slower than HSFS scan rates.  Thus, reducing I/O requirements has a significant performance advantage.  The size of the row key determines the performance of the workload.  The short row key is better than, the long row key because it has lower storage overhead and faster read/ writes performance.  The readability of the row key is critical. Thus, it is essential to start with human-readable row key.  The uniqueness of the row key is also critical since a row key is equivalent to a key in hash table analogy.  If the row key is based on the non-unique attribute, the application should handle such cases and only put data in HBase with a unique row key (Grover et al., 2015).

The timestamp is the second essential consideration for good HBase schema design (Grover et al., 2015).  The timestamp provides advantages of determining which records are newer in case of put operation to modify the record.  It also determines the order where records are returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records because time-to-live (TTL) operation compared with the timestamp shows the record value has either been overwritten by another put or deleted (Grover et al., 2015).

The hop term refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the better because of the overhead.  Although multi-hop requests with HBase can be made, it is best to avoid them through better schema design, for example by leveraging de-normalization, because every hop is a round-trip to HBase which has a significant performance overhead (Grover et al., 2015).

The number of tables and regions per table in HBase can have a negative impact on the performance and distribution of the data (Grover et al., 2015).  If the number of tables and regions are not implemented correctly, it can result in an imbalance in the distribution of the load.  Important considerations include one region server per node, many regions in a region server, a give region is pinned to a particular region server, and tables are split into regions and scattered across region servers.  A table must have at least one region.  All regions in a region server receive “put” requests and share the region server’s “memstore,” which is a cache structure present on every HBase region server. The “memstore” caches the write is sent to that region server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region server; the less memstore space is available per region.  The default configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can be divided by 100MB and result should be the maximum number of regions which can be put on that region server.   The vast region takes a long time to compact.  The upper limit on the size of a region is around 20GB. However, there are successful HBase clusters with upward of 120GB regions.  The regions can be assigned to HBase table using one of two techniques. The first technique is to create the table with a single default region, which auto splits as data increases.  The second technique is to create the table with a given number of regions and set the region size to a high enough value, e.g., 100GB per region to avoid auto splitting (Grover et al., 2015).  Figure 11 shows a topology of region servers, regions and tables. 


Figure 11.  The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).

The columns used in HBase is different from the traditional relational database (Grover et al., 2015; Yang et al., 2013).  In HBase, unlike the traditional database, a record can have a million columns, and the next record can have a million completely different columns, which is not recommended but possible (Grover et al., 2015).   HBase stores data in a format called HFile, where each column value gets its row in HFile (Grover et al., 2015; Yang et al., 2013). The row has files like row key, timestamp, column names, and values. The file format provides various functionality, like versioning and sparse column storage (Grover et al., 2015). 

HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013).  A column family is a container for columns.  In HBase, a table can have one or more column families.  Each column family has its set of HFiles and gets compacted independently of other column families in the same table.  In many cases, no more than one column family is needed per table.  The use of more than one column family per table can be done when the operation is done, or the rate of change on a subset of the columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013).  The last consideration for HBase schema design is the use of TTL, which is a built-in feature of HBase which ages out data based on its timestamp (Grover et al., 2015).  If TTL is not used and an aging requirement is needed, then a much more I/O intensive operation would need to be done.   The objects in HBase begin with table object, followed by regions for the table, store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013).  Figure 12 shows the hierarchy of objects in HBase.

Figure 12.  The Hierarchy of Objects in HBase (Yang et al., 2013).

To summarize this section, HBase schema design requires seven key consideration starting with the row key, which should be selected carefully for record retrieval, distribution, block cache, ability to scan, size, readability, and uniqueness.  The timestamp and hops are other schema design consideration for HBase.  Tables and regions must be considered for put performance, and compacting time.  The use of columns and column families should also be considered when designing the schema for HBase. The TTL to remove data that aged is another consideration for HBase schema design. 

Metadata Consideration

The above discussion has been about the data and the techniques to store it in Hadoop.  Metadata is as essential as the data itself.  Metadata is data about the data (Grover et al., 2015)).  Hadoop ecosystem has various forms of metadata.   Metadata about logical dataset usually stored in a separate metadata repository include the information like the location of a data set such as directory in HDFS or HBase table name, the schema associated with the dataset, the partitioning and sorting properties of the data set, the format of the data set e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such files and the location of various blocks on data nodes, usually stored and managed by Hadoop NameNode (Grover et al., 2015).  Metadata about tables in HBase include information like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually stored and managed by HBase (Grover et al., 2015).  Metadata about data ingest and transformation include information like which user-generated a given dataset, where the dataset came from, how long it took to generate it, and how many records there are, or the size of the data load (Grover et al., 2015).  Metadata about dataset statistics include information like the number of rows in a dataset, number of unique values in each column, a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015).  Figure 13 summarizes this various metadata.


Figure 13.  Various Metadata in Hadoop.

Apache Hive was the first project in the Hadoop ecosystem to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015).  Hives stores this metadata in a relational database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015).  Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015).  The query process in Hive goes to the metastore to get the metadata for the desired query, and metastore sends the metadata to Hive generating execution plan, followed by executing the job using the Hadoop cluster, which implements the job and Hive send the fetched result to the user (Antony et al., 2016; Grover et al., 2015).  Figure 14 shows the query process and the role of the metastore in Hive framework.


Figure 14.  Query Process and the Role of Metastore in Hive (Antony et al., 2016).

More projects have utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the usage of Hive metastore outside of Hive (Grover et al., 2015).  HCatalog is a part of Hive and allows other tools like Pig and MapReduce to integrate with Hive metastore.  It also opens the access to Hive metastore to other tools such as REST API via WebHCat server.  MapReduce, Pig, and standalone applications can talk directly to the metastore of Hive through its APIs, but HCatalog allows easy access through its WebHCat REST APIs, and it allows the cluster administrators to lock down access to the Hive metastore to address security concerns. Other ways to store metadata include the embedding of metadata in file paths and names.  Another technique to store metadata involves storing it in HDFS in a hidden file, e.g., .metadata.  Figure 15 shows the HCatalog as an accessibility veneer around the Hive metastore (Grover et al., 2015). 


Figure 15.  HCatalog acts an accessibility veneer around the Hive metastore (Grover et al., 2015).

Hive Metastore and HCatalog Limitations

There are some limitations for Hive metastore and HCatalog, including the problem with high availability (Grover et al., 2015).  The HA database cluster solutions to bring HA to the Hive metastore database.  For the metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the cluster.  However, concurrency issues related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues. 

The fixed schema for metadata is another limitation.  Hadoop provides much flexibility on the type of data that can be stored, mainly because of the Schema-on-Read concept. Hive metastore provides a fixed schema for the metadata itself. It provides a tabular abstraction for the data sets.   The data in metastore is moving the part in the infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).

Conclusion

This project has discussed essential topics related to Hadoop technology.  It began with an overview of Hadoop providing a history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x.  The discussion involved the Big Data Analytics Process using Hadoop technology.  The process involves six significant steps starting with the problem identification, required data to be collected, and the data collection process. The pre-processing data and ETL process must be implemented before performing the analytics. The last step is the visualization of the data for decision making.  Before processing any data and before collecting any data for storage, some considerations must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and better data retrieval, giving some tools cannot split the data while others can.  These considerations begin with data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  Compression must be considered when designing the schema for Hadoop. Since HDFS and HBase are commonly used in Hadoop for data storage, the discussion involved the consideration for the HDFS and HBase schema design considerations.  To summarize the design of the schema for Hadoop, HDFS, and HBase makes a difference in storing data in various nodes using the right tools for splitting the data.  Thus, organizations must pay attention to the process and the design requirements before storing data into Hadoop for better computational processing. 

References

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Antony, B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley & Sons.

Armstrong, D. (n.d.). R: Learning by Example: Lattice Graphics. Retrieved from https://quantoid.net/files/rbe/lattice.pdf.

Bhojwani, N., & Shah, A. P. V. (2016). A SURVEY ON HADOOP HBASE SYSTEM. Development, 3(1).

Dittrich, J., & Quiané-Ruiz, J.-A. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014-2015.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

sas.com. (2018). Hadoop – why it is and why it matters. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

 

Abstract

The purpose of this project is to discuss how data can be handled before Hadoop can take action on breaking data into manageable sizes.  The discussion begins with an overview of Hadoop providing a brief history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion involves the Big Data Analytics process using Hadoop which involves six significant steps including the pre-processing data and ETL process where the data must be converted and cleaned before processing it.  Before data processing, some consideration must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and data retrieval as it will affect how data can be split among various nodes in the distributed environment because not all tools can split the data.  This consideration begins with the data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  The compression of the data must be considered carefully because not all compression types are “splittable.” The discussion also involves the schema design consideration for HDFS and HBase since they are used often in the Hadoop ecosystem. 

Keywords: Big Data Analytics; Hadoop; Data Modelling in Hadoop; Schema Design in Hadoop.

Introduction

In the age of Big Data, dealing with large datasets in terabytes and petabytes is a reality and requires specific technology as the traditional technology was found inappropriate for it (Dittrich & Quiané-Ruiz, 2012).  Hadoop is developed to store, and process such large datasets efficiently.  Hadoop is becoming a data processing engine for Big Data (Dittrich & Quiané-Ruiz, 2012).  One of the significant advantages of Hadoop MapReduce is allowing non-expert users to run easily analytical tasks over Big Data (Dittrich & Quiané-Ruiz, 2012). However, before the analytical process takes place, some schema design and data modeling consideration must be taken for Hadoop so that the data process can be efficient (Grover, Malaska, Seidman, & Shapira, 2015).  Hadoop requires splitting the data. Some tools can split the data while others cannot split the data natively and requires integration (Grover et al., 2015). 

This project discusses these considerations to ensure the appropriate schema design for Hadoop and its components of HDFS, HBase where the data gets stored in a distributed environment.   The discussion begins with an overview of Hadoop first, followed by the data analytics process and ends with the data modeling techniques and consideration for Hadoop which can assist in splitting the data appropriately for better data processing performance and better data retrieval.

Overview of Hadoop

            Google published and disclosed its MapReduce technique and implementation early around 2004 (Karanth, 2014).  It also introduced the Google File System (GFS) which is associated with MapReduce implementation.  The MapReduce, since then, has become the most common technique to process massive data sets in parallel and distributed settings across many companies (Karanth, 2014).  In 2008, Yahoo released Hadoop as an open-source implementation of the MapReduce framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).  

The Apache Hadoop is the parent project for all subsequence projects of Hadoop (Karanth, 2014).  It contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21 branch.  The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop 2.0.  Two additional releases for Hadoop involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS append and security-related features into Hadoop respectively.  The timeline for Hadoop technology is outlined in Figure 1.


Figure 1.  Hadoop Timeline from 2003 until 2013 (Karanth, 2014).

Hadoop version 1.0 was the inception and evolution of Hadoop as a simple MapReduce job-processing framework (Karanth, 2014).  It exceeded its expectations with wide adoption of massive data processing.  The stable version of the 1.x release includes features such as append and security.  Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop clusters in enterprises.  Hadoop is becoming a common cluster-computing and storage platform from being limited to MapReduce only, because it has been moving faster than MapReduce to stay leading in massive scale data processing with the challenge of being backward compatible (Karanth, 2014). 

            In Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).  MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The yet another resource negotiator (YARN) was developed to separate concerns relating to resource management and application execution, which enables other application paradigms to be added into Hadoop computing cluster. The support for diverse applications result in the efficient and effective utilization of the resources and integrates well with the infrastructure of the business (Karanth, 2014).  YARN maintains backward compatibility with Hadoop version 1.x APIs  (Karanth, 2014).  Thus, the old MapReduce program can still execute in YARN with no code changes, but it has to be recompiled (Karanth, 2014).

            YARN abstracts out the resource management functions to form a platform layer called ResourceManager (RM) (Karanth, 2014).  Every cluster must have RM to keep track of cluster resource usage and activity.  RM is also responsible for allocation of the resources and resolving contentions among resource seekers in the cluster.  RM utilizes a generalized resource model and is agnostic to application-specific resource needs.  RM does not need to know the resources corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows Hadoop 1.x and Hadoop 2.x with YARN layer.   


Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).

Hadoop 2.x involves various enhancement at the storage layer as well.   These enhancements include the high availability feature to have a hot standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can become active NameNode in a matter of minutes.  The Zookeeper or any other HA monitoring service can be utilized to track NameNode failure (Karanth, 2014).  The failover process to promote the hot standby as the active NameNode is triggered with the assistance of the Zookeeper.  The HDFS federation is another enhancement in Hadoop 2.x, which is a more generalized storage model, where the block storage has been generalized and separated from the filesystem layer (Karanth, 2014).  The HDFS snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery.   Other enhancements added in Hadoop 2.x include the Protocol Buffers (Karanth, 2014). The wire protocol for RPCs within Hadoop is based on Protocol Buffers.  Hadoop 2.x is aware of the type of storage and expose this information to the application, to optimize data fetch and placement strategies (Karanth, 2014).  HDFS append support has been another enhancement in Hadoop 2.x.

Hadoop is regarded to be the de facto open-source framework for dealing with large-scale, massively parallel, and distributed data processing (Karanth, 2014).  The framework of Hadoop includes two layers for computation and data layer (Karanth, 2014).  The computation layer is used for parallel and distributed computation processing, while the data layer is used for a highly fault-tolerant data storage layer which is associated with the computation layer.  These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other similar hardware (Karanth, 2014).

Hadoop Architecture

Apache Hadoop has four projects: Hadoop Common, Hadoop Distributed File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016).  The HDFS is used to store data, MapReduce is used to process data, and YARN is used to manage the resources such as CPU and memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014).  Apache Hadoop integrates with other tools such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).

            Hadoop three significant components for Big Data Analytics.  The HDFS is a framework for reliable distributed data storage (Ankam, 2016; Karanth, 2014).  Some considerations must be taken when storing data into HDFS (Grover et al., 2015).  The multiple frameworks for parallel processing of data include MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes and DataNodes.  It also includes Oozie for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse.  Sqoop for data exchange, and Flume for log collection.  YARN is in Hadoop 2.0 as discussed earlier for distributed computing, while HCatalog for Hadoop metadata management.  HBase is for columnar database and Zookeeper for coordination (Alguliyev & Imamverdiyev, 2014).  Figure 3 shows the Hadoop ecosystem components.


Figure 3.  Hadoop Architecture (Alguliyev & Imamverdiyev, 2014).

Big Data Analytics Process Using Hadoop

The process of Big Data Analytics involves six essential steps (Ankam, 2016). The identification of the business problem and outcomes is the first step.  Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in the call volumes, and so forth.  Examples of the outcome include improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing call volume by 50% by next quarter while keeping customers happy.  The required data must be identified where data sources can be data warehouse using online analytical processing, application database using online transactional processing, log files from servers, documents from the internet, sensor-generated data, and so forth, based on the case and the problem.  Data collection is the third step in analyzing the Big Data (Ankam, 2016).  Sqoop tool can be used to collect data from the relational database, and Flume can be used for stream data.  Apache Kafka can be used for reliable intermediate storage.  The data collection and design should be implemented using the fault tolerance strategy (Ankam, 2016).  The preprocessing data and ETL process is the fourth step in the analytical process.  The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it, it needs to be converted to the required format and cleaned from inconsistent, invalid or corrupted data.  Apache Hive, Apache Pig, and Spark SQL can be used for preprocessing massive amounts of data.  The analytics implementation is the fifth steps which should be in order to answer the business questions and problems. The analytical process requires understanding the data and relationships between data points.  The types of data analytics include descriptive and diagnostic analytics to present the past and current views of the data, to answer questions such as what and why happened.  The predictive analytics is performed to answer questions such as what would happen based on a hypothesis. Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing mode.  Real-time analytics tools including Impala, Tez, Drill, and Spark SQL can be integrated into the traditional business intelligence (BI) using any of BI tools such as Tableau, QlikView, and others for interactive analytics. The last step in this process involves the visualization of the data to present the analytics output in a graphical or pictorial format to understand the analysis better for decision making.  The finished data is exported from Hadoop to a relational database using Sqoop, for integration into visualization systems or visualizing systems are directly integrated into tools such as Tableau, QlikView, Excel, and so forth.  Web-based notebooks such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize data by integrating Hadoop and Spark components (Ankam, 2016). 

Data Preprocessing, Modeling and Design Consideration in Hadoop

            Before processing any data, and before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better retrieval (Grover et al., 2015).  The traditional data management system is referred to as Schema-on-Write system which requires the definition of the schema of the data store before the data is loaded (Grover et al., 2015).  This traditional data management system results in long analysis cycles, data modeling, data transformation loading, testing, and so forth before the data can be accessed (Grover et al., 2015).   In addition to this long analysis cycle, if anything changes or wrong decision was made, the cycle must start from the beginning which will take longer time for processing (Grover et al., 2015).   This section addresses various types of consideration before processing the data from Hadoop for analytical purpose.

Data Pre-Processing Consideration

The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014).  Preprocessing techniques must be used to improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013).  The data pre-processing involves three techniques: data integration, data cleansing, and redundancy elimination.

The data integration techniques are used to combine data residing in different sources and provide users with a unified view of the data (Hu et al., 2014).  The traditional database approach has well-established data integration system including the data warehouse method, and the data federation method (Hu et al., 2014).  The data warehouse approach is also known as ETL consisting of extraction, transformation, and loading (Hu et al., 2014).  The extraction step involves the connection to the source systems and selecting and collecting the required data to be processed for analytical purposes.  The transformation step involves the application of a series of rules to the extracted data to convert it into a standard format.  The load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014).  The federation approach creates a virtual database to query and aggregate data from various sources (Hu et al., 2014).  The virtual database contains information or metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014).  These two data pre-processing are called store-and-pull techniques which is not appropriate for Big Data processing, with high computation and high streaming, and dynamic nature (Hu et al., 2014).  

The data cleansing process is a vital process to keep the data consistent and updated to get widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014).  The cleansing process is required to determine the incomplete, inaccurate, or unreasonable data and then remove these data to improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014).  The first step is to define and determine the error types.  The second step is to search and identify error instances.  The third step is to correct the errors, and then document error instances and error types. The last step is to modify data entry procedures to reduce future errors.  Various types of checks must be done at the cleansing process, including the format checks, completeness checks, reasonableness checks, and limit checks (Hu et al., 2014).  The process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014).  The data cleansing process depends on the complex relationship model, and it has extra computation and delay overhead (Hu et al., 2014).  Organizations must seek a balance between the complexity of the data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014). 

The data redundancy is the third data pre-processing step where data is repeated increasing the overhead of the data transmission and causes limitawtions for storage systems, including wasted space, inconsistency of the data, corruption of the dta, and reduced reliability (Hu et al., 2014).  Various redundancy reduction methods include redundancy detection and data compression (Hu et al., 2014).  The data compression method poses an extra computation burden in the data compression and decompression processes (Hu et al., 2014).

Data Modeling and Design Consideration

Schema-on-Write system is used when the application or structure is well understood and frequently accessed through queries and reports on high-value data (Grover et al., 2015).        The term Schema-on-Read is used in the context of Hadoop data management system (Ankam, 2016; Grover et al., 2015). This term refers to the raw data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the requirement of the processing application (Ankam, 2016; Grover et al., 2015).  The Schema-on-Read is used when the application or structure of data is not well understood (Ankam, 2016; Grover et al., 2015).  The agility of the process is implemented through the schema-on-read providing valuable insights on data not previously accessible (Grover et al., 2015).

            Five factors must be considered before storing data into Hadoop for processing (Grover et al., 2015).  The data storage format must be considered as there are some file formats and compression formats supported on Hadoop.  Each type of format has strengths that make it better suited to specific applications.   Although Hadoop Distributed File System (HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented on top of HDFS such as HBase for traditional data access functionality, and Hive for additional data management functionality (Grover et al., 2015).  These systems of HBase for data access functionality and Hive for data management functionality must be taken into consideration before storing data into Hadoop (Grover et al., 2015). The second factor involves the multitenancy which is a common approach for clusters to host multiple users, groups and application types. The multi-tenant clusters involve essential considerations for data storage.  The schema design factor should also be considered before storing data into Hadoop even if Hadoop is a schema-less (Grover et al., 2015).  The schema design consideration involves directory structures for data loaded into HDFS and the output of the data processing and analysis, including the schema of objects stored in systems such as HBase and Hive.  The last factor for consideration before storing data into Hadoop is represented in the metadata management.  Metadata is related to the stored data and is often regarded as necessary as the data.  The understanding of the metadata management plays a significant role as it can affect the accessibility of the data.  The security is another factor which should be considered before storing data into Hadoop system.  The security of the data decision involves authentication, fine-grained access control, and encryption. These security measures should be considered for data at rest when it gets stored as well as in motion during the processing (Grover et al., 2015).  Figure 4 summarizes these considerations before storing data into the Hadoop system. 


Figure 4.  Considerations Before Storing Data into Hadoop.

Data Storage Format Considerations

            When architecting a solution on Hadoop, the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in Hadoop involve file format, compression, data storage system (Grover et al., 2015).  The standard file formats involve three types:  text data, structured text data, and binary data.  Figure 5 summarizes these three standard file formats.


Figure 5.  Standard File Formats.

The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).  These text data format can come in many forms such as CSV files, or unstructured data such as emails.  Compression of the file is recommended, and the selection of the compression is influenced by how the data will be used (Grover et al., 2015).  For instance, if the data is for archival, the most compact compression method can be used, while if the data are used in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).  The splittable format enables Hadoop to split files into chunks for processing, which is essential to efficient parallel processing (Grover et al., 2015).

In most cases, the use of container formats such as SequenceFiles or Avro provides benefits making it the preferred format for most file system including text (Grover et al., 2015).  It is worth noting that these container formats provide functionality to support splittable compression among other benefits (Grover et al., 2015).   The binary data involves images which can be stored in Hadoop as well.  The container format such as SequenceFile is preferred when storing binary data in Hadoop.  If the binary data splittable unit is more than 64MB, the data should be put into its file, without using the container format (Grover et al., 2015).

XML and JSON Format Challenges with Hadoop

The structured text data include formats such as XML and JSON, which can present unique challenges using Hadoop because splitting XML and JSON files for processing is not straightforward, and Hadoop does not provide a built-in InputFormat for either (Grover et al., 2015).  JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record.  When using these file format, two primary consideration must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed.  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

Hadoop File Types Considerations

            Several Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015).  The Hadoop-specific file formats include file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet (Grover et al., 2015).  These files types share two essential characteristics that are important for Hadoop application: splittable compression and agnostic compression.  The ability of splittable files play a significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging data locality feature of Hadoop (Grover et al., 2015).  The agnostic compression is the ability to compress using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable compression and agnostic compression.


Figure 6. Three Hadoop File Types with the Two Common Characteristics.  

1.      SequenceFiles Format Consideration

SequenceFiles format is the most widely used Hadoop file-based formats.  SequenceFile format store data as binary key-value pairs (Grover et al., 2015).  It involves three formats for records stored within SequenceFiles:  uncompressed, record-compressed, and block-compressed.  Every SequenceFile uses a standard header format containing necessary metadata about the file such as the compression codec used, key and value class names, user-defined metadata, and a randomly generated syn marker.  The SequenceFiles arewell supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language.  The frequent use case for SequenceFiles is a container for smaller files.  However, storing a large number of small files in Hadoop can cause memory issue and excessive overhead in processing.  Packing smaller files into a SequenceFile can make the storage and processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015).   Other file-based formats include the MapFiles, SetFiles, Array-Files, and BloomMapFiles.  These formats offer a high level of integration for all forms of MapReduce jobs, including those run via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015).  Figure 7 summarizes the three formats for records stored within SequenceFiles.


Figure 7.  Three Formats for Records Stored within SequenceFile.

2.      Serialization Formats Consideration

Serialization is the process of moving data structures into bytes for storage or for transferring data over the network (Grover et al., 2015).   The de-serialization is the opposite process of converting a byte stream back into a data structure (Grover et al., 2015).  The serialization process is the fundamental building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can be efficiently stored and transferred across a network connection (Grover et al., 2015).  Figure 8 summarizes the serialization formats when architecting for Hadoop.


Figure 8.  Serialization Process vs. Deserialization Process.

The serialization involves two aspects of data processing in a distributed system of interprocess communication using data storage, and remote procedure calls or RPC (Grover et al., 2015).  Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only.  Other serialization frameworks have been increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers and Avro (Grover et al., 2015).  Avro is a language-neutral data serialization system (Grover et al., 2015).  It was designed to address the limitation of the Writables of Hadoop which is lack of language portability.  Similar to Thrift and Protocol Buffers, Avro is described through a language-independent schema (Grover et al., 2015).   Avro, unlike Thrift and Protocol Buffers, the code generation is optional.  Table 1 provides a comparison between these serialization formats.

Table 1:  Comparison between Serialization Formats.

3.      Columnar Format Consideration

Row-oriented systems have been used to fetch data stored in the database (Grover et al., 2015).  This type of data retrieval has been used as the analysis heavily relied on fetching all fields for records that belonged to a specific time range.  This process is efficient if all columns of the record are available at the time or writing because the record can be written with a single disk seek.  The column storage has recently been used to fetch data.  The use of columnar storage has four main benefits over the row-oriented system (Grover et al., 2015).  The skips I/O and decompression on columns that are not part of the query is one of the benefits of the columnar storage.  Columnar data storage works better for queries that access a small subset of columns than the row-oriented data storage, which can be used when many columns are retrieved.  The compression on columns provides efficiency because data is more similar within the same column than it is in a block of rows.  The columnar data storage is more appropriate for data warehousing-based applications where aggregations are implemented using specific columns than an extensive collection of records (Grover et al., 2015).  Hadoop applications have been using the columnar file formats including the RCFile format, Optimized Row Columnar (ORC), and Parquet.  The RCFile format has been used as a Hive Format.  It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization.  It breaks files into row splits, and within each split uses columnar-oriented storage.  Despite its advantages of the query and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and compression.  The newer version of the columnar formats ORC and Parquet are designed to address many of the limitations of the RCFile (Grover et al., 2015). 

Compression Consideration

Compression is another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving the data processing performance (Grover et al., 2015).  Some compression formats supported on Hadoop are not splittable (Grover et al., 2015).  MapReduce framework splits data for input to multiple tasks; the nonsplittable compression format is an obstacle to efficient processing.  Thus, the splittability is a critical consideration in selecting the compression format and file format for Hadoop.  Various compression types for Hadoop include Snappy, LZO, Gzip, bzip2.  Google developed Snappy for speed. However, it does not offer the best compression size. It is designed to be used with a container format like SequenceFile or Avro because it is not inherently splittable.  It is being distributed with Hadoop. Similar to Snappy, LZO is optimized for speed as opposed to size.  However, LZO, unlike Snappy support splittability of the compressed files, but it requires indexing.  LZO, unlike Snappy, is not distributed with Hadoop and requires a license and separate installation.  Gzip, like Snappy, provides good compression performance, but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like the Snappy.  Gzip is slower than Snappy for write processing.  Gzip is not splittable and should be used with a container format.  The use of smaller blocks with Gzip can result in better performance.   The bzip2 is another compression type for Hadoop.  It provides good compression performance, but it can be slower than another compression codec such as Snappy.  It is not an ideal codec for Hadoop storage. Bzip2, unlike Snappy and Gzip, is inherently splittable.  It inserts synchronization markers between blocks.  It can be used for active archival purposes (Grover et al., 2015).

The compression format can become splittable when used with container file formats such as Avro, SequenceFile which compress blocks of records or each record individually (Grover et al., 2015).  If the compression is implemented on the entire file without using the container file format, the compression format that inherently supports splittable must be used such as bzip2.  The compression use with Hadoop has three recommendation (Grover et al., 2015).  The first recommendation is to enable compression of MapReduce intermediate output, which improves performance by decreasing the among of intermediate data that needs to be read and written from and to disk.  The second recommendation s to pay attention to the order of the data.  When the data is close together, it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization of those chunks determines the final compression.   The last recommendation is to consider the use of a compact file format with support for splittable compression such as Avro.  Avro and SequenceFiles support splittability with non-splittable compression formats.  A single HDFS block can contain multiple Avro or SequenceFile blocks. Each block of the Avro or SequenceFile can be compressed and decompressed individually and independently of any other blocks of Avro or SequenceFile. This technique makes the data splittable because each block can be compressed and decompressed individually.  Figure 9 shows the Avro and SequenceFile splittability support (Grover et al., 2015).  


Figure 9.  Compression Example Using Avro (Grover et al., 2015).

Design Consideration for HDFS Schema

HDFS and HBase are the commonly used storage managers in the Hadoop ecosystem.  Organizations can store the data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015).  When storing data in HDFS, some design techniques must be taken into consideration.  The schema-on-read model of Hadoop does not impose any requirement when loading data into Hadoop, as data can be ingested into HDFS by one of many methods without the requirements to associate a schema or preprocess the data.  Although Hadoop has been used to load many types of data such as the unstructured data, semi-structured data, some order is still required, because Hadoop serves as a central location for the entire organization and the data stored in HDFS is intended to be shared across various departments and teams in the organization (Grover et al., 2015).  The data repository should be carefully structured and organized to provide various benefits to the organization  (Grover et al., 2015).   When there is a standard directory structure, it becomes easier to share data among teams working with the same data set.  The data gets staged in a separate location before processing it.  The standard stage technique can help not processing data that has not been appropriately staged or entirely yet.  The standard organization of data allows for some code reuse that may process the data (Grover et al., 2015).  The placement of data assumptions can help simplify the loading process of the data into Hadoop.   The HDFS data model design for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to the traditional schema  (Grover et al., 2015).  The HDFS data model design for projects of unstructured and semi-structured data is likely to focus on directory placement and metadata management (Grover et al., 2015). 

Grover et al. (2015) suggested three key considerations when designing the schema, regardless of the data model design project.  The first consideration is to develop standard practices that can be followed by all teams.  The second point is to ensure the design works well with the chosen tools.  For instance, if the version of Hive can support only table partitions on directories that are named a certain way, it will affect the schema design and the names of the table subdirectories.  The last consideration when designing a schema is to keep usage patterns in mind, because different data processing and querying patterns work better with different schema designs (Grover et al., 2015). 

HDFS Files Location Consideration

            The first step when designing an HDFS schema involves the determination of the location of the file.  Standard file location plays a significant role in finding and sharing data among various departments and teams. It also helps in the assignment of permission to access files to various groups and users.  The recommended file locations are summarized in Table 2.


Table 2.  Standard Files Locations.

HDFS Schema Design Consideration

The HDFS schema design involves advanced techniques to organize data into files (Grover et al., 2015).   A few strategies are recommended to organize the data set. These strategies for data organization involve partitioning, bucketing, and denormalizing process.  The partitioning process of the data set is a common technique used to reduce the amount of I/O required to process the data set.  HDFS does not store indexes on the data unlike the traditional data warehouse. Such a lack of indexes in HDFS plays a key role in speeding up data ingest, with a full table scan cost where every query will have to read the entire dataset even when processing a small subset of data. Breaking up the data set into smaller subsets, or partitions can help with the full table scan, allowing queries to read only the specific partitions reducing the amount of I/O and improving the query time processing significantly (Grover et al., 2015). When data is placed in the filesystem, the directory format for partition should be as shown below.  The order data sets are partitioned by date because there are a large number of orders done daily and the partitions will contain large enough files which are optimized by HDFS.  Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing (Grover et al., 2015).

  • <data set name>/<partition_column_name=partition_column_value>/(Armstrong)
  • e.g. medication_orders/date=20181107/[order1.csv, order2.csv]

Bucketing is another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015).  The bucketing technique is similar to the hash partitions which is used in the relational database.   Various tools such as HCatalog, Hive, Impala, and Pig understand this directory structure leveraging the partitioning to reduce the amount of I/O requiring during the data processing. The partition example above was implemented using the date which resulted in large data files which can be optimized by HDFS (Grover et al., 2015).  However, if the data sets are partitioned by a the category of the physician, the result will be too many small files, which leads to small file problems, which can lead to excessive memory use for the NameNode, since metadata for each file stored in HDFS is stored in memory (Grover et al., 2015).  Many small files can also lead to many processing tasks, causing excessive overhead in processing.  The solution for too many small files is to use the bucketing process for the physician in this example, which uses the hashing function to map physician into a specified number of buckets (Grover et al., 2015).

The bucketing technique controls the size of the data subsets and optimizes the query speed (Grover et al., 2015).  The recommended average bucket size is a few multiples of the HDFS block size. The distribution of data when hashed on the bucketing column is essential because it results in consistent bucketing (Grover et al., 2015).  The use of the number of buckets as a power of two is every day.   Bucketing allows joining two data sets.  The join, in this case, is used to represent the general idea of combining two data sets to retrieve a result. The joins can be implemented through the SQL-on-Hadoop systems and also in MapReduce, or Spark, or other programming interfaces to Hadoop.  When using join in the bucketing technique, it joins corresponding buckets individually without having to join the entire datasets, which help in minimizing the time complexity for the reduce-side join of the two datasets process, which is computationally expensive (Grover et al., 2015).   The join is implemented in the map stage of a MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into memory, which is called map-side join process.  The map-side join process improves the join performance as compared to a reduce-side join process.  A hive for data analysis recognizes the tables are bucketed and optimize the process accordingly.

Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get stored in memory when joining, resulting in the faster process and much less memory than a simple bucket join.  Hive supports this optimization as well.  The use of both sorting and bucketing on large tables that are frequently joined together using the join key for bucketing is recommended (Grover et al., 2015).

The schema design depends on how the data will be queried (Grover et al., 2015).  Thus, the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is implemented.   In some cases, when the identification of one partitioning key is challenging, storing the same data set multiple times can be implemented, each with the different physical organization, which is regarded to be an anti-pattern in a relational database.  However, this solution can be implemented with Hadoop, because with Hadoop is write-once, and few updates are expected.  Thus, the overhead of keeping duplicated data set in sync is reduced.  The cost of storage in Hadoop clusters is reduced as well  (Grover et al., 2015). The duplicated data set in sync provides better query speed processing in such cases (Grover et al., 2015). 

Regarding the denormalizing process, it is another technique of trading the disk space for query performance, where joining the entire data set need is minimized (Grover et al., 2015).   In the relational database model, the data is stored in the third standard form (NF3), where redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity.  In this relational model, most queries require joining a large number of tables together to produce a final result as desired (Grover et al., 2015).  However, in Hadoop, joins are often the slowest operations and consume the most resources from the cluster.  Specifically, the reduce-side join requires sending the entire table over the network, which is computationally costly.  While sorting and bucketing help minimizing this computational cost, another solution is to create data sets that are pre-joined or pre-aggregated (Grover et al., 2015).  Thus, the data can be joined once and store it in this form instead of running the join operations every time there is a query for that data.  Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process  (Grover et al., 2015).  Other techniques to speed up the process include the aggregation or data type conversion.  The duplication of the data is of less concern; thus, when the processing is frequent for a large number of queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational database.  In Hadoop, the new dataset is created that contains the same data in its aggregated form (Grover et al., 2015).

To summarize, the partitioning process is used to reduce the I/O overhead of processing by selectively reading and writing data in particular partitions.  The bucketing can be used to speed up queries that involve joins or sampling, by reducing the I/O as well.  The denormalization can be implemented to speed up Hadoop jobs.   In this section, a review of advanced techniques to organize data into files is discussed.  The discussion includes the use of a small number of large files versus a large number of small files.  Hadoop prefers working with a small number of large files than a large number of small files.  The discussion also addresses the reduce-side join versus map-side join techniques.   The reduce-side join is computationally costly. Hence, the map-side join technique is preferred and recommend. 

HBase Schema Design Consideration

HBase is not a relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013).  HBase is similar to a large hash table, which allows the association of values with keys and performs a fast lookup of the value based on a given key  (Grover et al., 2015). The operations of hash tables involve put, get, scan, increment and delete.  HBase provides scalability and flexibility and is useful in many applications, including fraud detection, which is a widespread application for HBase (Grover et al., 2015).

The framework of HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore, HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016).  Each component of the HBase framework plays a significant role in data storage and processing.  Figure 10 illustrated the HBase framework.


Figure 10.  HBase Architecture (Bhojwani & Shah, 2016).

            The following consideration must be taken when designing the schema for HBase (Grover et al., 2015).

  • Row Key Consideration.
  • Timestamp Consideration.
  • Hops Consideration.
  • Tables and Regions Consideration.
  • Columns Use Consideration.
  • Column Families Use Consideration.
  • Time-To-Live Consideration.

The row key is one of the most critical factors for well-architected HBase schema design (Grover et al., 2015).  The row key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and uniqueness.  The row key is critical for retrieving records from HBase. In the relational database, the composite key can be used to combine multiple primary keys.  In HBase, multiple pieces of information can be combined in a single key.  For instance, a key of customer_id, order_id, and timestamp will be a row key for a row describing an order. In a relational database, they are three different columns in the relational database, but in HBase, they will be combined into a single unique identifier.  Another consideration for selecting the row key is the get operation because a get operation of a single record is the fasted operation in HBase.  A single get operation can retrieve the most common uses of the data improves the performance, which requires to put much information in a single record which is called denormalized design.    For instance, while in the relational database, customer information will be placed in various tables, in HBase all customer information will be stored in a single record where get operation will be used. The distribution is another consideration for HBase schema design.  The row key determines the regions of HBase cluster for a given table, which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013).   The row keys are sorted, and each region stores a range of these sorted row keys  (Grover et al., 2015).  Each region is pinned to a region server namely a node in the cluster  (Grover et al., 2015).  The combination of device ID and timestamp or reverse timestamp is commonly used to “salt” the key in machine data  (Grover et al., 2015).  The block cache is a least recently used (LRU) cache which caches data blocks in memory  (Grover et al., 2015).  HBase reads records in chunks of 64KB from the disk by default. Each of these chunks is called HBase block  (Grover et al., 2015).  When the HBase block is read from disk, it will be put into the block cache  (Grover et al., 2015).   The choice of the row key can affect the scan operation as well.  HBase scan rates are about eight times slower than HSFS scan rates.  Thus, reducing I/O requirements has a significant performance advantage.  The size of the row key determines the performance of the workload.  The short row key is better than, the long row key because it has lower storage overhead and faster read/ writes performance.  The readability of the row key is critical. Thus, it is essential to start with human-readable row key.  The uniqueness of the row key is also critical since a row key is equivalent to a key in hash table analogy.  If the row key is based on the non-unique attribute, the application should handle such cases and only put data in HBase with a unique row key (Grover et al., 2015).

The timestamp is the second essential consideration for good HBase schema design (Grover et al., 2015).  The timestamp provides advantages of determining which records are newer in case of put operation to modify the record.  It also determines the order where records are returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records because time-to-live (TTL) operation compared with the timestamp shows the record value has either been overwritten by another put or deleted (Grover et al., 2015).

The hop term refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the better because of the overhead.  Although multi-hop requests with HBase can be made, it is best to avoid them through better schema design, for example by leveraging de-normalization, because every hop is a round-trip to HBase which has a significant performance overhead (Grover et al., 2015).

The number of tables and regions per table in HBase can have a negative impact on the performance and distribution of the data (Grover et al., 2015).  If the number of tables and regions are not implemented correctly, it can result in an imbalance in the distribution of the load.  Important considerations include one region server per node, many regions in a region server, a give region is pinned to a particular region server, and tables are split into regions and scattered across region servers.  A table must have at least one region.  All regions in a region server receive “put” requests and share the region server’s “memstore,” which is a cache structure present on every HBase region server. The “memstore” caches the write is sent to that region server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region server; the less memstore space is available per region.  The default configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can be divided by 100MB and result should be the maximum number of regions which can be put on that region server.   The vast region takes a long time to compact.  The upper limit on the size of a region is around 20GB. However, there are successful HBase clusters with upward of 120GB regions.  The regions can be assigned to HBase table using one of two techniques. The first technique is to create the table with a single default region, which auto splits as data increases.  The second technique is to create the table with a given number of regions and set the region size to a high enough value, e.g., 100GB per region to avoid auto splitting (Grover et al., 2015).  Figure 11 shows a topology of region servers, regions and tables. 


Figure 11.  The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).

The columns used in HBase is different from the traditional relational database (Grover et al., 2015; Yang et al., 2013).  In HBase, unlike the traditional database, a record can have a million columns, and the next record can have a million completely different columns, which is not recommended but possible (Grover et al., 2015).   HBase stores data in a format called HFile, where each column value gets its row in HFile (Grover et al., 2015; Yang et al., 2013). The row has files like row key, timestamp, column names, and values. The file format provides various functionality, like versioning and sparse column storage (Grover et al., 2015). 

HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013).  A column family is a container for columns.  In HBase, a table can have one or more column families.  Each column family has its set of HFiles and gets compacted independently of other column families in the same table.  In many cases, no more than one column family is needed per table.  The use of more than one column family per table can be done when the operation is done, or the rate of change on a subset of the columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013).  The last consideration for HBase schema design is the use of TTL, which is a built-in feature of HBase which ages out data based on its timestamp (Grover et al., 2015).  If TTL is not used and an aging requirement is needed, then a much more I/O intensive operation would need to be done.   The objects in HBase begin with table object, followed by regions for the table, store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013).  Figure 12 shows the hierarchy of objects in HBase.

Figure 12.  The Hierarchy of Objects in HBase (Yang et al., 2013).

To summarize this section, HBase schema design requires seven key consideration starting with the row key, which should be selected carefully for record retrieval, distribution, block cache, ability to scan, size, readability, and uniqueness.  The timestamp and hops are other schema design consideration for HBase.  Tables and regions must be considered for put performance, and compacting time.  The use of columns and column families should also be considered when designing the schema for HBase. The TTL to remove data that aged is another consideration for HBase schema design. 

Metadata Consideration

The above discussion has been about the data and the techniques to store it in Hadoop.  Metadata is as essential as the data itself.  Metadata is data about the data (Grover et al., 2015)).  Hadoop ecosystem has various forms of metadata.   Metadata about logical dataset usually stored in a separate metadata repository include the information like the location of a data set such as directory in HDFS or HBase table name, the schema associated with the dataset, the partitioning and sorting properties of the data set, the format of the data set e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such files and the location of various blocks on data nodes, usually stored and managed by Hadoop NameNode (Grover et al., 2015).  Metadata about tables in HBase include information like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually stored and managed by HBase (Grover et al., 2015).  Metadata about data ingest and transformation include information like which user-generated a given dataset, where the dataset came from, how long it took to generate it, and how many records there are, or the size of the data load (Grover et al., 2015).  Metadata about dataset statistics include information like the number of rows in a dataset, number of unique values in each column, a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015).  Figure 13 summarizes this various metadata.


Figure 13.  Various Metadata in Hadoop.

Apache Hive was the first project in the Hadoop ecosystem to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015).  Hives stores this metadata in a relational database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015).  Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015).  The query process in Hive goes to the metastore to get the metadata for the desired query, and metastore sends the metadata to Hive generating execution plan, followed by executing the job using the Hadoop cluster, which implements the job and Hive send the fetched result to the user (Antony et al., 2016; Grover et al., 2015).  Figure 14 shows the query process and the role of the metastore in Hive framework.


Figure 14.  Query Process and the Role of Metastore in Hive (Antony et al., 2016).

More projects have utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the usage of Hive metastore outside of Hive (Grover et al., 2015).  HCatalog is a part of Hive and allows other tools like Pig and MapReduce to integrate with Hive metastore.  It also opens the access to Hive metastore to other tools such as REST API via WebHCat server.  MapReduce, Pig, and standalone applications can talk directly to the metastore of Hive through its APIs, but HCatalog allows easy access through its WebHCat REST APIs, and it allows the cluster administrators to lock down access to the Hive metastore to address security concerns. Other ways to store metadata include the embedding of metadata in file paths and names.  Another technique to store metadata involves storing it in HDFS in a hidden file, e.g., .metadata.  Figure 15 shows the HCatalog as an accessibility veneer around the Hive metastore (Grover et al., 2015). 


Figure 15.  HCatalog acts an accessibility veneer around the Hive metastore (Grover et al., 2015).

Hive Metastore and HCatalog Limitations

There are some limitations for Hive metastore and HCatalog, including the problem with high availability (Grover et al., 2015).  The HA database cluster solutions to bring HA to the Hive metastore database.  For the metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the cluster.  However, concurrency issues related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues. 

The fixed schema for metadata is another limitation.  Hadoop provides much flexibility on the type of data that can be stored, mainly because of the Schema-on-Read concept. Hive metastore provides a fixed schema for the metadata itself. It provides a tabular abstraction for the data sets.   The data in metastore is moving the part in the infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).

Conclusion

This project has discussed essential topics related to Hadoop technology.  It began with an overview of Hadoop providing a history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x.  The discussion involved the Big Data Analytics Process using Hadoop technology.  The process involves six significant steps starting with the problem identification, required data to be collected, and the data collection process. The pre-processing data and ETL process must be implemented before performing the analytics. The last step is the visualization of the data for decision making.  Before processing any data and before collecting any data for storage, some considerations must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and better data retrieval, giving some tools cannot split the data while others can.  These considerations begin with data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop.  Compression must be considered when designing the schema for Hadoop. Since HDFS and HBase are commonly used in Hadoop for data storage, the discussion involved the consideration for the HDFS and HBase schema design considerations.  To summarize the design of the schema for Hadoop, HDFS, and HBase makes a difference in storing data in various nodes using the right tools for splitting the data.  Thus, organizations must pay attention to the process and the design requirements before storing data into Hadoop for better computational processing. 

References

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Antony, B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley & Sons.

Armstrong, D. (n.d.). R: Learning by Example: Lattice Graphics. Retrieved from https://quantoid.net/files/rbe/lattice.pdf.

Bhojwani, N., & Shah, A. P. V. (2016). A SURVEY ON HADOOP HBASE SYSTEM. Development, 3(1).

Dittrich, J., & Quiané-Ruiz, J.-A. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014-2015.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

sas.com. (2018). Hadoop – why it is and why it matters. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

 

 

The Impact of XML on MapReduce

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the impact of XML on MapReduce. The discussion addresses the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce is also discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

XML Query Processing Using MapReduce

XML format has been used to store data for multiple applications (Aravind & Agrawal, 2014).  Data needs to be ingested into Hadoop and get analyzed to obtain value from the XML data (Aravind & Agrawal, 2014).  Hadoop ecosystem needs to understand XML when it gets ingested into it and be able to interpret it (Aravind & Agrawal, 2014).  MapReduce is a building block of Hadoop ecosystem.  In the age of Big Data, XML documents are expected to be very large and to be scalable and distributed.  The process of XML queries using MapReduce requires the decomposition of a big XML document and distribute portions to different nodes.  The relational approach is not appropriate as it is expensive because transforming a big XML document into relational database tables can be extremely time consuming and θ-joins among relational table (Wu, 2014).  Various research studies have proposed various approaches to implement native XML query processing algorithms using MapReduce. 

            (Dede, Fadika, Gupta, & Govindaraju, 2011) have discussed and analyzed the scalable and distributed processing of scientific XML data, and how the MapReduce model should be used in XML metadata indexing.   The study has presented performance results using two MapReduce implementations of Apache Hadoop framework and proposed framework of LEMO-MR.  The study has provided an indexing framework that is capable of indexing and efficiently searching large-scale scientific XML datasets.  The framework has been tailed for integration with any framework that uses the MapReduce model to meet the scalability and variety requirements. 

(Fegaras, Li, Gupta, & Philip, 2011) have also discussed and analyzed query optimization in a MapReduce environment. The study has presented a novel query language for large-scale analysis of XML data on a MapReduce environment, called MRQL for MapReduce Query Language, that is designed to capture most common data analysis tasks which can be optimized.   XML data fragmentation is also discussed in this study.  When using a parallel data computation, it expects the input data to be fragmented into small manageable pieces, that determine the granularity of the computation.  In a MapReduce environment, each map worker is assigned a data split that consists of data fragments.  A map worker processes these data one fragment at a time.  The fragment is a relational tuple for relational data that is structured, while for a text file, a fragment can be a single line in the file.  However, for hierarchical data and nested collections data such as XML data, the fragment size and structure depend on the actual application that processes these data.  For instance, XML data may consist of some XML documents, each one containing a single XML element, whose size may exceed the memory capability of a map worker.  Thus, when processing XML data, it is recommended to allow custom fragmentation to meet a wide range of applications requirements.  (Fegaras et al., 2011) have argued that Hadoop provides a simple input format for XML fragmentation based on a single tag name.  XML document data can be split, which may start and end at arbitrary points in the document, even in the middle of tag names.  (Fegaras et al., 2011; Sakr & Gaber, 2014) have indicated that this input format allows reading the document as a stream of string fragments, so that each string will contain a single complete element that has the requested tag name.   XML parser can then be used to parse these strings and convert them to objects.  The fragmentation process is complex because the requested elements may cross data split boundaries and these data splits may reside in different data nodes in the data file system (DFS).  Hadoop DFS is the implicit solution for this problem allowing to scan beyond a data split to the next, subject to some overhead for transferring data between nodes.  (Fegaras et al., 2011) have proposed XML fragmentation technique that was built on top of the existing Hadoop XML input format, providing a higher level of abstraction and better customization.  It is a higher level of abstraction because it constructs XML data in the MRQL data model, ready to be processed by MRQL queries instead of deriving a string for each XML element (Fegaras et al., 2011).

(Sakr & Gaber, 2014) have also discussed briefly another language that has been proposed to support distributed XML processing using the MapReduce framework, called ChuQL.  It presents a MapReduce-based extension for the syntax, grammar, and semantics of XQuery, the standard W3C language for querying the XML documents.  The implementation of ChuQL takes care of distributing the computation to multiple XQuery engines running in Hadoop nodes, as described by one or more ChuQL MapReduce expressions.  The representation of the “word count” example program in the ChuQL language using its extended expressions where the MapReduce expression is used to describe a MapReduce job.  The clauses of input and output are used to read and write onto HDFS respectively. The clauses of rr and rw are used for describing the record reader and writer respectively.  The clauses of the map and reduce represent the standard map and reduce phases of the framework where they process XML values or key/value pairs of XML values to match the MapReduce model which are specified using XQuery expressions. Figure 1 show the word count example in ChQL using XML in distributed environment.

Figure 1.  The Word Count Example Program in ChQL Using XML in Distributed Env. (Sakr & Gaber, 2014).

(Vasilenko & Kurapati, 2014) have discussed and analyzed the efficient processing of XML documents in Hadoop MapReduce environment.  They argued that the most common approach to process XML data is to introduce a custom solution based on the user-defined functions or scripts.  The common choices vary from introducing an ETL process for extracting the data of interest to the transformation of XML into other formats that are natively supported by Hive.   They have addressed a generic approach to handling XML based on Apache Hive architecture.  The researchers have described an approach that complements the existing family of Hive serializers and de-serializers for other popular data formats, such as JSON, and makes it much easier for users to deal with the large XML dataset format.  The implementation included logical splits for the input files each of which is assigned to an individual Mapper. The mapper relies on the implemented Apache Hive XML SerDe to break the split into XML fragments using a specified start/end byte sequences. Each fragment corresponds to a single Hive record.  The fragments are handled by the XML processor to extract value for the record column utilizing specified XPath queries.  The reduce phase was not required in this implementation (Vasilenko & Kurapati, 2014). 

            (Wu, 2014) have discussed and analyzed the partitioning XML documents and distributing XML fragments into different compute nodes, which can introduce high overhead in XML fragment transferring from one node to another during the MapReduce process execution.  The researchers have proposed a technique to use MapReduce to distribute labels in inverted lists in a computing cluster so that structural joins can be parallelly performed to process queries.  They have also proposed an optimization technique to reduce the computing space in the proposed framework to improve the performance of query processing.  They have argued that their approaches are different from the current shred and distributed XML document into different nodes in a cluster approach.  The process includes reading and distributing the inverted lists that are required for input queries during the query processing, and their size is much smaller than the size of the whole document.  The process also includes the partition of the total computing space for structural joins so that each sub-space can be handled by one reducer to perform structural joins.  The researchers have also proposed a pruning-based optimization algorithm to improve the performance of their approach.

Conclusion

This discussion has addressed the XML query processing using MapReduce environment.  The discussion has addressed the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce has also been discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

References

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Dede, E., Fadika, Z., Gupta, C., & Govindaraju, M. (2011). Scalable and distributed processing of scientific XML data. Paper presented at the Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on.

Fegaras, L., Li, C., Gupta, U., & Philip, J. (2011). XML Query Optimization in Map-Reduce.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Vasilenko, D., & Kurapati, M. (2014). Efficient processing of xml documents in hadoop map reduce.

Wu, H. (2014). Parallelizing structural joins to process queries over big XML data using MapReduce. Paper presented at the International Conference on Database and Expert Systems Applications.

XML Design Document

Dr. O. Aly
Computer Science

Introduction

            The purpose of this discussion is to discuss and analyze the design of the XML document.  The discussion also examines the XML design document from the perspective of the users for improved performance.  The discussion begins with XML Design Principles and detailed analysis of each principle.  XML design document is also examined from the performance perspective focusing on the appropriate use of elements and attributes when designing XML document.

XML Design Principles

            The XML design document has guidelines and principles that developers should follow.  These guidelines are divided into four major principles for the use of elements and attributes: core content principles, structured information principle, readability principles, element and attribute binding principles.  Figure 1 summarizes these principles of XML design document.


Figure 1.  XML Design Document Four Principles for Elements and Attributes Use.

Core Content Principle

The core content principle involves the use of element versus the use of the attribute.  If the information is part of the essential material for human-readable documents, the use of elements is recommended.  If the information is for machine-oriented records formats, and to help applications process the primary communication, the use of attributes is recommended.  Example of this principle includes the title which is replaced in an attribute while it should be placed in element content.  Another example of this principle is the internal product identifies thrown as elements into detailed records of the products, while some cases attributes are more appropriate than elements because the internal product code would not be of primary interest to most readers or processors of the document when the ID has an extended format.  Similar to data and metadata, the data should be placed in elements, and metadata should be in attributes (Ogbuji, 2004). 

Since elements and attributes are the two main building blocks of XML design document, developers should be aware of the legal and illegal elements and attributes.  (Fawcett, Ayers, & Quin, 2012) have identified legal and illegal elements.  For instance, the spaces are allowed after a name, but names cannot contain spaces.  Digits can appear within a name, while names cannot begin with a digit.  The spaces can appear between the name and the forward slash in a self-closing element, while the initial spaces are not allowed.  A hyphen is allowed within a name, but a hyphen is not allowed as the first character.  The non-roman characters are allowed if they are classified as letters by the Unicode specifications, where the element name is forename in Greek, while the start and end tags must match case-sensitively (Fawcett et al., 2012).  Table 1 shows the legal and illegal elements when designing XML document.


Table 1.  Legal vs. Illegal Elements Consideration for XML Design Document (Fawcett et al., 2012).

            For the attributes, (Fawcett et al., 2012) have identified legal and illegal attributes.  The single quote inside double quote delimiters is allowed.  The double quotes inside a single quote delimiter are also allowed, while a single quote inside single quote delimiters is not allowed.  The attribute names cannot begin with a digit.  Two attributes with the same name are not allowed.  The mismatching delimiters are not allowed.   Table 2 shows the legal and illegal attributes to be considered when designing XML document.


Table 2.  Legal vs. Illegal Attributes Consideration for XML Design Document (Fawcett et al., 2012).

Structured Information Principle

            Since the element is an extensible engine for expressing structure in XML, the use of the element is recommended if the information is expressed in a structured form, especially if the structure is extensible.  The use of the attribute is recommended if the information is expressed as an atomic token since attributes are designed to express simple properties of the information represented in an element (Ogbuji, 2004).  The date is an excellent example as it has a fixed structure and acts as a single token, hence can be used as an attribute.  Personal names are recommended to be in the element content, instead of having the names in attributes, since personal names have variable structure, and are rarely an atomic token.  The following code example is making the name as an element.  Figure 2 shows the name is an element, while Figure 3 shows the name is an attribute.  

Readability Principle

            If the information is intended to be for human readability, the use of the element is recommended. If the information is for machine readability, the use of the attribute is recommended (Ogbuji, 2004).  The URL is an example as it cannot be used without the computer to retrieve the referenced resource (Ogbuji, 2004).

Element/Attribute Binding

            The use of element is recommended if its value is required to be modified by another attribute (Ogbuji, 2004).  The attribute should provide some properties or modifications of the element (Ogbuji, 2004).

XML Design Document Examination

            One of the best practices identified by IBM for DB2 is to use attributes and elements appropriately in XML (IBM, 2018).  Although it is identified for DB2, it can be applied to the design of an application using XML document because the elements and attributes are the building blocks of XML as discussed above.  The example for the examination involves a menu and the use of elements and attributes. 

If a menu for a restaurant is developed using XML design document technique, and the portion sizes of items are placed in the menu, the code content principle is applied with the assumption that it is not important to the reader of the menu format.  Following the structured information principle, the code will be as follows by not placing the portion measurement and units into a single attribute.  Figure 4 shows the code using the core content principles, while Figure 5 shows the code using the structured information principle.

However, following the structured information principle in Figure xx allows portion-unit to modify the portion-size which is not recommended.  The use of the attribute is recommended to modify the element which is the menu-item element in this example.   Thus, the solution is to modify the code and make the element to be modified by the attribute portion-unit.  The result of this code will show the portion size to the reader as shown in Figure 6. 

After the modification of the code to make the element modifiable by the portion-unit, the principles of the core content and readability are applied.  This modification contradicts the original decision that it is not essential to the reader to know about the size which is based on the core content principle.  Therefore, XML developers should judge the appropriate principle to be applied based on the requirements.

The following link is available, to see another XML design document as a menu example: https://www.w3schools.com/xml/default.asp.  The code in the provided link shown in Figure 7 shows that the attributes are modifying the elements which are recommended. 

Conclusion

            This assignment has focused on the XML design document.  The discussion has covered the four major principles that should be considered

 when designing XML document.  The four principles go around the two building blocks of the attribute and element.  The use of the element is recommended for human-readable documents, while the use of the attributes is recommended for machine-oriented records.  The use of the element is also recommended for information that is expressed in a structured form, especially if the structure is extensible, while the use of the attributes is recommended for information is expressed as an atomic token.  If the attribute modifies another attribute, the use of element is recommended.  XML document design should also consider the legal and illegal elements and attributes.  A few examples have been provided to demonstrate the use of element versus attributes, and the method to improve the code for good performance as well as for good practice.   The discussion was limited to the use of element and attributes and performance consideration from that perspective. However, XML design document involves other performance considerations for XML for the database, for parsing, and for data warehouse as discussed in (IBM, 2018; Mahboubi & Darmont, 2009; Nicola & John, 2003; Su-Cheng, Chien-Sing, & Mustapha, 2010).

References

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

IBM. (2018). Best Practices for XML Performance in DB2. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/perf/src/tpc/db2z_bestpractice4xmlperf.html.

Mahboubi, H., & Darmont, J. (2009). Enhancing XML data warehouse query performance by fragmentation. Paper presented at the Proceedings of the 2009 ACM symposium on Applied Computing.

Nicola, M., & John, J. (2003). XML parsing: a threat to database performance. Paper presented at the Proceedings of the twelfth international conference on Information and knowledge management.

Ogbuji, U. (2004). Principles of XML Design: When to Use Elements Versus Attributes. Retrieved from https://www.ibm.com/developerworks/library/x-eleatt/x-eleatt-pdf.pdf.

Su-Cheng, H., Chien-Sing, L., & Mustapha, N. (2010). Bridging XML and relational databases: Mapping choices and performance evaluation. IETE Technical Review, 27(4), 308-317.

Case Study: Big Data Analytics in Healthcare Using Outlier Detection.

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss and examine Big Data Analytics (BDA) technique and a case study.  The discussion begins with an overview of BDA application in various sectors, followed by the implementation of BDA in the healthcare industry.  The records show the healthcare industry suffers from fraud, waste, and abuse (FWA).  The emphasis of this discussion is on FWA in the healthcare industry.  The project provides a case study of BDA in healthcare using outlier detection data mining tool.  The data mining phases of the use case are discussed and analyzed.  An improvement for the selected BDA technique of the outlier detection is proposed in this project.  The analysis shows that the outlier detection data mining technique for fraud detection is under experimentation and is not proven reliable yet. The recommendation is to use the clustering data mining technique as a more heuristic technique for fraud detection. Organizations should evaluate the BDA tools and select the most appropriate and fit tool to meet the requirements of the business model successfully.

Keywords: Big Data Analytics; Healthcare; Outlier Detection; Fraud Detection.

Introduction

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.  This project discusses and analyzes the application of Big Data Analytics. It begins with an overview of such broad applications, with more emphasis on a single application for further investigation.  Healthcare sector is selected for further discussion and with a closer lens to investigate the implementation of BDA, and methods to improve such implementation.

Overview of Big Data Analytics Applications

            Numerous research studies have discussed and analyzed the application of Big Data in different domains. (Chen & Zhang, 2014) have discussed BDA in the scientific research domains such as astronomy, meteorology, social computing, bioinformatics, and computational biology, which are based on data-intensive scientific discovery.  Other studies such as (Rabl et al., 2012) have investigated the performance of six modern open-source data stores in the context of the monitor of application performance as part of the initiative of (CA-Technologies, 2018). (Bi & Cochran, 2014) have discussed BDA in cloud manufacturing, indicating that the success of a manufacturing enterprise depends on the advancement of IT to support and enhance the value stream.  The manufacturing technologies have evolved throughout the years.  The measures of such advancement of a manufacturing system can be implemented by scale, complexity and automation responsiveness (Bi & Cochran, 2014).  Figure 1 illustrates such evolution of the manufacturing technologies before the 1950s until the Big Data age. 


Figure 1.  Manufacturing Technologies, Information System, ITs, and Their Evolutions

McKinsey Institute has first reported four essential sectors that can benefit from BDA: healthcare industry, government services, retailing, and manufacturing (Brown, Chui, & Manyika, 2011).  The report has also reported a prediction for BDA implementation to improve the productivity by .5 to 1 percent annually and produce hundreds of billions of dollars in new value (Brown et al., 2011).  McKinsey Institute has indicated that not all industries are created equal in the context of parsing the benefits from BDA (Brown et al., 2011).   

Another report by McKinsey Institute have reported the transformative potential of BD in  five domains:  health care (U.S.), public sector administration (European Union), Retail (U.S.) Manufacturing (global), and Personal Location Data (global) (Manyika et al., 2011).  The same report has predicted $300 billion as a potential annual value to US healthcare, and 60% potential increase in retailers’ operating margins possible with BDA (Manyika et al., 2011). Some sectors are poised for more significant gains and benefits from BD than others, although the implementation of BD will matter across all sectors (Manyika et al., 2011).  It is divided by cluster A, B, C, D and E.  The cluster A reflects information and computer and electronic products, while finance & insurance and government are categorized as class B.  Cluster C include several sectors such as construction, educational services, and arts and entertainments.  Cluster D has manufacturing, wholesale trade, while cluster E covers retail, healthcare providers, accommodation and food. Figure 2 shows some sectors are positioned for more significant gains from the use of BD. 


Figure 2.  Capturing Value from Big Data by Sector (Manyika et al., 2011).

The application of BDA in specific sectors have been discussed in various research studies, such as health and medical research (Liang & Kelemen, 2016), biomedical research (Luo, Wu, Gopukumar, & Zhao, 2016), machine learning techniques in healthcare sectors (MCA, 2017).  The next section discusses the implementation of BDA in the healthcare sector.

Big Data Analytics Implementation in Healthcare

            Numerous research studies have discussed Big Data Analytics (BDA) in healthcare industries from a different perspective.  Healthcare industries have taken advantages of BDA in fraud and abuse prevention, detection and reporting (cms.gov, 2017).  The fraud and abuse of Medicare are regarded to be a severe problem which needs attention (cms.gov, 2017).  Various examples of Medicare fraud scenarios are reported (cms.gov, 2017).  Submitting, or causing to be submitted, false claims or making misrepresentations of fact to obtain a federal healthcare payment is the first Medicare fraud case.  Soliciting, receiving, offering and paying remuneration to induce or reward referrals for items or services reimbursed by federal health care programs is another Medicare fraud scenario.  The last fraud case in Medicare is making prohibited referrals for certain designated health services (cms.gov, 2017). The abuse of Medicare includes billing for unnecessary medical services, charging excessively for services or supplies, and misusing codes on a claim such as upcoding or unbundling codes (cms.gov, 2017; J. Liu et al., 2016).  In 2012, the payments of $120 billion were improperly for healthcare (J. Liu et al., 2016).  Medicare and Medicaid contributed to more than half of this improper payment total (J. Liu et al., 2016).  The annual loss to fraud, waste, and abuse in healthcare domain is estimated to be $750 billion (J. Liu et al., 2016).  In 2013, over 60% of the improper payments were for healthcare related. Figure 3 illustrates the improper payments in government expenditure.


Figure 3. Improper Payments Resulted from Fraud and Abuse (J. Liu et al., 2016).

Medicare fraud and abuse are governed by federal laws (cms.gov, 2017).  These federal laws include False Claim Act (FCA), Anti-Kickback Statute (AKS), Physician Self-Referral Law (Stark Law), Criminal Health Care Fraud Statute, Social Security Act, and the United States Criminal Code.  Medicare anti-fraud and abuse partnerships of various government agencies such as Health Care Fraud Prevention Partnership (HFPP) and Centers for Medicare and Medicaid Services (CMS) have been established to combat fraud and abuse. The main aim of this partnership is to uphold the integrity of the Medicare program, save and recoup taxpayer funds, reduce the costs of health care to patients, and improve the quality of healthcare (cms.gov, 2017).  

In 2010, Health and Human Services (HHS) and CMS initiated a national effort known as Fraud Prevention System (FPS), a predictive analytics technology which runs predictive algorithms and other analytics nationwide on all Medicare FFS claims prior to any payment in an effort to detect any potential suspicious claims and patterns that may constitute fraud and abuse (cms.gov, 2017).  In 2012, CMS developed the Program Integrity Command Center to combine Medicare and Medicaid experts such as clinicians, policy experts, officials, fraud investigators, and law enforcement community including FBI to develop and improve predictive analytics that identifies fraud and mobilize a rapid response (cms.gov, 2017). Such effort aims to connect with the field offices to examine the fraud allegations within few hours through a real-time investigation.  Before the application of BDA, the process to find substantiating evidence of a fraud allegation took days or weeks.

Research communities and data analytics industry have exerted various efforts to develop fraud-detection systems (J. Liu et al., 2016).  Various research studies have used different data mining for healthcare fraud and abuse detection.  (J. Liu et al., 2016) have used unsupervised data mining approach and applied the clustering data mining technique for healthcare fraud detection.  (Ekina, Leva, Ruggeri, & Soyer, 2013) have used the unsupervised data mining approach and applied the Bayesian co-clustering data mining technique for healthcare fraud detection.  (Ngufor & Wojtusiak, 2013) have used the hybrid supervised and unsupervised data mining approach, and applied the unsupervised data labeling and outlier detection, classification and regression data mining technique for medical claims prediction.  (Capelleveen, 2013; van Capelleveen, Poel, Mueller, Thornton, & van Hillegersberg, 2016) have used unsupervised data mining approach, and applied outlier detection data mining technique for health insurance fraud detection with the Medicaid domain. 

Case Study of BDA in Healthcare

The case study presented by (Capelleveen, 2013; van Capelleveen et al., 2016) has been selected for further investigation on the application of BDA in healthcare.  The outlier detection, which is one of the unsupervised data mining techniques, is regarded as an effective predictor for fraud detection and is recommended for use to support the audits initiations (Capelleveen, 2013; van Capelleveen et al., 2016).  The outlier detection is the primary analytic tool which was used in this case study.   The outlier detection tool can be based on linear model analysis, multivariate clustering analysis, peak analysis, and boxplot analysis (Capelleveen, 2013; van Capelleveen et al., 2016).  The algorithm of data mining outlier detection approach of this case study has been used on Medicaid dataset of 650,000 healthcare claims and 369 dentists of one state. RapidMiner can be used for outlier detection data mining techniques.  The study of (Capelleveen, 2013; van Capelleveen et al., 2016) did not specify the name of the tool which was used in the outlier detection of the fraud and abuse in Medicare with emphasis on dental practice.

The process for such outlier detection unsupervised data mining technique involves seven iterative phases.  The first step involves the composition of metrics composition for domains. These metrics are derived or calculated data such as feature, attribute or measurement which characterizes the behavior of an entity for a certain period.  The purpose of this metrics is to develop a comparative behavioral analysis using data mining algorithms.  These metrics are expected during the first iteration to be inferred from provider behavior supported by fraud causes and developed in cooperation with fraud experts.  In the subsequent iterations, the metrics composition consists of the latest metrics which updates the existing metrics that modify the configuration and make adjustments on the confidence level to optimize the hit rates.  The composition of metrics phase is followed by the cleaning and filtering the data.  The selection of provider groups, and computing the metrics is the third phase in this outlier detection process.  The fourth phase involves the comparison of providers by metric and flagging outliers.  The predictors form suspicion for provider fraud detection is the fifth phase, followed by the report and presentation to fraud investigators phase.  The last phase of the use of the outlier protection analytic tool involves the metric evaluation.  The result of the outlier detection analysis has shown that 12 of the top 17 providers (71%) submitted suspicious claim patterns and should be referred to officials for further investigation.  The study concluded that the outlier detection tool could be used to provide new patterns of potential fraud that can be identified and possibly used for future automated detection technique.

Proposed Improvements for Outlier Detection Tool Use Case

            (Lazarevic & Kumar, 2005) have indicated that most of the outlier detection techniques are categorized into four categories.  The statistical approach, the distance-based approach, the profiling method, and the model-based approach.  The data points are modeled in the statistical approach using a stochastic distribution and are determined to be outliers based on their relationship with the model.  Most statistical approaches have the limitation with higher dimensionality distribution of the data points due to the complexity of such a distribution which results in inaccurate estimations.  The distance-based approach can detect the outliers using the computation of the distances among points to overcome the limitation of the statistical approach.  Various distance-based outlier detection algorithms have been proposed, and they are based on different approaches.  The first approach is based on computing the full dimensional distances of points from one another using all the available features.  The second approach is based on computing the densities of local neighborhoods.   The profiling method develops profiles of normal behavior using different data mining techniques or heuristic-based approaches, and deviations from them are considered as intrusions.  The model-based approach begins with the categorization of normal behavior using some predictive models. Such as neural replicator networks or unsupervised support vector machines, and detect outliers as the deviations from the learned model (Lazarevic & Kumar, 2005).      (Capelleveen, 2013; van Capelleveen et al., 2016) have indicated that the outlier detection tool as a data mining technique has not proven itself in the long run and is still under experimentation.  It is also considered a sophisticated data mining technique (Capelleveen, 2013; van Capelleveen et al., 2016). The validation of effectiveness remains difficult (Capelleveen, 2013; van Capelleveen et al., 2016). 

Based on this analysis of the outlier detection tool, more heuristic and novel approach should be used.  (Viattchenin, 2016) have proposed a novel technique for outlier detection.  The proposed technique for outlier detection is based on a heuristic algorithm of clustering, which is a function-based method.  (Q. Liu & Vasarhelyi, 2013) have proposed a healthcare fraud detection using a clustering model incorporating geolocation information.  The results of the clustering model using have detected claims with the extreme payment amount and identified some suspicious claims.  In summary, integrating the clustering technique can play a role in enhancing the reliability and validity of the outlier detection data mining technique.

Conclusion

This project has discussed and examined Big Dat Analytics (BDA) methods. An overview of BDA application in various sectors is discussed, followed by the implementation of BDA in the healthcare industry.  The records showed that the healthcare industry is suffering from fraud, waste, and abuse.  The discussion has provided a case study of BDA in healthcare using outlier detection tool.  The data mining phases have been discussed and analyzed.  A proposed improvement for the selected BDA technique of outlier detection has also been addressed.  The analysis has indicated that the outlier detection technique is under experimentation, and more heuristic data mining fraud detection technique should be used such as the clustering data mining technique.  In summary, various BDA techniques are available for different industries.  Organizations must select the appropriate BDA tool to meet the requirements of the business model. 

References

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4(1), 24-35.

CA-Technologies. (2018). CA Technoligies. Retrieved from https://www.ca.com/us/company/about-us.html.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. University of Twente.  

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

cms.gov. (2017). Medicare Fraud & Abuse: Prevention, Detection, and Reporting. Retrieved from https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_abuse.pdf.

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of bayesian methods in detection of healthcare fraud.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. Paper presented at the Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liu, J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sricharan, K., . . . Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine, 37(2), 33-46.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

MCA, M. J. S. (2017). Applications of Big Data Analytics and Machine Learning Techniques in Health Care Sectors. International Journal Of Engineering And Computer Science, 6(7).

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Ngufor, C., & Wojtusiak, J. (2013). Unsupervised labeling of data for supervised learning and its application to medical claims prediction. Computer Science, 14(2), 191.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

van Capelleveen, G., Poel, M., Mueller, R. M., Thornton, D., & van Hillegersberg, J. (2016). Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. International Journal of Accounting Information Systems, 21, 18-31.

Viattchenin, D. A. (2016). A Technique for Outlier Detection Based on Heuristic Possibilistic Clustering. CERES, 17.

 

Big Data Analytics Tools

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify and describe a tool in the market for data analytics, how the tool is used and where it can be used.  The discussion begins with an overview of the Big Data Analytics tools, followed by the top five tools for 2018, among which RapidMiner is selected as the BDA tool for this discussion.  The discussion of the RapidMiner as one of the top five BDA tools include the features, technical specification, use, advantages, and limitation.  The application of RapidMiner in various industries such as medical and education is also addressed in this discussion. 

Overview of Big Data Analytics Tools

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.

Regarding the BDA tools, various BDA tools exist in the market for different business purposes based on the business model of the organization.  Organizations must select the right tool that will serve their business model.  Various studies have discussed various tools for BDA implementation.  (Chen & Zhang, 2014) have examined various types of BD tools. Some tools are based on batch processing such as Apache Hadoop, Dryad, Apache Mahout, and Tableau, while other tools are based on stream processing such as Storm, S4, Splunk, Apache Kafka, and SAP Hana as summarized in Table 1 and Table 2.  Each tool provides certain features for BDA implementation and offers various advantages to those BDA-adapted organizations.


Table 1.  Big Data Tools Based on Batch Processing (Chen & Zhang, 2014).


Table 2.  Big Data Tools Based on Stream Processing (Chen & Zhang, 2014).

Other studies such as (Rangra & Bansal, 2014) have provided a comparative study of data mining tools such as Weka, Keel, R-Programming, Knime, RapidMiner, and Orange, their technical specification, general features, specialization, advantages, and limitations. (Choi, 2017) have discussed the BDA tools by categories.  These BDA tools are categorized by open source data tools, data visualization tools, sentiment tools, and data extraction tools. Figure 1 provides a summary of some of the examples of BDA tools including the databases sources to download big datasets for analysis.


Figure 1.  A Summary of Big Data Analytics Tools.

(Al-Khoder & Harmouch, 2014) have evaluated four of the most popular open source and free data mining tools including R, RapidMiner, Weka, and Knime.  R foundation has developed R-Programming, while Rapid-I company have developed RapidMinder.  Weka is developed by University of Waikato, and Knime is developed by Knime.com AG. Figure 2 provides a summary of these four BDA most popular open source and free data mining tools, with the logo, description, launch date, current version at the time of writing the study, and development team.


Figure 2.  Open Source and Free Data Mining Tools Analyzed by (Al-Khoder & Harmouch, 2014).

The top five of BDA tools for 2018 include Tableau Public, Rapid Miner, Hadoop, R-Programming, IBM Big Data (Seli, 2017). The present discussion focuses on one of these two five BDA tools for 2018.  Figure 3 summarizes these top five BDA tools for 2018.


Figure 3.  Top Five BDA Tools for 2018.

RapidMiner Big Data Analytic Tool

RapidMiner Big Data Analytic tool is selected for the present discussion since it was among the top five BDA tools for 2018.  RapidMiner is an open source platform for BDA, based on Java programming language. RapidMiner provides machine learning procedures and data mining.  It also provides data visualization, processing, statistical modeling, deployment, evaluation and predictive analytics (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014; Seli, 2017).  RapidMiner is known for its commercial and business applications, as it provides an integrated environment and platform for machine learning, data mining, predictive analysis, and business analytics  (Hofmann & Klinkenberg, 2013; Seli, 2017).  It is also used for research, education, training, rapid prototyping, and application development (Rangra & Bansal, 2014).  It is specialized in predictive analysis and statistical computing. It supports all steps of the data mining process (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014). RapidMiner uses the client/server model, where the server can be software, or a service or on cloud infrastructures (Rangra & Bansal, 2014).

RapidMiner was released on 2006.  The latest version of RapidMiner server is 7.2 with a free version of server and Radoop and can be downloaded from RapidMiner site (rapidminer, 2018).  It can be installed on any operating system (Rangra & Bansal, 2014).  The advantages of the RapidMiner include an integrated environment for all steps that are required for data mining process, easy to use graphical user interface (GUI) for the design of data mining process, the visualization of the result and data, the validation and optimization of these processes.  RapidMiner can be integrated into more complex systems (Hofmann & Klinkenberg, 2013).  RapidMiner also stores the data mining processes in a machine-readable XML format, which can be executed with a click of a button, providing a visualized graphics of the data mining processes (Hofmann & Klinkenberg, 2013). It contains over a hundred learning schemes for regression classification and clustering analysis (Rangra & Bansal, 2014).  RapidMiner has a few limitations including the size constraints of the number of rows and more hardware resources than other tools such as SAS for the same task and data (Seli, 2017).  RapidMiner also requires prominent knowledge of the database handling (Rangra & Bansal, 2014).  

RapidMiner Use and Application

Data Mining requires six essential steps to extract value from a large dataset (Chisholm, 2013). The process of Data mining framework begins with business understanding, followed by the data understanding and data preparation.  The modeling, evaluation and deployment phases develop the models for predictions, testing, and deploying them in real-time. Figure 4 illustrates these six steps of the data mining.


Figure 4.  Data Mining Six Phases Process Framework (Chisholm, 2013).

Before working with RapidMiner, the user must know the common terms used by RapidMiner.  Some of these standard terms are a process, operator, macro, repository, attribute, role, label, and ID (Chisholm, 2013).  The data mining process in RapidMiner begins with loading the data into RapidMiner.  Loading the data into RapidMiner using import technique for either data in files, or databases.  The process of splitting the large file into pieces can be implemented in RapidMiner.  In some cases, the dataset can be split into chunks using RapidMiner process which reads each line in the file such as CSV file to be split into chunks. If the dataset is based on a database, a Java Data Connectivity (JDBC) driver must be used. RapidMiner support MySQL, PostgreSQL, SQL Server, Oracle and Access (Chisholm, 2013).  After loading the data into RapidMinder and generating data for testing, a predictive model can be created based on the loaded dataset, followed by the process execution and reviewing the result visually. RapidMiner provides various techniques to visualize the data.  It uses scatter plots, scatter 3D color, parallel and deviations, quartile color, plotting series, and survey plotter. Figure 5 illustrates scatter 3D color visualization of the data in RapidMiner (Chisholm, 2013).


Figure 5.  Scatter 3D Color Visualization of the Data in RapidMiner (Chisholm, 2013).

RapidMiner supports statistical analysis such as K-Nearest Neighbor Classifications, Naïve Bayes Classification, which can be used for credit approval and in education (Hofmann & Klinkenberg, 2013). RapidMiner application is also witnessed in other industries such as marketing, cross-selling and recommender system (Hofmann & Klinkenberg, 2013).  Other useful use cases of the RapidMiner application include the clustering in medical and education domains (Hofmann & Klinkenberg, 2013).  RapidMinder can also be used for text mining scenarios such as spam detection, language detection, and customer feedback analysis.  Other applications of RapidMiner include anomaly detection and instance selection.

Conclusion

This discussion has identified the different tools for Big Data Analytics (BDA). Over thirty analytic tools which can be used to overcome some of the BDA. Some are open source tools such as Knime, R-Programming, RapidMiner which can be downloaded for free, while others are described as visualization tools such as Tableau Public, Google Fusion to provide compelling visual images of the data in various scenarios.  Other tools are more semantic such as OpenText and Opinion Crawl.  Data extraction tools for BDA include Octoparse and Content Grabber. The users can download large datasets for BDA from various databases such as data.gov. 

The discussion has also addressed the top five BDA tools for 2018, such as Tableau Public, RapidMiner, Hadoop, R-Programming and IBM Big Data. RapidMiner was selected as BDA tools for this discussion.  The focus of the discussion on RapidMiner included the technical specification, use, advantages, and limitation.  The data mining process and steps when using RapidMiner have also been discussed.  The analytic process begins with the data upload to RapidMiner, during which the data can be split using the RapidMiner capabilities.  After the load and the cleaning of the data, the data model is developed and tested, followed by the visualization.  The visualization capabilities of RapidMiner include statistical analysis such as K-Nearest Neighbor and Naïve Bay Classification.  RapidMiner use cases have been addressed as well to include the medical and education domains, text mining scenarios such as spam detection.  Organizations must select the appropriate BDA tools based on the business model.

References

Al-Khoder, A., & Harmouch, H. (2014). Evaluating four of the most popular open source and free data mining tools.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chisholm, A. (2013). Exploring data with RapidMiner: Packt Publishing Ltd.

Choi, N. (2017). Top 30 Big Data Tools for Data Analysis. Retrieved from https://bigdata-madesimple.com/top-30-big-data-tools-data-analysis/.

Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and business analytics applications: CRC Press.

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Rangra, K., & Bansal, K. (2014). Comparative study of data mining tools. International journal of advanced research in computer science and software engineering, 4(6).

rapidminer. (2018). Introducing RapidMiner 7.2, Free Versions of Server & Radoop, and New Pricing. Retrieved from https://rapidminer.com/blog/introducing-new-rapidminer-pricing-free-versions-server-radoop/.

Seli, A. (2017). Top 5 Big Data Analytics Tools for 2018. Retrieved from http://heartofcodes.com/big-data-analytics-tools-for-2018/.

Case Study: Hadoop in Healthcare Industry

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify a real-life case study where Hadoop was used.  The discussion also addresses the view of the researcher whether Hadoop was used in the amplest manner.  The benefits of Hadoop to the identified industry of the use case are also discussed.  

Hadoop Real Life Case Study and Applications in the Healthcare Industry

Various research studies and reports have discussed Spark solution for real-time data processing in particular industries such as Healthcare, while others have discussed Hadoop solution for healthcare data analytics. For instance, (Shruika & Kudale, 2018) have discussed the use of Big Data in Healthcare with Spark, while (Beall, 2016) have indicated that United Healthcare is processing data using Hadoop framework for clinical advancements, financial analysis, and fraud and waste monitoring.  United Healthcare has utilized Hadoop to obtain a 360-degree view of each of its 85 million members (Beall, 2016). 

The emphasis of this discussion is on Hadoop in the Healthcare industry.  The data growth in the Healthcare industry is increasing exponentially (Dezyre, 2016).  McKinsey have anticipated the potential annual value for healthcare in the US is $300 billion, and 7% annual productivity growth using BDA (Manyika et al., 2011).  (Dezyre, 2016) have reported that the healthcare informatics poses challenges such as data knowledge representation, database design, data querying, and clinical decision support which contribute to the development of BDA.  

Big Data in healthcare include data such as patient-related data from electronic health records (EHRs), computerized physician order entry systems (CPOE), clinical decision support systems, medical devices and sensor, lab results and images such as Xrays, and so forth (Alexandru, Alexandru, Coardos, & Tudora, 2016; Wang, Kung, & Byrd, 2018).  Big Data framework for healthcare includes data layer, data aggregation layer, the analytical layer, the information exploration layer (Alexandru et al., 2016). Hadoop resides in the analytical layer of the Big Data framework (Alexandru et al., 2016).  

The data analysis involves Hadoop and MapReduce processing large dataset in batch form economically, analyzing both data types of structured and unstructured in a massively parallel processing environment (Alexandru et al., 2016).  (Alexandru et al., 2016) have indicated that stream computing can also be implemented using real-time or near real-time analysis to identify and respond to any health care fraud quickly.  The third type of analytics at the analytic layer also involves in-database analytics using data warehouse for data mining allowing high-speed parallel processing which can be used for prediction scenarios (Alexandru et al., 2016).  The in-database analytics can be used for preventive health care and pharmaceutical management.  Using Big Data framework including Hadoop ecosystem provides additional health care benefits such as scalability, security, confidentially and optimization features (Alexandru et al., 2016).

Hadoop technology was found to be the only technology that enables healthcare to store data in its native forms (Dezyre, 2016).   There are five successful use cases and applications of Hadoop in the healthcare industry (Dezyre, 2016).   The first application of Hadoop technology in healthcare is the cancer treatments and genomics.  Hadoop help develops better treatments for diseases such as cancel by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national “cancer knowledge network” to guide treatment decisions (Dezyre, 2016).  Hadoop can also be used to monitor the patient vitals.  The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over 6,200 children in their ICU units.  Through Hadoop, the hospital was able to store and analyze the vital signs, and if there is any pattern change, an alert is generated and sent to the physicians (Dezyre, 2016).  The third application of Hadoop in Healthcare industry involves the hospital network.  The Cleveland Clinic spinoff company, known as “Explorys” is taking advantages of Hadoop by developing the most extensive database in the healthcare industry. As a result, Explorys was able to provide clinical support, reduce the cost of care measurement and manage the population of at-risk patients (Dezyre, 2016).  The fourth application of Hadoop in Healthcare industry involves healthcare intelligence, where healthcare insurance businesses are interested in finding the age of individuals in specific regions, who below a certain age are not a victim of certain diseases.  Through Hadoop technology, the healthcare insurance companies can compute the cost of insurance policy.  Pig, Hive, and MapReduce of Hadoop ecosystem are used in this scenario to process such a large dataset (Dezyre, 2016).  The last application of Hadoop in the healthcare industry involves fraud prevention and detection.  

Conclusion

In conclusion, the healthcare industry has taken advantages of Hadoop technology in various areas not only for better treatment and better medication but also for reducing the cost and increasing productivity and efficiency.  It has also used Hadoop for fraud protection.  These are not only the benefits which Hadoop offers the healthcare industry.  Hadoop also offers storage capabilities, scalability, and analytics capabilities of various types of datasets using parallel processing and distributed file system.  From the viewpoint of the researcher, utilizing Spark on top of Hadoop will empower the healthcare industry not only at the batching processing level but also at the real-time data processing. (Basu, 2014) have reported that the healthcare industry can take advantages of Spark and Shark with Apache Hadoop for real-time healthcare analytics.  Although Hadoop alone offers excellent benefits to the healthcare industry, its integration with other analytic tools such as Spark can make a huge difference at the patient care level as well as at the industry return on investment level.

References

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data, and Cloud Computing. Management, 1, 2.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Beall, A. (2016). Big data in healthcare: How three organizations are using big data to improve patient care and more. Retrieved from https://www.sas.com/en_us/insights/articles/big-data/big-data-in-healthcare.html.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Shruika, D., & Kudale, R. A. (2018). Use of Big Data in Healthcare with Spark. International Journal of Science and Research (IJSR).

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

Hadoop Ecosystem

Dr. O. Aly
Computer Science

The purpose of this discussion is to discuss the Hadoop ecosystem, which is rapidly evolving. The discussion also covers Apache Spark, which is a recent addition to the Hadoop ecosystem. Both technologies and tools offer significant benefits for the challenges of storing and processing of large data sets in the age of Big Data Analytic.  The discussion also addresses the most significant differences between Hadoop and Spark.

Hadoop Solution, Components and Ecosystem

The growth of Big Data has demanded the attention not only from researchers, academia, and government but also from the software engineering as it has been challenging dealing with Big Data using the conventional computer science technologies (Koitzsch, 2017).  (Koitzsch, 2017) have referenced annual data volume statistics from Cisco VNI Global IP Traffic Forecast from 2014-2019 as illustrated in Figure 1 to show the growth magnitude of the data.


Figure 1.  Annual Data Volume Statistics [Cisco VNI Global IP Traffic Forecast 2014-2019] (Koitzsch, 2017). 

The complex characteristics of Big Data have demanded the innovation of distributed big data analysis as the conventional techniques were found inadequate (Koitzsch, 2017; Lublinsky, Smith, & Yakubovich, 2013).  Thus, tools such as Hadoop has emerged relying on clusters of relatively low-cost machines and disks, driving the distributed processing for large-scale data projects.  Apache Hadoop is a Java-based open source distributed processing framework has evolved from Apache Nutch, which is an open source web search engine, based on Apache Lucene (Koitzsch, 2017).  The new Hadoop subsystems have various language bindings such as Scala and Python (Koitzsch, 2017).  The core components of Hadoop 2 include MapReduce, Yarn, HDFS and other components including Tez as illustrated in Figure 2.


Figure 2.  Hadoop 2 Core Components (Koitzsch, 2017).

The Hadoop and its ecosystem are divided into major building blocks (Koitzsch, 2017).  The core components of the Hadoop 2 involve Yarn, Map/Reduce, HDFS, and Apache Tez.  The operational services component includes Apache Ambari, Oozie, Ganglia, NagiOs, Falcone, etc. The data services component includes Hive, HCatalog, PIG, HBase, Flume, Sqoop, etc.   The messaging component includes Apache Kafka, while the security services and secure ancillary components include Accumulo.  The glue components include Apache Camel, Spring Framework, and Spring Data.  Figure 3 summarizes these building blocks of the Hadoop and its ecosystem.

Figure 3. Hadoop 2 Technology Stack Diagram (Koitzsch, 2017).

Furthermore, the structure of the ecosystem of Hadoop involves various components, where Hadoop is in the center, providing bookkeeping and management for the cluster using Zookeeper, and Curator (Koitzsch, 2017).  Hive and Pig are a standard component of the Hadoop ecosystem providing data warehousing, while Mahout provides standard machine learning algorithm support.  Figure 4 shows the structure of the ecosystem of Hadoop (Koitzsch, 2017).  


Figure 4.  Hadoop Ecosystem (Koitzsch, 2017).

Hadoop Limitation Driving Additional Technologies

Hadoop has three significant limitations (Guo, 2013).  The first limitation is about the instability of the software of Hadoop as it is an open source software and the lack of technical support and documentation. Enterprise Hadoop can be used to overcome the first limitation.  Hadoop cannot handle real-time data processing, which is a significant limitation for Hadoop.  Spark or Storm can be used to overcome the real-time processing, as required by the application. Hadoop cannot large graph datasets either.  GraphLab can be utilized to overcome the large graph dataset limitation. 

The Enterprise Hadoop is distributions of Hadoop by various Hadoop-oriented vendors such as Cloudera, Hortonworks and MapR, and Hadapt (Guo, 2013).  Cloudera provides Big Data solutions and is regarded to be one of the most significant contributors to the Hadoop codebase (Guo, 2013).  Hortonworks and MapR are Hadoop-based Big Data solutions (Guo, 2013).  Spark is a real-time in-memory processing platform Big Data solution (Guo, 2013).  (Guo, 2013) have indicated that Spark “can be up to 40 times faster than Hadoop” (page 15). (Scott, 2015) has indicated that Spark is running in memory “can be 100 times faster than Hadoop MapReduce, but also ten times faster when processing disk-based data in a similar way to Hadoop MapReduce itself” (page 7).  Spark is described as ideal for iterative processing and responsive Big Data applications (Guo, 2013). Spark can also be integrated with Hadoop, where Hadoop-compatible storage API provides the capabilities to access any Hadoop-supported systems (Guo, 2013).   The storm is another choice for the Hadoop limitation of real-time data processing. The storm is developed and open source by Twitter (Guo, 2013). The GraphLab is the alternative solution for the Hadoop limitation of dealing with large graph dataset.  GraphLab is an open source distributed system, developed at Carnegie Mellon University, to handle sparse iterative graph algorithms (Guo, 2013). Figure 5 summarizes these three limitations and the alternatives of Hadoop to overcome them.  


Figure 5.  Three Major Limitations of Hadoop and Alternative Solutions.

Apache Spark Solution, and its Building Blocks

In 2009, Spark was developed by UC Berkeley AMPLab. Spark runs in-memory processing data quicker than Hadoop(Guo, 2013; Koitzsch, 2017; Scott, 2015).  In 2013, Spark became a project of Apache Software Foundation, and early in 2014, it became one of the major projects.   (Scott, 2015) has described Spark as a general-purpose engine for data processing, and can be used in various projects (Scott, 2015).  The primary tasks that are associated with Spark include interactive queries across large datasets, processing data streaming from sensors or financial systems, and machine learning (Scott, 2015).  While Hadoop was written in Java, Apache Spark was written primarily in Scala (Koitzsch, 2017).  

Three critical features for Spark:  simplicity, speed, and support (Scott, 2015).  The simplicity feature is represented in the access capabilities of Spark through a set of APIs which are well structured and documented assisting data scientist to utilize Spark quickly.  The speed feature reflects the in-memory processing of large dataset quickly.  The speed feature has distinguished Spark from Hadoop.  The last feature of the support is presented in the various programming languages such as Java, Python, R, and Scala, which Spark support (Scott, 2015). Spark has native support for integrating some leading storage solutions in the Hadoop ecosystems and beyond (Scott, 2015).  Databricks, IBM and other main Hadoop vendors are the providers of Spark-based solutions.

The typical use of Spark includes stream processing, machine learning, interactive analytics, and data integration (Scott, 2015).  Example of stream processing includes real-time data processing to identify and prevent potentially fraudulent transactions.  The machine learning is another typical use case of Spark, which is supported by the ability of Spark to run into memory and quickly run repeated queries that help in training machine learning algorithms to find the most efficient algorithm (Scott, 2015).  The interactive analytics is another typical use of Spark involving interactive query process where Spark responds and adapts quickly.  The data integration is another typical use of Spark involving the extract, transform and load (ETL) process reducing the cost and time.  Spark framework includes the Spark Core Engine, with SQL Spark, Spark Streaming for data streaming, MLib Machine Learning, GraphX for Graph Computation, Sark R for running R language on Spark. Figure 6 summarizes the framework of Spark and its building blocks (Scott, 2015).  


Figure 6.  Spark Building Blocks (Scott, 2015).

Differences Between Spark and Hadoop

Although Spark has its benefits in processing real-time data using in-memory processing, Spark is not a replacement for Hadoop or MapReduce (Scott, 2015).  Spark can run on top of Hadoop to benefit from Yarn which is the cluster manager of Hadoop, and the underlying storage of HDFS, HBase and so forth.  Besides, Spark can also run separately by itself without Hadoop, integrating with other cluster managers such as Mesos and other storage like Cassandra and Amazon S3 (Scott, 2015). Spark is described as a great companion to the modern Hadoop cluster deployment (Scott, 2015).  A spark is also described as a powerful tool on its own for processing a large volume of data sets.  However, Spark is not well-suited for production workload.  Thus, the integration Spark with Hadoop provides many capabilities which Spark cannot offer on its own. 

Hadoop offers Yarn as a resource manager, the distributed file system, disaster recovery capabilities, data security, and a distributed data platform.  Spark offers a machine learning model to Hadoop, delivering capabilities which is not easily used in Hadoop without Spark (Scott, 2015).   Spark also offers fast in-memory real-time data streaming, which Hadoop cannot accomplish without Spark (Scott, 2015).  In summary, although Hadoop has its limitations, Spark is not replacing Hadoop, but empowering it.

Conclusion

This discussion has covered significant topics relevant to Hadoop and Spark.  It began with Big Data, its complex characteristics, and the urgent need for technology and tools to deal with Big Data. Hadoop and Spark as emerging technologies and tools and their building blocks have been addressed in this discussion.  The differences between Spark and Hadoop is also covered. The conclusion of this discussion is that Spark is not replacing Hadoop and MapReduce.  Spark offers various benefits to Hadoop, and at the same time, Hadoop offers various benefits to Spark.  The integration of both Spark and Hadoop offers great benefits to the data scientists in Big Data Analytics domain.

References

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Koitzsch, K. (2017). Pro Hadoop Data Analytics: Springer.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Hadoop: Functionality, Installation and Troubleshooting

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss Hadoop functionality, installation steps, and any troubleshooting techniques.  It addresses two significant parts.  Part-I discusses Big Data and the emerging technology of Hadoop.   It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It also discusses the MapReduce framework, its benefits, and limitations.  Part-I provides a few success stories for Hadoop technology use with Big Data Analytics.  Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks.  It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Keywords: Big Data Analytics; Hadoop Ecosystem; MapReduce.

Introduction

This project discusses various significant topics related to Big Data Analytics.  It addresses two significant parts.  Part-I discusses Big Data and the emerging technology of Hadoop.   It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It also discusses the MapReduce framework, its benefits, and limitations.  Part-I provides a few success stories for Hadoop technology use with Big Data Analytics.  Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks.  It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Part-I
Hadoop Technology Overview

            The purpose of this Part is to address relevant topics related to Hadoop.  It begins with Big Data Analytics and Hadoop emerging technology.  The building blocks of the Hadoop ecosystem is also addressed in this part.  The building blocks include the Hadoop Distributed File System (HDFS), MapReduce, and HBase.  The benefits and limitations of Hadoop as well as MapReduce are also discussed in Part I of the project.  Part I ends with success stories for using Hadoop ecosystem technology with Big Data Analytics in various domains and industries.

Big Data Analytics and Hadoop Emerging Technology

Big Data is now the buzzword in the field of computer science and information technology.  Big Data attracted the attention of various sectors, researchers, academia, government and even the media (Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013).   In the 2011 report of the International Data Corporation (IDC), it is reporting that the amount of the information which will be created and replicated will exceed 1.8 zettabytes which are 1.8 trillion gigabytes in 2011. This amount of information is growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).

Big Data Analytics (BDA) analyzes and mines Big Data to produce operational and business knowledge at an unprecedented scale (Bi & Cochran, 2014).  BDA is described by (Bi & Cochran, 2014) to be an integral toolset of strategy, marketing, human resources, and research. It is the process of inspecting, cleaning, transforming, and modeling BD with the objective of discovering knowledge, generating solutions, and supporting decision-making (Bi & Cochran, 2014).  Big Data (BD) and BDA are regarded to be powerful tools that various organizations have benefited from (Bates, Saria, Ohno-Machado, Shah, & Escobar, 2014).  Companies which adopted Big Data Analytics successfully have been successful at using Big Data to improve the efficiency of the business (Bates et al., 2014).  Example for successful application of Big Data Analytics is IBM “Watson” an application developed by IBM and was viewed in the TV Jeopardy program, using some of these Big Data approaches (Bates et al., 2014).  (Manyika et al., 2011) have provided notable examples of organizations around the globe that are well-known for their extensive and effective use of data include companies like Wal-Mart, Harrah’s, Progressive Insurance, and Capital One, Tesco, and Amazon. These companies have already taken advantage of the Big Data as a “competitive weapon” (Manyika et al., 2011).  Figure 1 illustrates the different types of data which make up the Big Data space.


Figure 1: Big Data (Ramesh, 2015)

“Big data is about deriving value… The goal of big data is data-driven decision making” (Ramesh, 2015).  Thus, business should make the analytics as the goal when investing in storing Big Data (Ramesh, 2015).  Business should focus on the Analytics side of Big Data to retrieve the value that can assist in decision-making (Ramesh, 2015).  The value of BDA is increasing as the cash flow is increasing (B. Gupta & Jyoti, 2014).  Figure 2 illustrates the graph for the value of BDA with dimensions of time and cumulative cash flow.  Thus, there is no doubt that BDA provides great benefits to organizations. 

Figure 2.  The Value of Big Data Analytics. Adapted from (B. Gupta & Jyoti, 2014).

Furthermore, the organization must learn how to use Big Data Analytics to drive value for the business that aligns with the core competencies and create competitive advantages for the business (Minelli, Chambers, & Dhiraj, 2013).  BDA can improve operational efficiencies, increase revenues, and achieve competitive differentiation.  Table 1 summarizes the Big Data Business Models which can be used by organizations to put Big Data into work as opportunities for business. 

Table 1: Big Data Business Models (Minelli et al., 2013)

There are three types of status for data that organizations deal with: data in use, data at rest and data in motion.  The data in use indicates that the data are used for services or users require them for their work to accomplish specific tasks.  The data at rest indicates that the data are not in use and are stored or archived in storage.  The data in motion indicates that the data state is about to change from data at rest to data in use or transferred from one place to another successfully (Chang, Kuo, & Ramachandran, 2016).  Figure 3 summarizes these three types of data.

Figure 3.  Three Types for Data.

One of the significant characteristics of Big Data is velocity.  The speed of data generation is described by (Abbasi, Sarker, & Chiang, 2016) as “hallmark” of Big Data.   Wal-Mart is an example of generating the explosive amount of data, by collecting over 2.5 petabytes of customer transaction data every hour.  Moreover, over one billion new tweets occur every three days, and five billion search queries occur daily (Abbasi et al., 2016).  Velocity is the data in motion (Chopra & Madan, 2015; Emani, Cullot, & Nicolle, 2015; Katal, Wazid, & Goudar, 2013; Moorthy, Baby, & Senthamaraiselvi, 2014; Nasser & Tariq, 2015).  Velocity involves streams of data, structured data, and the availability of access and delivery (Emani et al., 2015). The velocity of the incoming data does not only represent the challenge of the speed of the incoming data because this data can be processed using the batch processing but also in streaming such high speed-generated data during the real-time for knowledge-based decision (Emani et al., 2015; Nasser & Tariq, 2015).  Real-Time Data (a.k.a Data in Motion) is the streaming data which needs to be analyzed as it comes in (Jain, 2013).

(CSA, 2013) have indicated that the technologies of Big Data are divided into two categories; batch processing for analyzing data that is at rest, and stream processing for analyzing data in motion. Example of data at rest analysis includes sales analysis, which is not based on a real-time data processing (Jain, 2013).  Example of data in motion analysis includes Association Rules in e-commerce. The response time for each data processing category is different.  For the stream processing, the response time of data was from millisecond to seconds, but the more significant challenge is to stream data and reduce the response time under much lower than milliseconds, which is very challenging (Chopra & Madan, 2015; CSA, 2013). The data in motion reflecting the stream processing or real-time processing does not always need to reside in memory, and new interactive analysis of large-scale data sets through new technologies like Apache Drill and Google’s Dremel provide new paradigms for data analytics.  Figure 4 illustrates the response time for each processing type.


Figure 4.  The Batch and Stream Processing Responsiveness (CSA, 2013).

There are two kinds of systems for the data at rest; the NoSQL systems for interactive data serving environments, and the systems for large-scale analytics based on the MapReduce paradigm, such as Hadoop.  The NoSQL systems are designed to have a simpler key-value based Data Model having in-built sharding, and work seamlessly in a distributed cloud-based environment (R. Gupta, Gupta, & Mohania, 2012).  A mapreduce-based framework such as Hadoop supports the batch-oriented processing (Chandarana & Vijayalakshmi, 2014; Erl, Khattak, & Buhler, 2016; Sakr & Gaber, 2014). The data stream management system allows the user to analyze data in motion, rather than collecting large quantities of data, storing it on disk, and then analyzing it. There are various streams processing systems such as IBM InfoSphere Streams (R. Gupta et al., 2012; Hirzel et al., 2013), Twitter’s Storm, and Yahoo’s S4.   These systems are designed and geared towards clusters of commodity hardware for real-time data processing (R. Gupta et al., 2012).

In 2004, Google introduced MapReduce framework as a parallel processing framework which deals with a large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012).  The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012).  Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). 

The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012).  MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014).  DFS is a significant feature of the MapReduce framework (Fadzil et al., 2012).  

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012).  MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down, there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007).  MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber, 2014).  MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber, 2014). 

Hadoop Ecosystem Building Blocks

BD emerging technologies such as Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale, but also heterogeneous datasets at unprecedented scale and speed (Cardenas, Manadhata, & Rajan, 2013).  Hadoop was developed by Yahoo and Apache to run jobs in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012).  A various large corporation such as Facebook, Amazon have used Hadoop as it offers high efficiency, high scalability, and high reliability (Yan et al., 2012).  The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015).  The MapReduce programming model is another significant component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015).  HBase is the third component of the Hadoop framework (Bao et al., 2012).  HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012). 

Hadoop Benefits and Limitations

Various studies have addressed various benefits for Hadoop technology.  Hadoop includes the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014).  Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014).  The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also allows nodes to crash without affecting the data processing.  It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012). 

Hadoop has faced various limitation such as low-level programming paradigm and schema, strictly batch processing, time skew and incremental computation (Alam & Ahmed, 2014).  The incremental computation is regarded to be one of the significant shortcomings of Hadoop technology (Alam & Ahmed, 2014).   The efficiency on handling incremented data is at the expense of losing the incompatibility with programming models which are offered by non-incremental systems such as MapReduce, which requires the implementation of incremental algorithms and increasing the complexity of the algorithm and the code (Alam & Ahmed, 2014).   The caching technique is proposed by (Alam & Ahmed, 2014) as a solution.  This caching solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014). 

Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, & Pasquin, 2011).   The Incoop proposed solution is to extend the open-source implementation of Hadoop of MapReduce programming paradigm to run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Incoop allows programmers to increment the MapReduce programs automatically without any modification to the code (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Moreover, information about the previously executed MapReduce tasks are recorded by Incoop to be reused in subsequent MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber, 2014). 

The Incoop is not a perfect solution, and it has some shortcomings which are addressed by (Sakr & Gaber, 2014; Zhang, Chen, Wang, & Yu, 2015).  Some enhancements are implemented to Incoop to include incremental HDFS called Inc-HDFS, Contraction Phase, and “Memoization-aware Scheduler” (Sakr & Gaber, 2014).  The Inc-HDFS provides the delta technique in the inputs of two consecutive job runs and splits the input based on the contents where the compatibility with HDFS is maintained.  The Contraction phase is a new phase in the MapReduce framework consisting of breaking up the Reduce tasks into smaller sub-computation forming an inverted tree allowing the small portion of the input changes to the path from the corresponding leaf to the root to be computed (Sakr & Gaber, 2014).  The Memoization-aware Scheduler is a modified version of the scheduler of Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).

Another solution called  i2MapReduce proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015).  The i2MapReduce does not perform the task-level computation but rather a key-value pair level incremental processing.  This solution also supports more complex iterative computation, which is used in data mining and reduces the I/O overhead by applying various techniques (Zhang et al., 2015).  IncMR is an enhanced framework for the large-scale incremental data processing (Yan et al., 2012).  It inherits the simplicity of the standard MapReduce, it does not modify HDFS and utilizes the same APIs of the MapReduce (Yan et al., 2012).  When using IncMR, all programs can complete incremental data processing without any modification (Yan et al., 2012). 

In summary, various efforts are exerted by researchers to overcome the incremental computation limitation of Hadoop, such as Incoop, Inc-HDFS, i2MapReduce, and IncMR.  Each proposed solution is an attempt to enhance and extend the standard Hadoop to avoid overheads such as I/O, to increase the efficiency, and without increasing the complexing of the computation and without causing any modification to the code.

MapReduce Benefits and Limitations

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014).  The complexity of the parallelization is solved by using two techniques:  Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014).  The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012).   The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012).  MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012).  It addresses the structural scalability by using the DFS which allows forming sizeable virtual storage for the framework by adding off-the-shelf hardware.  MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012). 

However, the earlier version of the MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014).  Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014).  The standard MapReduce does not share the execution of work to reduce the overall amount of work  (Sakr & Gaber, 2014).  Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).

Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data.   It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the processing of large unstructured data sets (Bakshi, 2012).  MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

The standard MapReduce framework faced the challenge of the iterative computation which is required in various operations such as data mining, PageRank, network traffic analysis, graph analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010; Sakr & Gaber, 2014).   These analyses techniques require the data to be processed iteratively until the computation satisfies a convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014).   Due to this limitation, and to this critical requirement, this iterative process is implemented and executed manually using a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014).   However, the manual implementation and execution of such iterative computation have two significant problems (Bu et al., 2010; Sakr & Gaber, 2014).  The first problem is reflected in loading unchanged data from iteration to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition when the output of the application did not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014).  This termination condition may require an extra MapReduce job on each iteration which causes overhead for scheduling extra tasks, reading extra data from disk, and moving data across the network (Bu et al., 2010; Sakr & Gaber, 2014). 

Researchers exerted efforts to solve the iterative computation.  HaLoop is proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010).   One solution to the iterative computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by  (Ekanayake et al., 2010) are to identify and keep invariant data during the iterations, where reading unnecessary data repeatedly is avoided.  The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014).  The first caching technique is implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique is implemented on the outputs of reducer making the check for the fixpoint more efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).

The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010).  This solution provides the synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and does not replicate remote values locally (Malewicz et al., 2010).  Mahoot is another solution that was introduced to solve the iterative computing by grouping a series of chained jobs to obtain the results (Polato et al., 2014).   In Mahoot solution, the result of each job is pushed into the next job until the final results are obtained (Polato et al., 2014).  The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011).   The task scheduler of the iHadoop utilizes the inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011). 

Apache Hadoop and Apache Spark are the most popular technology for the iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011).  Hadoop defines the iterative computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS) independently, processes the data, and writes the data back to HDFS (Liang et al., 2011).   Dacoop was proposed by Liang as an extension to Hadoop to handle the data-iterative applications, by using cache technique for repeatedly data processing and introducing shared memory-based data cache mechanism (Liang et al., 2011).  The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent tasks of the map and reduce during the whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012).   The iMapReduce avoid three significant overheads.  The first overhead is the job startup overhead which is avoided by building an internal loop from reduce to map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state data from the static structure data.  The third overhead is the synchronization overhead which is avoided by allowing asynchronous map task execution (Zhang et al., 2012).

Success Stories of Hadoop Technology for Big Data Analytics (BDA)

·         BDA and the Impact of Hadoop in Banking for Cost Reduction

            (Davenport & Dyché, 2013) have reported that Big Data has an impact at an International Financial Services Firm.  The bank has several objectives for Big Data. However, the primary objective is to exploit “a vast increase in computing power on dollar-for-dollar basis” (Davenport & Dyché, 2013).  The bank purchased Hadoop cluster, with 50 server nodes and 800 processor cores, capable of handling a petabyte of data.  The data scientists of the bank take the existing analytical procedures and converting them into the Hive scripting language to run on the Hadoop cluster. 

·         BDA and the Impact of Real-Time and Hadoop on Fraud Detection

Big Data with high velocity has created opportunities and requirements for organizations to increase its capability of Real-Time sense and response (Chan, 2014).  The Analysis of the Real-Time and the rapid response are critical features of the Big Data Management in many business situations (Chan, 2014).  For instance, as cited in (Chan, 2014), IBM (2013) in scrutinizing five million trade events that are created each day identified potential fraud, and analyzing 500 million daily call detail records in real-time was able to predict customer churn faster (Chan, 2014). 

“Fraud detection is one of the most visible uses of big data analytics”  (Cardenas et al., 2013).  Credit card and phone companies have conducted large-scale fraud detection for decades (Cardenas et al., 2013).  However, the custom-built infrastructure necessary to mine Big Data for fraud detection was not economical to have wide-scale adoption.  However, one of the significant impacts of BDA technologies is that they are facilitating a wide variety of industries to develop affordable infrastructure for security monitoring (Cardenas et al., 2013).  The new BD technologies of  Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale but also heterogeneous datasets at unprecedented scale and speed (Cardenas et al., 2013).  These technologies have transformed security analytics by facilitating the storage, maintenance, and analysis of security information (Cardenas et al., 2013).

·         BDA and the Impact of Hadoop in Time Reduction as Business Value

Big Data Analytics can be used in marketing in a competitive edge by reducing the time to respond to customers, rapid data capture, aggregation, processing, and analytics.  Harrah’s (currently Caesars) Entertainments has acquired both Hadoop clusters and open-source and commercial analytics software, with the primary objective of exploring and implementing Big Data to respond in real-time to customer marketing and service.  GE is another example that is regarded to be the most prominent creator of new service offerings based on Big Data (Davenport & Dyché, 2013).  The primary focus of GE was to optimize the service contracts and maintenance intervals for industrial products. 

Part-II
Hadoop Installation

The purpose of this Part is to go through the installation of Hadoop on a single cluster node using the Windows 10 operating system. It covers fourteen significant tasks, starting from the download of the software from the Apache site, to the demonstration of the successful installation and configuration.  The steps of the installation are derived from the installation guide of (aparche.org, 2018). Due to the lack of system resources, the Windows operating system was the most appropriate choice for this installation and configuration, although the researcher prefers Unix system over Windows due to the extensive experience with Unix.  However, the installation and configuration experience on Windows has its value as well.

Task-1: Hadoop Software Download

            The purpose of this task is to download the required Hadoop software for windows operating system from the following link: http://www-eu.apache.org/dist/hadoop/core/stable/.   Although there is a higher version than 2.9.1, the researcher has selected this version which is core stable version recommended by Apache.

Task-2: Java Installation

The purpose of this task is to install Java which is required for Hadoop as indicated in the administration guide.  Java 1.8.0_111 is installed on the system as shown below.

Task-3: Extract Hadoop Zip File

The purpose of this task is to extract Hadoop zip file into a directory under C:\Hadoop-2.9.1 as shown below.

Task-4: Setup Required System Variables.

The purpose of this task is to set up the required system variables.  Setup up the HADOOP_HOME as it is required per the installation guide.

Task-5: Edit core-site.xml

The purpose of this task is to setup the configuration of Hadoop by editing the core-site.xml file from C:\Hadoop-2.9.1\etc\hadoop and add the fs.defaultFS to identify the file system for Hadoop using the localhost and port 9000.

Task-6: Copy mapred-site.xml.template to mapred-site.xml

The purpose of this task is to copy the MapReduce template.  Copy mapred-site.xml.template to another file mapred-site.xml in the same directory.

Task-7: Edit mapred-site.xml

The purpose of this task is to set up the configuration for Hadoop MapReduce by editing mapred-site.xml and add between configuration tags the properties tag as shown below.

Task-8: Create Two Folders for DataNode and NameNode

The purpose of this task is to create two important folders for data node and name node which are required for the Hadoop file system. Create folder “data” under the Hadoop home C:\Hadoop-2.9.1.   Create folder “datanode” under C:\Hadoop-2.9.1\data.  Create folder “namenode” under C:\Hadoop-2.9.1\data.

Task-9: Edit hdfs-site.xml

The purpose of this task is to setup the configuration for Hadoop HDFS by editing the file C:\Hadoop-2.9.1\etc\hadoop\hdfs-site.xml, and add the properties for dfs.replication, dfs.namenode, and dfs.datanode as shown below.

Task-10: Edit yarn-site.xml

The purpose of this task is to set the configuration for yarn tool by editing the file C:\Hadoop-2.9.1\etc\hadoop\yarn-site.xml and add yarn.nodemanager.aux-services and its value of mapreduce_shuffle as shown below.

Task-11: Overcome Java Error

The purpose of this task is to overcome the Java error.  Edit C:\Hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add the JAVA_HOME to overcome the following error.

Task-12: Test the Configuration

            The purpose of this task is to test the current configuration and setup by issuing the following command to test the setup before running Hadoop.  The command will throw an error about HADOOP_COMMON_HOME is not found.

>hdfs namenode -format

Task-13: Overcome the HADOOP_COMMON_HOME Error

To overcome the HADOOP_COMMON_HOME “not found” error, edit haddop-env.cmd and add the following, and issue the command again and it will pass as shown below.

Task-14: Start Hadoop Processes

The purpose of this task is to start Hadoop dfs and yarn processes.

Task-15: Run the Cluster Page from the Browser

            The purpose of this task is to run the cluster page for Hadoop from the browser after the previous configuration setup.  If the configuration setup is implemented successfully, the cluster page gets displayed with the Hadoop functionality as shown below, otherwise, it can throw the 404 error, page not found.

Conclusion

This project has discussed various significant topics related to Big Data Analytics.  It addressed two significant parts.  Part-I has discussed Big Data and the emerging technology of Hadoop.   It has provided an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It has also discussed the MapReduce framework, its benefits, and limitations.  Part-I has also provided few success stories for Hadoop technology use with Big Data Analytics.  Part-II has addressed the installation and the configuration of Hadoop on Windows operating system using fourteen essential Tasks.  It has also addressed the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

References

Abbasi, A., Sarker, S., & Chiang, R. (2016). Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.

Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. Paper presented at the Computational Science and Computational Intelligence (CSCI), 2014 International Conference on.

aparche.org. (2018). Hadoop Installation Guide – Windows. Retrieved from https://wiki.apache.org/hadoop/Hadoop2OnWindows.

Bakshi, K. (2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

Bates, D. W., Saria, S., Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123-1131.

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Paper presented at the Proceedings of the 2nd ACM Symposium on Cloud Computing.

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.

Cardenas, A. A., Manadhata, P. K., & Rajan, S. P. (2013). Big data analytics for security. IEEE Security & Privacy, 11(6), 74-76.

Chan, J. O. (2014). An architecture for big data analytics. Communications of the IIMA, 13(2), 1.

Chandarana, P., & Vijayalakshmi, M. (2014). Big Data analytics frameworks. Paper presented at the Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on.

Chang, V., Kuo, Y.-H., & Ramachandran, M. (2016). Cloud computing adoption framework: A security framework for business clouds. Future Generation computer systems, 57, 24-41. doi:10.1016/j.future.2015.09.031

Chopra, A., & Madan, S. (2015). Big Data: A Trouble or A Real Solution? International Journal of Computer Science Issues (IJCSI), 12(2), 221.

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative mapreduce. Paper presented at the Proceedings of the 19th ACM international symposium on high performance distributed computing.

Elnikety, E., Elsayed, T., & Ramadan, H. E. (2011). iHadoop: asynchronous iterations for MapReduce. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on.

Emani, C. K., Cullot, N., & Nicolle, C. (2015). Understandable big data: A survey. Computer science review, 17, 70-81.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Gantz, J., & Reinsel, D. (2011). Extracting value from chaos. IDC iview, 1142, 1-12.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gupta, B., & Jyoti, K. (2014). Big data analytics with hadoop to analyze targeted attacks on enterprise data.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: what is new from databases perspective? Paper presented at the International Conference on Big Data Analytics.

Hirzel, M., Andrade, H., Gedik, B., Jacques-Silva, G., Khandekar, R., Kumar, V., . . . Soulé, R. (2013). IBM streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4), 7: 1-7: 11.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, P., & Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Jain, R. (2013). Big Data Fundamentals. Retrieved from http://www.cse.wustl.edu/~jain/cse570-13/ftp/m_10abd.pdf.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Paper presented at the System Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.

Katal, A., Wazid, M., & Goudar, R. (2013). Big data: issues, challenges, tools and good practices. Paper presented at the Contemporary Computing (IC3), 2013 Sixth International Conference on Contemporary Computing.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Liang, Y., Li, G., Wang, L., & Hu, Y. (2011). Dacoop: Accelerating data-iterative applications on Map/Reduce cluster. Paper presented at the Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses: John Wiley & Sons.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Moorthy, M., Baby, R., & Senthamaraiselvi, S. (2014). An Analysis for Big Data and its Technologies. International Journal of Science, Engineering and Computer Technology, 4(12), 412.

Nasser, T., & Tariq, R. (2015). Big Data Challenges. J Comput Eng Inf Technol 4: 3. doi:10.4172/2324, 9307, 2.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Ramesh, B. (2015). Big Data Architecture Big Data (pp. 29-59): Springer.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

White, T. (2012). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012). Incmr: Incremental data processing based on mapreduce. Paper presented at the Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for Mining Evolving Big Data. IEEE transactions on knowledge and data engineering, 27(7), 1906-1919.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.

 

XML in Healthcare and eCommerce

Dr. Aly, O.
Computer Science

The purpose of this discussion is to address the advantages and disadvantages of XML used in big data analytics for large healthcare organizations. The discussion also presents the use of XML in the healthcare industry as well as in another industry such as eCommerce.

Advantages of XML

XML has several advantages such as simplicity, platform, and vendor independent, extensibility, reuse by many applications, separation of content and presentation, and improved load balancing (Connolly & Begg, 2015; Fawcett, Ayers, & Quin, 2012).  XML also provides support for the integration of data from multiple sources (Connolly & Begg, 2015; Fawcett et al., 2012).  XML can describe data from a wide variety of applications (Connolly & Begg, 2015; Fawcett et al., 2012).  More advanced search engines capabilities in another advantage of XML (Connolly & Begg, 2015).  (Brewton, Yuan, & Akowuah, 2012) have identified two significant benefits of XML.  XML can support tags that are created by the users allows the language to be fully extensible and overcome any tag limitation.   The second significant benefit of XML in healthcare is the versatility, where any data types can be modeled, and tags can be created for specific contexts. 

Disadvantages of XML

The specification of the namespace prefix within DTDs is a significant limitation, as users cannot choose their namespace prefix but must use the prefix defined within the DTD (Fawcett et al., 2012).  This limitation exists as W3C completed the XML Recommendation before finalizing how namespaces would work.  While DTD has poor support to XML namespaces, it plays an essential part in the XML Recommendation.  Furthermore, (Forster, 2008) have identified a few disadvantages of XML.  The inefficiency is one of this limitation as XML was initially designed to accommodate the exchange of data between nodes of the different system and not as a database storage platform.  XML is described as inefficient compared to other storage algorithms (Forster, 2008).  The tags of XML make it readable to humans but requires additional storage and bandwidth (Forster, 2008).  Encoded image data represented in XML requires another program to get displayed as it must be un-encoded and then reassembled into an image (Forster, 2008).  Three XML parsers that inexperienced developers will not be familiar with:  Programs, APIs, and Engines.   XML lacks rendering instructions as it is a backend technology in the form of data storage and transmission technology.  (Brewton et al., 2012) have identified two significant limitations of XML.  The lack of the application that can process XML data and make its data useful.  Browsers utilize HTML to render XML document which indicates that XML cannot be used as an independent language from HTML.  The second major limitation of XML is the unlimited flexibility of the language, where the tags are created by the user, and there is no standard accepted set of tags to be used in the XML document.  The result of this limitation is that the developer cannot create general applications as each company will have its application with its own set of tags.

XML in Healthcare

Concerning XML in healthcare, (Brewton et al., 2012) have indicated that XML was a solution to the problem of finding a reliable and standardized means for storing and exchanging clinical documents.  American National Standards Institute has accredited Health Level 7 (HL7) as an organization which is responsible for setting up many communication standards used across America (Brewton et al., 2012).  The goal of this organization is to provide standards for the exchange, management and integration of data which support clinical patient care and management, delivery, and the evaluation of the services of healthcare (Brewton et al., 2012). Furthermore, HL7 is developing Clinical Document Architecture (CDA) to provide standards for the representation of the clinical document such as discharge summaries and progress notes.  The goal of CDA is to solve the problem of finding a reliable and standardized means for storing and exchanging clinical documents by specifying a markup and semantic structure through XML, allowing medical institutions to share clinical documents.  HL7 version 3 includes the rules for messaging as well as CDA which are implemented with XML and are derived from the Reference Information Model (RIM). Besides, XML supports the hierarchical structure of CDA (Brewton et al., 2012).  Healthcare data must be secured to protect the privacy of the patients.  XML provides signature capabilities which operate identically to regular digital signature (Brewton et al., 2012).  In addition to XML signature, it has encryption capabilities which mandate requirements for areas not covered by the secure socket layer technique (Brewton et al., 2012).  (Goldberg et al., 2005) have identified some limitations of XML when working with images in the biological domain.  The bulk of an image file is represented by the pixels in the image and not the metadata which is regarded as a severe problem.  Another related problem is that XML is verbose meaning that XML file is already more massive than the binary file, and the image files are already quite large which causes another problem when using XML in healthcare (Goldberg et al., 2005).

XML in eCommerce

(Sadath, 2013) have discussed some benefits and limitation of XML in the eCommerce domain.  XML has been advantages of being a flexible hierarchical model suitable to represent semi-structured data.  It is used effectively in data mining and is described as the most common tool used for data transformation between different types of application.  In data mining using XML, there are two approaches to access the XML document: the key-word base search and query-answering.  The key-word based has no much advantages because search takes place on the textual content of the document.  However, when using the query-answering approach to access the XML document, the structure should be known in advance which is not often the case.  The consequences of such lack of knowledge about the structure can lead to information overload where too much data is included because the key-word used information does not exist, or if it incorrectly exists, incorrect answers are received (Sadath, 2013).  Thus, various efforts from researchers have been exerted to find the best approach for data mining in XML, such as XQuery, or Tree-based Association Rules (TARs) as means to represent intentional knowledge in native XML.

References

Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Paper presented at the Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP).

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Forster, D. (2008). Advantages and Disadvantages that You Should Know About XML. Retrieved from https://www.informdecisions.com/downloads/XML_Advantages_and_Disadvantages.pdf.

Goldberg, I. G., Allan, C., Burel, J.-M., Creager, D., Falconi, A., Hochheiser, H., . . . Swedlow, J. R. (2005). The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome biology, 6(5), R47.

Sadath, L. (2013). Data mining in E-commerce: A CRM Platform. International Journal of Computer Applications, 68(24).