"Artificial Intelligence without Big Data Analytics is lame, and Big Data Analytics without Artificial Intelligence is blind." Dr. O. Aly, Computer Science.
The purpose of this discussion is to
discuss and analyze the impact of XML on MapReduce. The discussion addresses
the various techniques and approaches proposed by various research studies for
processing large XML document using MapReduce.
The XML fragmentation process in the absence and presence of MapReduce
is also discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.
XML Query Processing Using MapReduce
XML format has been used to store data
for multiple applications (Aravind & Agrawal, 2014). Data needs
to be ingested into Hadoop and get analyzed to obtain value from the XML data (Aravind & Agrawal, 2014). Hadoop
ecosystem needs to understand XML when it gets ingested into it and be able to interpret it (Aravind & Agrawal, 2014). MapReduce is
a building block of Hadoop ecosystem. In the age of Big Data, XML documents are
expected to be very large and to be
scalable and distributed. The process of
XML queries using MapReduce requires the decomposition of a big XML document and distribute portions to
different nodes. The relational approach
is not appropriate as it is expensive because transforming a big XML document into relational database
tables can be extremely time consuming and θ-joins
among relational table (Wu, 2014). Various research studies have proposed
various approaches to implement native XML query processing algorithms using
MapReduce.
(Dede, Fadika, Gupta, & Govindaraju, 2011) have discussed and analyzed the scalable and distributed processing of scientific XML data, and how the MapReduce model should be used in XML metadata indexing.
The study has presented performance results using two MapReduce
implementations of Apache Hadoop framework and proposed framework of LEMO-MR. The study has provided an indexing framework
that is capable of indexing and efficiently searching large-scale scientific XML datasets. The framework has been tailed for integration with any framework that uses the
MapReduce model to meet the scalability and variety requirements.
(Fegaras, Li, Gupta, & Philip, 2011) have also discussed and analyzed query optimization
in a MapReduce environment. The study has
presented a novel query language for large-scale analysis of XML data on a
MapReduce environment, called MRQL for MapReduce Query Language, that is designed to capture most common data analysis
tasks which can be optimized. XML data
fragmentation is also discussed in this
study. When using a parallel data computation, it expects the
input data to be fragmented into small
manageable pieces, that determine the granularity of the computation. In a MapReduce environment, each map worker is assigned a data split that consists of data
fragments. A map worker processes these
data one fragment at a time. The fragment is a relational tuple for
relational data that is structured, while for a text file, a fragment can be a single line in the file. However, for hierarchical data and nested
collections data such as XML data, the fragment
size and structure depend on the actual application that processes these
data. For instance, XML data may consist
of some XML documents, each one containing
a single XML element, whose size may exceed the memory capability of a map worker.
Thus, when processing XML data, it is recommended to allow custom fragmentation to meet a wide range of
applications requirements. (Fegaras et al., 2011) have argued that Hadoop provides a simple input
format for XML fragmentation based on a single tag name. XML document data can be split, which may
start and end at arbitrary points in the document, even in the middle of tag
names. (Fegaras et al., 2011; Sakr & Gaber, 2014) have indicated that this input format allows reading the document as a stream of string fragments, so that each string will contain a
single complete element that has the requested tag name. XML parser can then be used to parse these
strings and convert them to objects. The
fragmentation process is complex because the requested elements may cross data split boundaries and these data splits may
reside in different data nodes in the data file system (DFS). Hadoop DFS is the implicit solution for this problem allowing to scan beyond a data
split to the next, subject to some overhead for transferring data between
nodes. (Fegaras et al., 2011) have proposed XML fragmentation technique that was
built on top of the existing Hadoop XML input format, providing a higher level of abstraction and better
customization. It is a higher level of abstraction
because it constructs XML data in the MRQL
data model, ready to be processed by MRQL queries instead of deriving a string
for each XML element (Fegaras et al., 2011).
(Sakr & Gaber, 2014) have also discussed briefly another language that has been proposed to support distributed XML processing using the MapReduce framework, called ChuQL. It presents a MapReduce-based extension for the syntax, grammar, and semantics of XQuery, the standard W3C language for querying the XML documents. The implementation of ChuQL takes care of distributing the computation to multiple XQuery engines running in Hadoop nodes, as described by one or more ChuQL MapReduce expressions. The representation of the “word count” example program in the ChuQL language using its extended expressions where the MapReduce expression is used to describe a MapReduce job. The clauses of input and output are used to read and write onto HDFS respectively. The clauses of rr and rw are used for describing the record reader and writer respectively. The clauses of the map and reduce represent the standard map and reduce phases of the framework where they process XML values or key/value pairs of XML values to match the MapReduce model which are specified using XQuery expressions. Figure 1 show the word count example in ChQL using XML in distributed environment.
Figure 1. The Word Count Example Program in ChQL Using XML in Distributed Env. (Sakr & Gaber, 2014).
(Vasilenko & Kurapati, 2014) have discussed and analyzed the efficient processing
of XML documents in Hadoop MapReduce environment. They argued that the most common approach to
process XML data is to introduce a custom solution based on the user-defined functions or scripts. The common
choices vary from introducing an ETL process for extracting the data of
interest to the transformation of XML
into other formats that are natively supported by Hive. They have addressed a generic approach to
handling XML based on Apache Hive architecture.
The researchers have described an approach that complements the existing
family of Hive serializers and de-serializers for other popular data formats,
such as JSON, and makes it much easier
for users to deal with the large XML dataset format. The implementation included logical splits
for the input files each of which is assigned
to an individual Mapper. The mapper
relies on the implemented Apache Hive XML
SerDe to break the split into XML fragments using a specified start/end byte sequences. Each fragment corresponds to a
single Hive record. The fragments are
handled by the XML processor to extract value for the record column utilizing specified XPath queries. The reduce phase was not required in this implementation (Vasilenko & Kurapati, 2014).
(Wu, 2014) have discussed and analyzed the partitioning
XML documents and distributing XML fragments into different compute nodes,
which can introduce high overhead in XML fragment transferring from one node to
another during the MapReduce process execution.
The researchers have proposed a technique to use MapReduce to distribute
labels in inverted lists in a computing cluster
so that structural joins can be parallelly performed to process queries. They have also proposed an optimization
technique to reduce the computing space in the proposed framework to improve
the performance of query processing. They
have argued that their approaches are different from the current shred and
distributed XML document into different nodes in a cluster approach. The process includes
reading and distributing the inverted lists that are required for input queries during the query processing, and
their size is much smaller than the size of the whole document. The process also includes the partition of
the total computing space for structural joins so that each sub-space can be
handled by one reducer to perform structural joins. The researchers have also proposed a pruning-based
optimization algorithm to improve the performance of their approach.
Conclusion
This discussion has addressed the XML query processing using MapReduce
environment. The discussion has addressed the various techniques and
approaches proposed by various research studies for processing large XML
document using MapReduce. The XML
fragmentation process in the absence and presence of MapReduce has also been
discussed to provide a better understanding of
the complex process of XML large documents using a distributed scalable MapReduce environment.
Dede, E., Fadika,
Z., Gupta, C., & Govindaraju, M. (2011). Scalable and distributed processing of scientific XML data. Paper
presented at the Grid Computing (GRID), 2011 12th IEEE/ACM International
Conference on.
Fegaras, L., Li,
C., Gupta, U., & Philip, J. (2011). XML
Query Optimization in Map-Reduce.
Sakr, S., &
Gaber, M. (2014). Large Scale and big data:
Processing and Management: CRC Press.
Vasilenko, D.,
& Kurapati, M. (2014). Efficient processing of xml documents in hadoop map
reduce.
Wu, H.
(2014). Parallelizing structural joins to
process queries over big XML data using MapReduce. Paper presented at the
International Conference on Database and Expert Systems Applications.
The purpose of this discussion is to discuss and analyze the design of the XML document. The discussion also examines the XML design document from the perspective of the users for improved performance. The discussion begins with XML Design Principles and detailed analysis of each principle. XML design document is also examined from the performance perspective focusing on the appropriate use of elements and attributes when designing XML document.
XML Design Principles
The XML design document has guidelines and principles that developers should follow. These guidelines are divided into four major principles for the use of elements and attributes: core content principles, structured information principle, readability principles, element and attribute binding principles. Figure 1 summarizes these principles of XML design document.
Figure 1. XML Design Document Four Principles for Elements and Attributes Use.
Core Content Principle
The core content
principle involves the use of element versus the use of the attribute.
If the information is part of the essential
material for human-readable documents, the use of elements is recommended.
If the information is for machine-oriented records formats, and to help
applications process the primary
communication, the use of attributes is
recommended. Example of this principle
includes the title which is replaced in
an attribute while it should be placed in
element content. Another example of this
principle is the internal product
identifies thrown as elements into detailed
records of the products, while some cases attributes are more appropriate than
elements because the internal product code would not be of primary interest to
most readers or processors of the document
when the ID has an extended format. Similar to data and metadata, the data should
be placed in elements, and metadata
should be in attributes (Ogbuji, 2004).
Since elements and attributes are the two main building blocks of XML design document, developers should be aware of the legal and illegal elements and attributes. (Fawcett, Ayers, & Quin, 2012) have identified legal and illegal elements. For instance, the spaces are allowed after a name, but names cannot contain spaces. Digits can appear within a name, while names cannot begin with a digit. The spaces can appear between the name and the forward slash in a self-closing element, while the initial spaces are not allowed. A hyphen is allowed within a name, but a hyphen is not allowed as the first character. The non-roman characters are allowed if they are classified as letters by the Unicode specifications, where the element name is forename in Greek, while the start and end tags must match case-sensitively (Fawcett et al., 2012). Table 1 shows the legal and illegal elements when designing XML document.
Table 1. Legal vs. Illegal Elements
Consideration for XML Design Document (Fawcett et al., 2012).
For the attributes, (Fawcett et al., 2012) have identified legal and illegal attributes. The single quote inside double quote delimiters is allowed. The double quotes inside a single quote delimiter are also allowed, while a single quote inside single quote delimiters is not allowed. The attribute names cannot begin with a digit. Two attributes with the same name are not allowed. The mismatching delimiters are not allowed. Table 2 shows the legal and illegal attributes to be considered when designing XML document.
Table 2. Legal vs. Illegal Attributes
Consideration for XML Design Document (Fawcett et al., 2012).
Structured Information Principle
Since the element is an extensible engine for expressing structure in XML, the use of the element is recommended if the information is expressed in a structured form, especially if the structure is extensible. The use of the attribute is recommended if the information is expressed as an atomic token since attributes are designed to express simple properties of the information represented in an element (Ogbuji, 2004). The date is an excellent example as it has a fixed structure and acts as a single token, hence can be used as an attribute. Personal names are recommended to be in the element content, instead of having the names in attributes, since personal names have variable structure, and are rarely an atomic token. The following code example is making the name as an element. Figure 2 shows the name is an element, while Figure 3 shows the name is an attribute.
Readability Principle
If
the information is intended to be for human readability, the use of the element
is recommended. If the information is for
machine readability, the use of the attribute is recommended (Ogbuji, 2004). The URL is an example as
it cannot be used without the computer to
retrieve the referenced resource (Ogbuji, 2004).
Element/Attribute Binding
The use of element is
recommended if its value is required to be modified by another attribute
(Ogbuji, 2004). The attribute should provide
some properties or modifications of the element (Ogbuji, 2004).
XML Design Document Examination
One of the best practices identified by IBM for DB2 is to use
attributes and elements appropriately in XML (IBM, 2018). Although it is identified for DB2, it can be applied to the design of an application
using XML document because the elements
and attributes are the building blocks of XML as discussed above. The example for the examination involves a
menu and the use of elements and attributes.
If a menu for a restaurant is developed using XML design document technique, and the portion sizes of items are placed in the menu, the code content principle is applied with the assumption that it is not important to the reader of the menu format. Following the structured information principle, the code will be as follows by not placing the portion measurement and units into a single attribute. Figure 4 shows the code using the core content principles, while Figure 5 shows the code using the structured information principle.
However, following the structured information principle in Figure xx allows portion-unit to modify the portion-size which is not recommended. The use of the attribute is recommended to modify the element which is the menu-item element in this example. Thus, the solution is to modify the code and make the element to be modified by the attribute portion-unit. The result of this code will show the portion size to the reader as shown in Figure 6.
After the modification
of the code to make the element modifiable by the portion-unit, the principles
of the core content and readability are applied. This modification contradicts the original
decision that it is not essential to the
reader to know about the size which is based
on the core content principle.
Therefore, XML developers should judge the appropriate principle to be
applied based on the requirements.
The following link is available, to see another XML design document as a menu example: https://www.w3schools.com/xml/default.asp. The code in the provided link shown in Figure 7 shows that the attributes are modifying the elements which are recommended.
Conclusion
This assignment has focused on the
XML design document. The discussion has
covered the four major principles that should be
considered
when designing XML document. The four principles go around the two
building blocks of the attribute and element.
The use of the element is recommended for human-readable documents,
while the use of the attributes is recommended
for machine-oriented records. The use of
the element is also recommended for
information that is expressed in a structured form, especially if the structure
is extensible, while the use of the attributes is
recommended for information is expressed as an atomic token. If the attribute modifies another attribute,
the use of element is recommended. XML document design should also consider the
legal and illegal elements and attributes.
A few examples have been provided to demonstrate the use of element
versus attributes, and the method to improve the code for good performance as
well as for good practice. The
discussion was limited to the use of element and attributes and
performance consideration from that perspective. However, XML design document involves other performance considerations for
XML for the database, for parsing, and
for data warehouse as discussed in (IBM, 2018; Mahboubi & Darmont, 2009; Nicola & John, 2003;
Su-Cheng, Chien-Sing, & Mustapha, 2010).
References
Fawcett, J.,
Ayers, D., & Quin, L. R. (2012). Beginning
XML: John Wiley & Sons.
Mahboubi, H.,
& Darmont, J. (2009). Enhancing XML
data warehouse query performance by fragmentation. Paper presented at the
Proceedings of the 2009 ACM symposium on Applied Computing.
Nicola, M., &
John, J. (2003). XML parsing: a threat to
database performance. Paper presented at the Proceedings of the twelfth
international conference on Information and knowledge management.
The purpose of this project is to discuss and examine Big Data Analytics (BDA) technique and a case study. The discussion begins with anoverview of BDA application in various sectors, followed by the implementation of BDA in the healthcare industry. The records show the healthcare industry suffers from fraud, waste, and abuse (FWA). The emphasis of this discussion is on FWA in the healthcare industry. The project provides a case study of BDA in healthcare using outlier detection data mining tool. The data mining phases of the use case are discussed and analyzed. An improvement for the selected BDA technique of the outlier detection is proposed in this project. The analysis shows that the outlier detection data mining technique for fraud detection is under experimentation and is not proven reliable yet. The recommendation is to use the clustering data mining technique as a more heuristic technique for fraud detection. Organizations should evaluate the BDA tools and select the most appropriate and fit tool to meet the requirements of the business model successfully.
Keywords: Big Data Analytics; Healthcare; Outlier Detection; Fraud Detection.
Organizations must be
able to quickly and effectively analyze a large
amount of data and extract value from such data for sound business
decisions. The benefits of Big Data
Analytics are driving organizations and businesses to implement the Big Data
Analytics techniques to be able to compete in the market. A survey conducted by CIO Insight has shown
that 65% of the executives and senior decisions makers have indicated that
organizations will risk becoming uncompetitive or irrelevant if Big Data is not
embraced (McCafferly, 2015). The same survey also has shown that 56% have
anticipated a higher investment for big data, and 15% have indicated that such
increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget
allocation can be used for skilled professionals, BD data storage, BDA tools, and so
forth. This project discusses and
analyzes the application of Big Data Analytics. It begins with an overview of
such broad applications, with more emphasis on a single application for
further investigation. Healthcare sector is selected for further
discussion and with a closer lens to investigate the implementation of BDA, and
methods to improve such implementation.
Numerous research studies have discussed and analyzed the application of Big Data in different domains. (Chen & Zhang, 2014) have discussed BDA in the scientific research domains such as astronomy, meteorology, social computing, bioinformatics, and computational biology, which are based on data-intensive scientific discovery. Other studies such as (Rabl et al., 2012) have investigated the performance of six modern open-source data stores in the context of the monitor of application performance as part of the initiative of (CA-Technologies, 2018). (Bi & Cochran, 2014) have discussed BDA in cloud manufacturing, indicating that the success of a manufacturing enterprise depends on the advancement of IT to support and enhance the value stream. The manufacturing technologies have evolved throughout the years. The measures of such advancement of a manufacturing system can be implemented by scale, complexity and automation responsiveness (Bi & Cochran, 2014). Figure 1 illustrates such evolution of the manufacturing technologies before the 1950s until the Big Data age.
Figure 1. Manufacturing Technologies,
Information System, ITs, and Their Evolutions
McKinsey Institute
has first reported four essential sectors
that can benefit from BDA: healthcare industry, government services, retailing,
and manufacturing (Brown,
Chui, & Manyika, 2011). The report has also reported a prediction for BDA
implementation to improve the productivity by .5 to 1 percent annually and
produce hundreds of billions of dollars in new value (Brown
et al., 2011).
McKinsey Institute has indicated that not all industries are created
equal in the context of parsing the benefits from BDA (Brown
et al., 2011).
Another report by McKinsey Institute have reported the transformative potential of BD in five domains: health care (U.S.), public sector administration (European Union), Retail (U.S.) Manufacturing (global), and Personal Location Data (global) (Manyika et al., 2011). The same report has predicted $300 billion as a potential annual value to US healthcare, and 60% potential increase in retailers’ operating margins possible with BDA (Manyika et al., 2011). Some sectors are poised for more significant gains and benefits from BD than others, although the implementation of BD will matter across all sectors (Manyika et al., 2011). It is divided by cluster A, B, C, D and E. The cluster A reflects information and computer and electronic products, while finance & insurance and government are categorized as class B. Cluster C include several sectors such as construction, educational services, and arts and entertainments. Cluster D has manufacturing, wholesale trade, while cluster E covers retail, healthcare providers, accommodation and food. Figure 2 shows some sectors are positioned for more significant gains from the use of BD.
Figure 2. Capturing Value from Big Data by
Sector (Manyika
et al., 2011).
The application of BDA in specific sectors have been
discussed in various research studies, such as health and medical research
(Liang
& Kelemen, 2016), biomedical research (Luo, Wu,
Gopukumar, & Zhao, 2016), machine learning techniques in
healthcare sectors (MCA,
2017). The
next section discusses the implementation of BDA in the healthcare sector.
Numerous research studies have discussed Big Data Analytics (BDA) in healthcare industries from a different perspective. Healthcare industries have taken advantages of BDA in fraud and abuse prevention, detection and reporting (cms.gov, 2017). The fraud and abuse of Medicare are regarded to be a severe problem which needs attention (cms.gov, 2017). Various examples of Medicare fraud scenarios are reported (cms.gov, 2017). Submitting, or causing to be submitted, false claims or making misrepresentations of fact to obtain a federal healthcare payment is the first Medicare fraud case. Soliciting, receiving, offering and paying remuneration to induce or reward referrals for items or services reimbursed by federal health care programs is another Medicare fraud scenario. The last fraud case in Medicare is making prohibited referrals for certain designated health services (cms.gov, 2017). The abuse of Medicare includes billing for unnecessary medical services, charging excessively for services or supplies, and misusing codes on a claim such as upcoding or unbundling codes (cms.gov, 2017; J. Liu et al., 2016). In 2012, the payments of $120 billion were improperly for healthcare (J. Liu et al., 2016). Medicare and Medicaid contributed to more than half of this improper payment total (J. Liu et al., 2016). The annual loss to fraud, waste, and abuse in healthcare domain is estimated to be $750 billion (J. Liu et al., 2016). In 2013, over 60% of the improper payments were for healthcare related. Figure 3 illustrates the improper payments in government expenditure.
Figure 3. Improper Payments Resulted from Fraud and Abuse (J. Liu et al., 2016).
Medicare
fraud and abuse are governed by federal
laws (cms.gov, 2017). These federal laws include False Claim Act (FCA), Anti-Kickback Statute (AKS),
Physician Self-Referral Law (Stark Law), Criminal Health Care Fraud Statute, Social
Security Act, and the United States
Criminal Code. Medicare anti-fraud and
abuse partnerships of various government agencies such as Health Care Fraud
Prevention Partnership (HFPP) and Centers for Medicare
and Medicaid Services (CMS) have been established to combat fraud and abuse. The main aim of this
partnership is to uphold the integrity of the Medicare program, save and recoup
taxpayer funds, reduce the costs of health care
to patients, and improve the quality of healthcare (cms.gov, 2017).
In
2010, Health and Human Services (HHS) and CMS initiated a national effort known
as Fraud Prevention System (FPS), a predictive analytics technology which runs
predictive algorithms and other analytics nationwide on all Medicare FFS claims
prior to any payment in an effort to detect any potential suspicious claims and
patterns that may constitute fraud and abuse (cms.gov, 2017). In 2012, CMS developed the Program Integrity
Command Center to combine Medicare and Medicaid experts such as clinicians,
policy experts, officials, fraud investigators, and law enforcement community
including FBI to develop and improve predictive analytics that identifies fraud and mobilize a rapid response (cms.gov, 2017). Such effort aims to connect with the field
offices to examine the fraud allegations within few hours through a real-time
investigation. Before the application of
BDA, the process to find substantiating evidence of a fraud allegation took
days or weeks.
Research
communities and data analytics industry have exerted various efforts to develop
fraud-detection systems (J. Liu et al., 2016). Various research studies have used different
data mining for healthcare fraud and abuse detection. (J. Liu et al., 2016) have used
unsupervised data mining approach and applied the clustering data mining
technique for healthcare fraud detection.
(Ekina, Leva, Ruggeri, & Soyer, 2013) have used the
unsupervised data mining approach and applied the Bayesian co-clustering data
mining technique for healthcare fraud
detection. (Ngufor & Wojtusiak, 2013) have used the
hybrid supervised and unsupervised data mining approach, and applied the
unsupervised data labeling and outlier detection, classification and regression
data mining technique for medical claims prediction. (Capelleveen, 2013; van Capelleveen, Poel, Mueller,
Thornton, & van Hillegersberg, 2016) have used unsupervised
data mining approach, and applied outlier
detection data mining technique for health insurance fraud detection
with the Medicaid domain.
The
case study presented by (Capelleveen, 2013; van Capelleveen et al., 2016) has been selected for further investigation on the
application of BDA in healthcare. The
outlier detection, which is one of the unsupervised data mining techniques, is
regarded as an effective predictor for
fraud detection and is recommended for use to support the audits initiations (Capelleveen, 2013; van Capelleveen et al., 2016). The outlier detection is the primary analytic
tool which was used in this case
study. The outlier detection tool can be based on linear model analysis, multivariate
clustering analysis, peak analysis, and boxplot analysis (Capelleveen, 2013; van Capelleveen et al., 2016). The algorithm of data mining outlier detection approach of this case study has been used on Medicaid dataset of 650,000
healthcare claims and 369 dentists of one state. RapidMiner can be used for outlier
detection data mining techniques.
The study of (Capelleveen, 2013; van Capelleveen et al., 2016) did not
specify the name of the tool which was used
in the outlier detection of the fraud and abuse in Medicare with emphasis on
dental practice.
The
process for such outlier detection unsupervised data mining technique involves seven iterative phases. The first step involves the composition of metrics
composition for domains. These metrics are derived or calculated data such as
feature, attribute or measurement which characterizes the behavior of an entity
for a certain period. The purpose of this metrics is to develop a
comparative behavioral analysis using data mining algorithms. These metrics are expected during the first
iteration to be inferred from provider
behavior supported by fraud causes and developed in cooperation with fraud
experts. In the subsequent iterations,
the metrics composition consists of the latest metrics which updates the
existing metrics that modify the configuration and make adjustments on the
confidence level to optimize the hit rates.
The composition of metrics phase is
followed by the cleaning and filtering the data. The selection of provider groups, and
computing the metrics is the third phase in this outlier detection
process. The fourth phase involves the
comparison of providers by metric and
flagging outliers. The predictors form
suspicion for provider fraud detection is the fifth phase, followed by the
report and presentation to fraud investigators phase. The last phase of the use of the outlier
protection analytic tool involves the metric evaluation. The result of the outlier detection analysis
has shown that 12 of the top 17 providers (71%) submitted suspicious claim patterns and should be referred to officials for further
investigation. The study concluded that
the outlier detection tool could be used
to provide new patterns of potential fraud that can be identified and possibly
used for future automated detection technique.
(Lazarevic & Kumar, 2005) have indicated that most of the outlier detection techniques are categorized into four categories. The statistical approach, the distance-based
approach, the profiling method, and the model-based approach. The data points are modeled in the
statistical approach using a stochastic distribution and are determined to be
outliers based on their relationship with the model. Most statistical approaches have the
limitation with higher dimensionality distribution of the data points due to
the complexity of such a distribution which results in inaccurate estimations. The distance-based approach can detect the
outliers using the computation of the distances among points to overcome the
limitation of the statistical approach. Various
distance-based outlier detection algorithms have been proposed, and they are based on different approaches. The first approach is based on computing the full dimensional
distances of points from one another using all the available features. The second approach is based on computing the densities of local neighborhoods. The
profiling method develops profiles of normal behavior using different data
mining techniques or heuristic-based approaches, and deviations from them are considered as intrusions. The model-based approach begins with the
categorization of normal behavior using
some predictive models. Such as neural
replicator networks or unsupervised support vector machines, and detect
outliers as the deviations from the learned model (Lazarevic & Kumar, 2005). (Capelleveen, 2013; van Capelleveen et al., 2016) have indicated that the outlier detection tool
as a data mining technique has not proven itself in the long run and is still
under experimentation. It is also
considered a sophisticated data mining
technique (Capelleveen, 2013; van Capelleveen et al., 2016). The
validation of effectiveness remains difficult (Capelleveen, 2013; van Capelleveen et al., 2016).
Based on this
analysis of the outlier detection tool, more heuristic and novel approach
should be used. (Viattchenin, 2016) have
proposed a novel technique for outlier detection. The proposed technique for outlier detection is based on a heuristic algorithm of
clustering, which is a function-based method. (Q. Liu & Vasarhelyi, 2013) have proposed a healthcare fraud detection using a clustering model
incorporating geolocation information.
The results of the clustering model using
have detected claims with the extreme payment amount and identified some suspicious
claims. In summary, integrating the
clustering technique can play a role in enhancing the reliability and validity
of the outlier detection data mining technique.
This project has discussed and examined Big Dat Analytics
(BDA) methods. An overview of BDA application in various sectors is discussed,
followed by the implementation of BDA in the healthcare industry. The records showed that the healthcare
industry is suffering from fraud, waste, and abuse. The discussion has provided a case study of
BDA in healthcare using outlier detection tool.
The data mining phases have been discussed and analyzed. A proposed improvement for the selected BDA technique
of outlier detection has also been addressed.
The analysis has indicated that the outlier detection technique is under
experimentation, and more heuristic data mining fraud detection technique
should be used such as the clustering data mining technique. In summary, various BDA techniques are available
for different industries. Organizations
must select the appropriate BDA tool to meet the requirements of the business
model.
Capelleveen,
G. C. (2013). Outlier based predictors
for health insurance fraud detection within US Medicaid. University of
Twente.
Chen,
C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges,
techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.
Ekina,
T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of bayesian
methods in detection of healthcare fraud.
Lazarevic,
A., & Kumar, V. (2005). Feature
bagging for outlier detection. Paper presented at the Proceedings of the
eleventh ACM SIGKDD international conference on Knowledge discovery in data
mining.
Liang,
Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health
and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).
Liu,
J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sricharan, K., . . .
Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in
healthcare data. AI Magazine, 37(2),
33-46.
Liu,
Q., & Vasarhelyi, M. (2013). Healthcare
fraud detection: A survey and a clustering model incorporating Geo-location
information.
Luo,
J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in
biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Manyika,
J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A.
H. (2011). Big data: The next frontier for innovation, competition, and
productivity.
MCA,
M. J. S. (2017). Applications of Big Data Analytics and Machine Learning
Techniques in Health Care Sectors. International
Journal Of Engineering And Computer Science, 6(7).
Ngufor,
C., & Wojtusiak, J. (2013). Unsupervised labeling of data for supervised
learning and its application to medical claims prediction. Computer Science, 14(2), 191.
Rabl,
T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., &
Mankovskii, S. (2012). Solving big data challenges for enterprise application
performance management. Proceedings of
the VLDB Endowment, 5(12), 1724-1735.
van
Capelleveen, G., Poel, M., Mueller, R. M., Thornton, D., & van
Hillegersberg, J. (2016). Outlier detection in healthcare fraud: A case study in
the Medicaid dental domain. International
Journal of Accounting Information Systems, 21, 18-31.
Viattchenin, D. A. (2016). A Technique for Outlier
Detection Based on Heuristic Possibilistic Clustering. CERES, 17.
The purpose of this discussion is to
identify and describe a tool in the market for data analytics, how the tool is used and where it can be used.
The discussion begins with an overview of the Big Data Analytics tools,
followed by the top five tools for 2018, among which RapidMiner is selected as the BDA tool for this
discussion. The discussion of the
RapidMiner as one of the top five BDA tools include the features, technical
specification, use, advantages, and
limitation. The application of
RapidMiner in various industries such as medical and education is also addressed in this discussion.
Overview of Big Data Analytics Tools
Organizations must be able to quickly and
effectively analyze a large amount of data
and extract value from such data for sound business decisions. The benefits of Big Data Analytics are
driving organizations and businesses to implement the Big Data Analytics
techniques to be able to compete in the market. A survey conducted by CIO Insight has shown
that 65% of the executives and senior decisions makers have indicated that
organizations will risk becoming uncompetitive or irrelevant if Big Data is not
embraced (McCafferly, 2015). The same
survey also has shown that 56% have anticipated a higher investment for big
data, and 15% have indicated that such increasing trend in the budget
allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.
Regarding the BDA tools, various BDA tools exist in the market for different business purposes based on the business model of the organization. Organizations must select the right tool that will serve their business model. Various studies have discussed various tools for BDA implementation. (Chen & Zhang, 2014) have examined various types of BD tools. Some tools are based on batch processing such as Apache Hadoop, Dryad, Apache Mahout, and Tableau, while other tools are based on stream processing such as Storm, S4, Splunk, Apache Kafka, and SAP Hana as summarized in Table 1 and Table 2. Each tool provides certain features for BDA implementation and offers various advantages to those BDA-adapted organizations.
Table 1. Big Data Tools Based on Batch Processing (Chen & Zhang, 2014).
Table 2. Big Data Tools Based on Stream
Processing (Chen & Zhang, 2014).
Other studies such as (Rangra & Bansal, 2014) have provided a comparative study of data mining tools such as Weka, Keel, R-Programming, Knime, RapidMiner, and Orange, their technical specification, general features, specialization, advantages, and limitations. (Choi, 2017) have discussed the BDA tools by categories. These BDA tools are categorized by open source data tools, data visualization tools, sentiment tools, and data extraction tools. Figure 1 provides a summary of some of the examples of BDA tools including the databases sources to download big datasets for analysis.
Figure 1. A Summary of Big Data Analytics Tools.
(Al-Khoder & Harmouch, 2014) have evaluated four of the most popular open source and free data mining tools including R, RapidMiner, Weka, and Knime. R foundation has developed R-Programming, while Rapid-I company have developed RapidMinder. Weka is developed by University of Waikato, and Knime is developed by Knime.com AG. Figure 2 provides a summary of these four BDA most popular open source and free data mining tools, with the logo, description, launch date, current version at the time of writing the study, and development team.
Figure 2. Open Source and Free Data
Mining Tools Analyzed by (Al-Khoder & Harmouch, 2014).
The top five of BDA tools for 2018 include Tableau Public, Rapid Miner, Hadoop, R-Programming, IBM Big Data (Seli, 2017). The present discussion focuses on one of these two five BDA tools for 2018. Figure 3 summarizes these top five BDA tools for 2018.
Figure 3. Top Five BDA Tools for 2018.
RapidMiner Big Data Analytic Tool
RapidMiner Big Data Analytic tool is
selected for the present discussion since
it was among the top five BDA tools for 2018.
RapidMiner is an open source platform for BDA, based on Java programming
language. RapidMiner provides machine learning procedures
and data mining. It also provides data
visualization, processing, statistical modeling, deployment, evaluation and
predictive analytics (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014; Seli, 2017). RapidMiner is known for its commercial and
business applications, as it provides an integrated environment and platform
for machine learning, data mining, predictive analysis, and business analytics (Hofmann & Klinkenberg, 2013; Seli, 2017). It is also
used for research, education, training, rapid prototyping, and application
development (Rangra & Bansal, 2014). It is specialized in predictive analysis and statistical computing. It supports all
steps of the data mining process (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014). RapidMiner uses the client/server model, where the
server can be software, or a service or
on cloud infrastructures (Rangra & Bansal, 2014).
RapidMiner was
released on 2006. The latest
version of RapidMiner server is 7.2 with a free version of server and Radoop
and can be downloaded from RapidMiner
site (rapidminer, 2018). It can be installed on any operating system (Rangra & Bansal, 2014). The
advantages of the RapidMiner include an integrated environment for all steps
that are required for data mining process, easy to use graphical user interface
(GUI) for the design of data mining process, the visualization of the result
and data, the validation and optimization of these processes. RapidMiner can
be integrated into more complex systems (Hofmann & Klinkenberg, 2013). RapidMiner
also stores the data mining processes in a machine-readable XML format, which
can be executed with a click of a button, providing a visualized graphics of the data mining processes (Hofmann & Klinkenberg, 2013). It contains over a hundred learning schemes for
regression classification and clustering analysis (Rangra & Bansal, 2014). RapidMiner
has a few limitations including the size constraints of the number of rows and
more hardware resources than other tools such as SAS for the same task and data
(Seli, 2017). RapidMiner also requires prominent knowledge
of the database handling (Rangra & Bansal, 2014).
RapidMiner Use and Application
Data Mining requires six essential steps to extract value from a large dataset (Chisholm, 2013). The process of Data mining framework begins with business understanding, followed by the data understanding and data preparation. The modeling, evaluation and deployment phases develop the models for predictions, testing, and deploying them in real-time. Figure 4 illustrates these six steps of the data mining.
Figure 4. Data Mining Six Phases Process
Framework (Chisholm, 2013).
Before working with RapidMiner, the user must know the common terms used by RapidMiner. Some of these standard terms are a process, operator, macro, repository, attribute, role, label, and ID (Chisholm, 2013). The data mining process in RapidMiner begins with loading the data into RapidMiner. Loading the data into RapidMiner using import technique for either data in files, or databases. The process of splitting the large file into pieces can be implemented in RapidMiner. In some cases, the dataset can be split into chunks using RapidMiner process which reads each line in the file such as CSV file to be split into chunks. If the dataset is based on a database, a Java Data Connectivity (JDBC) driver must be used. RapidMiner support MySQL, PostgreSQL, SQL Server, Oracle and Access (Chisholm, 2013). After loading the data into RapidMinder and generating data for testing, a predictive model can be created based on the loaded dataset, followed by the process execution and reviewing the result visually. RapidMiner provides various techniques to visualize the data. It uses scatter plots, scatter 3D color, parallel and deviations, quartile color, plotting series, and survey plotter. Figure 5 illustrates scatter 3D color visualization of the data in RapidMiner (Chisholm, 2013).
Figure 5. Scatter 3D Color Visualization
of the Data in RapidMiner (Chisholm, 2013).
RapidMiner supports
statistical analysis such as K-Nearest Neighbor Classifications, Naïve Bayes Classification, which can be used for credit
approval and in education (Hofmann & Klinkenberg, 2013). RapidMiner application is also witnessed in other industries such as marketing,
cross-selling and recommender system (Hofmann & Klinkenberg, 2013). Other useful
use cases of the RapidMiner application include the clustering in medical and
education domains (Hofmann & Klinkenberg, 2013). RapidMinder
can also be used for text mining scenarios
such as spam detection, language detection, and
customer feedback analysis. Other applications of RapidMiner include anomaly detection
and instance selection.
Conclusion
This discussion has
identified the different tools for Big Data Analytics (BDA). Over thirty
analytic tools which can be used to overcome some of the BDA. Some are open
source tools such as Knime, R-Programming, RapidMiner which can be downloaded
for free, while others are described as
visualization tools such as Tableau Public, Google Fusion to provide compelling visual images of the data in various
scenarios. Other tools are more semantic
such as OpenText and Opinion Crawl. Data extraction tools for BDA include
Octoparse and Content Grabber. The users
can download large datasets for BDA from various databases such as
data.gov.
The discussion has also
addressed the top five BDA tools for 2018, such as Tableau Public, RapidMiner,
Hadoop, R-Programming and IBM Big Data. RapidMiner was selected as BDA tools for this discussion. The focus of the discussion on RapidMiner
included the technical specification, use, advantages, and limitation. The data mining process and steps when using
RapidMiner have also been discussed. The analytic process begins with the data
upload to RapidMiner, during which the data can be split using the RapidMiner
capabilities. After the load and the
cleaning of the data, the data model is
developed and tested, followed by the visualization. The visualization capabilities of RapidMiner include statistical analysis such as
K-Nearest Neighbor and Naïve Bay Classification. RapidMiner use cases have been addressed as
well to include the medical and education domains, text mining scenarios such
as spam detection. Organizations must
select the appropriate BDA tools based on the business model.
References
Al-Khoder, A., & Harmouch, H.
(2014). Evaluating four of the most popular open source and free data mining
tools.
Chen, C. P.,
& Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques
and technologies: A survey on Big Data. Information
Sciences, 275, 314-347.
Chisholm, A.
(2013). Exploring data with RapidMiner:
Packt Publishing Ltd.
Rangra, K., &
Bansal, K. (2014). Comparative study of data mining tools. International journal of advanced research in computer science and
software engineering, 4(6).
The purpose
of this discussion is to identify a real-life case study where Hadoop was used.
The discussion also addresses the view of the researcher whether Hadoop was used in the amplest
manner. The benefits of Hadoop to
the identified industry of the use case are also
discussed.
Hadoop Real Life Case Study and
Applications in the Healthcare Industry
Various
research studies and reports have discussed Spark solution for real-time data
processing in particular industries such
as Healthcare, while others have discussed Hadoop solution for healthcare data
analytics. For instance, (Shruika & Kudale,
2018)
have discussed the use of Big Data in Healthcare with Spark, while (Beall, 2016) have indicated that United Healthcare is processing
data using Hadoop framework for clinical advancements, financial analysis, and
fraud and waste monitoring. United
Healthcare has utilized Hadoop to obtain a 360-degree
view of each of its 85 million members (Beall, 2016).
The
emphasis of this discussion is on Hadoop in the Healthcare
industry. The data growth in the Healthcare industry is increasing exponentially
(Dezyre, 2016). McKinsey have
anticipated the potential annual value for healthcare in the US is $300 billion, and 7% annual productivity
growth using BDA (Manyika et al., 2011). (Dezyre, 2016) have reported that the healthcare informatics poses
challenges such as data knowledge representation, database design, data querying,
and clinical decision support which contribute to the development of BDA.
Big
Data in healthcare include data such as patient-related data from electronic
health records (EHRs), computerized physician order entry systems (CPOE),
clinical decision support systems, medical devices and sensor, lab results and images
such as Xrays, and so forth (Alexandru, Alexandru,
Coardos, & Tudora, 2016; Wang, Kung, & Byrd, 2018). Big Data
framework for healthcare includes data
layer, data aggregation layer, the analytical layer, the information
exploration layer (Alexandru et al., 2016). Hadoop resides in the analytical layer of the Big
Data framework (Alexandru et al.,
2016).
The
data analysis involves Hadoop and MapReduce processing large dataset in batch
form economically, analyzing both data types of structured and unstructured in
a massively parallel processing environment (Alexandru et al.,
2016). (Alexandru et al.,
2016)
have indicated that stream computing can
also be implemented using real-time or near real-time analysis to identify and
respond to any health care fraud
quickly. The third type of analytics at
the analytic layer also involves in-database analytics using data warehouse for
data mining allowing high-speed parallel processing which can be used for
prediction scenarios (Alexandru et al.,
2016). The in-database analytics can be used for preventive health care and pharmaceutical management. Using Big Data framework including Hadoop
ecosystem provides additional health care
benefits such as scalability, security, confidentially and optimization
features (Alexandru et al.,
2016).
Hadoop
technology was found to be the only technology that enables healthcare to store
data in its native forms (Dezyre, 2016). There are
five successful use cases and applications of Hadoop in the healthcare industry (Dezyre, 2016). The first
application of Hadoop technology in healthcare is the cancer treatments and
genomics. Hadoop help develops better
treatments for diseases such as cancel by accelerating the design and testing
of effective treatments tailored to patients, expanding genetically based
clinical cancer trials, and establishing a national “cancer knowledge network”
to guide treatment decisions (Dezyre, 2016). Hadoop can
also be used to monitor the patient vitals.
The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over 6,200 children
in their ICU units. Through Hadoop, the
hospital was able to store and analyze the vital signs, and if there is any
pattern change, an alert is generated and
sent to the physicians (Dezyre, 2016). The third
application of Hadoop in Healthcare industry involves the hospital
network. The Cleveland Clinic spinoff
company, known as “Explorys” is taking advantages of Hadoop by developing the most extensive database in the healthcare industry. As a result, Explorys was
able to provide clinical support, reduce the cost of care measurement and
manage the population of at-risk patients (Dezyre, 2016). The fourth
application of Hadoop in Healthcare industry involves healthcare intelligence,
where healthcare insurance businesses are interested in finding the age of
individuals in specific regions, who
below a certain age are not a victim of certain diseases. Through Hadoop technology, the healthcare insurance companies can compute the cost of insurance policy. Pig, Hive, and
MapReduce of Hadoop ecosystem are used in this scenario to process such a large
dataset (Dezyre, 2016). The last
application of Hadoop in the healthcare
industry involves fraud prevention and
detection.
Conclusion
In
conclusion, the healthcare industry has
taken advantages of Hadoop technology in various areas not only for better
treatment and better medication but also
for reducing the cost and increasing
productivity and efficiency. It has also
used Hadoop for fraud protection. These
are not only the benefits which Hadoop offers the healthcare industry. Hadoop also offers storage capabilities,
scalability, and analytics capabilities of various types of datasets using
parallel processing and distributed file system. From the viewpoint
of the researcher, utilizing Spark on top of Hadoop will empower the healthcare
industry not only at the batching processing level
but also at the real-time data processing. (Basu, 2014) have reported that
the healthcare industry can take
advantages of Spark and Shark with Apache Hadoop for real-time healthcare
analytics. Although Hadoop alone offers excellent benefits to the healthcare
industry, its integration with other analytic tools such as Spark can make a huge
difference at the patient care level as well as at the industry return on
investment level.
References
Alexandru, A., Alexandru, C.,
Coardos, D., & Tudora, E. (2016). Healthcare, Big Data, and Cloud
Computing. Management, 1, 2.
Manyika, J.,
Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H.
(2011). Big data: The next frontier for innovation, competition, and
productivity.
Shruika, D.,
& Kudale, R. A. (2018). Use of Big Data in Healthcare with Spark. International Journal of Science and
Research (IJSR).
Wang,
Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its
capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change,
126, 3-13.
The purpose
of this discussion is to discuss the Hadoop ecosystem, which is rapidly
evolving. The discussion also covers Apache Spark, which is a recent addition
to the Hadoop ecosystem. Both technologies and tools offer significant benefits for the challenges of
storing and processing of large data sets in the age of Big Data Analytic. The discussion also addresses the most
significant differences between Hadoop and Spark.
Hadoop Solution, Components and Ecosystem
The growth of Big Data has demanded the attention not only from researchers, academia, and government but also from the software engineering as it has been challenging dealing with Big Data using the conventional computer science technologies (Koitzsch, 2017). (Koitzsch, 2017) have referenced annual data volume statistics from Cisco VNI Global IP Traffic Forecast from 2014-2019 as illustrated in Figure 1 to show the growth magnitude of the data.
Figure 1. Annual Data Volume Statistics
[Cisco VNI Global IP Traffic Forecast
2014-2019] (Koitzsch, 2017).
The complex characteristics of Big Data have demanded the innovation of distributed big data analysis as the conventional techniques were found inadequate (Koitzsch, 2017; Lublinsky, Smith, & Yakubovich, 2013). Thus, tools such as Hadoop has emerged relying on clusters of relatively low-cost machines and disks, driving the distributed processing for large-scale data projects. Apache Hadoop is a Java-based open source distributed processing framework has evolved from Apache Nutch, which is an open source web search engine, based on Apache Lucene (Koitzsch, 2017). The new Hadoop subsystems have various language bindings such as Scala and Python (Koitzsch, 2017). The core components of Hadoop 2 include MapReduce, Yarn, HDFS and other components including Tez as illustrated in Figure 2.
The Hadoop and its ecosystem are divided into major building blocks (Koitzsch, 2017). The core components of the Hadoop 2 involve Yarn, Map/Reduce, HDFS, and Apache Tez. The operational services component includes Apache Ambari, Oozie, Ganglia, NagiOs, Falcone, etc. The data services component includes Hive, HCatalog, PIG, HBase, Flume, Sqoop, etc. The messaging component includes Apache Kafka, while the security services and secure ancillary components include Accumulo. The glue components include Apache Camel, Spring Framework, and Spring Data. Figure 3 summarizes these building blocks of the Hadoop and its ecosystem.
Furthermore, the structure of the ecosystem of Hadoop involves various components, where Hadoop is in the center, providing bookkeeping and management for the cluster using Zookeeper, and Curator (Koitzsch, 2017). Hive and Pig are a standard component of the Hadoop ecosystem providing data warehousing, while Mahout provides standard machine learning algorithm support. Figure 4 shows the structure of the ecosystem of Hadoop (Koitzsch, 2017).
Figure 4. Hadoop Ecosystem (Koitzsch, 2017).
Hadoop Limitation Driving Additional Technologies
Hadoop has three significant limitations (Guo, 2013). The first limitation is about the instability
of the software of Hadoop as it is an open source software and the lack of technical support and documentation. Enterprise
Hadoop can be used to overcome the first limitation. Hadoop cannot handle real-time data
processing, which is a significant
limitation for Hadoop. Spark or Storm
can be used to overcome the real-time
processing, as required by the application. Hadoop cannot large graph datasets
either. GraphLab can be utilized to
overcome the large graph dataset
limitation.
The Enterprise Hadoop is distributions of Hadoop by various Hadoop-oriented vendors such as Cloudera, Hortonworks and MapR, and Hadapt (Guo, 2013). Cloudera provides Big Data solutions and is regarded to be one of the most significant contributors to the Hadoop codebase (Guo, 2013). Hortonworks and MapR are Hadoop-based Big Data solutions (Guo, 2013). Spark is a real-time in-memory processing platform Big Data solution (Guo, 2013). (Guo, 2013) have indicated that Spark “can be up to 40 times faster than Hadoop” (page 15). (Scott, 2015) has indicated that Spark is running in memory “can be 100 times faster than Hadoop MapReduce, but also ten times faster when processing disk-based data in a similar way to Hadoop MapReduce itself” (page 7). Spark is described as ideal for iterative processing and responsive Big Data applications (Guo, 2013). Spark can also be integrated with Hadoop, where Hadoop-compatible storage API provides the capabilities to access any Hadoop-supported systems (Guo, 2013). The storm is another choice for the Hadoop limitation of real-time data processing. The storm is developed and open source by Twitter (Guo, 2013). The GraphLab is the alternative solution for the Hadoop limitation of dealing with large graph dataset. GraphLab is an open source distributed system, developed at Carnegie Mellon University, to handle sparse iterative graph algorithms (Guo, 2013). Figure 5 summarizes these three limitations and the alternatives of Hadoop to overcome them.
Figure 5. Three Major Limitations of Hadoop and Alternative Solutions.
Apache Spark Solution, and its Building Blocks
In 2009, Spark was developed by UC Berkeley AMPLab. Spark runs in-memory
processing data quicker than Hadoop(Guo, 2013; Koitzsch, 2017; Scott, 2015). In 2013,
Spark became a project of Apache Software Foundation, and early in 2014, it became one of the major projects. (Scott, 2015) has described Spark as a general-purpose engine
for data processing, and can be used in various projects (Scott, 2015). The primary
tasks that are associated with Spark
include interactive queries across large datasets, processing data streaming from sensors or financial
systems, and machine learning (Scott, 2015). While Hadoop was
written in Java, Apache Spark was written primarily in Scala (Koitzsch, 2017).
Three critical
features for Spark: simplicity, speed,
and support (Scott, 2015). The simplicity feature is represented in the access capabilities of
Spark through a set of APIs which are well structured and documented assisting
data scientist to utilize Spark quickly.
The speed feature reflects the in-memory processing of large dataset
quickly. The speed feature has
distinguished Spark from Hadoop. The
last feature of the support is presented
in the various programming languages such as Java, Python, R, and Scala, which
Spark support (Scott, 2015). Spark has native support for integrating some leading storage solutions in the Hadoop
ecosystems and beyond (Scott, 2015). Databricks, IBM and other main Hadoop vendors
are the providers of Spark-based solutions.
The typical use of Spark includes stream processing, machine learning, interactive analytics, and data integration (Scott, 2015). Example of stream processing includes real-time data processing to identify and prevent potentially fraudulent transactions. The machine learning is another typical use case of Spark, which is supported by the ability of Spark to run into memory and quickly run repeated queries that help in training machine learning algorithms to find the most efficient algorithm (Scott, 2015). The interactive analytics is another typical use of Spark involving interactive query process where Spark responds and adapts quickly. The data integration is another typical use of Spark involving the extract, transform and load (ETL) process reducing the cost and time. Spark framework includes the Spark Core Engine, with SQL Spark, Spark Streaming for data streaming, MLib Machine Learning, GraphX for Graph Computation, Sark R for running R language on Spark. Figure 6 summarizes the framework of Spark and its building blocks (Scott, 2015).
Figure 6. Spark Building Blocks (Scott, 2015).
Differences Between Spark and Hadoop
Although Spark has its benefits in
processing real-time data using in-memory processing, Spark is not a
replacement for Hadoop or MapReduce (Scott, 2015). Spark can run on top of Hadoop to benefit
from Yarn which is the cluster manager of Hadoop, and the underlying storage of HDFS, HBase and so
forth. Besides, Spark can also run
separately by itself without Hadoop, integrating with other cluster managers
such as Mesos and other storage like Cassandra and Amazon S3 (Scott, 2015). Spark is described as a great companion to the modern
Hadoop cluster deployment (Scott, 2015). A spark
is also described as a powerful tool on
its own for processing a large volume of
data sets. However, Spark is not
well-suited for production workload.
Thus, the integration Spark with Hadoop provides many capabilities which
Spark cannot offer on its own.
Hadoop offers Yarn as a resource manager,
the distributed file system, disaster recovery capabilities, data security, and a distributed data platform. Spark offers a machine learning model to Hadoop, delivering capabilities which is not easily used in Hadoop without Spark (Scott, 2015). Spark also offers fast in-memory real-time
data streaming, which Hadoop cannot accomplish without Spark (Scott, 2015). In summary, although Hadoop has its
limitations, Spark is not replacing Hadoop, but empowering it.
Conclusion
This discussion has covered significant topics relevant to Hadoop and
Spark. It began with Big Data, its
complex characteristics, and the urgent need for technology and tools to deal with Big Data. Hadoop and Spark as
emerging technologies and tools and their building blocks have been addressed in this discussion. The differences between Spark and Hadoop is also covered. The conclusion of this discussion is that Spark is not replacing
Hadoop and MapReduce. Spark offers
various benefits to Hadoop, and at the same time,
Hadoop offers various benefits to Spark.
The integration of both Spark and Hadoop offers great benefits to the data scientists in Big Data Analytics domain.
References
Guo, S. (2013). Hadoop operations and cluster management
cookbook: Packt Publishing Ltd.
Koitzsch, K.
(2017). Pro Hadoop Data Analytics:
Springer.
Lublinsky, B.,
Smith, K. T., & Yakubovich, A. (2013). Professional
hadoop solutions: John Wiley & Sons.
Scott,
J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.
The purpose of this project is to discuss Hadoop functionality, installation steps, and any troubleshooting techniques. It addresses two significant parts. Part-I discusses Big Data and the emerging technology of Hadoop. It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations. It also discusses the MapReduce framework, its benefits, and limitations. Part-I provides a few success stories for Hadoop technology use with Big Data Analytics. Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks. It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.
Keywords: Big Data Analytics; Hadoop
Ecosystem; MapReduce.
This project
discusses various significant topics related to Big Data Analytics. It addresses two significant parts. Part-I
discusses Big Data and the emerging technology of Hadoop. It also provides an overview of the Hadoop ecosystem, its building blocks,
benefits, and limitations. It also discusses the MapReduce framework,
its benefits, and limitations. Part-I provides a few success stories for
Hadoop technology use with Big Data Analytics.
Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks.
It also addresses the errors during the configuration setup and the
techniques to overcome these errors to proceed successfully with the Hadoop
installation.
The purpose of this Part is to address relevant topics related to Hadoop. It begins with Big Data Analytics and Hadoop emerging technology. The building blocks of the Hadoop ecosystem is also addressed in this part. The building blocks include the Hadoop Distributed File System (HDFS), MapReduce, and HBase. The benefits and limitations of Hadoop as well as MapReduce are also discussed in Part I of the project. Part I ends with success stories for using Hadoop ecosystem technology with Big Data Analytics in various domains and industries.
Big Data is now the buzzword in the field of computer science and information technology. Big Data attracted the attention of various
sectors, researchers, academia, government and even the media (Géczy, 2014; Kaisler, Armour, Espinosa,
& Money, 2013). In the 2011 report of the
International Data Corporation (IDC), it is reporting that the amount of the
information which will be created and replicated will exceed 1.8 zettabytes which are 1.8 trillion gigabytes in 2011. This amount
of information is growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).
Big Data Analytics (BDA) analyzes and mines Big Data to produce operational and business knowledge at an unprecedented scale (Bi & Cochran, 2014). BDA is described by (Bi & Cochran, 2014) to be an integral toolset of strategy, marketing, human resources, and research. It is the process of inspecting, cleaning, transforming, and modeling BD with the objective of discovering knowledge, generating solutions, and supporting decision-making (Bi & Cochran, 2014). Big Data (BD) and BDA are regarded to be powerful tools that various organizations have benefited from (Bates, Saria, Ohno-Machado, Shah, & Escobar, 2014). Companies which adopted Big Data Analytics successfully have been successful at using Big Data to improve the efficiency of the business (Bates et al., 2014). Example for successful application of Big Data Analytics is IBM “Watson” an application developed by IBM and was viewed in the TV Jeopardy program, using some of these Big Data approaches (Bates et al., 2014). (Manyika et al., 2011) have provided notable examples of organizations around the globe that are well-known for their extensive and effective use of data include companies like Wal-Mart, Harrah’s, Progressive Insurance, and Capital One, Tesco, and Amazon. These companies have already taken advantage of the Big Data as a “competitive weapon” (Manyika et al., 2011). Figure 1 illustrates the different types of data which make up the Big Data space.
Figure 1: Big Data (Ramesh, 2015)
“Big data is about deriving value… The goal of big data is
data-driven decision making” (Ramesh, 2015). Thus, business
should make the analytics as the goal when investing in storing Big Data (Ramesh, 2015). Business should
focus on the Analytics side of Big Data to retrieve the value that can assist in decision-making
(Ramesh, 2015). The value of BDA is increasing as the cash
flow is increasing (B. Gupta & Jyoti, 2014). Figure 2 illustrates the graph
for the value of BDA with dimensions of time and cumulative cash flow. Thus, there is no doubt that BDA provides great benefits to organizations.
Figure 2. The Value of Big Data Analytics. Adapted from (B. Gupta & Jyoti, 2014).
Furthermore, the organization must learn how to use Big Data Analytics to drive value for the business that aligns with the core competencies and create competitive advantages for the business (Minelli, Chambers, & Dhiraj, 2013). BDA can improve operational efficiencies, increase revenues, and achieve competitive differentiation. Table 1 summarizes the Big Data Business Models which can be used by organizations to put Big Data into work as opportunities for business.
Table 1: Big Data Business Models (Minelli et al., 2013)
There are three types of status for data that organizations deal with: data in use, data at rest and data in motion. The data in use indicates that the data are used for services or users require them for their work to accomplish specific tasks. The data at rest indicates that the data are not in use and are stored or archived in storage. The data in motion indicates that the data state is about to change from data at rest to data in use or transferred from one place to another successfully (Chang, Kuo, & Ramachandran, 2016). Figure 3 summarizes these three types of data.
Figure 3. Three Types for Data.
One
of the significant characteristics of Big
Data is velocity. The speed of data
generation is described by (Abbasi, Sarker, & Chiang, 2016) as “hallmark” of
Big Data. Wal-Mart is an example of
generating the explosive amount of data,
by collecting over 2.5 petabytes of customer transaction data every hour. Moreover, over one billion new tweets occur
every three days, and five billion search queries occur daily (Abbasi et al., 2016). Velocity is the data in motion (Chopra & Madan, 2015; Emani, Cullot, &
Nicolle, 2015; Katal, Wazid, & Goudar, 2013; Moorthy, Baby, &
Senthamaraiselvi, 2014; Nasser & Tariq, 2015). Velocity involves streams of data, structured
data, and the availability of access and
delivery (Emani et al., 2015). The velocity of
the incoming data does not only represent the challenge of the speed of the
incoming data because this data can be processed using the batch processing but also in streaming such high
speed-generated data during the real-time for knowledge-based decision (Emani et al., 2015; Nasser & Tariq, 2015). Real-Time Data (a.k.a Data in Motion) is the
streaming data which needs to be analyzed as it comes
in (Jain, 2013).
(CSA, 2013) have indicated that the technologies of Big Data are divided into two categories; batch processing for analyzing data that is at rest, and stream processing for analyzing data in motion. Example of data at rest analysis includes sales analysis, which is not based on a real-time data processing (Jain, 2013). Example of data in motion analysis includes Association Rules in e-commerce. The response time for each data processing category is different. For the stream processing, the response time of data was from millisecond to seconds, but the more significant challenge is to stream data and reduce the response time under much lower than milliseconds, which is very challenging (Chopra & Madan, 2015; CSA, 2013). The data in motion reflecting the stream processing or real-time processing does not always need to reside in memory, and new interactive analysis of large-scale data sets through new technologies like Apache Drill and Google’s Dremel provide new paradigms for data analytics. Figure 4 illustrates the response time for each processing type.
Figure 4. The Batch and Stream
Processing Responsiveness (CSA,
2013).
There
are two kinds of systems for the data at rest; the NoSQL systems for
interactive data serving environments, and the systems for large-scale analytics based on the MapReduce paradigm, such as Hadoop. The NoSQL systems are designed to have a
simpler key-value based Data Model having in-built sharding, and work
seamlessly in a distributed cloud-based
environment (R. Gupta, Gupta, & Mohania, 2012). A mapreduce-based
framework such as Hadoop supports the batch-oriented
processing (Chandarana &
Vijayalakshmi, 2014; Erl, Khattak, & Buhler, 2016; Sakr & Gaber, 2014).
The data stream management system allows the user
to analyze data in motion, rather than collecting large quantities of data,
storing it on disk, and then analyzing it. There are various streams
processing systems such as IBM InfoSphere Streams (R. Gupta et al., 2012; Hirzel et al., 2013), Twitter’s Storm,
and Yahoo’s S4. These systems are
designed and geared towards clusters of commodity hardware for real-time data
processing (R. Gupta et al., 2012).
In 2004, Google introduced MapReduce framework as a parallel processing
framework which deals with a large set of
data (Bakshi, 2012; Fadzil, Khalid, &
Manaf, 2012; White, 2012). The MapReduce framework has
gained much popularity because it has features for hiding sophisticated
operations of the parallel processing (Fadzil et al., 2012). Various MapReduce frameworks
such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012).
The capability of the MapReduce framework was
realized by different research areas such as data warehousing, data
mining, and the bioinformatics (Fadzil et al., 2012). MapReduce framework consists of
two main layers; the Distributed File System (DFS) layer to store data and the
MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012;
Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014). DFS is a significant feature of the MapReduce framework (Fadzil et al., 2012).
MapReduce framework is using large clusters of low-cost commodity
hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li,
2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013;
Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012). MapReduce framework is using
“Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose
components are loosely coupled and when
any node goes down, there is no negative
impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan,
Hsiao, & Parker, 2007). MapReduce framework involves the
“Fault-Tolerance” by applying the replication technique and allows replacing
any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber,
2014). MapReduce framework involves the
automatic support for the parallelization of execution which makes the
MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber,
2014).
BD emerging technologies such as Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale, but also heterogeneous datasets at unprecedented scale and speed (Cardenas, Manadhata, & Rajan, 2013). Hadoop was developed by Yahoo and Apache to run jobs in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012). A various large corporation such as Facebook, Amazon have used Hadoop as it offers high efficiency, high scalability, and high reliability (Yan et al., 2012). The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015). The MapReduce programming model is another significant component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015). HBase is the third component of the Hadoop framework (Bao et al., 2012). HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012).
Various studies
have addressed various benefits for
Hadoop technology. Hadoop includes the
scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al.,
2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014). Hadoop allows the nodes in the cluster to
scale up and down based on the computation requirements and with no change in
the data formats (H. Hu et al., 2014; Polato et al., 2014). Hadoop also provides massively parallel
computation to commodity hardware decreasing the cost per terabyte of storage
which makes the massively parallel computation affordable when the volume of
the data gets increased (H. Hu et al., 2014). The Hadoop technology offers the flexibility
feature as it is not tight with a schema which allows the utilization of any
data either structured, non-structures, and semi-structured,
and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014). Hadoop also allows nodes to crash without
affecting the data processing. It
provides fault tolerance environment where data and computation can be
recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012).
Hadoop has faced various limitation such as low-level programming
paradigm and schema, strictly batch processing, time skew and incremental
computation (Alam & Ahmed, 2014). The incremental computation is
regarded to be one of the significant
shortcomings of Hadoop technology (Alam & Ahmed, 2014). The efficiency on handling
incremented data is at the expense of losing the incompatibility with
programming models which are offered by
non-incremental systems such as MapReduce, which requires the implementation of
incremental algorithms and increasing the
complexity of the algorithm and the code (Alam & Ahmed, 2014). The caching technique is
proposed by (Alam & Ahmed, 2014) as a solution. This caching
solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014).
Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, &
Pasquin, 2011). The Incoop proposed solution is
to extend the open-source implementation of Hadoop of MapReduce programming
paradigm to run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber,
2014). Incoop allows programmers to
increment the MapReduce programs automatically without any modification to the
code (Bhatotia et al., 2011; Sakr & Gaber,
2014). Moreover, information about the
previously executed MapReduce tasks are recorded by Incoop to be reused in
subsequent MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber,
2014).
The Incoop is not a perfect solution, and
it has some shortcomings which are addressed by (Sakr & Gaber, 2014; Zhang, Chen,
Wang, & Yu, 2015). Some enhancements are
implemented to Incoop to include incremental HDFS called Inc-HDFS, Contraction
Phase, and “Memoization-aware Scheduler” (Sakr & Gaber, 2014). The Inc-HDFS provides the delta
technique in the inputs of two consecutive job runs and splits the input based
on the contents where the compatibility with HDFS is maintained. The
Contraction phase is a new phase in the MapReduce framework consisting of
breaking up the Reduce tasks into smaller sub-computation
forming an inverted tree allowing the small portion of the input changes
to the path from the corresponding leaf
to the root to be computed (Sakr & Gaber, 2014). The Memoization-aware Scheduler
is a modified version of the scheduler of
Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).
Another solution called i2MapReduce
proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015). The i2MapReduce does
not perform the task-level computation but rather
a key-value pair level incremental processing. This solution also supports more complex
iterative computation, which is used in data mining and reduces the I/O
overhead by applying various techniques (Zhang et al., 2015). IncMR is an enhanced framework
for the large-scale incremental data processing (Yan et al., 2012). It inherits the simplicity of
the standard MapReduce, it does not modify HDFS
and utilizes the same APIs of the MapReduce (Yan et al., 2012). When using IncMR, all programs
can complete incremental data processing without any modification (Yan et al., 2012).
In summary, various efforts are exerted by researchers to overcome the
incremental computation limitation of Hadoop, such as Incoop, Inc-HDFS, i2MapReduce,
and IncMR. Each proposed solution is an
attempt to enhance and extend the standard Hadoop to avoid overheads such as
I/O, to increase the efficiency, and without increasing the complexing of the
computation and without causing any modification to the code.
MapReduce was introduced to solve the problem of
parallel processing of a large set of
data in a distributed environment which required manual management of the
hardware resources (Fadzil et al., 2012; Sakr & Gaber,
2014). The complexity of the
parallelization is solved by using two
techniques: Map/Reduce technique, and
Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber,
2014). The parallel framework must be
reliable to ensure good resource management in the distributed environment
using off-the-shelf hardware to solve the scalability issue to support any
future requirement for processing (Fadzil et al., 2012). The earlier frameworks such as the Message Passing Interface (MPI)
framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012). MapReduce framework covers the
two categories of the scalability; the structural scalability, and the load
scalability (Fadzil et al., 2012). It addresses the structural
scalability by using the DFS which allows forming sizeable virtual storage for the framework by adding off-the-shelf
hardware. MapReduce framework addresses
the load scalability by increasing the number of the nodes to improve the
performance (Fadzil et al., 2012).
However,
the earlier version of the MapReduce
framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate
functions to join multiple datasets in
one task (Sakr & Gaber, 2014). Another limitation of the standard MapReduce
framework is found in the iterative
processing which is required for analysis
techniques such as PageRank algorithm, recursive relational queries, and social
network analysis (Sakr & Gaber, 2014). The standard MapReduce does not share the
execution of work to reduce the overall amount of work (Sakr & Gaber, 2014). Another limitation was found in the lack of
support of data index and column storage but support only for a sequential method when scanning the input
data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).
Moreover,
many argued that MapReduce is not regarded to be the optimal solution for
structured data. It is known as
shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, &
Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the
processing of large unstructured data sets (Bakshi, 2012). MapReduce has the limitation of performance
and efficiency (Lee et al., 2012).
The standard MapReduce framework faced the challenge of the iterative
computation which is required in various
operations such as data mining, PageRank, network traffic analysis, graph
analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010;
Sakr & Gaber, 2014). These analyses techniques
require the data to be processed iteratively until the computation satisfies a
convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014). Due to this limitation, and to this critical
requirement, this iterative process is implemented and executed manually using
a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014). However, the manual
implementation and execution of such iterative computation have two significant problems (Bu et al., 2010; Sakr & Gaber, 2014). The first problem is reflected
in loading unchanged data from iteration
to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition
when the output of the application did
not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014). This termination condition may
require an extra MapReduce job on each iteration which causes overhead for
scheduling extra tasks, reading extra data from disk, and moving data across
the network (Bu et al., 2010; Sakr & Gaber, 2014).
Researchers exerted efforts to solve the iterative
computation. HaLoop is
proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010). One solution to the iterative
computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by (Ekanayake et al., 2010) are to identify and keep invariant
data during the iterations, where reading unnecessary data repeatedly is avoided. The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014). The first caching technique is
implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique
is implemented on the outputs of reducer making the check for the fixpoint more
efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).
The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was
inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010). This solution provides the
synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and
does not replicate remote values locally (Malewicz et al., 2010). Mahoot is another solution that
was introduced to solve the iterative computing by grouping a series of chained
jobs to obtain the results (Polato et al., 2014). In Mahoot solution, the result
of each job is pushed into the next job until the final results are obtained (Polato et al., 2014). The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one
iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011). The task scheduler of the iHadoop utilizes the inter-iteration data
locality by scheduling tasks that exhibit a producer/consumer relation on the
same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011).
Apache Hadoop and Apache Spark are the most popular technology for the
iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011). Hadoop defines the iterative
computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS)
independently, processes the data, and writes the data back to HDFS (Liang et al., 2011). Dacoop was proposed by Liang as
an extension to Hadoop to handle the data-iterative applications, by using
cache technique for repeatedly data processing and introducing shared
memory-based data cache mechanism (Liang et al., 2011). The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent
tasks of the map and reduce during the
whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012). The iMapReduce avoid three significant
overheads. The first overhead is the job
startup overhead which is avoided by
building an internal loop from reduce to
map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state
data from the static structure data. The
third overhead is the synchronization overhead which is avoided by allowing
asynchronous map task execution (Zhang et al., 2012).
(Davenport & Dyché, 2013) have reported that Big Data has an
impact at an International Financial Services Firm. The bank has several objectives for Big Data.
However, the primary objective is to exploit “a vast increase in computing
power on dollar-for-dollar basis” (Davenport & Dyché, 2013).
The bank purchased Hadoop cluster, with 50 server nodes and 800
processor cores, capable of handling a petabyte of data. The data scientists of the bank take the
existing analytical procedures and converting them into the Hive scripting
language to run on the Hadoop cluster.
Big Data with high velocity has
created opportunities and requirements for organizations to increase its
capability of Real-Time sense and response (Chan, 2014).
The Analysis of the Real-Time and the rapid response are critical
features of the Big Data Management in many business situations (Chan, 2014).
For instance, as cited in (Chan, 2014), IBM (2013) in scrutinizing five
million trade events that are created each day identified potential fraud, and
analyzing 500 million daily call detail records in real-time was able to
predict customer churn faster (Chan, 2014).
“Fraud detection is one of the most
visible uses of big data analytics” (Cardenas et al., 2013).
Credit card and phone companies have conducted large-scale fraud
detection for decades (Cardenas et al., 2013).
However, the custom-built infrastructure necessary to mine Big Data for
fraud detection was not economical to have wide-scale adoption. However, one of the significant impacts of
BDA technologies is that they are facilitating a wide variety of industries to
develop affordable infrastructure for security monitoring (Cardenas et al., 2013).
The new BD technologies of Hadoop ecosystem including Pig, Hive, Mahout,
and Hadoop, stream mining, complex-event processing, and NoSQL databases enable
the analysis of not only large-scale but also heterogeneous datasets at
unprecedented scale and speed (Cardenas
et al., 2013). These technologies have transformed security
analytics by facilitating the storage, maintenance, and analysis of security
information (Cardenas
et al., 2013).
Big Data Analytics can be used in
marketing in a competitive edge by reducing the time to respond to customers,
rapid data capture, aggregation, processing, and analytics. Harrah’s (currently Caesars) Entertainments
has acquired both Hadoop clusters and open-source and commercial analytics
software, with the primary objective of exploring and implementing Big Data to
respond in real-time to customer marketing and service. GE is another example that is regarded to be
the most prominent creator of new service offerings based on Big Data (Davenport & Dyché, 2013).
The primary focus of GE was to optimize the service contracts and
maintenance intervals for industrial products.
The purpose of this Part is to go through
the installation of Hadoop on a single cluster node using the Windows 10
operating system. It covers fourteen significant
tasks, starting from the download of the software from the Apache site, to the demonstration of the
successful installation and configuration.
The steps of the installation are derived
from the installation guide of (aparche.org,
2018). Due to the lack of system resources, the Windows operating
system was the most appropriate choice for this installation and configuration,
although the researcher prefers Unix system over Windows due to the extensive
experience with Unix. However, the
installation and configuration experience on Windows has its value as well.
The purpose of this task is to download the required Hadoop software for windows operating system from the following link: http://www-eu.apache.org/dist/hadoop/core/stable/. Although there is a higher version than 2.9.1, the researcher has selected this version which is core stable version recommended by Apache.
The purpose of this task is to install Java which is required for Hadoop as indicated in the administration guide. Java 1.8.0_111 is installed on the system as shown below.
The purpose of this task is to setup the configuration of Hadoop by editing the core-site.xml file from C:\Hadoop-2.9.1\etc\hadoop and add the fs.defaultFS to identify the file system for Hadoop using the localhost and port 9000.
The purpose of this task is to set up the configuration for Hadoop MapReduce by editing mapred-site.xml and add between configuration tags the properties tag as shown below.
The purpose of this task is to create two important folders for data node and name node which are required for the Hadoop file system. Create folder “data” under the Hadoop home C:\Hadoop-2.9.1. Create folder “datanode” under C:\Hadoop-2.9.1\data. Create folder “namenode” under C:\Hadoop-2.9.1\data.
The purpose of this task is to setup the configuration for Hadoop HDFS by editing the file C:\Hadoop-2.9.1\etc\hadoop\hdfs-site.xml, and add the properties for dfs.replication, dfs.namenode, and dfs.datanode as shown below.
The purpose of this task is to set the configuration for yarn tool by editing the file C:\Hadoop-2.9.1\etc\hadoop\yarn-site.xml and add yarn.nodemanager.aux-services and its value of mapreduce_shuffle as shown below.
The purpose of this task is to overcome the Java error. Edit C:\Hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add the JAVA_HOME to overcome the following error.
The purpose of this task is to test
the current configuration and setup by issuing the following command to test
the setup before running Hadoop. The
command will throw an error about HADOOP_COMMON_HOME is not found.
To overcome the HADOOP_COMMON_HOME “not found” error, edit haddop-env.cmd and add the following, and issue the command again and it will pass as shown below.
The purpose of this task is to run the cluster page for Hadoop from the browser after the previous configuration setup. If the configuration setup is implemented successfully, the cluster page gets displayed with the Hadoop functionality as shown below, otherwise, it can throw the 404 error, page not found.
This project has discussed various
significant topics related to Big Data Analytics. It addressed two significant parts. Part-I
has discussed Big Data and the emerging technology of Hadoop. It has provided an overview of the Hadoop ecosystem, its building blocks,
benefits, and limitations. It has also discussed the MapReduce
framework, its benefits, and
limitations. Part-I has also provided
few success stories for Hadoop technology use with Big Data Analytics. Part-II has addressed the installation and
the configuration of Hadoop on Windows
operating system using fourteen essential
Tasks. It has also addressed the errors
during the configuration setup and the techniques to overcome these errors to
proceed successfully with the Hadoop installation.
Abbasi,
A., Sarker, S., & Chiang, R. (2016). Big data research in information
systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.
Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. Paper
presented at the Computational Science and Computational Intelligence (CSCI),
2014 International Conference on.
Bakshi, K. (2012). Considerations for big data: Architecture
and approach. Paper presented at the Aerospace Conference, 2012 IEEE.
Bao, Y., Ren, L., Zhang, L., Zhang,
X., & Luo, Y. (2012). Massive sensor
data management framework in cloud manufacturing based on Hadoop. Paper
presented at the Industrial Informatics (INDIN), 2012 10th IEEE International
Conference on.
Bates, D. W., Saria, S.,
Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care:
using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123-1131.
Bhatotia, P., Wieder, A., Rodrigues,
R., Acar, U. A., & Pasquin, R. (2011). Incoop:
MapReduce for incremental computations. Paper presented at the Proceedings
of the 2nd ACM Symposium on Cloud Computing.
Bi, Z., & Cochran, D. (2014). Big
data analytics with applications. Journal
of Management Analytics, 1(4), 249-265.
Bu, Y., Howe, B., Balazinska, M.,
& Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large
clusters. Proceedings of the VLDB
Endowment, 3(1-2), 285-296.
Cardenas, A. A., Manadhata, P. K.,
& Rajan, S. P. (2013). Big data analytics for security. IEEE Security & Privacy, 11(6),
74-76.
Chan, J. O. (2014). An architecture
for big data analytics. Communications of
the IIMA, 13(2), 1.
Chandarana, P., & Vijayalakshmi,
M. (2014). Big Data analytics frameworks.
Paper presented at the Circuits, Systems, Communication and Information
Technology Applications (CSCITA), 2014 International Conference on.
Chang, V., Kuo, Y.-H., &
Ramachandran, M. (2016). Cloud computing adoption framework: A security
framework for business clouds. Future
Generation computer systems, 57, 24-41. doi:10.1016/j.future.2015.09.031
Chopra, A., & Madan, S. (2015).
Big Data: A Trouble or A Real Solution? International
Journal of Computer Science Issues (IJCSI), 12(2), 221.
CSA, C. S. A. (2013). Big Data
Analytics for Security Intelligence. Big
Data Working Group.
Davenport, T. H., & Dyché, J.
(2013). Big data in big companies. International
Institute for Analytics.
De Mauro, A., Greco, M., &
Grimaldi, M. (2015). What is big data? A
consensual definition and a review of key research topics. Paper presented
at the AIP Conference Proceedings.
Ekanayake, J., Li, H., Zhang, B.,
Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative mapreduce. Paper presented at the
Proceedings of the 19th ACM international symposium on high performance
distributed computing.
Elnikety, E., Elsayed, T., &
Ramadan, H. E. (2011). iHadoop:
asynchronous iterations for MapReduce. Paper presented at the Cloud
Computing Technology and Science (CloudCom), 2011 IEEE Third International
Conference on.
Emani, C. K., Cullot, N., &
Nicolle, C. (2015). Understandable big data: A survey. Computer science review, 17, 70-81.
Erl, T., Khattak, W., & Buhler,
P. (2016). Big Data Fundamentals:
Concepts, Drivers & Techniques: Prentice Hall Press.
Fadzil, A. F. A., Khalid, N. E. A.,
& Manaf, M. (2012). Performance of
scalable off-the-shelf hardware for data-intensive parallel processing using
MapReduce. Paper presented at the Computing and Convergence Technology
(ICCCT), 2012 7th International Conference on.
Gantz, J., & Reinsel, D. (2011).
Extracting value from chaos. IDC iview,
1142, 1-12.
Géczy, P. (2014). Big data
characteristics. The Macrotheme Review, 3(6),
94-104.
Gupta, B., & Jyoti, K. (2014).
Big data analytics with hadoop to analyze targeted attacks on enterprise data.
Gupta, R., Gupta, H., & Mohania,
M. (2012). Cloud computing and big data
analytics: what is new from databases perspective? Paper presented at the
International Conference on Big Data Analytics.
Hirzel, M., Andrade, H., Gedik, B.,
Jacques-Silva, G., Khandekar, R., Kumar, V., . . . Soulé, R. (2013). IBM
streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4), 7: 1-7: 11.
Hu, H., Wen, Y., Chua, T.-S., &
Li, X. (2014). Toward scalable systems for big data analytics: A technology
tutorial. IEEE Access, 2, 652-687.
Hu, P., & Dai, W. (2014).
Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1),
37-48.
Inukollu, V. N., Arsi, S., &
Ravuri, S. R. (2014). Security issues associated with big data in cloud
computing. International Journal of
Network Security & Its Applications, 6(3), 45.
Jinquan, D., Jie, H., Shengsheng, H.,
Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data
Storage and Processing. Intel Technology
Journal, 16(4), 92-110.
Kaisler, S., Armour, F., Espinosa, J.
A., & Money, W. (2013). Big data:
issues and challenges moving forward. Paper presented at the System
Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.
Katal, A., Wazid, M., & Goudar,
R. (2013). Big data: issues, challenges,
tools and good practices. Paper presented at the Contemporary Computing
(IC3), 2013 Sixth International Conference on Contemporary Computing.
Khan, N., Yaqoob, I., Hashem, I. A.
T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data:
Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.
Krishnan, K. (2013). Data warehousing in the age of big data:
Newnes.
Lee, K.-H., Lee, Y.-J., Choi, H.,
Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a
survey. ACM SIGMOD Record, 40(4),
11-20.
Liang, Y., Li, G., Wang, L., &
Hu, Y. (2011). Dacoop: Accelerating
data-iterative applications on Map/Reduce cluster. Paper presented at the
Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011
12th International Conference on.
Malewicz, G., Austern, M. H., Bik, A.
J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph
processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data.
Manyika, J., Chui, M., Brown, B.,
Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The
next frontier for innovation, competition, and productivity.
Minelli, M., Chambers, M., &
Dhiraj, A. (2013). Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today’s
Businesses: John Wiley & Sons.
Mishra, B. S. P., Dehuri, S., &
Kim, E. (2016). Techniques and
Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing
(Vol. 17): Springer.
Moorthy, M., Baby, R., &
Senthamaraiselvi, S. (2014). An Analysis for Big Data and its Technologies. International Journal of Science,
Engineering and Computer Technology, 4(12), 412.
Nasser, T., & Tariq, R. (2015).
Big Data Challenges. J Comput Eng Inf
Technol 4: 3. doi:10.4172/2324, 9307, 2.
Polato, I., Ré, R., Goldman, A.,
& Kon, F. (2014). A comprehensive view of Hadoop research—A systematic
literature review. Journal of Network and
Computer Applications, 46, 1-25.
Ramesh, B. (2015). Big Data
Architecture Big Data (pp. 29-59):
Springer.
Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and
Management: CRC Press.
White, T. (2012). Hadoop: The definitive guide: ”
O’Reilly Media, Inc.”.
Yan, C., Yang, X., Yu, Z., Li, M.,
& Li, X. (2012). Incmr: Incremental
data processing based on mapreduce. Paper presented at the Cloud
Computing (CLOUD), 2012 IEEE 5th International Conference on.
Yang, H.-c., Dasdan,
A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large
clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD
international conference on Management of data.
Zhang, Y., Chen,
S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for
Mining Evolving Big Data. IEEE
transactions on knowledge and data engineering, 27(7), 1906-1919.
Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012).
imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.
The purpose of this discussion is to
address the advantages and disadvantages of XML used in big data analytics for
large healthcare organizations. The discussion also presents the use of XML in the
healthcare industry as well as in another
industry such as eCommerce.
Advantages of XML
XML has several
advantages such as simplicity, platform, and vendor independent, extensibility,
reuse by many applications, separation of content and presentation, and
improved load balancing (Connolly & Begg, 2015; Fawcett, Ayers, & Quin,
2012). XML also provides support for the
integration of data from multiple sources (Connolly & Begg, 2015; Fawcett
et al., 2012). XML can describe data
from a wide variety of applications (Connolly & Begg, 2015; Fawcett et al.,
2012). More advanced search engines
capabilities in another advantage of XML (Connolly & Begg, 2015). (Brewton, Yuan, & Akowuah, 2012) have identified two significant benefits of
XML. XML can support tags that are
created by the users allows the language to be fully extensible and overcome
any tag limitation. The second
significant benefit of XML in healthcare is the versatility, where any data
types can be modeled, and tags can be created for specific contexts.
Disadvantages
of XML
The
specification of the namespace prefix within DTDs is a significant limitation,
as users cannot choose their namespace prefix but must use the prefix defined
within the DTD (Fawcett et al., 2012).
This limitation exists as W3C completed the XML Recommendation before
finalizing how namespaces would work.
While DTD has poor support to XML namespaces, it plays an essential part
in the XML Recommendation. Furthermore,
(Forster, 2008) have identified a few disadvantages of XML. The inefficiency is one of this limitation as
XML was initially designed to accommodate the exchange of data between nodes of
the different system and not as a database storage platform. XML is
described as inefficient compared to other storage algorithms (Forster,
2008). The tags of XML make it readable
to humans but requires additional storage and bandwidth (Forster, 2008). Encoded image data represented in XML
requires another program to get displayed as it must be un-encoded and then
reassembled into an image (Forster, 2008).
Three XML parsers that inexperienced developers will not be familiar
with: Programs, APIs, and Engines. XML lacks rendering instructions as it is a
backend technology in the form of data storage and transmission technology. (Brewton et al., 2012) have identified two significant limitations of XML.
The lack of the application that can process XML data and make its data useful.
Browsers utilize HTML to render XML document which indicates that XML
cannot be used as an independent language
from HTML. The second major limitation
of XML is the unlimited flexibility of the language, where the tags are created by the user, and there is
no standard accepted set of tags to be used in the XML document. The result of this limitation is that the
developer cannot create general applications as each company will have its application
with its own set of tags.
XML in Healthcare
Concerning
XML in healthcare, (Brewton et al., 2012) have indicated that XML was a
solution to the problem of finding a reliable and standardized means for
storing and exchanging clinical documents.
American National Standards Institute has accredited Health Level 7
(HL7) as an organization which is responsible for setting up many communication
standards used across America (Brewton et al., 2012). The goal of this organization is to provide
standards for the exchange, management and integration of data which support
clinical patient care and management, delivery, and the evaluation of the
services of healthcare (Brewton et al., 2012). Furthermore, HL7 is developing
Clinical Document Architecture (CDA) to provide standards for the
representation of the clinical document such as discharge summaries and
progress notes. The goal of CDA is to
solve the problem of finding a reliable and standardized means for storing and
exchanging clinical documents by specifying a markup and semantic structure
through XML, allowing medical institutions to share clinical documents. HL7 version 3 includes the rules for
messaging as well as CDA which are implemented with XML and are derived from the Reference Information Model
(RIM). Besides, XML supports the hierarchical structure of CDA (Brewton et al.,
2012). Healthcare data must be secured
to protect the privacy of the patients.
XML provides signature capabilities which operate identically to regular
digital signature (Brewton et al., 2012).
In addition to XML signature, it has encryption capabilities which
mandate requirements for areas not covered by the secure socket layer technique
(Brewton et al., 2012). (Goldberg et
al., 2005) have identified some
limitations of XML when working with images in the biological domain. The bulk of an image file is represented by
the pixels in the image and not the metadata which is regarded as a severe problem.
Another related problem is that XML is verbose meaning that XML file is
already more massive than the binary file, and the image files are already
quite large which causes another problem when using XML in healthcare (Goldberg
et al., 2005).
XML in eCommerce
(Sadath, 2013) have discussed some benefits and limitation of
XML in the eCommerce domain. XML has
been advantages of being a flexible hierarchical model suitable to represent
semi-structured data. It is used
effectively in data mining and is described
as the most common tool used for data transformation between different types of
application. In data mining using XML,
there are two approaches to access the XML document: the key-word base search
and query-answering. The key-word based
has no much advantages because search takes place on the textual content of the
document. However, when using the
query-answering approach to access the XML document, the structure should be
known in advance which is not often the case.
The consequences of such lack of knowledge about the structure can lead
to information overload where too much data is
included because the key-word used information does not exist, or if it
incorrectly exists, incorrect answers are received (Sadath, 2013). Thus, various efforts from researchers have
been exerted to find the best approach for data mining in XML, such as XQuery,
or Tree-based Association Rules (TARs) as means to represent intentional
knowledge in native XML.
References
Brewton, J., Yuan, X., & Akowuah, F.
(2012). XML in health information
systems. Paper presented at the Proceedings of the International Conference
on Bioinformatics & Computational Biology (BIOCOMP).
Connolly, T.,
& Begg, C. (2015). Database Systems:
A Practical Approach to Design, Implementation, and Management (6th Edition
ed.): Pearson.
Fawcett, J.,
Ayers, D., & Quin, L. R. (2012). Beginning
XML: John Wiley & Sons.
Goldberg, I. G., Allan,
C., Burel, J.-M., Creager, D., Falconi, A., Hochheiser, H., . . . Swedlow, J.
R. (2005). The Open Microscopy Environment (OME) Data Model and XML file: open
tools for informatics and quantitative analysis in biological imaging. Genome biology, 6(5), R47.
Sadath, L. (2013).
Data mining in E-commerce: A CRM Platform. International
Journal of Computer Applications, 68(24).
The purpose of this discussion is to discuss how XML is used to represent Big Data and in various forms. The discussion begins with some basic information about XML, followed by three pillars of XML, XML elements and Attributes, and document-centric vs. the data-centric view of XML. Big Data and XML representation are also addressed in this discussion, followed by the XML processing efficiency and Hadoop technology.
What is XML
XML stands for eXtensible Markup Language, which can be utilized to describe data in a meaningful way (Fawcett, Ayers, & Quin, 2012). It has gained a good reputation due to its ability for interoperability among various applications and passing data between different components (Fawcett et al., 2012). It has been used to describe documents and data in a text-based standardized format what can be transferred via Internet standard protocol (Benz & Durant, 2004; Fawcett et al., 2012). Various standardized formats for XML are available, known as “schemas,” which represent various types of data such as medical records, financial transactions, and GPS (Fawcett et al., 2012).
XML has no tags of its own like HMTL (Howard, 2010). However, XML allows users to write the XML creating their own tags as needed provided that these tags follow the rules of the XML specifications (Howard, 2010). These rules include root element, closing tags, properly nested elements, case matters, quotation-marked values (Howard, 2010). Figure 1 summarizes these rules of the XML. Document type definition (DTD) or schema enforces these rules.
Figure 1. A Summary of the XML Rules.
Like HTML, XML is
based on Standard Generalized Markup Language (SGML) (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar, Lacroix, Bressan,
Lee, & Li, 2002). SGML was developed in 1974 as part of the IBM document-sharing project and was officially standardized by the International Organization for Standardization
(ISO) in 1986 (Benz & Durant, 2004). Although SGML was developed to define various types
of markup, it was found complicated, and hence few applications could read SGML
(Fawcett et al., 2012; Nambiar et al., 2002). Hyper Text
Markup Language (HTML) was the first adoption of the SGML (Benz & Durant, 2004). HTML was explicitly designed to describe documents for
display in a Web browser. However, with
the explosion of the Web and the need for more than just displaying data in a
Web browser, the developers struggled with the effectiveness of the HTML and
strived to find a method which could describe data more effectively than HTML on
the web (Benz & Durant, 2004).
In 1998, the World Wide Web Consortium (W3C) has combined the basic characteristics which
separate data from the format in SGML,
with the extension of the HTML tag
formats which were used for the Web and developed the first XML Recommendation (Benz & Durant, 2004). (Myer, 2005) have described HTML as a presentation language,
while XML as a data-description language. In brief, XML was designed to
overcome the limitation of both SGML and HTML (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar et al., 2002).
Three Pillars of XML
(Benz & Durant, 2004) have identified three pillars for XML: extensibility, structure, and validity (Figure 2). The extensibility pillar reflects the ability of the XML to describe structured data as text, and the format is open to extension, meaning that any data which can be described as a text and can be nested in XML tags can be generated as an XML file. The structure is the second pillar of the XML as it is described to be complicated for a human to follow, however, the file is designed to be read by the application. XML parsers and other types of tools which can read XML are designed to read XML format easily. The data representations using XML are much larger than their original format (Benz & Durant, 2004; Fawcett et al., 2012). The validity pillar of XML where the data of the XML file can be optionally validated for structure and content, based on two data validation standards: data type definition (DTD), and XML schema standard (Benz & Durant, 2004).
Figure 2. Three Pillar of XML.
XML Elements and Attributes
XML is developed to describe data and documents more effectively than using HTML, the W3C XML Recommendation provided strict instruction on the format requirements that will distinguish a text file that has various tags and the XML file which has distinguished tags (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010).
There are two main features of the XML file, known as “elements,” and “attributes” (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010). In the example below, the “applicationUsers” and “user” reflect the elements feature of the XML. The “firstName” and “lastName” reflect the attributes feature of the XML. Text can be placed between the opening and closing tags of an element to represent the actual data associated with the elements and attributes surrounding the text (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010). Figure 3 shows the elements and attributes in a simple way. Figure 4 illustrates a simple XML document showing the first line, which determines what version of the W3C XML recommendation that the document should adhere to, in addition to XML rules such as root element.
Figure 3. XML Elements and Features.
Figure 4. Simple XML Document Format Adapted from (Benz & Durant, 2004).
Document-Centric vs. Data-Centric View of XML
XML documents use DTD to derive their structures. Thus, XML documents can take any structure, while relational and object-relational data models have a fixed pre-defined structure (Bourret, 2010; Nambiar et al., 2002). Thus, structured data using XML is a combination with DTDs or XML schema specification is described as data-centric format or characteristic of XML (Bourret, 2010; Nambiar et al., 2002). This type of data-centric view of XML is highly structured similarly to the relational database where the order of sibling elements is not essential in such documents. Various query languages were developed for the data-centric format of XML such as XML-QL, LOREL and XQL which require data to be fully structured (Nambiar et al., 2002). The database is said to be XML-enabled, or third-party software such as middleware, data integration software, or a Web application server (Bourret, 2010).
However, the document-centric
format of XML is highly unstructured (Nambiar et al., 2002). The data in the
document-centric format of XML can be
stored and retrieved using a native-XML database or document management system (Bourret, 2010). Furthermore,
the implicit and explicit order of the elements matters in such XML documents (Nambiar et al., 2002). The implicit
order is represented by order of the
elements within a file in a tree-like representation, while the explicit order is represented by an attribute or a tag in the
document (Nambiar et al., 2002). The explicit order can be expressed in a relational database, whereas the capture of the
implicit order while converting the document-centric XML document into the relational database was a challenge (Nambiar et al., 2002). In addition
to the implicit order challenge, XML
documents differ from a relational representation by allowing deep nesting and
hyper-linked components (Nambiar et al., 2002). The
transformation of implicit order, nesting, and
hyperlinks into tables can be a solution. However,
such a transformation is costly regarding
time and space (Nambiar et al., 2002). Thus, the
XML processing efficiency was a challenge.
Big Data and XML Representation
Big Data was first defined using the well-known 3V features reflecting the volume, velocity, and variety (Wang, Kung, & Byrd, 2018; Ylijoki & Porras, 2016). The volume feature reflects the magnitude and the size of the data from terabytes to exabytes. The velocity reflects the speed of the data growth and the speed of the processing of the data from batch to real-time and streaming. The variety feature of Big Data reflects the various types of the data from text to the graph to include structured data as well as unstructured and semi-structured (Wang et al., 2018; Ylijoki & Porras, 2016).
Big Data development has gone through an evolutionary phase as well as the revolutionary phase. The evolutionary phase of the Big Data
development has gone through the period of 2001 and 2008 (Wang et al., 2018). During that
evolutionary period, it became possible for sophisticated software to meet the
needs and requirements of dealing with the explosive growth of the data (Wang et al., 2018). Analytics modules were added using software and
application developments like XML web services, database management systems,
and Hadoop, in addition to the functions which were added to core modules which
focused on enhancing usability for end users (Wang et al., 2018). These
software application developments like XML web services, database management
systems and Hadoop enabled users to process a large
amount of data within and across organizations collaboratively as well as in real-time
(Wang et al., 2018). During the
2000s, XML became the standard formatting language for semi-structured data, mostly for an online
purpose, which led to the development of XML database, which was regarded as a new generation of the database (Verheij, 2013).
Healthcare organization, at the same time, began to
digitalize the medical records and aggregate clinical data in the substantial electronic database (Wang et al., 2018). Such
development of software and application like XML web services, database management
systems, and Hadoop made the significant
volume of the healthcare data storable, usable, searchable, and actionable, and
assisted the healthcare providers to practice
medicine more effectively (Wang et al., 2018).
Starting from 2009, Big Data Analytics entered an
advanced phase which is the revolutionary phase, where the computing of the big
data became a breakthrough innovation for Business Intelligence (Wang et al., 2018). Besides, the
data management and its techniques were predicted to shift from structured data
into unstructured data, and from a static
environment to ubiquitous cloud-based environment (Wang et al., 2018). The data for
healthcare industry continued to grow, and as of 2011, the stored data for
healthcare reached 150 exabytes (1 EB = 118 bytes) worldwide, mainly
in the form of electronic health records (Wang et al., 2018). Other industries
of Big Data Analytics pioneers include banks and e-commerce started to
experience the impact on the business process improvement, workforce
effectiveness, cost reduction, and new customer attraction (Wang et al., 2018).
The data management approaches for Big Data include various
types of databases such as columnar, document stores, key-value/tuple stores, graph, multimodal, object, grid and cloud
database solutions, XML-databases, multi-dimensional, and multi-value (Williams, 2016). Big Data
analytics systems are distinguished from the traditional data management
systems as they have the capabilities to analyze semi-structured or
unstructured data which lack in the traditional data management systems (Williams, 2016). XML as a
textual language for exchanging data on the Web is regarded as a typical
example of semi-structured data (Benz & Durant, 2004; Gandomi & Haider, 2015; Nambiar et al.,
2002).
In various industries such as healthcare, the
semi-structured and unstructured data refer to information which cannot be
stored in a traditional database and cannot fit into predefined data
models. Example of this semi-structured and unstructured healthcare
database include XML-based electronic healthcare records, clinical
images, medical transcripts, and results of the labs. As an example of the case study, (Luo, Wu, Gopukumar, & Zhao, 2016) have referenced a case study where a hybrid XML
database and Hadoop/HBase infrastructure were used to design the Clinical Data
Managing and Analyzing System.
XML Processing Efficiency and Hadoop Technology
Organizations can derive value from XML documents which reflect semi-structured data (Aravind & Agrawal, 2014). To derive value from these semi-structured XML documents, the XML data needs to be ingested into Hadoop for the analytic purpose (Aravind & Agrawal, 2014). However, Hadoop technology does not offer a standard XML “RecordReader,” while XML is one of the standard file formats for the MapReduce (Lublinsky, Smith, & Yakubovich, 2013).
There is an increasing demand for efficient processing of large volume of data
stored in XML using Apache Hadoop Map/Reduce (Vasilenko & Kurapati, 2014). Various approaches have been used for XML processing efficiency. The ETL process for extracting data is one
approach (Vasilenko & Kurapati, 2014). The
transformations of XML into other formats that are natively supported by Hive
is another technique for efficient processing of XML (Vasilenko & Kurapati, 2014). Another
approach is to use Apache Hive XPath UDFs. However, these functions can only be used in Hive views, and SELECT statements and not in the table CREATE DDL (Vasilenko & Kurapati, 2014). (Subhashini & Arya, 2012) have provided
few attempts from various researchers such as the use of a generic XML-based
Web information extraction solution using two key technologies. XML-based Web data conversion technology, and
the XSLT (Extensible Stylesheet Language Transformation). The XML-based Web
data conversion technology is used to convert the HTML into XHTML document
using XML rules and develop XMLDOM tree, and
DOM-based XPath algorithm is generating
XPath expression for the desired information nodes when the information points are marked by the users. The XSLT is
used to extract the required information
from the XHTML document, and the result
of the extraction are expressed in XML (Subhashini & Arya, 2012). XSLT is
regarded to be one of the most important of the XML technologies to consider in
solving information processing issues (Holman, 2017). Other attempts included the use of a wrapper
based on XBRL (eXtensible Business Reporting Language)-GL taxonomy to extract
financial data from the web (Subhashini & Arya, 2012). Those are a few attempts to solve the processing
issues outlined in (Subhashini & Arya, 2012).
Fawcett, J.,
Ayers, D., & Quin, L. R. (2012). Beginning
XML: John Wiley & Sons.
Gandomi, A., &
Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information
Management, 35(2), 137-144.
Howard, G. K.
(2010). Xml: Visual Quickstart Guide, 2/E:
Pearson Education India.
Lublinsky, B.,
Smith, K. T., & Yakubovich, A. (2013). Professional
hadoop solutions: John Wiley & Sons.
Luo, J., Wu, M.,
Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical
research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Nambiar, U., Lacroix,
Z., Bressan, S., Lee, M. L., & Li, Y. G. (2002). Efficient XML data management: an analysis. Paper presented at the
International Conference on Electronic Commerce and Web Technologies.
Subhashini, C.,
& Arya, A. (2012). A Framework For Extracting Information From Web Using
VTD-XML’s XPath. International Journal on
Computer Science and Engineering, 4(3), 463.
Vasilenko, D.,
& Kurapati, M. (2014). Efficient processing of xml documents in hadoop map
reduce.
Verheij, B.
(2013). The process of big data solution
adoption. TU Delft, Delft University of Technology.
Wang, Y., Kung,
L., & Byrd, T. A. (2018). Big data analytics: Understanding its
capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change,
126, 3-13.
Williams, S.
(2016). Business intelligence strategy
and big data analytics: a general management perspective: Morgan Kaufmann.
Ylijoki, O., &
Porras, J. (2016). Perspectives to definition of big data: a mapping study and
discussion. Journal of Innovation
Management, 4(1), 69-91.
The purpose of this
project is to analyze the online radio dataset called (lastfm.csv). The project
is divided into two main Parts. Part-I evaluates and examines the dataset for understanding the Dataset using the RStudio. Part-I involves three major tasks to review and understand the Dataset variables. Part-II discusses the Pre-Data Analysis, by
converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame. The Association Rule data
mining technique is used in this
project. The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support
larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users. The
construction of the association rules is also implemented using the function of
“apriori” in R package arules. The search was
implemented for artists or groups of artists who have support larger
than 1% and who give confidence to another
artist that is larger than
50%. These requirements rule out rare
artists. The calculation and the list of
antecedents (LHS) are also implemented which involve more than one artist. The list is further narrowed down by
requiring that the lift is larger than 5
and the resulting list is ordered
according to the decreasing confidence as illustrated in Figure 6.
Keywords:
Online
Radio, Association Rule Data Mining Analysis
Introduction
This project examines and analyzes the Dataset of (lastfm.csv). The dataset is downloaded from CTU course materials. The lastfm.csv dataset reflect online radio which keeps track of every thing the user plays. It has 289,955 observations with four variables. The focus of this analysis is Association Rule. The information in the dataset is used for recommending music the user is likely to enjoy and supports focused on marketing which sends the user advertisements for music the user is likely to buy. From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists. Thus, to calculate such support, the count of the incidences (0/1) (frequency) is implemented across all memebers of the network and divide those frequencies by the number of the members. From the support, the confidence and the lift is calculated.
This
project addresses two major Parts. Part-I covers the following key Tasks to
understand and examine the Dataset of “lastfm.csv.”
Task-1: Review the Variables of the Dataset.
Task-2: Load and Understand the Dataset Using
names(), head(), dim() Functions.
Task-3: Examine the Dataset,
Summary of the Descriptive Statistics, and Visualization of the Variables.
Part-II
covers the following three primary key Tasks to the plot, discuss and analyze the result.
Task-1: Required Computations for
Association Rules and Frequent Items.
Task-2: Association Rules.
Task-3: Discussion and Analysis.
Various resources were utilized to develop the required code using R. These resources include(Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).
The
purpose of this task is to understand the variables of the dataset. The Dataset is “lastfm.csv” dataset. The Dataset describes the artists and the
users who listens to the music. From the available information such as
demographic information (such as age, sex and location) the support for the
frequencies of listeninig to various individual artists can be determined as
well as the joint support for pairs or larger groupings of artists. There are 4 variables. Table 1 summarizes the selected variables for
this project.
The purpose of this task is to load and understand the Dataset using names(), head(), dim() function. The task also displays the first three observations.
## reading the data
lf <-read.csv(“C:/CS871/Data/lastfm.csv”)
lf
dim(lf)
length(lf$user)
names(lf)
head(lf)
lf <- data.frame(lf)
head(lf)
str(lf)
lf[1:20,]
lfsmallset <- lf[1:1000,]
lfsmallset
plot(lfsmallset, col=”blue”, main=”Small Set of Online Radio”)
Figure 1. First Sixteen Observations for User (1) – Woman from Germany.
Figure 2. The plot of Small Set of Last FM Variables.
The purpose of this task is to
examine the dataset. This task also factor
the user and levels users and artist variables.
It also displays the summary of the variables and the visualization of
each variable.
The
purpose of this task is to first implement computations which are required for
the association rules. The required
package arules is first installed. This
task visualizes the frequency of items in Figure 4.
##
Install arules library for association rules
install.packages(“arules”)
library(arules)
###
computational environment for mining association rules and frequent item sets
playlist
<- split(x=lf[,”artist”], f=lf$user)
playlist[1:2]
##
Remove Artist Duplicates.
playlist
<- lapply(playlist,unique)
playlist
<- as(playlist,”transactions”)
##
view this as a list of “transaction”
##
transactions is a data class defined in arules
itemFrequency(playlist)
##
lists the support of the 1,004 bands
##
number of times band is listed to on the playlist of 15,000 users
##
computes relative frequency of artist mentioned by the 15,000 users
The purpose of this task is to implement the data
mining for the music list (lastfm.csv) using Association Rules technique. First, the code builds the Association Rules, followed by the implementation of the
associations with support > 0.01 and confidence > 0.50. Rule out rare
bands and ordering the result by confidence for better understanding of the
association rules result.
## Build the Association Rules
## Only associations with support > 0.01 and confidence
> 0.50
The association rules are
used to explore the relationship between items and sets of items (Fischetti et al.,
2017; Giudici, 2005). Each
transaction is composed of one or more items.
The interest is in transactions of at least two items because there
cannot be relationships between several items in the purchase of a single item (Fischetti et al.,
2017).
The association rule is the explicit mention in a relationship in the data, in
the form of X >= Y, where X (the antecedent) can be composed of
one or several items and is called
itemset, and Y (the consequent) is always one single item. In this project, the interest is in the
antecedents of music since the interest is in promoting the purchase of
music. The frequent “itemsets” are the
items or collections of items which frequently
occur in transactions. The
“itemsets” are considered frequent if they occur more frequently than a
specified threshold (Fischetti et al.,
2017). The threshold is called minimal support (Fischetti et al.,
2017). The omission of “itemsets” with support less
than the minimum support is called
support pruning (Fischetti et al.,
2017).
The support for an itemset is the proportion among all cases where the itemset
of interest is present, which allows estimation of how interesting an itemset or a rule is when support is low, the
interest is limited (Fischetti et al.,
2017).
The confidence is the proportion of cases of X where X >= Y, which
can be computed as the number of cases
featuring X and Y divided by the number of cases featuring X (Fischetti et al.,
2017). Lift is a measure of the improvement of the
rule support over what can be expected by
chance, which is computed as support(X>=Y)/support(X)*support(Y) (Fischetti et al.,
2017). If the lift value is not higher than 1, the
rule does not explain the relationship between the items better than could be
expected by chance. The goal of
“apriori” is to compute the frequent “itemsets” and the association rules efficiently and to compute support and confidence.
In this project, the large dataset of lastfm (289,955 observations and
four variables) is used. The descriptive analysis shows that the
number of males (N=211823) exceeds
the number of female users (N=78132)
as illustrated in Figure 3. The top
artist has a value of 2704, followed by “Beatles” of 2668 and “Coldplay” of
2378. The top country has the value of
59558 followed by the United Kingdom of
27638 and German of 24251 as illustrated in Task-3 of Part-I.
As
illustrated in Figure 1, the first sixteen observations are for the user (1) for a woman from Germany, resulting in
the first sixteen rows of the data matrix.
The R package arules was used for
mining the association rules and for identifying frequent “itemsets.” The data is
transformed into an incidence matrix where each listener represents a
row, with 0 and 1s across the columns indicating whether or not the user has
played a particular artist. The incidence matrix is stored in the R object
“playlist.” The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support
larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users.
The construction of the association rules is also
implemented using the function of “apriori” in R package arules.
The search was implemented for
artists or groups of artists who have support larger than 1% and who give
confidence to another artist that is larger than 50%. These requirements rule out rare
artists. The calculation and the list of
antecedents (LHS) are also implemented which involve more than one artist. For instance, listening both to “Muse” and
“Beatles” has support larger than 1%, and the confidence for “Radiohead,” given
that someone listens to both “Muse” and “Beatles” is 0.507 with a lift of 2.82
as illustrated in Figure 5. This result
exceeded the two requirements as antecedents involving three artists do not
come up in the list because they do not meet both requirements. The list is further narrowed down by
requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing
confidence as illustrated in Figure 6.
The result shows that listening to both “Led Zeppelin” and “the Doors”
has a support of 1%, the confidence of
0.597 (60%) and lift of 5.69 and is quite predictive of listening to “Pink
Floyd” as shown in Figure 6. Another example of the association rule result is
listening to “Judas Priest” lifts the chance of listening to the “Iron Maiden”
by a factor of 8.56 as illustrated in Figure 6.
Thus, if the user listens to “Judas Priest,” the recommendation for that
user to also to listen to “Iron Maiden.”
The same association rules results apply to all of the six items listed
in Figure 6.
References
Ahlemeyer-Stubbe, A.,
& Coleman, S. (2014). A practical
guide to data mining for business and industry: John Wiley & Sons.
Fischetti,
T., Mayor, E., & Forte, R. M. (2017). R:
Predictive Analysis: Packt Publishing.
Giudici,
P. (2005). Applied data mining:
statistical methods for business and industry: John Wiley & Sons.
Ledolter,
J. (2013). Data mining and business
analytics with R: John Wiley & Sons.