Big Data Analytics Tools

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify and describe a tool in the market for data analytics, how the tool is used and where it can be used.  The discussion begins with an overview of the Big Data Analytics tools, followed by the top five tools for 2018, among which RapidMiner is selected as the BDA tool for this discussion.  The discussion of the RapidMiner as one of the top five BDA tools include the features, technical specification, use, advantages, and limitation.  The application of RapidMiner in various industries such as medical and education is also addressed in this discussion. 

Overview of Big Data Analytics Tools

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.

Regarding the BDA tools, various BDA tools exist in the market for different business purposes based on the business model of the organization.  Organizations must select the right tool that will serve their business model.  Various studies have discussed various tools for BDA implementation.  (Chen & Zhang, 2014) have examined various types of BD tools. Some tools are based on batch processing such as Apache Hadoop, Dryad, Apache Mahout, and Tableau, while other tools are based on stream processing such as Storm, S4, Splunk, Apache Kafka, and SAP Hana as summarized in Table 1 and Table 2.  Each tool provides certain features for BDA implementation and offers various advantages to those BDA-adapted organizations.


Table 1.  Big Data Tools Based on Batch Processing (Chen & Zhang, 2014).


Table 2.  Big Data Tools Based on Stream Processing (Chen & Zhang, 2014).

Other studies such as (Rangra & Bansal, 2014) have provided a comparative study of data mining tools such as Weka, Keel, R-Programming, Knime, RapidMiner, and Orange, their technical specification, general features, specialization, advantages, and limitations. (Choi, 2017) have discussed the BDA tools by categories.  These BDA tools are categorized by open source data tools, data visualization tools, sentiment tools, and data extraction tools. Figure 1 provides a summary of some of the examples of BDA tools including the databases sources to download big datasets for analysis.


Figure 1.  A Summary of Big Data Analytics Tools.

(Al-Khoder & Harmouch, 2014) have evaluated four of the most popular open source and free data mining tools including R, RapidMiner, Weka, and Knime.  R foundation has developed R-Programming, while Rapid-I company have developed RapidMinder.  Weka is developed by University of Waikato, and Knime is developed by Knime.com AG. Figure 2 provides a summary of these four BDA most popular open source and free data mining tools, with the logo, description, launch date, current version at the time of writing the study, and development team.


Figure 2.  Open Source and Free Data Mining Tools Analyzed by (Al-Khoder & Harmouch, 2014).

The top five of BDA tools for 2018 include Tableau Public, Rapid Miner, Hadoop, R-Programming, IBM Big Data (Seli, 2017). The present discussion focuses on one of these two five BDA tools for 2018.  Figure 3 summarizes these top five BDA tools for 2018.


Figure 3.  Top Five BDA Tools for 2018.

RapidMiner Big Data Analytic Tool

RapidMiner Big Data Analytic tool is selected for the present discussion since it was among the top five BDA tools for 2018.  RapidMiner is an open source platform for BDA, based on Java programming language. RapidMiner provides machine learning procedures and data mining.  It also provides data visualization, processing, statistical modeling, deployment, evaluation and predictive analytics (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014; Seli, 2017).  RapidMiner is known for its commercial and business applications, as it provides an integrated environment and platform for machine learning, data mining, predictive analysis, and business analytics  (Hofmann & Klinkenberg, 2013; Seli, 2017).  It is also used for research, education, training, rapid prototyping, and application development (Rangra & Bansal, 2014).  It is specialized in predictive analysis and statistical computing. It supports all steps of the data mining process (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014). RapidMiner uses the client/server model, where the server can be software, or a service or on cloud infrastructures (Rangra & Bansal, 2014).

RapidMiner was released on 2006.  The latest version of RapidMiner server is 7.2 with a free version of server and Radoop and can be downloaded from RapidMiner site (rapidminer, 2018).  It can be installed on any operating system (Rangra & Bansal, 2014).  The advantages of the RapidMiner include an integrated environment for all steps that are required for data mining process, easy to use graphical user interface (GUI) for the design of data mining process, the visualization of the result and data, the validation and optimization of these processes.  RapidMiner can be integrated into more complex systems (Hofmann & Klinkenberg, 2013).  RapidMiner also stores the data mining processes in a machine-readable XML format, which can be executed with a click of a button, providing a visualized graphics of the data mining processes (Hofmann & Klinkenberg, 2013). It contains over a hundred learning schemes for regression classification and clustering analysis (Rangra & Bansal, 2014).  RapidMiner has a few limitations including the size constraints of the number of rows and more hardware resources than other tools such as SAS for the same task and data (Seli, 2017).  RapidMiner also requires prominent knowledge of the database handling (Rangra & Bansal, 2014).  

RapidMiner Use and Application

Data Mining requires six essential steps to extract value from a large dataset (Chisholm, 2013). The process of Data mining framework begins with business understanding, followed by the data understanding and data preparation.  The modeling, evaluation and deployment phases develop the models for predictions, testing, and deploying them in real-time. Figure 4 illustrates these six steps of the data mining.


Figure 4.  Data Mining Six Phases Process Framework (Chisholm, 2013).

Before working with RapidMiner, the user must know the common terms used by RapidMiner.  Some of these standard terms are a process, operator, macro, repository, attribute, role, label, and ID (Chisholm, 2013).  The data mining process in RapidMiner begins with loading the data into RapidMiner.  Loading the data into RapidMiner using import technique for either data in files, or databases.  The process of splitting the large file into pieces can be implemented in RapidMiner.  In some cases, the dataset can be split into chunks using RapidMiner process which reads each line in the file such as CSV file to be split into chunks. If the dataset is based on a database, a Java Data Connectivity (JDBC) driver must be used. RapidMiner support MySQL, PostgreSQL, SQL Server, Oracle and Access (Chisholm, 2013).  After loading the data into RapidMinder and generating data for testing, a predictive model can be created based on the loaded dataset, followed by the process execution and reviewing the result visually. RapidMiner provides various techniques to visualize the data.  It uses scatter plots, scatter 3D color, parallel and deviations, quartile color, plotting series, and survey plotter. Figure 5 illustrates scatter 3D color visualization of the data in RapidMiner (Chisholm, 2013).


Figure 5.  Scatter 3D Color Visualization of the Data in RapidMiner (Chisholm, 2013).

RapidMiner supports statistical analysis such as K-Nearest Neighbor Classifications, Naïve Bayes Classification, which can be used for credit approval and in education (Hofmann & Klinkenberg, 2013). RapidMiner application is also witnessed in other industries such as marketing, cross-selling and recommender system (Hofmann & Klinkenberg, 2013).  Other useful use cases of the RapidMiner application include the clustering in medical and education domains (Hofmann & Klinkenberg, 2013).  RapidMinder can also be used for text mining scenarios such as spam detection, language detection, and customer feedback analysis.  Other applications of RapidMiner include anomaly detection and instance selection.

Conclusion

This discussion has identified the different tools for Big Data Analytics (BDA). Over thirty analytic tools which can be used to overcome some of the BDA. Some are open source tools such as Knime, R-Programming, RapidMiner which can be downloaded for free, while others are described as visualization tools such as Tableau Public, Google Fusion to provide compelling visual images of the data in various scenarios.  Other tools are more semantic such as OpenText and Opinion Crawl.  Data extraction tools for BDA include Octoparse and Content Grabber. The users can download large datasets for BDA from various databases such as data.gov. 

The discussion has also addressed the top five BDA tools for 2018, such as Tableau Public, RapidMiner, Hadoop, R-Programming and IBM Big Data. RapidMiner was selected as BDA tools for this discussion.  The focus of the discussion on RapidMiner included the technical specification, use, advantages, and limitation.  The data mining process and steps when using RapidMiner have also been discussed.  The analytic process begins with the data upload to RapidMiner, during which the data can be split using the RapidMiner capabilities.  After the load and the cleaning of the data, the data model is developed and tested, followed by the visualization.  The visualization capabilities of RapidMiner include statistical analysis such as K-Nearest Neighbor and Naïve Bay Classification.  RapidMiner use cases have been addressed as well to include the medical and education domains, text mining scenarios such as spam detection.  Organizations must select the appropriate BDA tools based on the business model.

References

Al-Khoder, A., & Harmouch, H. (2014). Evaluating four of the most popular open source and free data mining tools.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chisholm, A. (2013). Exploring data with RapidMiner: Packt Publishing Ltd.

Choi, N. (2017). Top 30 Big Data Tools for Data Analysis. Retrieved from https://bigdata-madesimple.com/top-30-big-data-tools-data-analysis/.

Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and business analytics applications: CRC Press.

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Rangra, K., & Bansal, K. (2014). Comparative study of data mining tools. International journal of advanced research in computer science and software engineering, 4(6).

rapidminer. (2018). Introducing RapidMiner 7.2, Free Versions of Server & Radoop, and New Pricing. Retrieved from https://rapidminer.com/blog/introducing-new-rapidminer-pricing-free-versions-server-radoop/.

Seli, A. (2017). Top 5 Big Data Analytics Tools for 2018. Retrieved from http://heartofcodes.com/big-data-analytics-tools-for-2018/.

Case Study: Hadoop in Healthcare Industry

Dr. Aly, O.
Computer Science

The purpose of this discussion is to identify a real-life case study where Hadoop was used.  The discussion also addresses the view of the researcher whether Hadoop was used in the amplest manner.  The benefits of Hadoop to the identified industry of the use case are also discussed.  

Hadoop Real Life Case Study and Applications in the Healthcare Industry

Various research studies and reports have discussed Spark solution for real-time data processing in particular industries such as Healthcare, while others have discussed Hadoop solution for healthcare data analytics. For instance, (Shruika & Kudale, 2018) have discussed the use of Big Data in Healthcare with Spark, while (Beall, 2016) have indicated that United Healthcare is processing data using Hadoop framework for clinical advancements, financial analysis, and fraud and waste monitoring.  United Healthcare has utilized Hadoop to obtain a 360-degree view of each of its 85 million members (Beall, 2016). 

The emphasis of this discussion is on Hadoop in the Healthcare industry.  The data growth in the Healthcare industry is increasing exponentially (Dezyre, 2016).  McKinsey have anticipated the potential annual value for healthcare in the US is $300 billion, and 7% annual productivity growth using BDA (Manyika et al., 2011).  (Dezyre, 2016) have reported that the healthcare informatics poses challenges such as data knowledge representation, database design, data querying, and clinical decision support which contribute to the development of BDA.  

Big Data in healthcare include data such as patient-related data from electronic health records (EHRs), computerized physician order entry systems (CPOE), clinical decision support systems, medical devices and sensor, lab results and images such as Xrays, and so forth (Alexandru, Alexandru, Coardos, & Tudora, 2016; Wang, Kung, & Byrd, 2018).  Big Data framework for healthcare includes data layer, data aggregation layer, the analytical layer, the information exploration layer (Alexandru et al., 2016). Hadoop resides in the analytical layer of the Big Data framework (Alexandru et al., 2016).  

The data analysis involves Hadoop and MapReduce processing large dataset in batch form economically, analyzing both data types of structured and unstructured in a massively parallel processing environment (Alexandru et al., 2016).  (Alexandru et al., 2016) have indicated that stream computing can also be implemented using real-time or near real-time analysis to identify and respond to any health care fraud quickly.  The third type of analytics at the analytic layer also involves in-database analytics using data warehouse for data mining allowing high-speed parallel processing which can be used for prediction scenarios (Alexandru et al., 2016).  The in-database analytics can be used for preventive health care and pharmaceutical management.  Using Big Data framework including Hadoop ecosystem provides additional health care benefits such as scalability, security, confidentially and optimization features (Alexandru et al., 2016).

Hadoop technology was found to be the only technology that enables healthcare to store data in its native forms (Dezyre, 2016).   There are five successful use cases and applications of Hadoop in the healthcare industry (Dezyre, 2016).   The first application of Hadoop technology in healthcare is the cancer treatments and genomics.  Hadoop help develops better treatments for diseases such as cancel by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national “cancer knowledge network” to guide treatment decisions (Dezyre, 2016).  Hadoop can also be used to monitor the patient vitals.  The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over 6,200 children in their ICU units.  Through Hadoop, the hospital was able to store and analyze the vital signs, and if there is any pattern change, an alert is generated and sent to the physicians (Dezyre, 2016).  The third application of Hadoop in Healthcare industry involves the hospital network.  The Cleveland Clinic spinoff company, known as “Explorys” is taking advantages of Hadoop by developing the most extensive database in the healthcare industry. As a result, Explorys was able to provide clinical support, reduce the cost of care measurement and manage the population of at-risk patients (Dezyre, 2016).  The fourth application of Hadoop in Healthcare industry involves healthcare intelligence, where healthcare insurance businesses are interested in finding the age of individuals in specific regions, who below a certain age are not a victim of certain diseases.  Through Hadoop technology, the healthcare insurance companies can compute the cost of insurance policy.  Pig, Hive, and MapReduce of Hadoop ecosystem are used in this scenario to process such a large dataset (Dezyre, 2016).  The last application of Hadoop in the healthcare industry involves fraud prevention and detection.  

Conclusion

In conclusion, the healthcare industry has taken advantages of Hadoop technology in various areas not only for better treatment and better medication but also for reducing the cost and increasing productivity and efficiency.  It has also used Hadoop for fraud protection.  These are not only the benefits which Hadoop offers the healthcare industry.  Hadoop also offers storage capabilities, scalability, and analytics capabilities of various types of datasets using parallel processing and distributed file system.  From the viewpoint of the researcher, utilizing Spark on top of Hadoop will empower the healthcare industry not only at the batching processing level but also at the real-time data processing. (Basu, 2014) have reported that the healthcare industry can take advantages of Spark and Shark with Apache Hadoop for real-time healthcare analytics.  Although Hadoop alone offers excellent benefits to the healthcare industry, its integration with other analytic tools such as Spark can make a huge difference at the patient care level as well as at the industry return on investment level.

References

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data, and Cloud Computing. Management, 1, 2.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Beall, A. (2016). Big data in healthcare: How three organizations are using big data to improve patient care and more. Retrieved from https://www.sas.com/en_us/insights/articles/big-data/big-data-in-healthcare.html.

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Shruika, D., & Kudale, R. A. (2018). Use of Big Data in Healthcare with Spark. International Journal of Science and Research (IJSR).

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

Hadoop Ecosystem

Dr. O. Aly
Computer Science

The purpose of this discussion is to discuss the Hadoop ecosystem, which is rapidly evolving. The discussion also covers Apache Spark, which is a recent addition to the Hadoop ecosystem. Both technologies and tools offer significant benefits for the challenges of storing and processing of large data sets in the age of Big Data Analytic.  The discussion also addresses the most significant differences between Hadoop and Spark.

Hadoop Solution, Components and Ecosystem

The growth of Big Data has demanded the attention not only from researchers, academia, and government but also from the software engineering as it has been challenging dealing with Big Data using the conventional computer science technologies (Koitzsch, 2017).  (Koitzsch, 2017) have referenced annual data volume statistics from Cisco VNI Global IP Traffic Forecast from 2014-2019 as illustrated in Figure 1 to show the growth magnitude of the data.


Figure 1.  Annual Data Volume Statistics [Cisco VNI Global IP Traffic Forecast 2014-2019] (Koitzsch, 2017). 

The complex characteristics of Big Data have demanded the innovation of distributed big data analysis as the conventional techniques were found inadequate (Koitzsch, 2017; Lublinsky, Smith, & Yakubovich, 2013).  Thus, tools such as Hadoop has emerged relying on clusters of relatively low-cost machines and disks, driving the distributed processing for large-scale data projects.  Apache Hadoop is a Java-based open source distributed processing framework has evolved from Apache Nutch, which is an open source web search engine, based on Apache Lucene (Koitzsch, 2017).  The new Hadoop subsystems have various language bindings such as Scala and Python (Koitzsch, 2017).  The core components of Hadoop 2 include MapReduce, Yarn, HDFS and other components including Tez as illustrated in Figure 2.


Figure 2.  Hadoop 2 Core Components (Koitzsch, 2017).

The Hadoop and its ecosystem are divided into major building blocks (Koitzsch, 2017).  The core components of the Hadoop 2 involve Yarn, Map/Reduce, HDFS, and Apache Tez.  The operational services component includes Apache Ambari, Oozie, Ganglia, NagiOs, Falcone, etc. The data services component includes Hive, HCatalog, PIG, HBase, Flume, Sqoop, etc.   The messaging component includes Apache Kafka, while the security services and secure ancillary components include Accumulo.  The glue components include Apache Camel, Spring Framework, and Spring Data.  Figure 3 summarizes these building blocks of the Hadoop and its ecosystem.

Figure 3. Hadoop 2 Technology Stack Diagram (Koitzsch, 2017).

Furthermore, the structure of the ecosystem of Hadoop involves various components, where Hadoop is in the center, providing bookkeeping and management for the cluster using Zookeeper, and Curator (Koitzsch, 2017).  Hive and Pig are a standard component of the Hadoop ecosystem providing data warehousing, while Mahout provides standard machine learning algorithm support.  Figure 4 shows the structure of the ecosystem of Hadoop (Koitzsch, 2017).  


Figure 4.  Hadoop Ecosystem (Koitzsch, 2017).

Hadoop Limitation Driving Additional Technologies

Hadoop has three significant limitations (Guo, 2013).  The first limitation is about the instability of the software of Hadoop as it is an open source software and the lack of technical support and documentation. Enterprise Hadoop can be used to overcome the first limitation.  Hadoop cannot handle real-time data processing, which is a significant limitation for Hadoop.  Spark or Storm can be used to overcome the real-time processing, as required by the application. Hadoop cannot large graph datasets either.  GraphLab can be utilized to overcome the large graph dataset limitation. 

The Enterprise Hadoop is distributions of Hadoop by various Hadoop-oriented vendors such as Cloudera, Hortonworks and MapR, and Hadapt (Guo, 2013).  Cloudera provides Big Data solutions and is regarded to be one of the most significant contributors to the Hadoop codebase (Guo, 2013).  Hortonworks and MapR are Hadoop-based Big Data solutions (Guo, 2013).  Spark is a real-time in-memory processing platform Big Data solution (Guo, 2013).  (Guo, 2013) have indicated that Spark “can be up to 40 times faster than Hadoop” (page 15). (Scott, 2015) has indicated that Spark is running in memory “can be 100 times faster than Hadoop MapReduce, but also ten times faster when processing disk-based data in a similar way to Hadoop MapReduce itself” (page 7).  Spark is described as ideal for iterative processing and responsive Big Data applications (Guo, 2013). Spark can also be integrated with Hadoop, where Hadoop-compatible storage API provides the capabilities to access any Hadoop-supported systems (Guo, 2013).   The storm is another choice for the Hadoop limitation of real-time data processing. The storm is developed and open source by Twitter (Guo, 2013). The GraphLab is the alternative solution for the Hadoop limitation of dealing with large graph dataset.  GraphLab is an open source distributed system, developed at Carnegie Mellon University, to handle sparse iterative graph algorithms (Guo, 2013). Figure 5 summarizes these three limitations and the alternatives of Hadoop to overcome them.  


Figure 5.  Three Major Limitations of Hadoop and Alternative Solutions.

Apache Spark Solution, and its Building Blocks

In 2009, Spark was developed by UC Berkeley AMPLab. Spark runs in-memory processing data quicker than Hadoop(Guo, 2013; Koitzsch, 2017; Scott, 2015).  In 2013, Spark became a project of Apache Software Foundation, and early in 2014, it became one of the major projects.   (Scott, 2015) has described Spark as a general-purpose engine for data processing, and can be used in various projects (Scott, 2015).  The primary tasks that are associated with Spark include interactive queries across large datasets, processing data streaming from sensors or financial systems, and machine learning (Scott, 2015).  While Hadoop was written in Java, Apache Spark was written primarily in Scala (Koitzsch, 2017).  

Three critical features for Spark:  simplicity, speed, and support (Scott, 2015).  The simplicity feature is represented in the access capabilities of Spark through a set of APIs which are well structured and documented assisting data scientist to utilize Spark quickly.  The speed feature reflects the in-memory processing of large dataset quickly.  The speed feature has distinguished Spark from Hadoop.  The last feature of the support is presented in the various programming languages such as Java, Python, R, and Scala, which Spark support (Scott, 2015). Spark has native support for integrating some leading storage solutions in the Hadoop ecosystems and beyond (Scott, 2015).  Databricks, IBM and other main Hadoop vendors are the providers of Spark-based solutions.

The typical use of Spark includes stream processing, machine learning, interactive analytics, and data integration (Scott, 2015).  Example of stream processing includes real-time data processing to identify and prevent potentially fraudulent transactions.  The machine learning is another typical use case of Spark, which is supported by the ability of Spark to run into memory and quickly run repeated queries that help in training machine learning algorithms to find the most efficient algorithm (Scott, 2015).  The interactive analytics is another typical use of Spark involving interactive query process where Spark responds and adapts quickly.  The data integration is another typical use of Spark involving the extract, transform and load (ETL) process reducing the cost and time.  Spark framework includes the Spark Core Engine, with SQL Spark, Spark Streaming for data streaming, MLib Machine Learning, GraphX for Graph Computation, Sark R for running R language on Spark. Figure 6 summarizes the framework of Spark and its building blocks (Scott, 2015).  


Figure 6.  Spark Building Blocks (Scott, 2015).

Differences Between Spark and Hadoop

Although Spark has its benefits in processing real-time data using in-memory processing, Spark is not a replacement for Hadoop or MapReduce (Scott, 2015).  Spark can run on top of Hadoop to benefit from Yarn which is the cluster manager of Hadoop, and the underlying storage of HDFS, HBase and so forth.  Besides, Spark can also run separately by itself without Hadoop, integrating with other cluster managers such as Mesos and other storage like Cassandra and Amazon S3 (Scott, 2015). Spark is described as a great companion to the modern Hadoop cluster deployment (Scott, 2015).  A spark is also described as a powerful tool on its own for processing a large volume of data sets.  However, Spark is not well-suited for production workload.  Thus, the integration Spark with Hadoop provides many capabilities which Spark cannot offer on its own. 

Hadoop offers Yarn as a resource manager, the distributed file system, disaster recovery capabilities, data security, and a distributed data platform.  Spark offers a machine learning model to Hadoop, delivering capabilities which is not easily used in Hadoop without Spark (Scott, 2015).   Spark also offers fast in-memory real-time data streaming, which Hadoop cannot accomplish without Spark (Scott, 2015).  In summary, although Hadoop has its limitations, Spark is not replacing Hadoop, but empowering it.

Conclusion

This discussion has covered significant topics relevant to Hadoop and Spark.  It began with Big Data, its complex characteristics, and the urgent need for technology and tools to deal with Big Data. Hadoop and Spark as emerging technologies and tools and their building blocks have been addressed in this discussion.  The differences between Spark and Hadoop is also covered. The conclusion of this discussion is that Spark is not replacing Hadoop and MapReduce.  Spark offers various benefits to Hadoop, and at the same time, Hadoop offers various benefits to Spark.  The integration of both Spark and Hadoop offers great benefits to the data scientists in Big Data Analytics domain.

References

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Koitzsch, K. (2017). Pro Hadoop Data Analytics: Springer.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Hadoop: Functionality, Installation and Troubleshooting

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss Hadoop functionality, installation steps, and any troubleshooting techniques.  It addresses two significant parts.  Part-I discusses Big Data and the emerging technology of Hadoop.   It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It also discusses the MapReduce framework, its benefits, and limitations.  Part-I provides a few success stories for Hadoop technology use with Big Data Analytics.  Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks.  It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Keywords: Big Data Analytics; Hadoop Ecosystem; MapReduce.

Introduction

This project discusses various significant topics related to Big Data Analytics.  It addresses two significant parts.  Part-I discusses Big Data and the emerging technology of Hadoop.   It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It also discusses the MapReduce framework, its benefits, and limitations.  Part-I provides a few success stories for Hadoop technology use with Big Data Analytics.  Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks.  It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Part-I
Hadoop Technology Overview

            The purpose of this Part is to address relevant topics related to Hadoop.  It begins with Big Data Analytics and Hadoop emerging technology.  The building blocks of the Hadoop ecosystem is also addressed in this part.  The building blocks include the Hadoop Distributed File System (HDFS), MapReduce, and HBase.  The benefits and limitations of Hadoop as well as MapReduce are also discussed in Part I of the project.  Part I ends with success stories for using Hadoop ecosystem technology with Big Data Analytics in various domains and industries.

Big Data Analytics and Hadoop Emerging Technology

Big Data is now the buzzword in the field of computer science and information technology.  Big Data attracted the attention of various sectors, researchers, academia, government and even the media (Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013).   In the 2011 report of the International Data Corporation (IDC), it is reporting that the amount of the information which will be created and replicated will exceed 1.8 zettabytes which are 1.8 trillion gigabytes in 2011. This amount of information is growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).

Big Data Analytics (BDA) analyzes and mines Big Data to produce operational and business knowledge at an unprecedented scale (Bi & Cochran, 2014).  BDA is described by (Bi & Cochran, 2014) to be an integral toolset of strategy, marketing, human resources, and research. It is the process of inspecting, cleaning, transforming, and modeling BD with the objective of discovering knowledge, generating solutions, and supporting decision-making (Bi & Cochran, 2014).  Big Data (BD) and BDA are regarded to be powerful tools that various organizations have benefited from (Bates, Saria, Ohno-Machado, Shah, & Escobar, 2014).  Companies which adopted Big Data Analytics successfully have been successful at using Big Data to improve the efficiency of the business (Bates et al., 2014).  Example for successful application of Big Data Analytics is IBM “Watson” an application developed by IBM and was viewed in the TV Jeopardy program, using some of these Big Data approaches (Bates et al., 2014).  (Manyika et al., 2011) have provided notable examples of organizations around the globe that are well-known for their extensive and effective use of data include companies like Wal-Mart, Harrah’s, Progressive Insurance, and Capital One, Tesco, and Amazon. These companies have already taken advantage of the Big Data as a “competitive weapon” (Manyika et al., 2011).  Figure 1 illustrates the different types of data which make up the Big Data space.


Figure 1: Big Data (Ramesh, 2015)

“Big data is about deriving value… The goal of big data is data-driven decision making” (Ramesh, 2015).  Thus, business should make the analytics as the goal when investing in storing Big Data (Ramesh, 2015).  Business should focus on the Analytics side of Big Data to retrieve the value that can assist in decision-making (Ramesh, 2015).  The value of BDA is increasing as the cash flow is increasing (B. Gupta & Jyoti, 2014).  Figure 2 illustrates the graph for the value of BDA with dimensions of time and cumulative cash flow.  Thus, there is no doubt that BDA provides great benefits to organizations. 

Figure 2.  The Value of Big Data Analytics. Adapted from (B. Gupta & Jyoti, 2014).

Furthermore, the organization must learn how to use Big Data Analytics to drive value for the business that aligns with the core competencies and create competitive advantages for the business (Minelli, Chambers, & Dhiraj, 2013).  BDA can improve operational efficiencies, increase revenues, and achieve competitive differentiation.  Table 1 summarizes the Big Data Business Models which can be used by organizations to put Big Data into work as opportunities for business. 

Table 1: Big Data Business Models (Minelli et al., 2013)

There are three types of status for data that organizations deal with: data in use, data at rest and data in motion.  The data in use indicates that the data are used for services or users require them for their work to accomplish specific tasks.  The data at rest indicates that the data are not in use and are stored or archived in storage.  The data in motion indicates that the data state is about to change from data at rest to data in use or transferred from one place to another successfully (Chang, Kuo, & Ramachandran, 2016).  Figure 3 summarizes these three types of data.

Figure 3.  Three Types for Data.

One of the significant characteristics of Big Data is velocity.  The speed of data generation is described by (Abbasi, Sarker, & Chiang, 2016) as “hallmark” of Big Data.   Wal-Mart is an example of generating the explosive amount of data, by collecting over 2.5 petabytes of customer transaction data every hour.  Moreover, over one billion new tweets occur every three days, and five billion search queries occur daily (Abbasi et al., 2016).  Velocity is the data in motion (Chopra & Madan, 2015; Emani, Cullot, & Nicolle, 2015; Katal, Wazid, & Goudar, 2013; Moorthy, Baby, & Senthamaraiselvi, 2014; Nasser & Tariq, 2015).  Velocity involves streams of data, structured data, and the availability of access and delivery (Emani et al., 2015). The velocity of the incoming data does not only represent the challenge of the speed of the incoming data because this data can be processed using the batch processing but also in streaming such high speed-generated data during the real-time for knowledge-based decision (Emani et al., 2015; Nasser & Tariq, 2015).  Real-Time Data (a.k.a Data in Motion) is the streaming data which needs to be analyzed as it comes in (Jain, 2013).

(CSA, 2013) have indicated that the technologies of Big Data are divided into two categories; batch processing for analyzing data that is at rest, and stream processing for analyzing data in motion. Example of data at rest analysis includes sales analysis, which is not based on a real-time data processing (Jain, 2013).  Example of data in motion analysis includes Association Rules in e-commerce. The response time for each data processing category is different.  For the stream processing, the response time of data was from millisecond to seconds, but the more significant challenge is to stream data and reduce the response time under much lower than milliseconds, which is very challenging (Chopra & Madan, 2015; CSA, 2013). The data in motion reflecting the stream processing or real-time processing does not always need to reside in memory, and new interactive analysis of large-scale data sets through new technologies like Apache Drill and Google’s Dremel provide new paradigms for data analytics.  Figure 4 illustrates the response time for each processing type.


Figure 4.  The Batch and Stream Processing Responsiveness (CSA, 2013).

There are two kinds of systems for the data at rest; the NoSQL systems for interactive data serving environments, and the systems for large-scale analytics based on the MapReduce paradigm, such as Hadoop.  The NoSQL systems are designed to have a simpler key-value based Data Model having in-built sharding, and work seamlessly in a distributed cloud-based environment (R. Gupta, Gupta, & Mohania, 2012).  A mapreduce-based framework such as Hadoop supports the batch-oriented processing (Chandarana & Vijayalakshmi, 2014; Erl, Khattak, & Buhler, 2016; Sakr & Gaber, 2014). The data stream management system allows the user to analyze data in motion, rather than collecting large quantities of data, storing it on disk, and then analyzing it. There are various streams processing systems such as IBM InfoSphere Streams (R. Gupta et al., 2012; Hirzel et al., 2013), Twitter’s Storm, and Yahoo’s S4.   These systems are designed and geared towards clusters of commodity hardware for real-time data processing (R. Gupta et al., 2012).

In 2004, Google introduced MapReduce framework as a parallel processing framework which deals with a large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012).  The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012).  Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). 

The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012).  MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014).  DFS is a significant feature of the MapReduce framework (Fadzil et al., 2012).  

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012).  MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down, there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007).  MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber, 2014).  MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber, 2014). 

Hadoop Ecosystem Building Blocks

BD emerging technologies such as Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale, but also heterogeneous datasets at unprecedented scale and speed (Cardenas, Manadhata, & Rajan, 2013).  Hadoop was developed by Yahoo and Apache to run jobs in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012).  A various large corporation such as Facebook, Amazon have used Hadoop as it offers high efficiency, high scalability, and high reliability (Yan et al., 2012).  The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015).  The MapReduce programming model is another significant component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015).  HBase is the third component of the Hadoop framework (Bao et al., 2012).  HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012). 

Hadoop Benefits and Limitations

Various studies have addressed various benefits for Hadoop technology.  Hadoop includes the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014).  Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014).  The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also allows nodes to crash without affecting the data processing.  It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012). 

Hadoop has faced various limitation such as low-level programming paradigm and schema, strictly batch processing, time skew and incremental computation (Alam & Ahmed, 2014).  The incremental computation is regarded to be one of the significant shortcomings of Hadoop technology (Alam & Ahmed, 2014).   The efficiency on handling incremented data is at the expense of losing the incompatibility with programming models which are offered by non-incremental systems such as MapReduce, which requires the implementation of incremental algorithms and increasing the complexity of the algorithm and the code (Alam & Ahmed, 2014).   The caching technique is proposed by (Alam & Ahmed, 2014) as a solution.  This caching solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014). 

Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, & Pasquin, 2011).   The Incoop proposed solution is to extend the open-source implementation of Hadoop of MapReduce programming paradigm to run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Incoop allows programmers to increment the MapReduce programs automatically without any modification to the code (Bhatotia et al., 2011; Sakr & Gaber, 2014).  Moreover, information about the previously executed MapReduce tasks are recorded by Incoop to be reused in subsequent MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber, 2014). 

The Incoop is not a perfect solution, and it has some shortcomings which are addressed by (Sakr & Gaber, 2014; Zhang, Chen, Wang, & Yu, 2015).  Some enhancements are implemented to Incoop to include incremental HDFS called Inc-HDFS, Contraction Phase, and “Memoization-aware Scheduler” (Sakr & Gaber, 2014).  The Inc-HDFS provides the delta technique in the inputs of two consecutive job runs and splits the input based on the contents where the compatibility with HDFS is maintained.  The Contraction phase is a new phase in the MapReduce framework consisting of breaking up the Reduce tasks into smaller sub-computation forming an inverted tree allowing the small portion of the input changes to the path from the corresponding leaf to the root to be computed (Sakr & Gaber, 2014).  The Memoization-aware Scheduler is a modified version of the scheduler of Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).

Another solution called  i2MapReduce proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015).  The i2MapReduce does not perform the task-level computation but rather a key-value pair level incremental processing.  This solution also supports more complex iterative computation, which is used in data mining and reduces the I/O overhead by applying various techniques (Zhang et al., 2015).  IncMR is an enhanced framework for the large-scale incremental data processing (Yan et al., 2012).  It inherits the simplicity of the standard MapReduce, it does not modify HDFS and utilizes the same APIs of the MapReduce (Yan et al., 2012).  When using IncMR, all programs can complete incremental data processing without any modification (Yan et al., 2012). 

In summary, various efforts are exerted by researchers to overcome the incremental computation limitation of Hadoop, such as Incoop, Inc-HDFS, i2MapReduce, and IncMR.  Each proposed solution is an attempt to enhance and extend the standard Hadoop to avoid overheads such as I/O, to increase the efficiency, and without increasing the complexing of the computation and without causing any modification to the code.

MapReduce Benefits and Limitations

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014).  The complexity of the parallelization is solved by using two techniques:  Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014).  The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012).   The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012).  MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012).  It addresses the structural scalability by using the DFS which allows forming sizeable virtual storage for the framework by adding off-the-shelf hardware.  MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012). 

However, the earlier version of the MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014).  Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014).  The standard MapReduce does not share the execution of work to reduce the overall amount of work  (Sakr & Gaber, 2014).  Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).

Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data.   It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the processing of large unstructured data sets (Bakshi, 2012).  MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

The standard MapReduce framework faced the challenge of the iterative computation which is required in various operations such as data mining, PageRank, network traffic analysis, graph analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010; Sakr & Gaber, 2014).   These analyses techniques require the data to be processed iteratively until the computation satisfies a convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014).   Due to this limitation, and to this critical requirement, this iterative process is implemented and executed manually using a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014).   However, the manual implementation and execution of such iterative computation have two significant problems (Bu et al., 2010; Sakr & Gaber, 2014).  The first problem is reflected in loading unchanged data from iteration to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition when the output of the application did not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014).  This termination condition may require an extra MapReduce job on each iteration which causes overhead for scheduling extra tasks, reading extra data from disk, and moving data across the network (Bu et al., 2010; Sakr & Gaber, 2014). 

Researchers exerted efforts to solve the iterative computation.  HaLoop is proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010).   One solution to the iterative computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by  (Ekanayake et al., 2010) are to identify and keep invariant data during the iterations, where reading unnecessary data repeatedly is avoided.  The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014).  The first caching technique is implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique is implemented on the outputs of reducer making the check for the fixpoint more efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).

The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010).  This solution provides the synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and does not replicate remote values locally (Malewicz et al., 2010).  Mahoot is another solution that was introduced to solve the iterative computing by grouping a series of chained jobs to obtain the results (Polato et al., 2014).   In Mahoot solution, the result of each job is pushed into the next job until the final results are obtained (Polato et al., 2014).  The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011).   The task scheduler of the iHadoop utilizes the inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011). 

Apache Hadoop and Apache Spark are the most popular technology for the iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011).  Hadoop defines the iterative computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS) independently, processes the data, and writes the data back to HDFS (Liang et al., 2011).   Dacoop was proposed by Liang as an extension to Hadoop to handle the data-iterative applications, by using cache technique for repeatedly data processing and introducing shared memory-based data cache mechanism (Liang et al., 2011).  The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent tasks of the map and reduce during the whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012).   The iMapReduce avoid three significant overheads.  The first overhead is the job startup overhead which is avoided by building an internal loop from reduce to map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state data from the static structure data.  The third overhead is the synchronization overhead which is avoided by allowing asynchronous map task execution (Zhang et al., 2012).

Success Stories of Hadoop Technology for Big Data Analytics (BDA)

·         BDA and the Impact of Hadoop in Banking for Cost Reduction

            (Davenport & Dyché, 2013) have reported that Big Data has an impact at an International Financial Services Firm.  The bank has several objectives for Big Data. However, the primary objective is to exploit “a vast increase in computing power on dollar-for-dollar basis” (Davenport & Dyché, 2013).  The bank purchased Hadoop cluster, with 50 server nodes and 800 processor cores, capable of handling a petabyte of data.  The data scientists of the bank take the existing analytical procedures and converting them into the Hive scripting language to run on the Hadoop cluster. 

·         BDA and the Impact of Real-Time and Hadoop on Fraud Detection

Big Data with high velocity has created opportunities and requirements for organizations to increase its capability of Real-Time sense and response (Chan, 2014).  The Analysis of the Real-Time and the rapid response are critical features of the Big Data Management in many business situations (Chan, 2014).  For instance, as cited in (Chan, 2014), IBM (2013) in scrutinizing five million trade events that are created each day identified potential fraud, and analyzing 500 million daily call detail records in real-time was able to predict customer churn faster (Chan, 2014). 

“Fraud detection is one of the most visible uses of big data analytics”  (Cardenas et al., 2013).  Credit card and phone companies have conducted large-scale fraud detection for decades (Cardenas et al., 2013).  However, the custom-built infrastructure necessary to mine Big Data for fraud detection was not economical to have wide-scale adoption.  However, one of the significant impacts of BDA technologies is that they are facilitating a wide variety of industries to develop affordable infrastructure for security monitoring (Cardenas et al., 2013).  The new BD technologies of  Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale but also heterogeneous datasets at unprecedented scale and speed (Cardenas et al., 2013).  These technologies have transformed security analytics by facilitating the storage, maintenance, and analysis of security information (Cardenas et al., 2013).

·         BDA and the Impact of Hadoop in Time Reduction as Business Value

Big Data Analytics can be used in marketing in a competitive edge by reducing the time to respond to customers, rapid data capture, aggregation, processing, and analytics.  Harrah’s (currently Caesars) Entertainments has acquired both Hadoop clusters and open-source and commercial analytics software, with the primary objective of exploring and implementing Big Data to respond in real-time to customer marketing and service.  GE is another example that is regarded to be the most prominent creator of new service offerings based on Big Data (Davenport & Dyché, 2013).  The primary focus of GE was to optimize the service contracts and maintenance intervals for industrial products. 

Part-II
Hadoop Installation

The purpose of this Part is to go through the installation of Hadoop on a single cluster node using the Windows 10 operating system. It covers fourteen significant tasks, starting from the download of the software from the Apache site, to the demonstration of the successful installation and configuration.  The steps of the installation are derived from the installation guide of (aparche.org, 2018). Due to the lack of system resources, the Windows operating system was the most appropriate choice for this installation and configuration, although the researcher prefers Unix system over Windows due to the extensive experience with Unix.  However, the installation and configuration experience on Windows has its value as well.

Task-1: Hadoop Software Download

            The purpose of this task is to download the required Hadoop software for windows operating system from the following link: http://www-eu.apache.org/dist/hadoop/core/stable/.   Although there is a higher version than 2.9.1, the researcher has selected this version which is core stable version recommended by Apache.

Task-2: Java Installation

The purpose of this task is to install Java which is required for Hadoop as indicated in the administration guide.  Java 1.8.0_111 is installed on the system as shown below.

Task-3: Extract Hadoop Zip File

The purpose of this task is to extract Hadoop zip file into a directory under C:\Hadoop-2.9.1 as shown below.

Task-4: Setup Required System Variables.

The purpose of this task is to set up the required system variables.  Setup up the HADOOP_HOME as it is required per the installation guide.

Task-5: Edit core-site.xml

The purpose of this task is to setup the configuration of Hadoop by editing the core-site.xml file from C:\Hadoop-2.9.1\etc\hadoop and add the fs.defaultFS to identify the file system for Hadoop using the localhost and port 9000.

Task-6: Copy mapred-site.xml.template to mapred-site.xml

The purpose of this task is to copy the MapReduce template.  Copy mapred-site.xml.template to another file mapred-site.xml in the same directory.

Task-7: Edit mapred-site.xml

The purpose of this task is to set up the configuration for Hadoop MapReduce by editing mapred-site.xml and add between configuration tags the properties tag as shown below.

Task-8: Create Two Folders for DataNode and NameNode

The purpose of this task is to create two important folders for data node and name node which are required for the Hadoop file system. Create folder “data” under the Hadoop home C:\Hadoop-2.9.1.   Create folder “datanode” under C:\Hadoop-2.9.1\data.  Create folder “namenode” under C:\Hadoop-2.9.1\data.

Task-9: Edit hdfs-site.xml

The purpose of this task is to setup the configuration for Hadoop HDFS by editing the file C:\Hadoop-2.9.1\etc\hadoop\hdfs-site.xml, and add the properties for dfs.replication, dfs.namenode, and dfs.datanode as shown below.

Task-10: Edit yarn-site.xml

The purpose of this task is to set the configuration for yarn tool by editing the file C:\Hadoop-2.9.1\etc\hadoop\yarn-site.xml and add yarn.nodemanager.aux-services and its value of mapreduce_shuffle as shown below.

Task-11: Overcome Java Error

The purpose of this task is to overcome the Java error.  Edit C:\Hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add the JAVA_HOME to overcome the following error.

Task-12: Test the Configuration

            The purpose of this task is to test the current configuration and setup by issuing the following command to test the setup before running Hadoop.  The command will throw an error about HADOOP_COMMON_HOME is not found.

>hdfs namenode -format

Task-13: Overcome the HADOOP_COMMON_HOME Error

To overcome the HADOOP_COMMON_HOME “not found” error, edit haddop-env.cmd and add the following, and issue the command again and it will pass as shown below.

Task-14: Start Hadoop Processes

The purpose of this task is to start Hadoop dfs and yarn processes.

Task-15: Run the Cluster Page from the Browser

            The purpose of this task is to run the cluster page for Hadoop from the browser after the previous configuration setup.  If the configuration setup is implemented successfully, the cluster page gets displayed with the Hadoop functionality as shown below, otherwise, it can throw the 404 error, page not found.

Conclusion

This project has discussed various significant topics related to Big Data Analytics.  It addressed two significant parts.  Part-I has discussed Big Data and the emerging technology of Hadoop.   It has provided an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations.  It has also discussed the MapReduce framework, its benefits, and limitations.  Part-I has also provided few success stories for Hadoop technology use with Big Data Analytics.  Part-II has addressed the installation and the configuration of Hadoop on Windows operating system using fourteen essential Tasks.  It has also addressed the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

References

Abbasi, A., Sarker, S., & Chiang, R. (2016). Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.

Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. Paper presented at the Computational Science and Computational Intelligence (CSCI), 2014 International Conference on.

aparche.org. (2018). Hadoop Installation Guide – Windows. Retrieved from https://wiki.apache.org/hadoop/Hadoop2OnWindows.

Bakshi, K. (2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

Bates, D. W., Saria, S., Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123-1131.

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Paper presented at the Proceedings of the 2nd ACM Symposium on Cloud Computing.

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.

Cardenas, A. A., Manadhata, P. K., & Rajan, S. P. (2013). Big data analytics for security. IEEE Security & Privacy, 11(6), 74-76.

Chan, J. O. (2014). An architecture for big data analytics. Communications of the IIMA, 13(2), 1.

Chandarana, P., & Vijayalakshmi, M. (2014). Big Data analytics frameworks. Paper presented at the Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on.

Chang, V., Kuo, Y.-H., & Ramachandran, M. (2016). Cloud computing adoption framework: A security framework for business clouds. Future Generation computer systems, 57, 24-41. doi:10.1016/j.future.2015.09.031

Chopra, A., & Madan, S. (2015). Big Data: A Trouble or A Real Solution? International Journal of Computer Science Issues (IJCSI), 12(2), 221.

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative mapreduce. Paper presented at the Proceedings of the 19th ACM international symposium on high performance distributed computing.

Elnikety, E., Elsayed, T., & Ramadan, H. E. (2011). iHadoop: asynchronous iterations for MapReduce. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on.

Emani, C. K., Cullot, N., & Nicolle, C. (2015). Understandable big data: A survey. Computer science review, 17, 70-81.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Gantz, J., & Reinsel, D. (2011). Extracting value from chaos. IDC iview, 1142, 1-12.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gupta, B., & Jyoti, K. (2014). Big data analytics with hadoop to analyze targeted attacks on enterprise data.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: what is new from databases perspective? Paper presented at the International Conference on Big Data Analytics.

Hirzel, M., Andrade, H., Gedik, B., Jacques-Silva, G., Khandekar, R., Kumar, V., . . . Soulé, R. (2013). IBM streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4), 7: 1-7: 11.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, P., & Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Jain, R. (2013). Big Data Fundamentals. Retrieved from http://www.cse.wustl.edu/~jain/cse570-13/ftp/m_10abd.pdf.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Paper presented at the System Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.

Katal, A., Wazid, M., & Goudar, R. (2013). Big data: issues, challenges, tools and good practices. Paper presented at the Contemporary Computing (IC3), 2013 Sixth International Conference on Contemporary Computing.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Liang, Y., Li, G., Wang, L., & Hu, Y. (2011). Dacoop: Accelerating data-iterative applications on Map/Reduce cluster. Paper presented at the Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses: John Wiley & Sons.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Moorthy, M., Baby, R., & Senthamaraiselvi, S. (2014). An Analysis for Big Data and its Technologies. International Journal of Science, Engineering and Computer Technology, 4(12), 412.

Nasser, T., & Tariq, R. (2015). Big Data Challenges. J Comput Eng Inf Technol 4: 3. doi:10.4172/2324, 9307, 2.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Ramesh, B. (2015). Big Data Architecture Big Data (pp. 29-59): Springer.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

White, T. (2012). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012). Incmr: Incremental data processing based on mapreduce. Paper presented at the Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for Mining Evolving Big Data. IEEE transactions on knowledge and data engineering, 27(7), 1906-1919.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.

 

XML in Healthcare and eCommerce

Dr. Aly, O.
Computer Science

The purpose of this discussion is to address the advantages and disadvantages of XML used in big data analytics for large healthcare organizations. The discussion also presents the use of XML in the healthcare industry as well as in another industry such as eCommerce.

Advantages of XML

XML has several advantages such as simplicity, platform, and vendor independent, extensibility, reuse by many applications, separation of content and presentation, and improved load balancing (Connolly & Begg, 2015; Fawcett, Ayers, & Quin, 2012).  XML also provides support for the integration of data from multiple sources (Connolly & Begg, 2015; Fawcett et al., 2012).  XML can describe data from a wide variety of applications (Connolly & Begg, 2015; Fawcett et al., 2012).  More advanced search engines capabilities in another advantage of XML (Connolly & Begg, 2015).  (Brewton, Yuan, & Akowuah, 2012) have identified two significant benefits of XML.  XML can support tags that are created by the users allows the language to be fully extensible and overcome any tag limitation.   The second significant benefit of XML in healthcare is the versatility, where any data types can be modeled, and tags can be created for specific contexts. 

Disadvantages of XML

The specification of the namespace prefix within DTDs is a significant limitation, as users cannot choose their namespace prefix but must use the prefix defined within the DTD (Fawcett et al., 2012).  This limitation exists as W3C completed the XML Recommendation before finalizing how namespaces would work.  While DTD has poor support to XML namespaces, it plays an essential part in the XML Recommendation.  Furthermore, (Forster, 2008) have identified a few disadvantages of XML.  The inefficiency is one of this limitation as XML was initially designed to accommodate the exchange of data between nodes of the different system and not as a database storage platform.  XML is described as inefficient compared to other storage algorithms (Forster, 2008).  The tags of XML make it readable to humans but requires additional storage and bandwidth (Forster, 2008).  Encoded image data represented in XML requires another program to get displayed as it must be un-encoded and then reassembled into an image (Forster, 2008).  Three XML parsers that inexperienced developers will not be familiar with:  Programs, APIs, and Engines.   XML lacks rendering instructions as it is a backend technology in the form of data storage and transmission technology.  (Brewton et al., 2012) have identified two significant limitations of XML.  The lack of the application that can process XML data and make its data useful.  Browsers utilize HTML to render XML document which indicates that XML cannot be used as an independent language from HTML.  The second major limitation of XML is the unlimited flexibility of the language, where the tags are created by the user, and there is no standard accepted set of tags to be used in the XML document.  The result of this limitation is that the developer cannot create general applications as each company will have its application with its own set of tags.

XML in Healthcare

Concerning XML in healthcare, (Brewton et al., 2012) have indicated that XML was a solution to the problem of finding a reliable and standardized means for storing and exchanging clinical documents.  American National Standards Institute has accredited Health Level 7 (HL7) as an organization which is responsible for setting up many communication standards used across America (Brewton et al., 2012).  The goal of this organization is to provide standards for the exchange, management and integration of data which support clinical patient care and management, delivery, and the evaluation of the services of healthcare (Brewton et al., 2012). Furthermore, HL7 is developing Clinical Document Architecture (CDA) to provide standards for the representation of the clinical document such as discharge summaries and progress notes.  The goal of CDA is to solve the problem of finding a reliable and standardized means for storing and exchanging clinical documents by specifying a markup and semantic structure through XML, allowing medical institutions to share clinical documents.  HL7 version 3 includes the rules for messaging as well as CDA which are implemented with XML and are derived from the Reference Information Model (RIM). Besides, XML supports the hierarchical structure of CDA (Brewton et al., 2012).  Healthcare data must be secured to protect the privacy of the patients.  XML provides signature capabilities which operate identically to regular digital signature (Brewton et al., 2012).  In addition to XML signature, it has encryption capabilities which mandate requirements for areas not covered by the secure socket layer technique (Brewton et al., 2012).  (Goldberg et al., 2005) have identified some limitations of XML when working with images in the biological domain.  The bulk of an image file is represented by the pixels in the image and not the metadata which is regarded as a severe problem.  Another related problem is that XML is verbose meaning that XML file is already more massive than the binary file, and the image files are already quite large which causes another problem when using XML in healthcare (Goldberg et al., 2005).

XML in eCommerce

(Sadath, 2013) have discussed some benefits and limitation of XML in the eCommerce domain.  XML has been advantages of being a flexible hierarchical model suitable to represent semi-structured data.  It is used effectively in data mining and is described as the most common tool used for data transformation between different types of application.  In data mining using XML, there are two approaches to access the XML document: the key-word base search and query-answering.  The key-word based has no much advantages because search takes place on the textual content of the document.  However, when using the query-answering approach to access the XML document, the structure should be known in advance which is not often the case.  The consequences of such lack of knowledge about the structure can lead to information overload where too much data is included because the key-word used information does not exist, or if it incorrectly exists, incorrect answers are received (Sadath, 2013).  Thus, various efforts from researchers have been exerted to find the best approach for data mining in XML, such as XQuery, or Tree-based Association Rules (TARs) as means to represent intentional knowledge in native XML.

References

Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Paper presented at the Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP).

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Forster, D. (2008). Advantages and Disadvantages that You Should Know About XML. Retrieved from https://www.informdecisions.com/downloads/XML_Advantages_and_Disadvantages.pdf.

Goldberg, I. G., Allan, C., Burel, J.-M., Creager, D., Falconi, A., Hochheiser, H., . . . Swedlow, J. R. (2005). The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome biology, 6(5), R47.

Sadath, L. (2013). Data mining in E-commerce: A CRM Platform. International Journal of Computer Applications, 68(24).

eXtensible Markup Language (XML)

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss how XML is used to represent Big Data and in various forms.  The discussion begins with some basic information about XML, followed by three pillars of XML, XML elements and Attributes, and document-centric vs. the data-centric view of XML.  Big Data and XML representation are also addressed in this discussion, followed by the XML processing efficiency and Hadoop technology.

What is XML

XML stands for eXtensible Markup Language, which can be utilized to describe data in a meaningful way (Fawcett, Ayers, & Quin, 2012).  It has gained a good reputation due to its ability for interoperability among various applications and passing data between different components (Fawcett et al., 2012).  It has been used to describe documents and data in a text-based standardized format what can be transferred via Internet standard protocol (Benz & Durant, 2004; Fawcett et al., 2012).  Various standardized formats for XML are available, known as “schemas,” which represent various types of data such as medical records, financial transactions, and GPS (Fawcett et al., 2012).   

XML has no tags of its own like HMTL (Howard, 2010).  However, XML allows users to write the XML creating their own tags as needed provided that these tags follow the rules of the XML specifications (Howard, 2010).  These rules include root element, closing tags, properly nested elements, case matters, quotation-marked values (Howard, 2010).  Figure 1 summarizes these rules of the XML.   Document type definition (DTD) or schema enforces these rules.


Figure 1.  A Summary of the XML Rules.

Like HTML, XML is based on Standard Generalized Markup Language (SGML) (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar, Lacroix, Bressan, Lee, & Li, 2002). SGML was developed in 1974 as part of the IBM document-sharing project and was officially standardized by the International Organization for Standardization (ISO) in 1986 (Benz & Durant, 2004). Although SGML was developed to define various types of markup, it was found complicated, and hence few applications could read SGML (Fawcett et al., 2012; Nambiar et al., 2002).   Hyper Text Markup Language (HTML) was the first adoption of the SGML (Benz & Durant, 2004).  HTML was explicitly designed to describe documents for display in a Web browser.  However, with the explosion of the Web and the need for more than just displaying data in a Web browser, the developers struggled with the effectiveness of the HTML and strived to find a method which could describe data more effectively than HTML on the web (Benz & Durant, 2004). 

In 1998, the World Wide Web Consortium (W3C) has combined the basic characteristics which separate data from the format in SGML, with the extension of the HTML tag formats which were used for the Web and developed the first XML Recommendation (Benz & Durant, 2004). (Myer, 2005) have described HTML as a presentation language, while XML as a data-description language. In brief, XML was designed to overcome the limitation of both SGML and HTML (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar et al., 2002). 

Three Pillars of XML

(Benz & Durant, 2004) have identified three pillars for XML: extensibility, structure, and validity (Figure 2).  The extensibility pillar reflects the ability of the XML to describe structured data as text, and the format is open to extension, meaning that any data which can be described as a text and can be nested in XML tags can be generated as an XML file. The structure is the second pillar of the XML as it is described to be complicated for a human to follow, however, the file is designed to be read by the application.  XML parsers and other types of tools which can read XML are designed to read XML format easily.  The data representations using XML are much larger than their original format (Benz & Durant, 2004; Fawcett et al., 2012).  The validity pillar of XML where the data of the XML file can be optionally validated for structure and content, based on two data validation standards: data type definition (DTD), and XML schema standard (Benz & Durant, 2004). 


Figure 2. Three Pillar of XML.

XML Elements and Attributes

XML is developed to describe data and documents more effectively than using HTML, the W3C XML Recommendation provided strict instruction on the format requirements that will distinguish a text file that has various tags and the XML file which has distinguished tags (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010). 

There are two main features of the XML file, known as “elements,” and “attributes” (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010).  In the example below, the “applicationUsers” and “user” reflect the elements feature of the XML.  The “firstName” and “lastName” reflect the attributes feature of the XML.  Text can be placed between the opening and closing tags of an element to represent the actual data associated with the elements and attributes surrounding the text (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010).  Figure 3 shows the elements and attributes in a simple way.  Figure 4 illustrates a simple XML document showing the first line, which determines what version of the W3C XML recommendation that the document should adhere to, in addition to XML rules such as root element.


Figure 3.  XML Elements and Features.


Figure 4.  Simple XML Document Format Adapted from (Benz & Durant, 2004).

Document-Centric vs. Data-Centric View of XML

XML documents use DTD to derive their structures. Thus, XML documents can take any structure, while relational and object-relational data models have a fixed pre-defined structure (Bourret, 2010; Nambiar et al., 2002).  Thus, structured data using XML is a combination with DTDs or XML schema specification is described as data-centric format or characteristic of XML (Bourret, 2010; Nambiar et al., 2002).  This type of data-centric view of XML is highly structured similarly to the relational database where the order of sibling elements is not essential in such documents.  Various query languages were developed for the data-centric format of XML such as XML-QL, LOREL and XQL which require data to be fully structured (Nambiar et al., 2002).  The database is said to be XML-enabled, or third-party software such as middleware, data integration software, or a Web application server (Bourret, 2010).

However, the document-centric format of XML is highly unstructured (Nambiar et al., 2002).  The data in the document-centric format of XML can be stored and retrieved using a native-XML database or document management system (Bourret, 2010).   Furthermore, the implicit and explicit order of the elements matters in such XML documents (Nambiar et al., 2002).  The implicit order is represented by order of the elements within a file in a tree-like representation, while the explicit order is represented by an attribute or a tag in the document (Nambiar et al., 2002).  The explicit order can be expressed in a relational database, whereas the capture of the implicit order while converting the document-centric XML document into the relational database was a challenge (Nambiar et al., 2002).  In addition to the implicit order challenge, XML documents differ from a relational representation by allowing deep nesting and hyper-linked components (Nambiar et al., 2002).  The transformation of implicit order, nesting, and hyperlinks into tables can be a solution. However, such a transformation is costly regarding time and space (Nambiar et al., 2002).  Thus, the XML processing efficiency was a challenge. 

Big Data and XML Representation

Big Data was first defined using the well-known 3V features reflecting the volume, velocity, and variety (Wang, Kung, & Byrd, 2018; Ylijoki & Porras, 2016).  The volume feature reflects the magnitude and the size of the data from terabytes to exabytes. The velocity reflects the speed of the data growth and the speed of the processing of the data from batch to real-time and streaming.  The variety feature of Big Data reflects the various types of the data from text to the graph to include structured data as well as unstructured and semi-structured (Wang et al., 2018; Ylijoki & Porras, 2016).

Big Data development has gone through an evolutionary phase as well as the revolutionary phase.  The evolutionary phase of the Big Data development has gone through the period of 2001 and 2008 (Wang et al., 2018).  During that evolutionary period, it became possible for sophisticated software to meet the needs and requirements of dealing with the explosive growth of the data (Wang et al., 2018). Analytics modules were added using software and application developments like XML web services, database management systems, and Hadoop, in addition to the functions which were added to core modules which focused on enhancing usability for end users (Wang et al., 2018).  These software application developments like XML web services, database management systems and Hadoop enabled users to process a large amount of data within and across organizations collaboratively as well as in real-time (Wang et al., 2018).  During the 2000s, XML became the standard formatting language for semi-structured data, mostly for an online purpose, which led to the development of XML database, which was regarded as a new generation of the database (Verheij, 2013).  

Healthcare organization, at the same time, began to digitalize the medical records and aggregate clinical data in the substantial electronic database (Wang et al., 2018).  Such development of software and application like XML web services, database management systems, and Hadoop made the significant volume of the healthcare data storable, usable, searchable, and actionable, and assisted the healthcare providers to practice medicine more effectively (Wang et al., 2018).

Starting from 2009, Big Data Analytics entered an advanced phase which is the revolutionary phase, where the computing of the big data became a breakthrough innovation for Business Intelligence (Wang et al., 2018).  Besides, the data management and its techniques were predicted to shift from structured data into unstructured data, and from a static environment to ubiquitous cloud-based environment (Wang et al., 2018).  The data for healthcare industry continued to grow, and as of 2011, the stored data for healthcare reached 150 exabytes (1 EB = 118 bytes) worldwide, mainly in the form of electronic health records (Wang et al., 2018).  Other industries of Big Data Analytics pioneers include banks and e-commerce started to experience the impact on the business process improvement, workforce effectiveness, cost reduction, and new customer attraction (Wang et al., 2018).    

The data management approaches for Big Data include various types of databases such as columnar, document stores, key-value/tuple stores, graph, multimodal, object, grid and cloud database solutions, XML-databases, multi-dimensional, and multi-value (Williams, 2016).  Big Data analytics systems are distinguished from the traditional data management systems as they have the capabilities to analyze semi-structured or unstructured data which lack in the traditional data management systems (Williams, 2016).  XML as a textual language for exchanging data on the Web is regarded as a typical example of semi-structured data (Benz & Durant, 2004; Gandomi & Haider, 2015; Nambiar et al., 2002).

In various industries such as healthcare, the semi-structured and unstructured data refer to information which cannot be stored in a traditional database and cannot fit into predefined data models.  Example of this semi-structured and unstructured healthcare database include XML-based electronic healthcare records, clinical images, medical transcripts, and results of the labs.  As an example of the case study, (Luo, Wu, Gopukumar, & Zhao, 2016) have referenced a case study where a hybrid XML database and Hadoop/HBase infrastructure were used to design the Clinical Data Managing and Analyzing System.  

 XML Processing Efficiency and Hadoop Technology

Organizations can derive value from XML documents which reflect semi-structured data (Aravind & Agrawal, 2014).  To derive value from these semi-structured XML documents, the XML data needs to be ingested into Hadoop for the analytic purpose (Aravind & Agrawal, 2014).  However, Hadoop technology does not offer a standard XML “RecordReader,” while XML is one of the standard file formats for the MapReduce (Lublinsky, Smith, & Yakubovich, 2013). 

There is an increasing demand for efficient processing of large volume of data stored in XML using Apache Hadoop Map/Reduce (Vasilenko & Kurapati, 2014). Various approaches have been used for XML processing efficiency.  The ETL process for extracting data is one approach (Vasilenko & Kurapati, 2014).  The transformations of XML into other formats that are natively supported by Hive is another technique for efficient processing of XML (Vasilenko & Kurapati, 2014).  Another approach is to use Apache Hive XPath UDFs. However, these functions can only be used in Hive views, and SELECT statements and not in the table CREATE DDL (Vasilenko & Kurapati, 2014).  (Subhashini & Arya, 2012) have provided few attempts from various researchers such as the use of a generic XML-based Web information extraction solution using two key technologies.  XML-based Web data conversion technology, and the XSLT (Extensible Stylesheet Language Transformation). The XML-based Web data conversion technology is used to convert the HTML into XHTML document using XML rules and develop XMLDOM tree, and DOM-based XPath algorithm is generating XPath expression for the desired information nodes when the information points are marked by the users.  The XSLT is used to extract the required information from the XHTML document, and the result of the extraction are expressed in XML (Subhashini & Arya, 2012).  XSLT is regarded to be one of the most important of the XML technologies to consider in solving information processing issues (Holman, 2017).   Other attempts included the use of a wrapper based on XBRL (eXtensible Business Reporting Language)-GL taxonomy to extract financial data from the web (Subhashini & Arya, 2012). Those are a few attempts to solve the processing issues outlined in (Subhashini & Arya, 2012).

References

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Benz, B., & Durant, J. R. (2004). XML programming bible (Vol. 129): John Wiley & Sons.

Bourret, R. (2010). XML Database Products. Retrieved from http://www.rpbourret.com/xml/XMLDatabaseProds.htm.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

Holman, G. K. (2017). What is XSLT? Retreived from https://www.xml.com/articles/2017/01/01/what-is-xslt/.

Howard, G. K. (2010). Xml: Visual Quickstart Guide, 2/E: Pearson Education India.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Myer, T. (2005). A Really, Really, Really Good Introduction to XML. Retreived from https://www.sitepoint.com/really-good-introduction-xml/.

Nambiar, U., Lacroix, Z., Bressan, S., Lee, M. L., & Li, Y. G. (2002). Efficient XML data management: an analysis. Paper presented at the International Conference on Electronic Commerce and Web Technologies.

Subhashini, C., & Arya, A. (2012). A Framework For Extracting Information From Web Using VTD-XML’s XPath. International Journal on Computer Science and Engineering, 4(3), 463.

Vasilenko, D., & Kurapati, M. (2014). Efficient processing of xml documents in hadoop map reduce.

Verheij, B. (2013). The process of big data solution adoption. TU Delft, Delft University of Technology.  

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.

Williams, S. (2016). Business intelligence strategy and big data analytics: a general management perspective: Morgan Kaufmann.

Ylijoki, O., & Porras, J. (2016). Perspectives to definition of big data: a mapping study and discussion. Journal of Innovation Management, 4(1), 69-91.

Quantitative Analysis of Online Radio “LastFM” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the online radio dataset called (lastfm.csv). The project is divided into two main Parts.  Part-I evaluates and examines the dataset for understanding the Dataset using the RStudio.  Part-I involves three major tasks to review and understand the Dataset variables.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame. The Association Rule data mining technique is used in this project.  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users.  The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6. 

Keywords: Online Radio, Association Rule Data Mining Analysis

Introduction

This project examines and analyzes the Dataset of (lastfm.csv).  The dataset is downloaded from CTU course materials.  The lastfm.csv dataset reflect online radio which keeps track of every thing the user plays.  It has 289,955 observations with four variables.  The focus of this analysis is Association Rule.  The information in the dataset is used for recommending music the user is likely to enjoy and supports focused on marketing which sends the user advertisements for music the user is likely to buy.  From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  Thus, to calculate such support, the count of the incidences (0/1) (frequency) is implemented across all memebers of the network and divide those frequencies by the number of the members.  From the support, the confidence and the lift is calculated.

This project addresses two major Parts.  Part-I covers the following key Tasks to understand and examine the Dataset of “lastfm.csv.” 

  • Task-1:  Review the Variables of the Dataset.
  • Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.
  • Task-3:  Examine the Dataset, Summary of the Descriptive Statistics, and Visualization of the Variables.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

  • Task-1: Required Computations for Association Rules and Frequent Items.
  • Task-2: Association Rules.
  • Task-3: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “lastfm.csv”

Task-1:  Review the Variables of the Dataset

The purpose of this task is to understand the variables of the dataset.  The Dataset is “lastfm.csv” dataset.  The Dataset describes the artists and the users who listens to the music. From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  There are 4 variables.  Table 1 summarizes the selected variables for this project.  

Table 1:  LastFm Dataset Variables

Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.

            The purpose of this task is to load and understand the Dataset using names(), head(), dim() function.  The task also displays the first three observations.

  • ## reading the data
  • lf <-read.csv(“C:/CS871/Data/lastfm.csv”)
  • lf
  • dim(lf)
  • length(lf$user)
  • names(lf)
  • head(lf)
  • lf <- data.frame(lf)
  • head(lf)
  • str(lf)
  • lf[1:20,]
  • lfsmallset <- lf[1:1000,]
  • lfsmallset
  • plot(lfsmallset, col=”blue”, main=”Small Set of Online Radio”)

Figure 1.  First Sixteen Observations for User (1) – Woman from Germany.

Figure 2. The plot of Small Set of Last FM Variables.

 Task-3:  Examine the Dataset, Summary of the Descriptive Statistics and Visualization of the Variables.

            The purpose of this task is to examine the dataset.  This task also factor the user and levels users and artist variables.  It also displays the summary of the variables and the visualization of each variable.

  • ### Factor user and levels user and artist
  • lf$user <- factor(lf$user)
  • levels(lf$user)    ## 15,000 users
  • levels(lf$artist)  ## 1,004 artists        
  • ## Summary of the Variables
  • summary(lf)
  • summary(lf$user)
  • summary(lf$artist)
  • ## Plot for Visualization of the variables.
  • plot(lf$user, col=”blue”)
  • plot(lf$artist, col=”blue”)
  • plot(lf$sex, col=”orange”)
  • plot(lf$country, col=”orange”)

Figure 3.  Plots of LastFM Variables.

Part-II:  Association Rules Data Mining, Discussion and Analysis

 Task-1:  Required Computations for Association Rules and Frequent Items

The purpose of this task is to first implement computations which are required for the association rules.  The required package arules is first installed.  This task visualizes the frequency of items in Figure 4.

  • ## Install arules library for association rules
  • install.packages(“arules”)
  • library(arules)
  • ### computational environment for mining association rules and frequent item sets
  • playlist <- split(x=lf[,”artist”], f=lf$user)
  • playlist[1:2]
  • ## Remove Artist Duplicates.
  • playlist <- lapply(playlist,unique)
  • playlist <- as(playlist,”transactions”)
  • ## view this as a list of “transaction”
  • ## transactions is a data class defined in arules
  • itemFrequency(playlist)
  • ## lists the support of the 1,004 bands
  • ## number of times band is listed to on the playlist of 15,000 users
  • ## computes relative frequency of artist mentioned by the 15,000 users
  • ## plots the item frequencies.
  • itemFrequencyPlot(playlist,support=0.08, cex.names=1.5, col=”blue”, main=”Item Frequency”)

Figure 4.  Plot of Item Frequency.

Task-2:  Association Rules Data Mining

The purpose of this task is to implement the data mining for the music list (lastfm.csv) using Association Rules technique.  First, the code builds the Association Rules, followed by the implementation of the associations with support > 0.01 and confidence > 0.50. Rule out rare bands and ordering the result by confidence for better understanding of the association rules result.

  • ## Build the Association Rules
  • ## Only associations with support > 0.01 and confidence > 0.50
  • ## Rule out rare bands
  • music.association.rules <- apriori(playlist, parameter=list(support=0.01, confidence=0.50))
  • inspect(music.association.rules)
  • ## Filter by lift > 5
  • ## Show only those with lift > 5, among those association with support > 0.01 and confidence > 0.50.
  • inspect(subset(music.association.rules, subset=lift > 5))
  • ## Order by confidence for better understanding of the association rules result.
  • inspect(sort(subset(music.association.rules, subset=lift>5), by=”confidence”))

Figure 5.  Example of Listening to both “Muse” and “Beatles” with a Confidence of 0.507 for Radiohead.

Figure 6.  Narrow the List by increasing the Lift to > 5 and Decreasing Confidence.

 Task-3: Discussion and Analysis

The association rules are used to explore the relationship between items and sets of items (Fischetti et al., 2017; Giudici, 2005).  Each transaction is composed of one or more items.  The interest is in transactions of at least two items because there cannot be relationships between several items in the purchase of a single item (Fischetti et al., 2017). The association rule is the explicit mention in a relationship in the data, in the form of X >= Y, where X (the antecedent) can be composed of one or several items and is called itemset, and Y (the consequent) is always one single item.  In this project, the interest is in the antecedents of music since the interest is in promoting the purchase of music.  The frequent “itemsets” are the items or collections of items which frequently occur in transactions.  The “itemsets” are considered frequent if they occur more frequently than a specified threshold (Fischetti et al., 2017).  The threshold is called minimal support (Fischetti et al., 2017).  The omission of “itemsets” with support less than the minimum support is called support pruning (Fischetti et al., 2017). The support for an itemset is the proportion among all cases where the itemset of interest is present, which allows estimation of how interesting an itemset or a rule is when support is low, the interest is limited (Fischetti et al., 2017). The confidence is the proportion of cases of X where X >= Y, which can be computed as the number of cases featuring X and Y divided by the number of cases featuring X (Fischetti et al., 2017).  Lift is a measure of the improvement of the rule support over what can be expected by chance, which is computed as support(X>=Y)/support(X)*support(Y) (Fischetti et al., 2017).  If the lift value is not higher than 1, the rule does not explain the relationship between the items better than could be expected by chance.  The goal of “apriori” is to compute the frequent “itemsets” and the association rules efficiently and to compute support and confidence.  

In this project, the large dataset of lastfm (289,955 observations and four variables) is used.  The descriptive analysis shows that the number of males (N=211823) exceeds the number of female users (N=78132) as illustrated in Figure 3.  The top artist has a value of 2704, followed by “Beatles” of 2668 and “Coldplay” of 2378.  The top country has the value of 59558 followed by the United Kingdom of 27638 and German of 24251 as illustrated in Task-3 of Part-I.

As illustrated in Figure 1, the first sixteen observations are for the user (1) for a woman from Germany, resulting in the first sixteen rows of the data matrix.  The R package arules was used for mining the association rules and for identifying frequent “itemsets.”  The data is transformed into an incidence matrix where each listener represents a row, with 0 and 1s across the columns indicating whether or not the user has played a particular artist.  The incidence matrix is stored in the R object “playlist.”  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users. 

The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  For instance, listening both to “Muse” and “Beatles” has support larger than 1%, and the confidence for “Radiohead,” given that someone listens to both “Muse” and “Beatles” is 0.507 with a lift of 2.82 as illustrated in Figure 5.  This result exceeded the two requirements as antecedents involving three artists do not come up in the list because they do not meet both requirements.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6.  The result shows that listening to both “Led Zeppelin” and “the Doors” has a support of 1%, the confidence of 0.597 (60%) and lift of 5.69 and is quite predictive of listening to “Pink Floyd” as shown in Figure 6. Another example of the association rule result is listening to “Judas Priest” lifts the chance of listening to the “Iron Maiden” by a factor of 8.56 as illustrated in Figure 6.  Thus, if the user listens to “Judas Priest,” the recommendation for that user to also to listen to “Iron Maiden.”  The same association rules results apply to all of the six items listed in Figure 6.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Quantitative Analysis of “Prostate Cancer” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.  This goal is to predict tumor spread, which is the log volume in this dataset of 97 men who had undergone a biopsy. The measures which are used for prediction are BPH, PSA, Gleason Score, CP, and size of the prostate.  The predicted tumor size affects the treatment options for the patients, which can include chemotherapy, radiation treatment, and surgical removal of the prostate.  

The dataset “prostate.cancer.csv” is downloaded from the CTU course learning materials.  The dataset has 97 observations or patients on six variables. The response variable is the log volume (lcavol).  This assignment is to predict this variable (lcavol) from five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) using the decision tree.  The response variable is a continuous measurement variable. The sum of squared residuals as the impurity (fitting) criterion is used in this analysis.

This assignment discusses and addresses fourteen Tasks as shown below:

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset. The dataset has 97 observations or patients with six variables. The response variable for prediction is (lcavol), and the five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) will be used for this prediction using the decision tree.  The response variable is a continuous measurement variable.  Table 1 summarizes these variables including the response variable of (lcavol).

Table 1:  Prostate Cancer Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • pc <- read.csv(“C:/CS871/prostate.cancer.csv”)
  • pc
  • dim(pc)
  • names(pc)
  • head(pc)
  • pc <- data.frame(pc)
  • head(pc)
  • str(pc)
  • pc <-data.frame(pc)
  • summary(pc)
  • plot(pc, col=”blue”, main=”Plot of Prostate Cancer”)

Figure 1. Plot of Prostate Variables.

Task-3:  Distribution of Prostate Cancer Variables.

  • #### Distribution of Prostate Cancel Variables
  • ### These are the variables names
  • colnames(pc)
  • ##Setup grid, margins.
  • par(mfrow=c(3,3), mar=c(4,4,2,0.5))
  • for (j in 1:ncol(pc))
  • {
  • hist(pc[,j], xlab=colnames(pc)[j],
  • main=paste(“Histogram of”, colnames(pc)[j]),
  • col=”blue”, breaks=20)
  • }
  • hist(pc$lcavol,col=”orange”)
  • hist(pc$age,col=”orange”)
  • hist(pc$lbph,col=”orange”)
  • hist(pc$lcp,col=”orange”)
  • hist(pc$gleason,col=”orange”)
  • hist(pc$lpsa,col=”orange”)

Figure 2.  Distribution of Prostate Cancer Variables.

Task-4:  Correlation Among Prostate Variables

  • ##Correlations between prostate cancer variables
  • pc.cor = cor(pc)
  • pc.cor <- round(pc.cor, 3)
  • summary(pc)
  • pc.cor[lower.tri(pc.cor,diag=TRUE)] = 0
  • pc.cor.sorted = sort(abs(pc.cor),decreasing=T)
  • pc.cor.sorted[1]
  • ##[1] 0.871
  • # Use arrayInd()
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[1]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ##[1] “lcp”     “gleason”
  • pc.cor.sorted[2]
  • ## [[1] 0.812
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[2]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ## [1] “lbph” “lcp”

Task-5:  Visualization of the Relationship and Correlation between the Cancer Spread (lcavol) and other Variables.

  • ## Visualizing relationships among varibles
  • plot(lcavol~age, data=pc, col=”red”, main=”Relationship of Age on the Cancer Volume (lcvol)”)
  • plot(lcavol~lbph, data=pc, col=”red”, main=”Relationship of Amount of Benign Prostatic Hyperplasia (lbph) on the Cancer Volume (lcavol)”)
  • plot(lcavol~lcp, data=pc, col=”red”, main=”Relationship of Capsular Penetration (lcp) on the Cancer Volume (lcvol)”)
  • plot(lcavol~gleason, data=pc, col=”red”, main=”Relationship of Gleason System (gleason) on the Cancer Volume (lcvol)”)
  • plot(lcavol~lpsa, data=pc, col=”red”,main=”Relationship of Prostate Specific Anitgen (lpsa) on the Cancer Volume (lcvol)”)

Figure 3.  Plot of Correlation Among Prostate Variables Using plot() Function.

  • ## Correlation among the variables using corrplot() function
  • install.packages(“ElemStatLearn”)
  • library(ElemStatLearn)           ## it contains the data
  • install.packages(“car”)
  • library(car)                   ## package to calculate variance inflation factor
  • install.packages(“corrplot”)
  • library(corrplot)                       ## correlation plots
  • install.packages(“leaps”)
  • library(leaps)               ## best subsets regression
  • install.packages(“glmnet”)
  • library(glmnet)                        ##allows ridge regression, LASSO and elastic net.
  • install.packages(“caret”)
  • library(caret)                ##parameter tuning
  • pc.cor=cor(pc)
  • corrplot.mixed(pc.cor)

Figure 4.  Plot of Correlation Among Prostate Variables Using corplot() Function.

Task-6:  Build a Decision Tree for Prediction

  • ##Building a Decision Tree for Prediction
  • install.packages(“tree”)
  • library(tree)
  • ##Construct the tree
  • pctree <- tree(lcavol ~., data=pc, mindev=0.1, mincut=1)
  • pctree <- tree(lcavol ~., data=pc,mincut=1)
  • pctree
  • plot(pctree, col=4)
  • text(pctree, digits=2)
  • pccut <- prune.tree(pctree, k=1.7)
  • plot(pccut, col=”red”, main=”Pruning Using k=1.7″)
  • pccut
  • text(pccut, digits=2)
  • pccut <- prune.tree(pccut, k=2.05)
  • plot(pccut, col=”darkgreen”, main=”Pruning Using k=2.05″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree,k=3)
  • plot(pccut, col=”orange”, main=”Pruning Using k=3″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree)
  • pccut
  • plot(pccut, col=”blue”, main=”Decision Tree Pruning”)
  • pccut <- prune.tree(pctree,best=3)
  • pccut
  • plot(pccut, col=”orange”, main=”Pruning Using best=3″)
  • text(pccut, digits=2)

Figure 5:  Initial Tree Development.

Figure 6:  First Pruning with α=1.7.

Figure 7:  Second Pruning with α=2.05.

Figure 8:  Third Pruning with α=3.

Figure 9:  Plot of the Decision Tree Pruning.

Figure 10.  Plot of the Final Tree.

Task-7:  Cross-Validation

  • ## Use cross-validation to prune the tree
  • set.seed(2)
  • cvpc <- cv.tree(pctree, k=10)
  • cvpc$size
  • cvpc$dev
  • plot(cvpc, pch=21, bg=8, type=”p”, cex=1.5, ylim=c(65,100), col=”blue”)
  • pccut <- prune.tree(pctree, best=3)
  • pccut
  • plot(pccut, col=”red”)
  • text(pccut)
  • ## Final Plot
  • plot(pc[,c(“lcp”,”lpsa”)],col=”red”,cex=0.2*exp(pc$lcavol))
  • abline(v=.261624, col=4, lwd=2)
  • lines(x=c(-2,.261624), y=c(2.30257,2.30257), col=4, lwd=2)

Figure 13.  Plot of the Cross-Validation Deviance.

Figure 14.  Plot of the Final Classification Tree.

Figure 15.  Plots of Cross-Validation.

Task-8:  Discussion and Analysis

The classification and regression tree (CART) represents a nonparametric technique which generalizes parametric regression models (Ledolter, 2013).  It allows for non-linearity and variables interactions with no need to specify the structure in advance. Furthermore, the violation of constant variance which represents a critical assumption in the regression model is not critical in this technique (Ledolter, 2013).

The descriptive statistics result shows that lcavol has a mean of 1.35 which is less than the median of 1.45 indicating a negatively skewed distribution, with a minimum of -1.35 and a maximum of 2.8. The age of the prostate cancer patients has an average of 64 years, with a minimum of 41 and a maximum of 79 years old.  The lbph has an average of 0.1004 which is less than the median of 0.300 indicating the same negatively skewed distribution with a minimum of -1.39 and maximum of 2.33.  The lcp has an average of -0.18 which is higher than the median of -0.79 indicating a positive skewed distribution with a minimum of -1.39 and a maximum of 2.9.  The Gleason measure has a mean of 6.8 which is a little less than the median of 7 indicating a little negative skewed distribution with a minimum of 6 and maximum of 9.  The last variable of lpsa has an average of 2.48 which is a little less than the median of 2.59 indicating a little negatively skewed distribution with a minimum of -0.43 and maximum of 5.58. The result shows that there is a positive correlation between lpsa and lcavol, and between lcp and lcavol as well.  The result also shows that the age between 60 and 70 the lcavol gets increased.

Furthermore, the result also shows that the Gleason result takes integer values of 6 and larger. The result of the lspa shows that the log PSA score, is close to the normally distributed dataset.  The result in Task-4 of the correlation among prostate variables is not surprising as it shows that if their Gleason score is high now, then they likely had a bad history of Gleason scores, which is known for such high Gleason.  The result also shows that lcavol as a predictor should be included for any prediction of the lpsa.

As illustrated in Figure 4, the result shows that PSA is highly correlated with the log of cancer volume (lcavol); it appeared to have a highly linear relationship.  The result also shows that multicollinearity may become an issue; for example, cancer volume is also correlated with capsular penetration, and this is correlated with the seminal vesicle invasion.

For the implementation of the Tree, the initial tree has 12 leave nodes, and the size of the tree is thus 12 as illustrated in Figure 5.  The root shows the 97 cases with deviance of 133.4. Node 1 is the root; Node 2 has a value of lcp < 0.26 with 63 patients and deviance of 64.11.  Node 3 has the value of lcp > 0.26 with 34 cases and deviance of 13.39.  Node 4 has the lpsa < 2.30 with 35 cases and deviance of 24.72. Node 5 has lpsa > 2.30 with 28 cases and 18.6 deviance. Node 6 has lcp < 2.14 with 25 cases and deviance of 6.662.  Node 7 has lcp > 2.139 with 9 cases and deviance of 1.48. Node 8 has lpsa < 0.11 with 4 cases and deviance of 0.3311, while Node 9 has lpsa > 0.11 with 31 cases and deviance of 18.92, and age of < 52 with deviance of 0.12 and age o > 52 with deviance of 13.88. Node 10 has lpsa < 3.25 with 23 cases and deviance of 11.61. while Node 11 has lcp > 3.25 with 5 cases and deviance of 1.76.  Node 12 is for age < 62 with 7 cases and deviance of 0.73.

The first pruning process using α=1.7 did not result in any different from the initial tree.  It resulted in the 12 nodes.  The second pruning with α=2.05 improved the tree with eight nodes.  The root shows the same result of 97 cases with deviance of 133.4.  Node 1 has lcp < 0.26 with deviance of 64.11 while Node 2 has lcp > 0.26 with deviance of 13.39.  The third pruning using α=3 has further improved the tree as shown in Figure 8.  The final Tree has the root with four nodes: Node 1 for lcp < 0.26 and Node 2 for lcp > 0.26.  Node 3 has lpsa < 2.30, while Node 4 reflects lpsa > 2.30.  With regard to the prediction, the patient with lcp=0.20, which is categorized in Node 2, and lpsa of 2.40 which is categorized in Node 4, can be predicted to have a log volume of (lcavol) of 1.20.

The biggest challenge for the CART model which is described as flexible, in comparison to the regression models, is the overfitting (Giudici, 2005; Ledolter, 2013).  If the splitting algorithm is not stopped, the tree algorithm can ultimately extract all information from the data, including information which is not and cannot be predicted in the population with the current set of prediction causing random or noise variation (Ledolter, 2013).   However, when the subsequent splits add minimal improvement of the prediction, the stop of generating new split nodes, in this case, can be used as a defense against the overfitting issue.  Thus, if 90% of all cases can be predicted correctly from 10 splits, and 90.1% of all cases from 11 splits, then, there is no need to add the 11th split to the tree, as it does not add much value only .1%. There are various techniques to stop the split process.  The basic constraints (mincut, mindev) lead to a full tree fit with a certain number of terminal nodes.  In this case of the prostate analysis, the mincut=1 is used which is a minimum number of observations to include in a child node and obtained a tree of size 12.

Since the three-building is stopped as illustrated in Figure 10, the cross-validation is used to evaluate the quality of the prediction of the current tree.  The cross-validation subjects the tree computed from one set of observation (the training sample) to another independent set of observation (the test sample). If most or all of the splits determined by the analysis of the training sample are based on random noise, then the prediction for the test sample is described to be poor.  The cross-validation cost or CV cost is the averaged error rate for particular tree size.  The tree size which produces the minimum CV cost is found.  The reference tree is then pruned back to the number of nodes matching the size which produces the minimum CV cost.  Pruning was implemented in a stepwise bottom-up manner, by removing the least important nodes during each pruning cycle. The v-fold CV is implemented with the R command (cv.tree). The graph in Figure 13 of the CV Deviance indicates that, for the prostate example, a tree of size 3 is appropriate.  Thus, the reference tree which was obtained from all the data is being pruned back to size 3. CV chooses the capsular penetration and PSA as the decision variable.  The effect of capsular penetration on the response of log volume (lcavol) depends on PSA. The final graph of Figure 15 shows that the CAR divides up the space of the explanatory variables into rectangles, with each rectangle leading to a different prediction. The size of the circles of the data points in the respective rectangles reflects the magnitude of the response.  Figure 15 confirms that the tree splits are quite reasonable.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Analysis of Ensembles

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze creating ensembles from different methods such as logistic regression, nearest neighbor methods, classification trees, Bayesian, or discriminant analysis. This discussion also addresses the use of the Random Forest to do the analysis.

Ensembles

There are two useful techniques which combine methods for improving predictive power: ensembles and uplift modeling.  Ensembles are the focus of this discussion. Thus, uplift modeling is not discussed in this discussion.  An ensemble combines multiple “supervised” models into a “super-model” (Shmueli, Bruce, Patel, Yahav, & Lichtendahl Jr, 2017)).  An ensemble is based on the dominant notion of combining models (EMC, 2015; Shmueli et al., 2017). Thus, several models can be combined to achieve improved predictive accuracy (Shmueli et al., 2017). 

Ensembles played a significant role in the million-dollar Netflix Prize contest which started in 2006 to improve their movie recommendation system (Shmueli et al., 2017).  The principle of combining methods is known for reducing risk because the variation is smaller than each of the individual components (Shmueli et al., 2017).  The risk is equivalent to a variation in prediction error in predictive modeling.  The more the prediction errors vary, the more volatile the predictive model (Shmueli et al., 2017).  Using an average of two predictions can potentially result in smaller error variance, and therefore, better predictive power (Shmueli et al., 2017).  Thus, results can be combined from multiple prediction methods or classifiers (Shmueli et al., 2017).  The combination can be implemented for predictions, classifications, and propensities as discussed below. 

Ensembles Combining Prediction Using Average Method

When combining prediction, the predictions can be combined with different methods by taking an average.  One alternative to a simple average is taking the median prediction, which would be less affected by extreme predictions (Shmueli et al., 2017). Computing a weighted average is another possibility where the weights are proportional to a quantity of interest such as quality or accuracy (Shmueli et al., 2017).  Ensembles for prediction are useful not only in cross-sectional prediction but also in time series forecasting (Shmueli et al., 2017).

Ensembles Combining Classification Using Voting Method

When combining classification, combining the results from multiple classifiers can be implemented using “voting,” for each record, multiple classifications are available.  A simple rule would be to choose the most popular class among these classifications (Shmueli et al., 2017). For instance, Classification Tree, a Naïve Bayes classifier, and discriminant analysis can be used for classifying a binary outcome (Shmueli et al., 2017).  For each record, three predicted classes are generated (Shmueli et al., 2017). Simple voting would choose the most common class of the three (Shmueli et al., 2017). Similar to the prediction, heavier weights can be assigned to scores from some models, based on considerations such as model accuracy or data quality, which can be implemented by setting a “majority rule” which is different from 50% (Shmueli et al., 2017).   Concerning the nearest neighbor (K-NN), an ensemble learning such as bagging can be performed with K-NN (Dubitzky, 2008). The individual decisions are combined to classify new examples.  Combining of individual results is performed by weighted or unweighted voting (Dubitzky, 2008).

Ensembles Combining Propensities Using Average Method

Similar to prediction, propensities can be combined by taking a simple or weighted average.  Some algorithms such as Naïve Bayes produce biased propensities and should not, therefore, be averaged with propensities from other methods (Shmueli et al., 2017).

Other Forms of Ensembles

Various methods are commonly used for classification, including bagging, boosting, random forest, and support vector machines (SVM).  The bagging, boosting, and random forest is all examples of ensemble methods which use multiple models to obtain better predictive performance than can be obtained from any of the constituent models (EMC, 2015; Ledolter, 2013; Shmueli et al., 2017).

  • Bagging: It is short for “bootstrap aggregating” (Ledolter, 2013; Shmueli et al., 2017). It was proposed by Leo Breiman in 1994, which is a model aggregation technique to reduce model variance (Swamynathan, 2017).  It is another form of Ensembles which is based on averaging across multiple random data samples (Shmueli et al., 2017).  There are two steps to implement bagging.  Figure 1illustrates the bagging process flow.
    • Generate multiple random samples by sampling “with replacement from the original data.”  This method is called “bootstrap sampling.”
    • Running an algorithm on each sample and producing scores (Shmueli et al., 2017).

Figure 1.  Bagging Process Flow (Swamynathan, 2017).

Bagging improves the performance stability of a model and helps avoid overfitting by separately modeling different data samples and then combining the result.  Thus, it is especially useful for algorithms such as Trees and Neural Networks.  Figure 2 illustrates an example of the bootstrap sample that has the same size as the original sample size, with ¾ of the original values plus replacement result in repetition of values.

Figure 2:  Bagging Example (Swamynathan, 2017).

Boosting: It is a slightly different method of creating ensembles.  It was introduced by Freud and Schapire in 1995 using the well-known AdaBoost algorithm (adaptive boosting) (Swamynathan, 2017).  The underlying concept of boosting is that rather than an independent individual hypothesis, combining hypotheses in a sequential order increases the accuracy (Swamynathan, 2017).  The boosting algorithms convert the “weak learners” into “strong learners” (Swamynathan, 2017).  Boosting algorithms are well designed to address the bias problems (Swamynathan, 2017). Boosting tends to increase the accuracy (Ledolter, 2013). The “AdaBoosting” process involves three steps. Figure 3 illustrates the “AdaBoosting” process:

  1. Assign uniform weight for all data points W0(x)=1/N, where N is the total number of training data points.
    1. At each iteration fit a classifier ym(xn) to the training data and update weights to minimize the weighted error function.
  • The final model is given by the following equation:

Figure 3.  “AdaBoosting” Process (Swamynathan, 2017).

            As an example illustration of AdaBoost, there is a sample dataset with 10 data points, with an assumption that all data points will have equal weights giving by, 1/10 as illustrated in Figure 4. 

Figure 4.  An Example Illustration of AdaBoost. Final Model After Three Iteration (Swamynathan, 2017).

  • Random Forest: It is another class of ensemble method using decision tree classifiers.  It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A particular case of random forest uses bagging on decision trees, where samples are randomly chosen with replacement from the original training set (EMC, 2015).
  • SVM: Itis another common classification method which combines linear models with instance-based learning techniques. The SVM select a small number of critical boundary instances called support vectors from each class and build a linear decision function which separates them as widely as possible.  SVM can efficiently perform, by default linear classifications and can also be configured to perform non-linear classifications (EMC, 2015).

Advantages and Limitations of Ensembles

Combining scores from multiple models is aimed at generating more precise predictions by lowering the prediction error variance (Shmueli et al., 2017).  The ensemble method is most useful when the combined models generate prediction error which is negatively associated or correlated, but it can also be useful when the correlation is low (Ledolter, 2013; Shmueli et al., 2017).  Ensembles can use simple averaging, weighted averaging, voting, and median (Ledolter, 2013; Shmueli et al., 2017).  Models can be based on the same algorithm or different algorithms, using the same sample or different sample (Ledolter, 2013; Shmueli et al., 2017).  Ensembles have become an important strategy for participants in data mining contests, where the goal is to optimize some predictive measure (Ledolter, 2013; Shmueli et al., 2017).  Ensembles which are based on different data samples help avoid overfitting. However, overfit can also happen with an ensemble in instances such as the choice of best weights when using a weighted average (Shmueli et al., 2017).   

The primary limitation of the ensemble is the resources which it requires such as computationally, and the skills and time investments (Shmueli et al., 2017).  Ensembles which combine results from different algorithms require the development of each model and their evaluation.  The boosting-type ensembles and bagging-type ensembles do not require much effort. However, they do have a computational cost.  Furthermore, ensembles which rely on multiple data sources require the collection and the maintenance of the multiple data sources (Shmueli et al., 2017).  Ensembles are regarded to be “black box” methods, where the relationship between the predictors and the outcome variable usually becomes non-transparent (Shmueli et al., 2017). 

The Use of Random Forests for Analysis

The decision tree is based on a set of True/False decision rules. The prediction is based on the tree rules for each terminal node.  A decision tree for a small set of sample training data encounters the overfitting problem. Random forest model, in contrast, is well suited to handle small sample size problems.  The random forest contains multiple decision trees as the more trees, the better.  Randomness is in selecting the random training subset from the training dataset, using bootstrap aggregating or bagging method to reduce the overfitting by stabilizing the predictions. This method is utilized in many other machine-learning algorithms, not only in the Random Forests (Hodeghatta & Nayak, 2016). There is another type of randomness which occurs when selecting variables randomly from the set of variables, resulting in different trees which are based on different sets of variables.  In a forest, all the trees would still influence the overall prediction by the random forest (Hodeghatta & Nayak, 2016).

The programming logic for Random Forest includes seven steps as follows (Azhad & Rao, 2011).

  1. Input the number of training set N.
  2. Compute the number of attributes M.
  3. For (m) input attributes used to form the decision at a node m<M.
  4. Choose training set by sampling with replacement.
  5. For each node of the tree, use one of the (m) variables as the decision node.
  6. Grow each tree without pruning.
  7. Select the classification with maximum votes.

Random Forests have a low bias (Hodeghatta & Nayak, 2016).  The variance is reduced, and thus, overfitting, by adding more trees, which is one of the advantages of the Random Forests, and hence gaining popularity.  The models of Random Forests are relatively robust to the set of input variables and often do not care about pre-processing of data.  Random Forests are described to be more efficient to build than other models such as SVM (Hodeghatta & Nayak, 2016).  Table 1 summarizes the Advantages and Disadvantages of Random Forests in a comparison with other Classification Algorithms such as Naïve Bayes, Decision Tree, Nearest Neighbor. 

Table 1.  Advantages and Disadvantages of Random Forest in comparison with other Classification Algorithms. Adapted from (Hodeghatta & Nayak, 2016).

References

Azhad, S., & Rao, M. S. (2011). Ensuring data storage security in cloud computing.

Dubitzky, W. (2008). Data Mining in Grid Computing Environments: John Wiley & Sons.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Shmueli, G., Bruce, P. C., Patel, N. R., Yahav, I., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R: John Wiley & Sons.

Swamynathan, M. (2017). Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python: Apress.

Decision Trees with a Comparison of Classification and Regression Decision Tree (CART)

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Decision Trees, with a comparison of Classification and Regression Decision Trees.  The discussion also addresses the advantages and disadvantages of the Decision Trees.  The focus of this discussion is on the Classification and Regression Tree (CART) algorithm as one of the statistical criteria. The discussion begins with a brief overview of the Classification, followed by additional related topics.  It will end with a sample Decision Tree for a decision whether or not to take an umbrella.

Classification

Classification is a fundamental data mining technique (EMC, 2015).  Most classification methods are supervised, in which they start with a training set of pre-labeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations (EMC, 2015).  For instance, marketing, sales, and customer demographic data can be used to develop a classifier to assign a “purchase” or “no purchase” label to potential future customers (EMC, 2015).  Classification is widely used for prediction purposes (EMC, 2015).  Logistic Regression is one of the popular classification methods (EMC, 2015).  Classification can be used for health care professionals to diagnose diseases such as heart disease (EMC, 2015).   There are two fundamental classification methods:  Decision Trees and Naïve Bayes.  In this discussion, the focus is on the Decision Trees. 

The Tree Models vs. Linear & Logistic Regression Models

The tree models are distinguished from the Linear and Logistic Regression models.  The tree models produce a classification of observations into groups first and then obtain a score for each group, while the Linear and Logistic Regression methods produce a score and then possibly a classification based on a discriminant rule (Giudici, 2005). 

Regression Trees vs. Classification Trees

The tree models are divided into regression trees and classification trees (Giudici, 2005).  The regression trees are used when the response variable is continuous, while the classification trees are used when the response variable is quantitative discrete or qualitative (categorical) (Giudici, 2005).  The tree models can be defined as a recursive process, through which a set of (n) statistical units are divided into groups progressively, based on a division rule aiming to increase a homogeneity or purity measure of the response variable in each of the obtained group (Giudici, 2005). An explanatory variable specifies a division rule at each step of the procedure, to split and establish splitting rules to partition the observations (Giudici, 2005). The final partition of the observation is the main result of a tree model (Giudici, 2005).  It is critical to specify a “stopping criteria” for the division process to achieve such a result (Giudici, 2005). 

Concerning the classification tree, fitted values are given regarding the fitted probabilities of affiliation to a single group (Giudici, 2005). A discriminant rule for the classification trees can be derived at each leaf of the tree (Giudici, 2005).  The classification of all observations belonging to a terminal node in the class corresponding to the most frequent level is a commonly used rule, called “majority rule” (Giudici, 2005).  While other “voting” schemes can also be implemented, in the absence of other consideration, this rule is the most reasonable (Giudici, 2005).  Thus, each of the leaves points out a clear allocation rule of the observation, which is read using the path that connects the initial node to each of them.  Therefore, every path in the tree model represents a classification rule (Giudici, 2005).  

With comparison to other discriminant models, the tree models produce rules which are less explicit analytically, and easier to understand graphically (Giudici, 2005). The tree models can be regarded as nonparametric predictive models as they do not require assumptions about the probability distribution of the response variable (Giudici, 2005).  This flexibility indicates that the tree models are generally applicable, whatever the nature of the dependent variable and the explanatory variables  (Giudici, 2005).  However, the disadvantages of this flexibility of a higher demand of computational resources, and their sequential nature and the complexity of their algorithm can make them dependent on the observed data, and even a small change might alter the structure of the tree  (Giudici, 2005). Thus, it is difficult to take a tree structure designed for one context and generalize it to other contexts  (Giudici, 2005).

The Classification Tree Analysis vs. The Hierarchical Cluster Analysis

The classification tree analysis is distinguished from the hierarchical cluster analysis despite their graphical similarities  (Giudici, 2005).  The classification trees are predictive rather than descriptive. While the hierarchical cluster analysis performs an unsupervised classification of the observations based on all available variables, the classification trees perform a classification of the observations based on all explanatory variables and supervised by the presence of the response variable (target variable) (Giudici, 2005).  The second critical difference between the hierarchical cluster analysis and the classification tree analysis is related to the partition rule.  While in the classification trees the segmentation is typically carried out using only one explanatory variable at a time, in the hierarchical clustering the divisive or agglomerative rule between groups is established based on the considerations on the distance between them, calculated using all the available variables  (Giudici, 2005).

Decision Trees Algorithms

The goal of Decision Trees is to extract from the training data the succession of decisions about the attributes that explain the best class, that is, group membership (Fischetti, Mayor, & Forte, 2017).  Decision Trees have a root, which is the best attribute to split the data upon, about the outcome (Fischetti et al., 2017).  The dataset is partitioned into branches by this attribute (Fischetti et al., 2017).  The branches lead to other nodes which correspond to the next best partition for the considered branch (Fischetti et al., 2017).  The process continues until the terminal nodes are reached, where no more partitioning is required (Fischetti et al., 2017).   Decision Trees allow class predictions (group membership) of previously unseen observations (testing datasets or prediction datasets) using statistical criteria applied on the seen data (training dataset) (Fischetti et al., 2017).  There are six statistical criteria of six algorithms:

  • ID3
  • C4.5
  • Random Forest.
  • Conditional Inference Trees.
  • Classification and Regress Trees (CART)

The most used algorithm in the statistical community is the CART algorithm, while C4.5 and its latest version C5.0 are widely used by computer scientists (Giudici, 2005).  The first versions of C4.5 and 5.0 were limited to categorical predictors, but the most recent versions are similar to CART (Giudici, 2005).  

Classification and Regression Trees (CART)

CART is often used as a generic acronym for the decision tree, although it is a specific implementation of tree models (EMC, 2015).  CART, similar to C4.5, can handle continuous attributes (EMC, 2015).  While C4.5 uses entropy-based criteria to rank tests, CART uses the Gini diversity index defined in equation (1) (EMC, 2015; Fischetti et al., 2017).

Moreover, while C4.5 uses stopping rules, CART construct a sequence of subtrees, uses cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the lowest cost, (EMC, 2015; Hand, Mannila, & Smyth, 2001). CART represents a powerful nonparametric technique which generalizes parametric regression models (Ledolter, 2013).  It allows nonlinearity and variable interactions without having to specify the structure in advance (Ledolter, 2013).  It operates by choosing the best variable for splitting the data into two groups at the root node (Hand et al., 2001).  It builds the tree using a single variable at a time, and can readily deal with large numbers of variables (Hand et al., 2001).  It uses different statistical criteria to decide on tree splits (Fischetti et al., 2017).  There are some differences between CART used for classification and the family of algorithms.  In CART, the attribute to be partition is selected with the Gini index as a decision criterion (Fischetti et al., 2017).  This method is described as more efficient compared to the information gain and information ratio (Fischetti et al., 2017).  CART implements the necessary partitioning on the modalities of the attribute and merges modalities for the partition, such as modality A versus modalities B and C (Fischetti et al., 2017).  The CART can predict a numeric outcome (Fischetti et al., 2017).  In the case of regression trees, CART performs regression and builds the tree in a way which minimizes the squared residuals (Fischetti et al., 2017). 

CART Algorithms of Division Criteria and Pruning

There are two critical aspects of the CART algorithm:  Division Criteria, and Pruning, which can be employed to reduce the complexity of a tree (Giudici, 2005).  Concerning the division criteria algorithm, the primary essential element of a tree model is to choose the division rule for the units belonging to a group, corresponding to a node of the tree (Giudici, 2005).  The decision rule selection means a predictor selection from those available, and the selection of the best partition of its levels (Giudici, 2005).  The selection is generally made using a goodness measure of the corresponding division rule, which allows the determination of the rule to maximize the goodness measure at each stage of the procedure (Giudici, 2005). 

The impurity concept refers to a measure of variability of the response values of the observations (Giudici, 2005).  In a regression tree, a node will be pure if it has null variance as all observations are equal, and it will be impure if the variance of the observation is high (Giudici, 2005).  For the regression trees, the impurity corresponds to the variance, while for the classification trees alternative measures for the impurity are considered such as Misclassification impurity, Gini impurity, Entropy impurity, and Tree assessments (Giudici, 2005).  

When there is no “stopping criterion,” a tree model can grow until each node contains identical observation regarding the values or levels of the dependent variable  (Giudici, 2005). This approach does not contain a parsimonious segmentation  (Giudici, 2005). Thus, it is critical to stop the growth of the tree at a reasonable dimension  (Giudici, 2005). The tree configuration becomes ideal when it is parsimonious and accurate  (Giudici, 2005). The parsimonious attribute indicates that the tree has a small number of leaves, and therefore, the predictive rule can be easily interpreted  (Giudici, 2005). The accurate attribute indicates a large number of leaves which are pure to a maximum extent  (Giudici, 2005). There are two opposing techniques for the final choice which tree algorithms can employ. The first technique uses stopping rules based on the thresholds on the number of the leaves, or on the maximum number of steps in the process, whereas the other algorithm technique introduces probabilistic assumptions on the variables, allowing the use of suitable statistical tests  (Giudici, 2005). The growth is stopped when the decrease in impurity is too small, in the absence of the probabilistic assumptions  (Giudici, 2005). The result of a tree model can be influenced by the choice of the stopping rule (Giudici, 2005). 

The CART method utilizes a strategy different from the stepwise stopping criteria. The method is based on the pruning concept.  The tree, first, is built to its greatest size, and it then gets “trimmed” or “pruned” according to a cost-complexity criterion  (Giudici, 2005). The concept of pruning is to find a subtree optimally, to minimize a loss function, which is used by CART algorithm and depends on the total impurity of the tree and the tree complexity  (Giudici, 2005). The misclassification impurity is usually chosen to be used for the pruning, although the other impurity methods can also be used.  The minimization of the loss function results in a compromise between choosing a complex model with low impurity but high complexity cost and choosing a simple model with a high impurity with low complexity cost  (Giudici, 2005).   The loss function is assessed by measuring the complexity of the model fitted on the training dataset, whose misclassification errors are measured in the validation data set  (Giudici, 2005).  This method partitions the training data into a subset for building the tree and then estimates the misclassification rate on the remaining validation subset (Hand et al., 2001).

The CART has been widely used for several years by marketing applications and others (Hodeghatta & Nayak, 2016).   The CART is described as a flexible model as the violations of constant variance which is very critical in regression, is permissible in the CART (Ledolter, 2013).  However, the biggest challenge in the CART is the avoidance of the “overfitting” (Ledolter, 2013).

Advantages and Disadvantages of the Trees

Decision trees for regression and classification have advantages and disadvantages.  Trees are regarded to be easier than linear regression and can be displayed graphically and interpreted easily (Cristina, 2010; Tibshirani, James, Witten, & Hastie, 2013).  Decision trees are self-explanatory and easy to understand even for non-technical users (Cristina, 2010; Tibshirani et al., 2013). They can handle qualitative predictors without the need to create dummy variables (Tibshirani et al., 2013).  Decision trees are efficient. Complex alternatives can be expressed quickly and precisely. A decision tree can easily be modified as new information becomes available.  Standard decision tree notation is easy to adopt (Cristina, 2010).  They can be used in conjunction with other management tools.  Decision trees can handle both nominal and numerical attributes (Cristina, 2010).  They are capable of handling datasets which may have errors or missing values.  Decision trees are considered to be a non-parametric method, which means that they have no assumption about the spatial distribution and the classifier structure.  Their representations are rich enough to represent any discrete-value classifier.

However, trees have limitations as well.  They do not have the same level of predictive accuracy as some of the other regression and classification models (Tibshirani et al., 2013). Most of the algorithms, like ID3 and C4.5, require that the target attribute will have only discrete values. Decision trees are over-sensitive to the training set, to irrelevant attributes and noise. Decision trees tend to perform less if many complex interactions are present, and well if a few highly relevant attributes exist as they use the “divide and conquer” method (Cristina, 2010).  Table 1 summarizes the advantages and disadvantages of the trees.

Table 1.  Summary of the Advantages and Disadvantages of Trees.
Note:  Constructed by the researcher based on the literature.

Take an Umbrella Decision Tree Example:

  • If input field value < n
    • Then target = Y%
  • If input field value > n
    • Then target = X%

Figure 1.  Decision Tree for Taking an Umbrella

  • The decision depends on the weather, on the predicted rain probability, and whether it is sunny or cloudy.
  • The forecast predicts rain with a probability between 70% and 30%.
    • If it is >70% rain probability, take an umbrella, else use >30% and <30% probability for further predictions.
    • If it is >30% rain probability and cloudy, take an umbrella, else no umbrella.
    • If it is <30% rain probability, no umbrella.

References

Cristina, P. (2010). Decision Trees. Retrieved from http://www.cs.ubbcluj.ro/~gabis/DocDiplome/DT/DecisionTrees.pdf.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Tibshirani, R., James, G., Witten, D., & Hastie, T. (2013). An introduction to statistical learning-with applications in R: New York, NY: Springer.