Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss Hadoop functionality, installation steps, and any troubleshooting techniques. It addresses two significant parts. Part-I discusses Big Data and the emerging technology of Hadoop. It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations. It also discusses the MapReduce framework, its benefits, and limitations. Part-I provides a few success stories for Hadoop technology use with Big Data Analytics. Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks. It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Keywords: Big Data Analytics; Hadoop Ecosystem; MapReduce.

Introduction

This project discusses various significant topics related to Big Data Analytics. It addresses two significant parts. Part-I discusses Big Data and the emerging technology of Hadoop. It also provides an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations. It also discusses the MapReduce framework, its benefits, and limitations. Part-I provides a few success stories for Hadoop technology use with Big Data Analytics. Part-II addresses the installation and the configuration of Hadoop on Windows operating system using fourteen critical Tasks. It also addresses the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

Part-I
Hadoop Technology Overview

The purpose of this Part is to address relevant topics related to Hadoop. It begins with Big Data Analytics and Hadoop emerging technology. The building blocks of the Hadoop ecosystem is also addressed in this part. The building blocks include the Hadoop Distributed File System (HDFS), MapReduce, and HBase. The benefits and limitations of Hadoop as well as MapReduce are also discussed in Part I of the project. Part I ends with success stories for using Hadoop ecosystem technology with Big Data Analytics in various domains and industries.

Big Data Analytics and Hadoop Emerging Technology

Big Data is now the buzzword in the field of computer science and information technology. Big Data attracted the attention of various sectors, researchers, academia, government and even the media (Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013). In the 2011 report of the International Data Corporation (IDC), it is reporting that the amount of the information which will be created and replicated will exceed 1.8 zettabytes which are 1.8 trillion gigabytes in 2011. This amount of information is growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).

Big Data Analytics (BDA) analyzes and mines Big Data to produce operational and business knowledge at an unprecedented scale (Bi & Cochran, 2014). BDA is described by (Bi & Cochran, 2014) to be an integral toolset of strategy, marketing, human resources, and research. It is the process of inspecting, cleaning, transforming, and modeling BD with the objective of discovering knowledge, generating solutions, and supporting decision-making (Bi & Cochran, 2014). Big Data (BD) and BDA are regarded to be powerful tools that various organizations have benefited from (Bates, Saria, Ohno-Machado, Shah, & Escobar, 2014). Companies which adopted Big Data Analytics successfully have been successful at using Big Data to improve the efficiency of the business (Bates et al., 2014). Example for successful application of Big Data Analytics is IBM “Watson” an application developed by IBM and was viewed in the TV Jeopardy program, using some of these Big Data approaches (Bates et al., 2014). (Manyika et al., 2011) have provided notable examples of organizations around the globe that are well-known for their extensive and effective use of data include companies like Wal-Mart, Harrah’s, Progressive Insurance, and Capital One, Tesco, and Amazon. These companies have already taken advantage of the Big Data as a “competitive weapon” (Manyika et al., 2011). Figure 1 illustrates the different types of data which make up the Big Data space.

Figure 1: Big Data (Ramesh, 2015)

“Big data is about deriving value… The goal of big data is data-driven decision making” (Ramesh, 2015). Thus, business should make the analytics as the goal when investing in storing Big Data (Ramesh, 2015). Business should focus on the Analytics side of Big Data to retrieve the value that can assist in decision-making (Ramesh, 2015). The value of BDA is increasing as the cash flow is increasing (B. Gupta & Jyoti, 2014). Figure 2 illustrates the graph for the value of BDA with dimensions of time and cumulative cash flow. Thus, there is no doubt that BDA provides great benefits to organizations.

Figure 2. The Value of Big Data Analytics. Adapted from (B. Gupta & Jyoti, 2014).

Furthermore, the organization must learn how to use Big Data Analytics to drive value for the business that aligns with the core competencies and create competitive advantages for the business (Minelli, Chambers, & Dhiraj, 2013). BDA can improve operational efficiencies, increase revenues, and achieve competitive differentiation. Table 1 summarizes the Big Data Business Models which can be used by organizations to put Big Data into work as opportunities for business.

Table 1: Big Data Business Models (Minelli et al., 2013)

There are three types of status for data that organizations deal with: data in use, data at rest and data in motion. The data in use indicates that the data are used for services or users require them for their work to accomplish specific tasks. The data at rest indicates that the data are not in use and are stored or archived in storage. The data in motion indicates that the data state is about to change from data at rest to data in use or transferred from one place to another successfully (Chang, Kuo, & Ramachandran, 2016). Figure 3 summarizes these three types of data.

Figure 3. Three Types for Data.

One of the significant characteristics of Big Data is velocity. The speed of data generation is described by (Abbasi, Sarker, & Chiang, 2016) as “hallmark” of Big Data. Wal-Mart is an example of generating the explosive amount of data, by collecting over 2.5 petabytes of customer transaction data every hour. Moreover, over one billion new tweets occur every three days, and five billion search queries occur daily (Abbasi et al., 2016). Velocity is the data in motion (Chopra & Madan, 2015; Emani, Cullot, & Nicolle, 2015; Katal, Wazid, & Goudar, 2013; Moorthy, Baby, & Senthamaraiselvi, 2014; Nasser & Tariq, 2015). Velocity involves streams of data, structured data, and the availability of access and delivery (Emani et al., 2015). The velocity of the incoming data does not only represent the challenge of the speed of the incoming data because this data can be processed using the batch processing but also in streaming such high speed-generated data during the real-time for knowledge-based decision (Emani et al., 2015; Nasser & Tariq, 2015). Real-Time Data (a.k.a Data in Motion) is the streaming data which needs to be analyzed as it comes in (Jain, 2013).

(CSA, 2013) have indicated that the technologies of Big Data are divided into two categories; batch processing for analyzing data that is at rest, and stream processing for analyzing data in motion. Example of data at rest analysis includes sales analysis, which is not based on a real-time data processing (Jain, 2013). Example of data in motion analysis includes Association Rules in e-commerce. The response time for each data processing category is different. For the stream processing, the response time of data was from millisecond to seconds, but the more significant challenge is to stream data and reduce the response time under much lower than milliseconds, which is very challenging (Chopra & Madan, 2015; CSA, 2013). The data in motion reflecting the stream processing or real-time processing does not always need to reside in memory, and new interactive analysis of large-scale data sets through new technologies like Apache Drill and Google’s Dremel provide new paradigms for data analytics. Figure 4 illustrates the response time for each processing type.

Figure 4. The Batch and Stream Processing Responsiveness (CSA, 2013).

There are two kinds of systems for the data at rest; the NoSQL systems for interactive data serving environments, and the systems for large-scale analytics based on the MapReduce paradigm, such as Hadoop. The NoSQL systems are designed to have a simpler key-value based Data Model having in-built sharding, and work seamlessly in a distributed cloud-based environment (R. Gupta, Gupta, & Mohania, 2012). A mapreduce-based framework such as Hadoop supports the batch-oriented processing (Chandarana & Vijayalakshmi, 2014; Erl, Khattak, & Buhler, 2016; Sakr & Gaber, 2014). The data stream management system allows the user to analyze data in motion, rather than collecting large quantities of data, storing it on disk, and then analyzing it. There are various streams processing systems such as IBM InfoSphere Streams (R. Gupta et al., 2012; Hirzel et al., 2013), Twitter’s Storm, and Yahoo’s S4. These systems are designed and geared towards clusters of commodity hardware for real-time data processing (R. Gupta et al., 2012).

In 2004, Google introduced MapReduce framework as a parallel processing framework which deals with a large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012). The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012). Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012).

The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012). MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014). DFS is a significant feature of the MapReduce framework (Fadzil et al., 2012).

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012). MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down, there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007). MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber, 2014). MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber, 2014).

Hadoop Ecosystem Building Blocks

BD emerging technologies such as Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale, but also heterogeneous datasets at unprecedented scale and speed (Cardenas, Manadhata, & Rajan, 2013). Hadoop was developed by Yahoo and Apache to run jobs in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012). A various large corporation such as Facebook, Amazon have used Hadoop as it offers high efficiency, high scalability, and high reliability (Yan et al., 2012). The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015). The MapReduce programming model is another significant component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015). HBase is the third component of the Hadoop framework (Bao et al., 2012). HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012).

Hadoop Benefits and Limitations

Various studies have addressed various benefits for Hadoop technology. Hadoop includes the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014). Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014). Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014). The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014). Hadoop also allows nodes to crash without affecting the data processing. It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012).

Hadoop has faced various limitation such as low-level programming paradigm and schema, strictly batch processing, time skew and incremental computation (Alam & Ahmed, 2014). The incremental computation is regarded to be one of the significant shortcomings of Hadoop technology (Alam & Ahmed, 2014). The efficiency on handling incremented data is at the expense of losing the incompatibility with programming models which are offered by non-incremental systems such as MapReduce, which requires the implementation of incremental algorithms and increasing the complexity of the algorithm and the code (Alam & Ahmed, 2014). The caching technique is proposed by (Alam & Ahmed, 2014) as a solution. This caching solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014).

Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, & Pasquin, 2011). The Incoop proposed solution is to extend the open-source implementation of Hadoop of MapReduce programming paradigm to run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber, 2014). Incoop allows programmers to increment the MapReduce programs automatically without any modification to the code (Bhatotia et al., 2011; Sakr & Gaber, 2014). Moreover, information about the previously executed MapReduce tasks are recorded by Incoop to be reused in subsequent MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber, 2014).

The Incoop is not a perfect solution, and it has some shortcomings which are addressed by (Sakr & Gaber, 2014; Zhang, Chen, Wang, & Yu, 2015). Some enhancements are implemented to Incoop to include incremental HDFS called Inc-HDFS, Contraction Phase, and “Memoization-aware Scheduler” (Sakr & Gaber, 2014). The Inc-HDFS provides the delta technique in the inputs of two consecutive job runs and splits the input based on the contents where the compatibility with HDFS is maintained. The Contraction phase is a new phase in the MapReduce framework consisting of breaking up the Reduce tasks into smaller sub-computation forming an inverted tree allowing the small portion of the input changes to the path from the corresponding leaf to the root to be computed (Sakr & Gaber, 2014). The Memoization-aware Scheduler is a modified version of the scheduler of Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).

Another solution called i²MapReduce proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015). The i²MapReduce does not perform the task-level computation but rather a key-value pair level incremental processing. This solution also supports more complex iterative computation, which is used in data mining and reduces the I/O overhead by applying various techniques (Zhang et al., 2015). IncMR is an enhanced framework for the large-scale incremental data processing (Yan et al., 2012). It inherits the simplicity of the standard MapReduce, it does not modify HDFS and utilizes the same APIs of the MapReduce (Yan et al., 2012). When using IncMR, all programs can complete incremental data processing without any modification (Yan et al., 2012).

In summary, various efforts are exerted by researchers to overcome the incremental computation limitation of Hadoop, such as Incoop, Inc-HDFS, i²MapReduce, and IncMR. Each proposed solution is an attempt to enhance and extend the standard Hadoop to avoid overheads such as I/O, to increase the efficiency, and without increasing the complexing of the computation and without causing any modification to the code.

MapReduce Benefits and Limitations

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014). The complexity of the parallelization is solved by using two techniques: Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014). The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012). The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012). MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012). It addresses the structural scalability by using the DFS which allows forming sizeable virtual storage for the framework by adding off-the-shelf hardware. MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012).

However, the earlier version of the MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014). Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014). The standard MapReduce does not share the execution of work to reduce the overall amount of work (Sakr & Gaber, 2014). Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).

Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data. It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the processing of large unstructured data sets (Bakshi, 2012). MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

The standard MapReduce framework faced the challenge of the iterative computation which is required in various operations such as data mining, PageRank, network traffic analysis, graph analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010; Sakr & Gaber, 2014). These analyses techniques require the data to be processed iteratively until the computation satisfies a convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014). Due to this limitation, and to this critical requirement, this iterative process is implemented and executed manually using a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014). However, the manual implementation and execution of such iterative computation have two significant problems (Bu et al., 2010; Sakr & Gaber, 2014). The first problem is reflected in loading unchanged data from iteration to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition when the output of the application did not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014). This termination condition may require an extra MapReduce job on each iteration which causes overhead for scheduling extra tasks, reading extra data from disk, and moving data across the network (Bu et al., 2010; Sakr & Gaber, 2014).

Researchers exerted efforts to solve the iterative computation. HaLoop is proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010). One solution to the iterative computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by (Ekanayake et al., 2010) are to identify and keep invariant data during the iterations, where reading unnecessary data repeatedly is avoided. The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014). The first caching technique is implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique is implemented on the outputs of reducer making the check for the fixpoint more efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).

The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010). This solution provides the synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and does not replicate remote values locally (Malewicz et al., 2010). Mahoot is another solution that was introduced to solve the iterative computing by grouping a series of chained jobs to obtain the results (Polato et al., 2014). In Mahoot solution, the result of each job is pushed into the next job until the final results are obtained (Polato et al., 2014). The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011). The task scheduler of the iHadoop utilizes the inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011).

Apache Hadoop and Apache Spark are the most popular technology for the iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011). Hadoop defines the iterative computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS) independently, processes the data, and writes the data back to HDFS (Liang et al., 2011). Dacoop was proposed by Liang as an extension to Hadoop to handle the data-iterative applications, by using cache technique for repeatedly data processing and introducing shared memory-based data cache mechanism (Liang et al., 2011). The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent tasks of the map and reduce during the whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012). The iMapReduce avoid three significant overheads. The first overhead is the job startup overhead which is avoided by building an internal loop from reduce to map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state data from the static structure data. The third overhead is the synchronization overhead which is avoided by allowing asynchronous map task execution (Zhang et al., 2012).

Success Stories of Hadoop Technology for Big Data Analytics (BDA)

· BDA and the Impact of Hadoop in Banking for Cost Reduction

(Davenport & Dyché, 2013) have reported that Big Data has an impact at an International Financial Services Firm. The bank has several objectives for Big Data. However, the primary objective is to exploit “a vast increase in computing power on dollar-for-dollar basis” (Davenport & Dyché, 2013). The bank purchased Hadoop cluster, with 50 server nodes and 800 processor cores, capable of handling a petabyte of data. The data scientists of the bank take the existing analytical procedures and converting them into the Hive scripting language to run on the Hadoop cluster.

· BDA and the Impact of Real-Time and Hadoop on Fraud Detection

Big Data with high velocity has created opportunities and requirements for organizations to increase its capability of Real-Time sense and response (Chan, 2014). The Analysis of the Real-Time and the rapid response are critical features of the Big Data Management in many business situations (Chan, 2014). For instance, as cited in (Chan, 2014), IBM (2013) in scrutinizing five million trade events that are created each day identified potential fraud, and analyzing 500 million daily call detail records in real-time was able to predict customer churn faster (Chan, 2014).

“Fraud detection is one of the most visible uses of big data analytics” (Cardenas et al., 2013). Credit card and phone companies have conducted large-scale fraud detection for decades (Cardenas et al., 2013). However, the custom-built infrastructure necessary to mine Big Data for fraud detection was not economical to have wide-scale adoption. However, one of the significant impacts of BDA technologies is that they are facilitating a wide variety of industries to develop affordable infrastructure for security monitoring (Cardenas et al., 2013). The new BD technologies of Hadoop ecosystem including Pig, Hive, Mahout, and Hadoop, stream mining, complex-event processing, and NoSQL databases enable the analysis of not only large-scale but also heterogeneous datasets at unprecedented scale and speed (Cardenas et al., 2013). These technologies have transformed security analytics by facilitating the storage, maintenance, and analysis of security information (Cardenas et al., 2013).

· BDA and the Impact of Hadoop in Time Reduction as Business Value

Big Data Analytics can be used in marketing in a competitive edge by reducing the time to respond to customers, rapid data capture, aggregation, processing, and analytics. Harrah’s (currently Caesars) Entertainments has acquired both Hadoop clusters and open-source and commercial analytics software, with the primary objective of exploring and implementing Big Data to respond in real-time to customer marketing and service. GE is another example that is regarded to be the most prominent creator of new service offerings based on Big Data (Davenport & Dyché, 2013). The primary focus of GE was to optimize the service contracts and maintenance intervals for industrial products.

Part-II
Hadoop Installation

The purpose of this Part is to go through the installation of Hadoop on a single cluster node using the Windows 10 operating system. It covers fourteen significant tasks, starting from the download of the software from the Apache site, to the demonstration of the successful installation and configuration. The steps of the installation are derived from the installation guide of (aparche.org, 2018). Due to the lack of system resources, the Windows operating system was the most appropriate choice for this installation and configuration, although the researcher prefers Unix system over Windows due to the extensive experience with Unix. However, the installation and configuration experience on Windows has its value as well.

Task-1: Hadoop Software Download

The purpose of this task is to download the required Hadoop software for windows operating system from the following link: http://www-eu.apache.org/dist/hadoop/core/stable/. Although there is a higher version than 2.9.1, the researcher has selected this version which is core stable version recommended by Apache.

Task-2: Java Installation

The purpose of this task is to install Java which is required for Hadoop as indicated in the administration guide. Java 1.8.0_111 is installed on the system as shown below.

Task-3: Extract Hadoop Zip File

The purpose of this task is to extract Hadoop zip file into a directory under C:\Hadoop-2.9.1 as shown below.

Task-4: Setup Required System Variables.

The purpose of this task is to set up the required system variables. Setup up the HADOOP_HOME as it is required per the installation guide.

Task-5: Edit core-site.xml

The purpose of this task is to setup the configuration of Hadoop by editing the core-site.xml file from C:\Hadoop-2.9.1\etc\hadoop and add the fs.defaultFS to identify the file system for Hadoop using the localhost and port 9000.

Task-6: Copy mapred-site.xml.template to mapred-site.xml

The purpose of this task is to copy the MapReduce template. Copy mapred-site.xml.template to another file mapred-site.xml in the same directory.

Task-7: Edit mapred-site.xml

The purpose of this task is to set up the configuration for Hadoop MapReduce by editing mapred-site.xml and add between configuration tags the properties tag as shown below.

Task-8: Create Two Folders for DataNode and NameNode

The purpose of this task is to create two important folders for data node and name node which are required for the Hadoop file system. Create folder “data” under the Hadoop home C:\Hadoop-2.9.1. Create folder “datanode” under C:\Hadoop-2.9.1\data. Create folder “namenode” under C:\Hadoop-2.9.1\data.

Task-9: Edit hdfs-site.xml

The purpose of this task is to setup the configuration for Hadoop HDFS by editing the file C:\Hadoop-2.9.1\etc\hadoop\hdfs-site.xml, and add the properties for dfs.replication, dfs.namenode, and dfs.datanode as shown below.

Task-10: Edit yarn-site.xml

The purpose of this task is to set the configuration for yarn tool by editing the file C:\Hadoop-2.9.1\etc\hadoop\yarn-site.xml and add yarn.nodemanager.aux-services and its value of mapreduce_shuffle as shown below.

Task-11: Overcome Java Error

The purpose of this task is to overcome the Java error. Edit C:\Hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add the JAVA_HOME to overcome the following error.

Task-12: Test the Configuration

The purpose of this task is to test the current configuration and setup by issuing the following command to test the setup before running Hadoop. The command will throw an error about HADOOP_COMMON_HOME is not found.

>hdfs namenode -format

Task-13: Overcome the HADOOP_COMMON_HOME Error

To overcome the HADOOP_COMMON_HOME “not found” error, edit haddop-env.cmd and add the following, and issue the command again and it will pass as shown below.

Task-14: Start Hadoop Processes

The purpose of this task is to start Hadoop dfs and yarn processes.

Task-15: Run the Cluster Page from the Browser

The purpose of this task is to run the cluster page for Hadoop from the browser after the previous configuration setup. If the configuration setup is implemented successfully, the cluster page gets displayed with the Hadoop functionality as shown below, otherwise, it can throw the 404 error, page not found.

Conclusion

This project has discussed various significant topics related to Big Data Analytics. It addressed two significant parts. Part-I has discussed Big Data and the emerging technology of Hadoop. It has provided an overview of the Hadoop ecosystem, its building blocks, benefits, and limitations. It has also discussed the MapReduce framework, its benefits, and limitations. Part-I has also provided few success stories for Hadoop technology use with Big Data Analytics. Part-II has addressed the installation and the configuration of Hadoop on Windows operating system using fourteen essential Tasks. It has also addressed the errors during the configuration setup and the techniques to overcome these errors to proceed successfully with the Hadoop installation.

References

Abbasi, A., Sarker, S., & Chiang, R. (2016). Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.

Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. Paper presented at the Computational Science and Computational Intelligence (CSCI), 2014 International Conference on.

aparche.org. (2018). Hadoop Installation Guide – Windows. Retrieved from https://wiki.apache.org/hadoop/Hadoop2OnWindows.

Bakshi, K. (2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

Bates, D. W., Saria, S., Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123-1131.

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Paper presented at the Proceedings of the 2nd ACM Symposium on Cloud Computing.

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.

Cardenas, A. A., Manadhata, P. K., & Rajan, S. P. (2013). Big data analytics for security. IEEE Security & Privacy, 11(6), 74-76.

Chan, J. O. (2014). An architecture for big data analytics. Communications of the IIMA, 13(2), 1.

Chandarana, P., & Vijayalakshmi, M. (2014). Big Data analytics frameworks. Paper presented at the Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on.

Chang, V., Kuo, Y.-H., & Ramachandran, M. (2016). Cloud computing adoption framework: A security framework for business clouds. Future Generation computer systems, 57, 24-41. doi:10.1016/j.future.2015.09.031

Chopra, A., & Madan, S. (2015). Big Data: A Trouble or A Real Solution? International Journal of Computer Science Issues (IJCSI), 12(2), 221.

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative mapreduce. Paper presented at the Proceedings of the 19th ACM international symposium on high performance distributed computing.

Elnikety, E., Elsayed, T., & Ramadan, H. E. (2011). iHadoop: asynchronous iterations for MapReduce. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on.

Emani, C. K., Cullot, N., & Nicolle, C. (2015). Understandable big data: A survey. Computer science review, 17, 70-81.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Gantz, J., & Reinsel, D. (2011). Extracting value from chaos. IDC iview, 1142, 1-12.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gupta, B., & Jyoti, K. (2014). Big data analytics with hadoop to analyze targeted attacks on enterprise data.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: what is new from databases perspective? Paper presented at the International Conference on Big Data Analytics.

Hirzel, M., Andrade, H., Gedik, B., Jacques-Silva, G., Khandekar, R., Kumar, V., . . . Soulé, R. (2013). IBM streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4), 7: 1-7: 11.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, P., & Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Jain, R. (2013). Big Data Fundamentals. Retrieved from http://www.cse.wustl.edu/~jain/cse570-13/ftp/m_10abd.pdf.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Paper presented at the System Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.

Katal, A., Wazid, M., & Goudar, R. (2013). Big data: issues, challenges, tools and good practices. Paper presented at the Contemporary Computing (IC3), 2013 Sixth International Conference on Contemporary Computing.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Liang, Y., Li, G., Wang, L., & Hu, Y. (2011). Dacoop: Accelerating data-iterative applications on Map/Reduce cluster. Paper presented at the Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses: John Wiley & Sons.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Moorthy, M., Baby, R., & Senthamaraiselvi, S. (2014). An Analysis for Big Data and its Technologies. International Journal of Science, Engineering and Computer Technology, 4(12), 412.

Nasser, T., & Tariq, R. (2015). Big Data Challenges. J Comput Eng Inf Technol 4: 3. doi:10.4172/2324, 9307, 2.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Ramesh, B. (2015). Big Data Architecture Big Data (pp. 29-59): Springer.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

White, T. (2012). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012). Incmr: Incremental data processing based on mapreduce. Paper presented at the Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental MapReduce for Mining Evolving Big Data. IEEE transactions on knowledge and data engineering, 27(7), 1906-1919.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.

Hadoop: Functionality, Installation and Troubleshooting