Hadoop and MapReduce

Dr. Aly, O.
Computer Science

Basic Concepts of the MapReduce Framework

In 2004, Google introduced MapReduce framework as a Parallel Processing framework which deals with large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012).  The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012).  Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). 

The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012).  MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014).  DFS is a major feature of the MapReduce framework (Fadzil et al., 2012).   

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012).  MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007).  MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the current running job (P. Hu & Dai, 2014-7; Sakr & Gaber, 2014).  MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014-7; Sakr & Gaber, 2014).  

The Top Three Features of Hadoop

The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; CSA, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015).  The MapReduce programming model is another major component of the Hadoop framework (Bao et al., 2012; CSA, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015).  HBase is the third component of Hadoop framework (Bao et al., 2012).  HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012). 

The key features of Hadoop include the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014).  Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014).  The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014).  Hadoop also allows nodes to crash without affecting the data processing.  It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012). 

Pros and Cons of MapReduce Framework

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014).  The complexity of the parallelization is solved by using two techniques:  Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014).  The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012).   The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012).  MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012).  It addresses the structural scalability by using the DFS which allows forming large virtual storage for the framework by adding off-the-shelf hardware.  MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012). 

However, the earlier version of MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014).  Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014).  The standard MapReduce does not share the execution of work to reduce the overall amount of work  (Sakr & Gaber, 2014).  Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014).  Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data.   It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012), and the processing of large unstructured data sets (Bakshi, 2012).  MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

References

Bakshi, K. (2012, 3-10 March 2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

CSA, C. S. A. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, P., & Dai, W. (2014-7). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

White, T. (2012). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.