Dr. Aly, O.
Computer Science
The standard MapReduce framework faced the challenge of the iterative computation which is required in various operations such as data mining, PageRank, network traffic analysis, graph analysis, social network analysis, and so forth (Bu, Howe, Balazinska, & Ernst, 2010; Sakr & Gaber, 2014). These analyses techniques require the data to be processed iteratively until the computation satisfies a convergence or stropping condition (Bu et al., 2010; Sakr & Gaber, 2014). Due to this limitation, and to this critical requirement, this iterative process is implemented and executed manually using a driver program when using the standard MapReduce framework (Bu et al., 2010; Sakr & Gaber, 2014). However, the manual implementation and execution of such iterative computation have two major problems (Bu et al., 2010; Sakr & Gaber, 2014). The first problem is reflected in loading unchanged data from iteration to iteration wasting input/output (I/O), network bandwidth, and CPU resources (Bu et al., 2010; Sakr & Gaber, 2014). The second problem is reflected in the overhead of the termination condition when the output of the application did not change for two consecutive iterations and reached a fixed point (Bu et al., 2010; Sakr & Gaber, 2014). This termination condition may require an extra MapReduce job on each iteration which causes overhead for scheduling extra tasks, reading extra data from disk, and moving data across the network (Bu et al., 2010; Sakr & Gaber, 2014).
Researchers exerted efforts to solve the iterative computation. HaLoop is proposed by (Bu et al., 2010), and Twister by (Ekanayake et al., 2010), Pregel by (Malewicz et al., 2010). One solution to the iterative computation limitation, as the case in HaLoop by (Bu et al., 2010) and Twister by (Ekanayake et al., 2010) are to identify and keep invariant data during the iterations, where reading unnecessary data repeatedly is avoided. The HaLoop by (Bu et al., 2010) implemented two caching functionalities (Bu et al., 2010; Sakr & Gaber, 2014). The first caching technique is implemented on the invariant data in the first iteration and reusing them in a later iteration. The second caching technique is implemented on the outputs of reducer making the check for the fixpoint more efficient without adding any extra MapReduce job (Bu et al., 2010; Sakr & Gaber, 2014).
The solution of Pregel by (Malewicz et al., 2010) is more focused on the graph and was inspired by the Bulk Synchronous Parallel model (Malewicz et al., 2010). This solution provides the synchronous computation and communication (Malewicz et al., 2010) and uses explicit messaging approach to acquire remote information and does not replicate remote values locally (Malewicz et al., 2010). Mahoot is another solution that was introduced to solve the iterative computing by grouping a series of chained jobs to obtain the results (Polato, Ré, Goldman, & Kon, 2014). In Mahoot solution, the result of each job is pushed into the next job until the final results are obtained (Polato et al., 2014). The iHadoop proposed by (Elnikety, Elsayed, & Ramadan, 2011) schedules iterations asynchronously and connects the output of one iteration to the next allowing both to process their data concurrently (Elnikety et al., 2011). The task scheduler of the iHadoop utilizes the inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast transfer of the local data (Elnikety et al., 2011).
Apache Hadoop and Apache Spark are the most popular technology for the iterative computation using in-memory data processing engine (Liang, Li, Wang, & Hu, 2011). Hadoop defines the iterative computation as a series of MapReduce jobs where each job reads the data from Hadoop Distributed File System (HDFS) independently, processes the data, and writes the data back to HDFS (Liang et al., 2011). Dacoop was proposed by Liang as an extension to Hadoop to handle the data-iterative applications, by using cache technique for repeatedly data processing and introducing shared memory-based data cache mechanism (Liang et al., 2011). The iMapReduce is another solution proposed by (Zhang, Gao, Gao, & Wang, 2012) to provide support of iterative processing implementing the persistent tasks of the map and reduce during the whole iterative process and how the persistent tasks are terminated (Zhang et al., 2012). The iMapReduce avoid three major overheads. The first overhead is the job startup overhead which is avoided by building an internal loop from reduce to map within a job. The second overhead is the communication overhead which is avoided by separating the iterated state data from the static structure data. The third overhead is the synchronization overhead which is avoided by allowing asynchronous map task execution (Zhang et al., 2012).
References
Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative MapReduce. Paper presented at the Proceedings of the 19th ACM international symposium on high performance distributed computing.
Elnikety, E., Elsayed, T., & Ramadan, H. E. (2011). iHadoop: asynchronous iterations for MapReduce. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on.
Liang, Y., Li, G., Wang, L., & Hu, Y. (2011). Dacoop: Accelerating data-iterative applications on Map/Reduce cluster. Paper presented at the Parallel and Distributed Computing, Applications, and Technologies (PDCAT), 2011 12th International Conference on.
Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.
Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.
Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.
Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47-68.