Hadoop Ecosystem

Dr. O. Aly
Computer Science

The purpose of this discussion is to discuss the Hadoop ecosystem, which is rapidly evolving. The discussion also covers Apache Spark, which is a recent addition to the Hadoop ecosystem. Both technologies and tools offer significant benefits for the challenges of storing and processing of large data sets in the age of Big Data Analytic.  The discussion also addresses the most significant differences between Hadoop and Spark.

Hadoop Solution, Components and Ecosystem

The growth of Big Data has demanded the attention not only from researchers, academia, and government but also from the software engineering as it has been challenging dealing with Big Data using the conventional computer science technologies (Koitzsch, 2017).  (Koitzsch, 2017) have referenced annual data volume statistics from Cisco VNI Global IP Traffic Forecast from 2014-2019 as illustrated in Figure 1 to show the growth magnitude of the data.


Figure 1.  Annual Data Volume Statistics [Cisco VNI Global IP Traffic Forecast 2014-2019] (Koitzsch, 2017). 

The complex characteristics of Big Data have demanded the innovation of distributed big data analysis as the conventional techniques were found inadequate (Koitzsch, 2017; Lublinsky, Smith, & Yakubovich, 2013).  Thus, tools such as Hadoop has emerged relying on clusters of relatively low-cost machines and disks, driving the distributed processing for large-scale data projects.  Apache Hadoop is a Java-based open source distributed processing framework has evolved from Apache Nutch, which is an open source web search engine, based on Apache Lucene (Koitzsch, 2017).  The new Hadoop subsystems have various language bindings such as Scala and Python (Koitzsch, 2017).  The core components of Hadoop 2 include MapReduce, Yarn, HDFS and other components including Tez as illustrated in Figure 2.


Figure 2.  Hadoop 2 Core Components (Koitzsch, 2017).

The Hadoop and its ecosystem are divided into major building blocks (Koitzsch, 2017).  The core components of the Hadoop 2 involve Yarn, Map/Reduce, HDFS, and Apache Tez.  The operational services component includes Apache Ambari, Oozie, Ganglia, NagiOs, Falcone, etc. The data services component includes Hive, HCatalog, PIG, HBase, Flume, Sqoop, etc.   The messaging component includes Apache Kafka, while the security services and secure ancillary components include Accumulo.  The glue components include Apache Camel, Spring Framework, and Spring Data.  Figure 3 summarizes these building blocks of the Hadoop and its ecosystem.

Figure 3. Hadoop 2 Technology Stack Diagram (Koitzsch, 2017).

Furthermore, the structure of the ecosystem of Hadoop involves various components, where Hadoop is in the center, providing bookkeeping and management for the cluster using Zookeeper, and Curator (Koitzsch, 2017).  Hive and Pig are a standard component of the Hadoop ecosystem providing data warehousing, while Mahout provides standard machine learning algorithm support.  Figure 4 shows the structure of the ecosystem of Hadoop (Koitzsch, 2017).  


Figure 4.  Hadoop Ecosystem (Koitzsch, 2017).

Hadoop Limitation Driving Additional Technologies

Hadoop has three significant limitations (Guo, 2013).  The first limitation is about the instability of the software of Hadoop as it is an open source software and the lack of technical support and documentation. Enterprise Hadoop can be used to overcome the first limitation.  Hadoop cannot handle real-time data processing, which is a significant limitation for Hadoop.  Spark or Storm can be used to overcome the real-time processing, as required by the application. Hadoop cannot large graph datasets either.  GraphLab can be utilized to overcome the large graph dataset limitation. 

The Enterprise Hadoop is distributions of Hadoop by various Hadoop-oriented vendors such as Cloudera, Hortonworks and MapR, and Hadapt (Guo, 2013).  Cloudera provides Big Data solutions and is regarded to be one of the most significant contributors to the Hadoop codebase (Guo, 2013).  Hortonworks and MapR are Hadoop-based Big Data solutions (Guo, 2013).  Spark is a real-time in-memory processing platform Big Data solution (Guo, 2013).  (Guo, 2013) have indicated that Spark “can be up to 40 times faster than Hadoop” (page 15). (Scott, 2015) has indicated that Spark is running in memory “can be 100 times faster than Hadoop MapReduce, but also ten times faster when processing disk-based data in a similar way to Hadoop MapReduce itself” (page 7).  Spark is described as ideal for iterative processing and responsive Big Data applications (Guo, 2013). Spark can also be integrated with Hadoop, where Hadoop-compatible storage API provides the capabilities to access any Hadoop-supported systems (Guo, 2013).   The storm is another choice for the Hadoop limitation of real-time data processing. The storm is developed and open source by Twitter (Guo, 2013). The GraphLab is the alternative solution for the Hadoop limitation of dealing with large graph dataset.  GraphLab is an open source distributed system, developed at Carnegie Mellon University, to handle sparse iterative graph algorithms (Guo, 2013). Figure 5 summarizes these three limitations and the alternatives of Hadoop to overcome them.  


Figure 5.  Three Major Limitations of Hadoop and Alternative Solutions.

Apache Spark Solution, and its Building Blocks

In 2009, Spark was developed by UC Berkeley AMPLab. Spark runs in-memory processing data quicker than Hadoop(Guo, 2013; Koitzsch, 2017; Scott, 2015).  In 2013, Spark became a project of Apache Software Foundation, and early in 2014, it became one of the major projects.   (Scott, 2015) has described Spark as a general-purpose engine for data processing, and can be used in various projects (Scott, 2015).  The primary tasks that are associated with Spark include interactive queries across large datasets, processing data streaming from sensors or financial systems, and machine learning (Scott, 2015).  While Hadoop was written in Java, Apache Spark was written primarily in Scala (Koitzsch, 2017).  

Three critical features for Spark:  simplicity, speed, and support (Scott, 2015).  The simplicity feature is represented in the access capabilities of Spark through a set of APIs which are well structured and documented assisting data scientist to utilize Spark quickly.  The speed feature reflects the in-memory processing of large dataset quickly.  The speed feature has distinguished Spark from Hadoop.  The last feature of the support is presented in the various programming languages such as Java, Python, R, and Scala, which Spark support (Scott, 2015). Spark has native support for integrating some leading storage solutions in the Hadoop ecosystems and beyond (Scott, 2015).  Databricks, IBM and other main Hadoop vendors are the providers of Spark-based solutions.

The typical use of Spark includes stream processing, machine learning, interactive analytics, and data integration (Scott, 2015).  Example of stream processing includes real-time data processing to identify and prevent potentially fraudulent transactions.  The machine learning is another typical use case of Spark, which is supported by the ability of Spark to run into memory and quickly run repeated queries that help in training machine learning algorithms to find the most efficient algorithm (Scott, 2015).  The interactive analytics is another typical use of Spark involving interactive query process where Spark responds and adapts quickly.  The data integration is another typical use of Spark involving the extract, transform and load (ETL) process reducing the cost and time.  Spark framework includes the Spark Core Engine, with SQL Spark, Spark Streaming for data streaming, MLib Machine Learning, GraphX for Graph Computation, Sark R for running R language on Spark. Figure 6 summarizes the framework of Spark and its building blocks (Scott, 2015).  


Figure 6.  Spark Building Blocks (Scott, 2015).

Differences Between Spark and Hadoop

Although Spark has its benefits in processing real-time data using in-memory processing, Spark is not a replacement for Hadoop or MapReduce (Scott, 2015).  Spark can run on top of Hadoop to benefit from Yarn which is the cluster manager of Hadoop, and the underlying storage of HDFS, HBase and so forth.  Besides, Spark can also run separately by itself without Hadoop, integrating with other cluster managers such as Mesos and other storage like Cassandra and Amazon S3 (Scott, 2015). Spark is described as a great companion to the modern Hadoop cluster deployment (Scott, 2015).  A spark is also described as a powerful tool on its own for processing a large volume of data sets.  However, Spark is not well-suited for production workload.  Thus, the integration Spark with Hadoop provides many capabilities which Spark cannot offer on its own. 

Hadoop offers Yarn as a resource manager, the distributed file system, disaster recovery capabilities, data security, and a distributed data platform.  Spark offers a machine learning model to Hadoop, delivering capabilities which is not easily used in Hadoop without Spark (Scott, 2015).   Spark also offers fast in-memory real-time data streaming, which Hadoop cannot accomplish without Spark (Scott, 2015).  In summary, although Hadoop has its limitations, Spark is not replacing Hadoop, but empowering it.

Conclusion

This discussion has covered significant topics relevant to Hadoop and Spark.  It began with Big Data, its complex characteristics, and the urgent need for technology and tools to deal with Big Data. Hadoop and Spark as emerging technologies and tools and their building blocks have been addressed in this discussion.  The differences between Spark and Hadoop is also covered. The conclusion of this discussion is that Spark is not replacing Hadoop and MapReduce.  Spark offers various benefits to Hadoop, and at the same time, Hadoop offers various benefits to Spark.  The integration of both Spark and Hadoop offers great benefits to the data scientists in Big Data Analytics domain.

References

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Koitzsch, K. (2017). Pro Hadoop Data Analytics: Springer.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.