The Impact of XML on MapReduce

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the impact of XML on MapReduce. The discussion addresses the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce is also discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

XML Query Processing Using MapReduce

XML format has been used to store data for multiple applications (Aravind & Agrawal, 2014).  Data needs to be ingested into Hadoop and get analyzed to obtain value from the XML data (Aravind & Agrawal, 2014).  Hadoop ecosystem needs to understand XML when it gets ingested into it and be able to interpret it (Aravind & Agrawal, 2014).  MapReduce is a building block of Hadoop ecosystem.  In the age of Big Data, XML documents are expected to be very large and to be scalable and distributed.  The process of XML queries using MapReduce requires the decomposition of a big XML document and distribute portions to different nodes.  The relational approach is not appropriate as it is expensive because transforming a big XML document into relational database tables can be extremely time consuming and θ-joins among relational table (Wu, 2014).  Various research studies have proposed various approaches to implement native XML query processing algorithms using MapReduce. 

            (Dede, Fadika, Gupta, & Govindaraju, 2011) have discussed and analyzed the scalable and distributed processing of scientific XML data, and how the MapReduce model should be used in XML metadata indexing.   The study has presented performance results using two MapReduce implementations of Apache Hadoop framework and proposed framework of LEMO-MR.  The study has provided an indexing framework that is capable of indexing and efficiently searching large-scale scientific XML datasets.  The framework has been tailed for integration with any framework that uses the MapReduce model to meet the scalability and variety requirements. 

(Fegaras, Li, Gupta, & Philip, 2011) have also discussed and analyzed query optimization in a MapReduce environment. The study has presented a novel query language for large-scale analysis of XML data on a MapReduce environment, called MRQL for MapReduce Query Language, that is designed to capture most common data analysis tasks which can be optimized.   XML data fragmentation is also discussed in this study.  When using a parallel data computation, it expects the input data to be fragmented into small manageable pieces, that determine the granularity of the computation.  In a MapReduce environment, each map worker is assigned a data split that consists of data fragments.  A map worker processes these data one fragment at a time.  The fragment is a relational tuple for relational data that is structured, while for a text file, a fragment can be a single line in the file.  However, for hierarchical data and nested collections data such as XML data, the fragment size and structure depend on the actual application that processes these data.  For instance, XML data may consist of some XML documents, each one containing a single XML element, whose size may exceed the memory capability of a map worker.  Thus, when processing XML data, it is recommended to allow custom fragmentation to meet a wide range of applications requirements.  (Fegaras et al., 2011) have argued that Hadoop provides a simple input format for XML fragmentation based on a single tag name.  XML document data can be split, which may start and end at arbitrary points in the document, even in the middle of tag names.  (Fegaras et al., 2011; Sakr & Gaber, 2014) have indicated that this input format allows reading the document as a stream of string fragments, so that each string will contain a single complete element that has the requested tag name.   XML parser can then be used to parse these strings and convert them to objects.  The fragmentation process is complex because the requested elements may cross data split boundaries and these data splits may reside in different data nodes in the data file system (DFS).  Hadoop DFS is the implicit solution for this problem allowing to scan beyond a data split to the next, subject to some overhead for transferring data between nodes.  (Fegaras et al., 2011) have proposed XML fragmentation technique that was built on top of the existing Hadoop XML input format, providing a higher level of abstraction and better customization.  It is a higher level of abstraction because it constructs XML data in the MRQL data model, ready to be processed by MRQL queries instead of deriving a string for each XML element (Fegaras et al., 2011).

(Sakr & Gaber, 2014) have also discussed briefly another language that has been proposed to support distributed XML processing using the MapReduce framework, called ChuQL.  It presents a MapReduce-based extension for the syntax, grammar, and semantics of XQuery, the standard W3C language for querying the XML documents.  The implementation of ChuQL takes care of distributing the computation to multiple XQuery engines running in Hadoop nodes, as described by one or more ChuQL MapReduce expressions.  The representation of the “word count” example program in the ChuQL language using its extended expressions where the MapReduce expression is used to describe a MapReduce job.  The clauses of input and output are used to read and write onto HDFS respectively. The clauses of rr and rw are used for describing the record reader and writer respectively.  The clauses of the map and reduce represent the standard map and reduce phases of the framework where they process XML values or key/value pairs of XML values to match the MapReduce model which are specified using XQuery expressions. Figure 1 show the word count example in ChQL using XML in distributed environment.

Figure 1.  The Word Count Example Program in ChQL Using XML in Distributed Env. (Sakr & Gaber, 2014).

(Vasilenko & Kurapati, 2014) have discussed and analyzed the efficient processing of XML documents in Hadoop MapReduce environment.  They argued that the most common approach to process XML data is to introduce a custom solution based on the user-defined functions or scripts.  The common choices vary from introducing an ETL process for extracting the data of interest to the transformation of XML into other formats that are natively supported by Hive.   They have addressed a generic approach to handling XML based on Apache Hive architecture.  The researchers have described an approach that complements the existing family of Hive serializers and de-serializers for other popular data formats, such as JSON, and makes it much easier for users to deal with the large XML dataset format.  The implementation included logical splits for the input files each of which is assigned to an individual Mapper. The mapper relies on the implemented Apache Hive XML SerDe to break the split into XML fragments using a specified start/end byte sequences. Each fragment corresponds to a single Hive record.  The fragments are handled by the XML processor to extract value for the record column utilizing specified XPath queries.  The reduce phase was not required in this implementation (Vasilenko & Kurapati, 2014). 

            (Wu, 2014) have discussed and analyzed the partitioning XML documents and distributing XML fragments into different compute nodes, which can introduce high overhead in XML fragment transferring from one node to another during the MapReduce process execution.  The researchers have proposed a technique to use MapReduce to distribute labels in inverted lists in a computing cluster so that structural joins can be parallelly performed to process queries.  They have also proposed an optimization technique to reduce the computing space in the proposed framework to improve the performance of query processing.  They have argued that their approaches are different from the current shred and distributed XML document into different nodes in a cluster approach.  The process includes reading and distributing the inverted lists that are required for input queries during the query processing, and their size is much smaller than the size of the whole document.  The process also includes the partition of the total computing space for structural joins so that each sub-space can be handled by one reducer to perform structural joins.  The researchers have also proposed a pruning-based optimization algorithm to improve the performance of their approach.

Conclusion

This discussion has addressed the XML query processing using MapReduce environment.  The discussion has addressed the various techniques and approaches proposed by various research studies for processing large XML document using MapReduce.  The XML fragmentation process in the absence and presence of MapReduce has also been discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.

References

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Dede, E., Fadika, Z., Gupta, C., & Govindaraju, M. (2011). Scalable and distributed processing of scientific XML data. Paper presented at the Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on.

Fegaras, L., Li, C., Gupta, U., & Philip, J. (2011). XML Query Optimization in Map-Reduce.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Vasilenko, D., & Kurapati, M. (2014). Efficient processing of xml documents in hadoop map reduce.

Wu, H. (2014). Parallelizing structural joins to process queries over big XML data using MapReduce. Paper presented at the International Conference on Database and Expert Systems Applications.