Data Query Processing Performance – Think and Knowledge Tank

Dr. Aly, O.
Computer Science

Abstract

The purpose of this paper is to provide a survey on the state-of-the-art techniques for applying MapReduce to improve the RDF data query processing performance. Tremendous effort from the industry and researchers have been exerted to develop efficient and scalable RDF processing system. The project discusses and analyzes the RDF framework and the major building blocks of the semantic web architecture. The RDF store architecture and the MapReduce Parallel Processing Framework and Hadoop are discussed in this project. Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques. This paper also discusses and analyzes the most prominent standardized semantic web technologies RDF and SPARQL. The discussion and the analysis of the various techniques which are applied on MapReduce to improve the RDF query processing performance include techniques such as RDFPath, PigSPARQL, Interactive SPARQL Query Processing on Hadoop (Sempala), Map-Side Index Nested Loop Join (MAPSIN JOIN), HadoopRDF, RDF-3X (RDF Triple eXpress), and Rya (a Scalable RDF Triple Store for the Clouds).

Keywords: RDF, SPARQL, MapReduce, Performance

MapReduce and RDF Data Query Processing Optimized Performance

This project provides a survey on the state-of-the-art techniques for applying MapReduce to improve the Resource Description Framework (RDF) data query processing performance. There has been a tremendous effort from the industry and researchers to develop efficient and scalable RDF processing systems. Most of the complex data-processing tasks require multiple cycles of the MapReduce which are chained together into sequential. The decomposition of a task into cycles or subtasks are often implemented. Thus, the low overall workflow cost is a key element in the decomposition. Each cycle of the MapReduce results in significant overhead. When using RDF, the decomposition problem reflects the distribution of operations such as SELECT, and JOIN into subtasks which is supported by MapReduce cycle. The issue of the decomposition is related to the operations order because the neighboring operation in a query plan can be effectively grouped into the same subtasks. When using MapReduce, the operation order is based on the requirement of key partitioning so that the neighboring operations do not cause any conflict. Various techniques are proposed to enhance the performance of the semantic web queries using RDF and MapReduce.

This project begins with the overview of RDF, followed by RDF Store Architecture, MapReduce Parallel Processing Framework, and Hadoop. RDF and SPARQL using semantic query is discussed covering the syntax of SPARQL and the missing features that are required to enhance the performance of RDF using MapReduce. Various techniques are discussed and analyzed on the application of MapReduce to improve RDF query processing performance. Some of these techniques include HadoopRDF, RDFPath, and PigSPARQL.

Resource Description Framework (RDF)

Resource Description Framework (RDF) is described as an emerging standard for processing metadata (Punnoose, Crainiceanu, & Rapp, 2012; Tiwana & Balasubramaniam, 2001) (Punnoose et al., 2012; Tiwana & Balasubramaniam, 2001). RDF provides interoperability between applications that exchange machine-understandable information on the Web (Sakr & Gaber, 2014; Tiwana & Balasubramaniam, 2001). The primary goal of RDF is to define a mechanism and provide standards for the metadata and for describing resources on the Web (Boussaid, Tanasescu, Bentayeb, & Darmont, 2007; Firat & Kuzu, 2011; Tiwana & Balasubramaniam, 2001) which makes no a priori assumptions about a particular application domain or the associated semantics (Tiwana & Balasubramaniam, 2001). These standards or mechanisms provided by the RDF can prevent users from accessing irrelevant subjects because RDF provided metadata that is relevant to the desired information (Firat & Kuzu, 2011).

RDF is also described as a Data Model (Choi, Son, Cho, Sung, & Chung, 2009; Myung, Yeon, & Lee, 2010) for representing labeled directed graphs (Choi et al., 2009; Nicolaidis & Iniewski, 2017), and useful for a Data Warehousing solution as the MapReduce framework (Myung et al., 2010). RDF is used as an important building block of Semantic Web of Web 3.0 (see Figure 1) (Choi et al., 2009; Firat & Kuzu, 2011). The technologies of the Semantic Web are useful for maintaining data in the Cloud (M. F. Husain, Khan, Kantarcioglu, & Thuraisingham, 2010). These technologies of the Semantic Web provide the ability to specify and query heterogeneous data in a standard manner (M. F. Husain et al., 2010).

RDF Data Model can be extended to ontologies to include RDF Schema (RDFS) and Ontology Web Language (OWL) to provide techniques to define and identify vocabularies specified to a certain domain, schema and relations between the elements of the vocabulary (Choi et al., 2009). RDS can be exported in various file formats (Sun & Jara, 2014). The most common of these formats is RDF + XML and XSD (Sun & Jara, 2014). The OWL is used to add semantics to the schema (Sun & Jara, 2014). For instance, if “A isAssociatedWith B,” which implies that “B is AssociatedWith A” (Sun & Jara, 2014). The OWL allows the ability to express these two things the same way (Sun & Jara, 2014). This similarity feature allowed by OWL is very useful for “joining” data expressed in different schemas (Sun & Jara, 2014). This feature allows building relationship and joining up data from multiple sites, described as “Linked Data” facilitating the heterogeneous data stream integration (Sun & Jara, 2014). OWL enables new facts to be derived from known facts using the inference rules (Nicolaidis & Iniewski, 2017). Another example which can be used to enforce the inference technique using OWL is when a triple states that a car is a subtype of a vehicle and another triple state that a Cabrio is a subtype of a car, the new fact will be that Cabrio is a vehicle which can be inferred from the previous facts (Nicolaidis & Iniewski, 2017).

RDF Data Model is described to be a simple and flexible framework (Myung et al., 2010). The underlying form and expression in RDF is a collection of “triples,” each consisting of a subject (s), a predicate (p), and an object (o) (Brickley, 2014; Connolly & Begg, 2015; Nicolaidis & Iniewski, 2017; Przyjaciel-Zablocki, Schätzle, Skaley, Hornung, & Lausen, 2013; Punnoose et al., 2012). The subjects and predicates are Resources; each encoded as a Uniform Resource Identifier (URI) to ensure the uniqueness, while the object can be a Resource or a Literal such as string, date or number (Nicolaidis & Iniewski, 2017). In (Firat & Kuzu, 2011), the basic structure of the RDF Data Model is based on a triplet of the object (O), quality (A) and value (V) (Firat & Kuzu, 2011). The basic role of RDF is to provide Data Model of the object, quality and value (OAV) (Firat & Kuzu, 2011). RDF Data Model is similar to the XML Data Model, where both do not include form-related information or names (Firat & Kuzu, 2011).

Figure 1: The Major Building Blocks of the Semantic Web Architectures. Adapted from (Firat & Kuzu, 2011).

RDF has been commonly used in applications such as Semantic Web, Bioinformatics, and Social Networks because of its great flexibility and applicability (Choi et al., 2009). These applications require a huge computation over a large set of data (Choi et al., 2009). Thus, the large-scale graph datasets are very common among these applications of Semantic Web, Bioinformatics, and Social Networks (Choi et al., 2009). However, the traditional techniques for processing such large-scale of the dataset are found to be inadequate (Choi et al., 2009). Moreover, RDF Data Model enables existing heterogeneous database systems to be integrated into a Data Warehouse because of its flexibility (Myung et al., 2010). The flexibility of the RDF Data Model also provides users the inference capability to discover unknown knowledge which is useful for large-scale data analysis (Myung et al., 2010). RDF triples require terabytes of disk space for storage and analysis (M. Husain, McGlothlin, Masud, Khan, & Thuraisingham, 2011; M. F. Husain et al., 2010). Researchers are encouraged to develop efficient repositories, because there are only a few existing frameworks such as RDF-3X, Jena, Sesame, BigOWLIM for Semantic Web technologies (M. Husain et al., 2011; M. F. Husain et al., 2010). These frameworks are single-machine RDF systems and are widely used because they are user-friendly and perform well for small and medium-sized RDF datasets (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014). The RDF-3X is regarded to be the fastest single machine RDF systems regarding query performance that vastly outperforms previous single machine systems (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014). However, the performance of RDF-3X diminishes for queries with unbound objects and low selectivity factor (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014). These frameworks are confronted by the large RDF graphs (M. Husain et al., 2011; M. F. Husain et al., 2010). Therefore, the storage of a large volume of RDF triples and the efficient query of the RDF triples are challenging and are regarded to be critical problems in Semantic Web (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014). These challenges also limit the scaling capabilities (M. Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).

RDF Store Architecture

The main purpose of the RDF store is to build a database for storing and retrieving data of any data expressed in RDF (Modoni, Sacco, & Terkaj, 2014). The term RDF store is used as an abstract for any system that can handle RDF data, allowing the ingestion of serialized RDF data and the retrieval of these data, and providing a set of APIs to facilitate the integration with other third-party application as the client application (Modoni et al., 2014). The term triple store often refers to these types of systems (Modoni et al., 2014). RDF store includes two major components; the Repository and the Middleware. The Repository represents a set of files or database (Modoni et al., 2014). The Middleware is on top of the repository and in constant communication with it (Modoni et al., 2014). Figure 2 illustrates the RDF store architecture. The Middleware has its components; Storage Provider, Query Engine, Parser/Serializer, and Client Connector (Modoni et al., 2014). The current RDF stores are categorized into three groups; database based stores, native stores, and hybrid stores (Modoni et al., 2014). Examples of the database based stores include MySQL, Oracle 12c, which are built on top of existing commercial database engines (Modoni et al., 2014). Examples of native stores are AllegroGraph, OWLIM which are built as database engines from scratch (Modoni et al., 2014). Examples of the hybrid stores are Virtuoso, and Sesame which supports architectural styles; native and DBMS-backed (Modoni et al., 2014).

Figure 2: RDF Store Architecture (Modoni et al., 2014)

MapReduce Parallel Processing Framework and Hadoop

In 2004, Google introduced MapReduce framework as a Parallel Processing framework which deals with a large set of data (Bakshi, 2012; Fadzil, Khalid, & Manaf, 2012; White, 2012-important-buildingblocks). The MapReduce framework has gained much popularity because it has features for hiding sophisticated operations of the parallel processing (Fadzil et al., 2012). Various MapReduce frameworks such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). The capability of the MapReduce framework was realized by different research areas such as data warehousing, data mining, and the bioinformatics (Fadzil et al., 2012). MapReduce framework consists of two main layers; the Distributed File System (DFS) layer to store data and the MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012; Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014). DFS is a major feature of the MapReduce framework (Fadzil et al., 2012).

MapReduce framework is using large clusters of low-cost commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014; Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra et al., 2016; Sakr & Gaber, 2014; White, 2012-important-buildingblocks). MapReduce framework is using “Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose components are loosely coupled and when any node goes down, there is no negative impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan, Hsiao, & Parker, 2007). MapReduce framework involves the “Fault-Tolerance” by applying the replication technique and allows replacing any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber, 2014). MapReduce framework involves the automatic support for the parallelization of execution which makes the MapReduce highly parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber, 2014).

MapReduce was introduced to solve the problem of parallel processing of a large set of data in a distributed environment which required manual management of the hardware resources (Fadzil et al., 2012; Sakr & Gaber, 2014). The complexity of the parallelization is solved by using two techniques: Map/Reduce technique, and Distributed File System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber, 2014). The parallel framework must be reliable to ensure good resource management in the distributed environment using off-the-shelf hardware to solve the scalability issue to support any future requirement for processing (Fadzil et al., 2012). The earlier frameworks such as the Message Passing Interface (MPI) framework was having a reliability issue and had a fault-tolerance issue when processing a large set of data (Fadzil et al., 2012). MapReduce framework covers the two categories of the scalability; the structural scalability, and the load scalability (Fadzil et al., 2012). It addresses the structural scalability by using the DFS which allows forming large virtual storage for the framework by adding off-the-shelf hardware. MapReduce framework addresses the load scalability by increasing the number of the nodes to improve the performance (Fadzil et al., 2012).

However, the earlier version of MapReduce framework faced challenges. Among these challenges are the join operation and the lack of support for aggregate functions to join multiple datasets in one task (Sakr & Gaber, 2014). Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as PageRank algorithm, recursive relational queries, and social network analysis (Sakr & Gaber, 2014). The standard MapReduce does not share the execution of work to reduce the overall amount of work (Sakr & Gaber, 2014). Another limitation was found in the lack of support of data index and column storage but support only for a sequential method when scanning the input data. Such a lack of data index affected the query performance (Sakr & Gaber, 2014). Moreover, many argued that MapReduce is not regarded to be the optimal solution for structured data. It is known as shared-nothing architecture, which supports scalability (Bakshi, 2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber, 2014; White, 2012-important-buildingblocks), and the processing of large unstructured data sets (Bakshi, 2012). MapReduce has the limitation of performance and efficiency (Lee et al., 2012).

Hadoop is a software framework which is derived from Big Table and MapReduce and managed by Apache. It was created by Doug Cutting and was named after his son’s toy elephant (Mishra et al., 2016). Hadoop allows applications to run on huge clusters of commodity hardware based on MapReduce (Mishra et al., 2016). The underlying concept of Hadoop is to allow the parallel processing of the data across different computing nodes to speed up computations and hide the latency (Mishra et al., 2016). The Hadoop Distributed File System (HDFS) is one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang, Zhang, & Luo, 2012; Cloud Security Alliance, 2013; De Mauro, Greco, & Grimaldi, 2015) and allowing access to data scattered over multiple nodes in without any exposure to the complexity of the environment (Bao et al., 2012; De Mauro et al., 2015). The MapReduce programming model is another major component of the Hadoop framework (Bao et al., 2012; Cloud Security Alliance, 2013; De Mauro et al., 2015) which is designed to implement the distributed and parallel algorithms efficiently (De Mauro et al., 2015). HBase is the third component of Hadoop framework (Bao et al., 2012). HBase is developed on the HDFS and is a NoSQL (Not only SQL) type database (Bao et al., 2012).

The key features of Hadoop include the scalability and flexibility, cost efficiency and fault tolerance (H. Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman, & Kon, 2014; Sakr & Gaber, 2014). Hadoop allows the nodes in the cluster to scale up and down based on the computation requirements and with no change in the data formats (H. Hu et al., 2014; Polato et al., 2014). Hadoop also provides massively parallel computation to commodity hardware decreasing the cost per terabyte of storage which makes the massively parallel computation affordable when the volume of the data gets increased (H. Hu et al., 2014). The Hadoop technology offers the flexibility feature as it is not tight with a schema which allows the utilization of any data either structured, non-structures, and semi-structured, and the aggregation of the data from multiple sources (H. Hu et al., 2014; Polato et al., 2014). Hadoop also allows nodes to crash without affecting the data processing. It provides fault tolerance environment where data and computation can be recovered without any negative impact on the processing of the data (H. Hu et al., 2014; Polato et al., 2014; White, 2012-important-buildingblocks).

RDF and SPARQL Using Semantic Query

Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques (M. Husain et al., 2011; M. F. Husain et al., 2010). RDF, SPARQL (Simple Protocol And RDF Query Language) are the most prominent standardized semantic web technologies (M. Husain et al., 2011; M. F. Husain et al., 2010). The Data Access Working Group (DAWG) of the World Wide Web Consortium (W3C) in 2007 recommended SPARQL and provided standards to be the query language for RDF, a protocol definition for sending SPARQL queries from a client to a query processor and an XML-based serialization format for results returned by the SPARQL query (Konstantinou, Spanos, Stavrou, & Mitrou, 2010; Sakr & Gaber, 2014). RDF is regarded to be the standard for storing and representing data (M. Husain et al., 2011; M. F. Husain et al., 2010). SPARQL is the query language to retrieve data from RDF triplestore (M. Husain et al., 2011; M. F. Husain et al., 2010; Nicolaidis & Iniewski, 2017; Sakr & Gaber, 2014; Zeng, Yang, Wang, Shao, & Wang, 2013). Like RDF, SPARQL is built on the “triple pattern,” which also contains the “subject,” “predicate,” and “object” and is terminated with a full stop (Connolly & Begg, 2015). RDF triple is regarded to be a SPARQL triple pattern (Connolly & Begg, 2015). URIs are written inside angle brackets for identifying resources; literal strings are denoted with either double or single quote; properties, like Name, can be identified by their URI or more normally using a QName-style syntax to improve readability (Connolly & Begg, 2015). The triple pattern can include variables which are not like the triple (Connolly & Begg, 2015). Any or all of the values of subject, predicate, and object in a triple pattern may be replaced by a variable, which indicates data items of interest that will be returned by a query (Connolly & Begg, 2015).

The semantic query plays a significant role in the Semantic Web, and the standardization of SPARQL plays a significant role to achieve such semantic queries (Konstantinou et al., 2010). Unlike the traditional query languages, SPARQL does not consider the graph level, but rather it models the graph as a set of triples (Konstantinou et al., 2010). Thus, when using the SPARQL query, the graph pattern is identified, and the nodes which match this pattern are returned (Konstantinou et al., 2010; Zeng et al., 2013). SPARQL syntax is similar to SQL such as SELECT FROM WHERE syntax which is the most striking syntax (Konstantinou et al., 2010). The core syntax of SPARQL is a conjunctive set of triple patterns called as “basic graph pattern” (Zeng et al., 2013). Table 1 shows the syntax of SPARQL to retrieve data from RDF using SELECT statement.

Table 1: Example of SPARQL syntax (Sakr & Gaber, 2014)

Although SPARQL syntax is similar to SQL in the context of SELECT to retrieve data, SPARQL is not as mature as SQL (Konstantinou et al., 2010). The current form of SPARQL allows the access to the raw data using URIs from RDF or OWL graph and letting the user perform the result processing (Konstantinou et al., 2010). However, SPARQL is expected to be the gateway to query information and knowledge supporting as many features as SQL does (Konstantinou et al., 2010). SPARQL does not support any aggregated functions such as MAX, MIN, SUM, AVG, COUNT, and the GROUP BY operations (Konstantinou et al., 2010). Moreover, SPARQL supports ORDER BY only on a global level and not solely on the OPTIONAL part of the query (Konstantinou et al., 2010). For mathematical operations, SPARQL does not extend its support beyond the basic mathematical operations (Konstantinou et al., 2010). SPARQL does not support the nested queries, meaning it does not allow CONSTRUCT query in the FROM part of the query. Moreover, SPARQL is missing the functionality offered by SELECT WHERE LIKE statement in SQL, allowing for keyword-based queries (Konstantinou et al., 2010). While SPARQL offers regex() function for string pattern matching, it cannot emulate the functionality of the LIKE operator (Konstantinou et al., 2010). SPARQL enables only the unbound variables in the SELECT part and rejecting the use of functions or other operators (Konstantinou et al., 2010). This limitation places SPARQL as elementary query language where URIs or literals only are returned, while users look for some result processing in the practical use cases (Konstantinou et al., 2010). SPARQL can be enhanced to include these missing features and functionality to include stored procedures, triggers, and operations for data manipulations such as update, insert, and delete (Konstantinou et al., 2010).

There is a group called SPARQL Working Group who are working on integrating these missing features in SPARQL. SPARQL/Update is an extension to SPARQL included in the leading Semantic Web development framework “Jena” allowing the update operation, the creation and the removal of the RDF graphs (Konstantinou et al., 2010). ARQ is a query engine for Jena which supports the SPARQL RDF Query language (Apache, 2017a). Some of the key features of the ARQ include the update, the GROUP BY, access and extension of the SPARQL algebra, and support for the federated query (Apache, 2017a). LARQ integrates SPARQL with Apache’s full-text search framework Lucene (Konstantinou et al., 2010) adding free text searches to SPARQL (Apache, 2017b). SPARQL+ extension of the ARC RDF sore offers most of the common aggregates and extends the SPARUL’s INSERT with CONSTRUCT clause (Konstantinou et al., 2010). The OpenLink’s Virtuoso extends SPARQL with aggregate functions, nesting, and subqueries, allowing the user to insert SPARQL queries inside SQL (Konstantinou et al., 2010). SPASQL offers a similar functionality embedding SPARQL into SQL (Konstantinou et al., 2010).

Although SPARQL is missing a lot of SQL features, it offers other features which are not part of SQL (Konstantinou et al., 2010). Some of these features include the OPTIONAL operator which does not modify the results in case of non-existence and it can be met in almost all of the query languages for RDF (Konstantinou et al., 2010). This feature is equivalent to the LEFT OUTER JOIN in SQL (Konstantinou et al., 2010). However, SPARQL syntax is much more user-friendly and intuitive than SQL (Konstantinou et al., 2010).

Techniques Applied on MapReduce

To Improve RDF Query Processing Performance

With the explosive growth of the data size, the traditional approach of analyzing the data in a centralized server is not adequate to scale up (Punnoose et al., 2012; Sakr & Gaber, 2014), and cannot scale concerning the increasing RDF datasets (Sakr & Gaber, 2014). Although SPARQL is used to query RDF data, the query of RDF dataset at the web scale is challenging because the computation of SPARQL queries requires several joins between subsets of the dataset (Sakr & Gaber, 2014). New methods are introduced to improve the parallel computing and allow storage and retrieval of RDF across large compute clusters which enables processing data of unprecedented magnitude (Punnoose et al., 2012). Various solutions are introduced to solve these challenges and achieve scalable RDF processing using the MapReduce framework such as PigSPARQL, and RDFPath.

RDFPath

In (Przyjaciel-Zablocki, Schätzle, Hornung, & Lausen, 2011), the RDFPath is proposed as a declarative path query language for RDF which provides a natural mapping to the MapReduce programming model by design, while remaining extensible (Przyjaciel-Zablocki et al., 2011). It supports the exploration of graph properties such as shortest connections between two nodes in an RDF graph (Przyjaciel-Zablocki et al., 2011). RDFPath is regarded to be a valuable tool for the analysis of social graphs (Przyjaciel-Zablocki et al., 2011). RDFPath combines an intuitive syntax for path queries with an effective execution strategy using MapReduce (Przyjaciel-Zablocki et al., 2011). RDFPath does benefit from the horizontal scaling properties of MapReduce when adding more nodes to improve the overall executions time significantly (Przyjaciel-Zablocki et al., 2011). Using RDFPath, large RDF graphs can be handled while scaling linearly with the size of the graph that RDFPathh can be used to investigate graph properties such as a variant of the famous six degrees of separation paradigm typically encountered in social graphs (Przyjaciel-Zablocki et al., 2011). It focuses on the path queries and studies their implementation based on MapReduce. There are various RDF query languages such as RQL, SeRQL, RDQL, Triple, N3, Versa, RxPath, RPL, and SPARQL (Przyjaciel-Zablocki et al., 2011). RDFPath has a competitive expressiveness to these other RDF query languages (Przyjaciel-Zablocki et al., 2011). A comparison of RDFPath capabilities with these other RDF query language shows that RDFPath has the same capabilities of SPARQL 1.1for the adjacent nodes, adjacent edges, the degree of a node, and fixed-length path. However, RDFPath shows more capabilities than SPARQL 1.1 in areas like the distance between two nodes and shortest paths as it has partial support for these two properties. However, SPARQL 1.1 shows full support to the aggregate functions while RDFPath shows only partial support (Przyjaciel-Zablocki et al., 2011). Table 2 shows the comparison of RDFPath with other RDF query languages including SPARQL.

Table 2: Comparison of RDF Query Language, adapted from (Przyjaciel-Zablocki et al., 2011).

2. PigSPARQL

PigSPARQL is regarded as a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing n large RDF graphs out of the box (Schätzle, Przyjaciel-Zablocki, Hornung, & Lausen, 2013). PigSPARQL is described as a system for processing SPARQL queries using the MapReduce framework by translating them into Pig Latin programs where each Pig Latin program is executed by a series of MapReduce jobs on a Hadoop cluster (Sakr & Gaber, 2014; Schätzle et al., 2013). PigSPARQL utilizes the query language of Pig, which is a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce (Schätzle et al., 2013). That intermediate layer provides an abstraction level which makes PigSPARQL independent of Hadoop version and accordingly ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer (Schätzle et al., 2013). This intermediate layer of Pig Latin approach provides the sustainability of PigSPARQL and is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which are also underpinned by the competitiveness with the existing systems such as HadoopRDF (Schätzle et al., 2013). As illustrated in Figure 3, the PigSPARQL workflow begins with the SPARQL that is mapped to Pig Latin by parsing the SPARQL query to generate an abstract syntax tree which is translated into a SPARQL Algebra tree (Schätzle et al., 2013). Several optimizations are applied on the Algebra level like the early execution of filters and a re-arrangement of triple patterns by selectivity (Schätzle et al., 2013). The optimized Algebra tree is traversed bottom-up, and an equivalent sequence of Pig Latin expressions are generated for every SPARQL Algebra operator (Schätzle et al., 2013). Pig automatically maps the resulting Pig Latin script into a sequence of MapReduce iterations at the runtime (Schätzle et al., 2013).

PigSPARQL is described as easy to use and competitive baseline for the comparison of MapReduce based SPARQL processing. PigSPARQL exceeds the functionalities of most existing research prototypes with the support of SPARQL 1.0 (Schätzle et al., 2013).

Figure 3: PigSPARQL Workflow From SPARQL to MapReduce, adapted from (Schätzle et al., 2013).

3. Interactive SPARQL Query Processing on Hadoop: Sempala

In (Schätzle, Przyjaciel-Zablocki, Neu, & Lausen, 2014), an interactive SPARQL query processing techniques “Sempala” on Hadoop is proposed. Sempala is a SPARQL-over-SQL-on-Hadoop approach designed with selective queries (Schätzle et al., 2014). It shows significant performance improvements compared to existing approaches (Schätzle et al., 2014). The approach of Sempala is inspired by the trend of applying SQL-on-Hadoop field where several new systems are developed for interactive SQL query processing such as Hive, Sharl, Presto, Phoenix, Impala and so forth (Schätzle et al., 2014). Thus, Sempala as the SPARQL-over-SQL approach is introduced to follow the trend and provide interactive-time SPARQL query processing on Hadoop (Schätzle et al., 2014). With Sempala, the data is stored in RFD in a columnar layout on HDFS and use Impala, which is an open source massive parallel processing (MPP) SQL query engine for Hadoop, to serve as the execution layer on top (Schätzle et al., 2014). The architecture of Sempala is illustrated in Figure 4.

Figure 4: Sempala Architecture adapted from (Schätzle et al., 2014).

Two main components of the architecture of the proposed Sempala; RDF Loader and Query Compiler (Schätzle et al., 2014). The RDF Loader converts an RDF dataset into the data layout used by Sempala. The Query Compiler rewrites a given SPARQL query into the SQL dialect of Impala based on the layout of the data (Schätzle et al., 2014). The Query Compiler of Sempala is based on the algebraic representation of SPARQL expressions defined by W3C recommendation (Schätzle et al., 2014). Jena ARQ is used to parse a SPARQL query into the corresponding algebra tree (Schätzle et al., 2014). Some basic algebraic optimization such as filter pushing is applied (Schätzle et al., 2014). The final step is to traverse the tree bottom up to generate the equivalent Impala SQL expressions based on the unified property table layout (Schätzle et al., 2014). In a comparison of Sempala with other Hadoop based systems such as Hive, PigSPARQL, MapMerge, and MAPSIN. Hive is the standard SQL warehouse for Hadoop based on MapReduce (Schätzle et al., 2014). The same query with minor syntactical modification can run on the same data because Impala is developed to be highly compatible with Hive (Schätzle et al., 2014). Sempala seems to follow a similar approach as PigSPARQL. However, in PigSPARQL, the Pig is used as the underlying system and intermediate level between MapReduce and SPARQL (Schätzle et al., 2014). MapMerge is an efficient map-side merge join implementation for scalable SPARQL BGP (“BasicGraphPatterns” (W3C, 2016)) which reduces the shuffling of the data between map and reduce phases in MapReduce (Schätzle et al., 2014). MAPSIN is an approach that uses HBase, which is standard NoSQL database for Hadoop to store RDF data and applies a map-side index nested loop join which avoids the reduce phase of the MapReduce (Schätzle et al., 2014). The findings of (Schätzle et al., 2014) shows that Sempala outperforms Hive and PigSPARQL, while MapMerge and MAPSIN could not be used because they only support SPARQL BGP (Schätzle et al., 2014).

4. Map-Side Index Nested Loop Join (MAPSIN JOIN)

MapReduce is facing the challenge of processing joins because the datasets are very large (Sakr & Gaber, 2014). Two datasets can be joined using MapReduce, but they have to be located on the same machine, which is not practical (Sakr & Gaber, 2014). Thus, solutions such as reduce-side approach are used and regarded to be the most prominent and flexible join technique in MapReduce (Sakr & Gaber, 2014). The reduce-side approach is also known as “Repartition Join” because datasets at the map phase are read and repartition according to the join key at the shuffle phase, while the actual computation for join is done in the reduce phase (Sakr & Gaber, 2014). The problem with this approach is that the datasets are transferred through the network with no regard to the join output which can consume a lot of the bandwidth of the network and cause bottleneck (Sakr & Gaber, 2014; Schätzle et al., 2013). Another solution called map-side join solution is introduced, where the actual join processing is done in the map phase to avoid the shuffle and reduce phase and avoid transferring both datasets over the network (Sakr & Gaber, 2014). The most common approach is the map-side merge join, although it is hard to cascade, in addition to the advantage of avoiding the shuffle and reduce phase is lost (Sakr & Gaber, 2014). Thus, the MAPSIN approach is proposed which is a map-side index nested loop join based on HBase (Sakr & Gaber, 2014; Schätzle et al., 2013). The MAPSIN join has the indexing capabilities of HBase which improves the query performance of the selective queries (Sakr & Gaber, 2014; Schätzle et al., 2013). The capabilities retain the flexibility of reduce-side joins while utilizing the effectiveness of a map-side join without any modification to the underlying framework (Sakr & Gaber, 2014; Schätzle et al., 2013).

Comparing MAPSIN with PigSPARQL, MAPSIN performs faster than PigSPARQL when using a sophisticated storage schema based on HBase which works well for selective queries but diminishes significantly in performance for less selective queries (Schätzle et al., 2013). However, MAPSIN does not support the queries of LUBM (Lehigh University Benchmark (W3C, 2016). The query runtime of MAPSIN is close to the runtime of the merge join approach (Schätzle et al., 2013).

5. HadoopRDF

HadoopRDF is proposed by (Tian, Du, Wang, Ni, & Yu, 2012) to combine the advantages of high fault tolerance and high throughput of the MapReduce distributed framework and the sophisticated indexing and query answering mechanism (Tian et al., 2012). HadoopRDF is developed on Hadoop cluster with many computers and echo node in the cluster has a sesame server to supply the service for storing and retrieving the RDF data (Tian et al., 2012). HadoopRDF is a MapReduce-based RDF system which stores data directly in HDFS and does not require any modification to the Hadoop framework (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Tian et al., 2012). The basic idea is to substitute the rudimentary HDFS without indexes and a query execution engine, with more elaborated RDF stores (Tian et al., 2012). The architecture of HadoopRDF is illustrated in Figure 5.

Figure 5: HadoopRDF Architecture, adapted from (Tian et al., 2012).

The architecture of HadoopRDF is similar to the architecture of Hadoop which scales up to thousands of nodes (Tian et al., 2012). Hadoop framework is the core of the HadoopRDF (Tian et al., 2012). Hadoop is built on top of HDFS, which is a replicated key-value store under the control of a central NameNode (Tian et al., 2012). Files in HDFS are broken into chunks fixed size, and the replica of these chunks are distributed across a group of DataNodes (Tian et al., 2012). The NameNode tracks the size and location of each replica (Tian et al., 2012). Hadoop which is a MapReduce framework is used for the computational purpose in the data-intensive application (Tian et al., 2012). In the architecture of HadoopRDF, the RDF stores are incorporated into the MapReduce framework. HadoopRDF is an advanced SPARQL engine which splits the original RDF graph according to predicates and objects and utilizes a cost-based query execution plan for reduce-side join (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Schätzle et al., 2013). HadoopRDF can re-balance automatically when the cluster size changes but join processing is also done in the reduce phase (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014). The findings of (M. Husain et al., 2011) indicated that HadoopRDF is more scalable and handles low selectivity queries more efficiently than RDF-3X. Moreover, the result showed that HadoopRDF is much more scalable than BigOWLIM and provides more efficient queries for the large data set (M. Husain et al., 2011). HadoopRDF requires a pre-processing phase like most systems (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014).

6. RDF-3X: RDF Triple eXpress

RDF-3X is proposed by (Neumann & Weikum, 2008). The RDF-3X engine is an implementation of SPARQL which achieves excellent performance by pursuing a RISC-style architecture with a streamlined architecture (Neumann & Weikum, 2008). RISC is Reduced Instruction Set Computer which is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instruction often found in other types of architectures (Neumann & Weikum, 2008). Thus, RDF-3X follows the concept of RISC-style with “reduced instruction set” designed to support RDF. RDF-3X is described to be a generic solution for storing and indexing RDF triples that eliminates the need for physical-design turning (Neumann & Weikum, 2008). RDF-3X provides a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths (Neumann & Weikum, 2008). It also provides a powerful and simple query processor which leverage fast merge joins to the large-scale data (Neumann & Weikum, 2008). Three major components in RDF-3X; physical design, query processor, and the query optimizer. The physical design component is workload-independent by creating appropriate indexes over a single “giant triples table” (Neumann & Weikum, 2008). The query processor is RISC-style by relying mostly on merge joins over sorted index lists. The query optimizer focuses on join order in its generation of the execution plan (Neumann & Weikum, 2008).

The findings of (Neumann & Weikum, 2008) showed that RDF-3X addressed the challenge of schema-free data and copes very well with data that exhibit a large diversity of property names (Neumann & Weikum, 2008). The optimizer of RDF-3X is known to produce efficient query execution plan (Galarraga, Hose, & Schenkel, 2014). The RDF-3X maintains local indexes for all possible orders and combinations of the triple components and for aggregations which enable efficient local data access (Galarraga et al., 2014). RDF-3X does not support LUMB. RDF-3X is a single-node RDF-store which builds indexes over all possible permutations of subject, predicate and object (Huang, Abadi, & Ren, 2011; M. Husain et al., 2011; Schätzle et al., 2014; Zeng et al., 2013). RDF-X3 is regarded to be the fastest existing semantic web repository and state-of-the-art “benchmark” engine for single place machines (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013). Thus, it outperforms any other solution for queries with bound objects and aggregate queries (M. Husain et al., 2011). However, the performance of RDF-3X diminishes exponentially for unbound queries and queries with even simple joins if the selectivity factor is low (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013). The experiment of (M. Husain et al., 2011) showed that RDF-3X is not only slower for such queries, it often aborts and cannot complete the query (M. Husain et al., 2011).

7. Rya: A Scalable RDF Triple Store for the Clouds

In (Punnoose et al., 2012), the Rya is proposed as a new scalable system for storing and retrieving RDF data in cluster nodes. In Rya, OWL model is used as a set of triples and store them in the triple store (Punnoose et al., 2012). Storing all the data in the triple store provides the benefits of using Hadoop MapReduce to run large batch processing jobs against the data set (Punnoose et al., 2012). The first phase of the process is performed only once at the time when the OWL model is loaded into Rya (Punnoose et al., 2012). In phase 1, MapReduce job runs to iterate through the entire graph of relationships and output the implicit relationships found as explicit RDF triples stores into the RDF store (Punnoose et al., 2012). The second phase of the process is performed every time a query is run, and once all explicit and implicit relationships are stored in Rya, the Rya query planner can expand the query at the runtime to utilize all these relationships (Punnoose et al., 2012). Three table index for indexing RDF triples is used to enhance the performance (Punnoose et al., 2012). The results of (Punnoose, Crainiceanu, & Rapp, 2015) showed that Rya outperformed SHARD. Moreover, in comparison with the graph-partitioning algorithm introduced by (Huang et al., 2011), as indicated in (Punnoose et al., 2015), the performance of Rya showed superiority in many cases over Graph Partitioning (Punnoose et al., 2015).

Conclusion

This project provided a survey on the state-of-the-art techniques for applying MapReduce to improve the RDF data query processing performance. Tremendous effort from the industry and researchers have been exerted to develop efficient and scalable RDF processing system. The project discussed the RDF framework and the major building blocks of the semantic web architecture. The RDF store architecture and the MapReduce Parallel Processing Framework and Hadoop are discussed in this project. Researchers have exerted effort in developing Semantic Web technologies which have been standardized to address the inadequacy of the current traditional analytical techniques. This paper also discussed the most prominent standardized semantic web technologies RDF and SPARQL. The project also discussed and analyzed in details various techniques applied on MapReduce to improve the RDF query processing performance. These techniques include RDFPath, PigSPARQL, Interactive SPARQL Query Processing on Hadoop (Sempala), Map-Side Index Nested Loop Join (MAPSIN JOIN), HadoopRDF, RDF-3X (RDF Triple eXpress), and Rya (a Scalable RDF Triple Store for the Clouds).

References

Apache, J. (2017a). ARQ – A SPARQL Processor for Jena

Apache, J. (2017b). LARQ – Adding Free Text Searches to SPARQL

Bakshi, K. (2012). Considerations for big data: Architecture and approach. Paper presented at the Aerospace Conference, 2012 IEEE.

Bao, Y., Ren, L., Zhang, L., Zhang, X., & Luo, Y. (2012). Massive sensor data management framework in cloud manufacturing based on Hadoop. Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE International Conference on.

Boussaid, O., Tanasescu, A., Bentayeb, F., & Darmont, J. (2007). Integration and dimensional modeling approaches for complex data warehousing. Journal of Global Optimization, 37(4), 571. doi:10.1007/s10898-006-9064-6

Brickley, D., & Guha, R. V. (Eds.). (2014). RDF schema 1.1. Retrieved from the W3C Web site: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/.

Choi, H., Son, J., Cho, Y., Sung, M. K., & Chung, Y. D. (2009). SPIDER: a system for scalable, parallel/distributed evaluation of large-scale RDF data. Paper presented at the Proceedings of the 18th ACM conference on Information and knowledge management.

Cloud Security Alliance. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. Paper presented at the AIP Conference Proceedings.

Fadzil, A. F. A., Khalid, N. E. A., & Manaf, M. (2012). Performance of scalable off-the-shelf hardware for data-intensive parallel processing using MapReduce. Paper presented at the Computing and Convergence Technology (ICCCT), 2012 7th International Conference on.

Firat, M., & Kuzu, A. (2011). Semantic web for e-learning bottlenecks: disorientation and cognitive overload. International Journal of Web & Semantic Technology, 2(4), 55.

Galarraga, L., Hose, K., & Schenkel, R. (2014). Partout: a distributed engine for efficient RDF processing. Paper presented at the Proceedings of the 23rd International Conference on World Wide Web.

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

Hu, P., & Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and Application, 7(1), 37-48.

Huang, J., Abadi, D. J., & Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. Proceedings of the VLDB Endowment, 4(11), 1123-1134.

Husain, M., McGlothlin, J., Masud, M. M., Khan, L., & Thuraisingham, B. M. (2011). Heuristics-based query processing for large RDF graphs using cloud computing. IEEE transactions on knowledge and data engineering, 23(9), 1312-1327.

Husain, M. F., Khan, L., Kantarcioglu, M., & Thuraisingham, B. (2010). Data intensive query processing for large RDF graphs using cloud computing tools. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security Issues Associated with Big Data in Cloud Computing. International Journal of Network Security & Its Applications, 6(3), 45. doi:10.5121/ijnsa.2014.6304

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, M. W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 1-18. doi:10.1155/2014/712826

Konstantinou, N., Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically approaching the semantic web bottleneck. International Journal of Web Engineering and Technology, 6(1), 83-111.

Krishnan, K. (2013). Data warehousing in the age of big data: Newnes.

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Modoni, G. E., Sacco, M., & Terkaj, W. (2014). A survey of RDF store solutions. Paper presented at the Engineering, Technology and Innovation (ICE), 2014 International ICE Conference on.

Myung, J., Yeon, J., & Lee, S.-g. (2010). SPARQL basic graph pattern processing with iterative MapReduce. Paper presented at the Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud.

Neumann, T., & Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1), 647-659.

Nicolaidis, I., & Iniewski, K. (2017). Building Sensor Networks: CRC Press.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., & Lausen, G. (2011). Rdfpath: Path query processing on large rdf graphs with mapreduce. Paper presented at the Extended Semantic Web Conference.

Przyjaciel-Zablocki, M., Schätzle, A., Skaley, E., Hornung, T., & Lausen, G. (2013). Map-side merge joins for scalable SPARQL BGP processing. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on.

Punnoose, R., Crainiceanu, A., & Rapp, D. (2012). Rya: a scalable RDF triple store for the clouds. Paper presented at the Proceedings of the 1st International Workshop on Cloud Intelligence.

Punnoose, R., Crainiceanu, A., & Rapp, D. (2015). SPARQL in the cloud using Rya. Information Systems, 48, 181-195.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., & Lausen, G. (2013). PigSPARQL: a SPARQL query processing baseline for big data. Paper presented at the Proceedings of the 12th International Semantic Web Conference (Posters & Demonstrations Track)-Volume 1035.

Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014). Sempala: interactive SPARQL query processing on hadoop. Paper presented at the International Semantic Web Conference.

Sun, Y., & Jara, A. J. (2014). An extensible and active semantic model of information organizing for the Internet of Things. Personal and Ubiquitous Computing, 18(8), 1821-1833. doi:10.1007/s00779-014-0786-z

Tian, Y., Du, J., Wang, H., Ni, Y., & Yu, Y. (2012). Hadooprdf: A scalable rdf data analysis system. 8th ICIC, 633-641.

Tiwana, A., & Balasubramaniam, R. (2001). Integrating knowledge on the web. IEEE Internet Computing, 5(3), 32-39.

W3C. (2016). RDF Store Benchmarking. Paper presented at the Retrieved from https://www.w3.org/wiki/RdfStoreBenchmarking.

White, T. (2012-important-buildingblocks). Hadoop: The definitive guide: ” O’Reilly Media, Inc.”.

Yang, H.-c., Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Zeng, K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A distributed graph engine for web scale RDF data. Paper presented at the Proceedings of the VLDB Endowment.

Tag: Data Query Processing Performance

RDF Data Query Processing Performance