"Artificial Intelligence without Big Data Analytics is lame, and Big Data Analytics without Artificial Intelligence is blind." Dr. O. Aly, Computer Science.
The purpose of this discussion is to discuss and analyze the relationship between performance and security and the impact of security implementation on the performance. The discussion also discusses and analyzes the balance between security and performance to provide good operational result in both categories. The discussion begins with the characteristics of the distributed environment including a database to have a good understanding of the complexity of the distributed environment, the influential factors on the distributed system. The discussion discusses and analyzes the security challenges in the distributed system and the negative correlation between security and performance in the distributed system.
Distributed Environment Challenges
The distributed system involves components located at networked computers communicating and coordinating their actions only by passing messages. The distributed system includes concurrency of components, lack of a global clock and independent failures of components. The challenges of the distributed system arise from the heterogeneity of the system components, openness to allow components to be added or replaced, security, scalability, failure handling, concurrency of components, transparency and providing quality of service (Coulouris, Dollimore, & Kindberg, 2005).
Example of distributed systems includes the Web Search whose task is to index
the entire content of the world wide web, containing a wide range of
information types and styles including web pages, multimedia sources and
scanned books. Massively multiplayer
online games (MMOGs) is another example of the distributed
system. Users interact through the
Internet with a persistent virtual world using MMOGs. The financial
trading market is another example of the distributed system using real-time
access to a wide range of information sources such as current share prices and trends, economic and political
development (Coulouris et al., 2005).
Influential Factors in Distributed Systems
The distributed system is going through significant changes due to some trends. The first influential trend in the distributed system involves the emergence of pervasive networking technology. The emergence of ubiquitous computing coupled with the desire to support user mobility in a distributed system is another factor that is impacting the distributed system. The increasing demand for multi-media services is another influential trend in the distributed system. The last influential trend is the view of the distributed systems as a utility. All these trends have a significant impact on the distributed system.
Security Challenge in Distributed System
Security is among some challenges in the distributed system. Many of the information resources which are stored in a distributed system have a high value to their users. The security of such information is critically important. Information Security involves confidentiality to protect against disclosure to unauthorized users, integrity to protect against alteration or corruption, and availability to protect against interferences with the means of accessing the resources. The security must comply with the CIA Triad for Confidentiality, Integrity, and Availability (Abernathy & McMillan, 2016; Coulouris et al., 2005; Stewart, Chapple, & Gibson, 2015). The security risks are associated with allowing access to resources in an intranet within the organization. Although the firewalls can be used to form barriers between department around the intranet, restricting access to the authorized users only, the proper use of the resource by users within the intranet and on the Internet cannot be ensured and guaranteed.
In the distributed system, users
send requests to access data managed by the server
which involves sending information in messages over a network. Examples include a user can send the credit
card information in electronic commerce or bank, or a doctor can request access
to patient’s information. The challenge
is to send sensitive information in a message over a network in a secure
manner. Moreover, the challenge is to
ensure the recipient is the right user.
Such challenges can be met by
using different security techniques such as encryption techniques. However, there are two security challenges
which have not been resolved yet; The
Denial of Service (DoS) and the Security of Mobile Code. The DoS occurs when the service is disrupted, and
users cannot access their data.
Currently, the DoS attacks are
encountered by attempting to catch and punish the perpetrators after the
event, which is a reactive solution and not proactive. The security of mobile
code is another open challenge. Example of the mobile code is an image is sent
which might be a source of DoS or access to a local
resource (Coulouris et al., 2005).
Negative Correlation between Security and Performance
The performance challenges of the Distribute System emerge from the more complex algorithm required for the distributed environment than for the centralized system. The complexity of the algorithm emerges from the requirement of replicated database systems, fully interconnected network, network delays represented by the simplistic queuing models, and so forth. Security is one of the most important issues in the distributed system. Security requires layers of security measure to protect the system from intruders. These layers of protection have a negative impact on the performance of the distributed environment. Moreover, data and information in transit or storage become vulnerable to attacks. There are four types of storage systems Server Attached Redundant Array of Independent Disk (RAID), centralized RAID, Network Attached Storage (NAS), and Storage Area Network (SAN). NAS and SAN have different performance because they have different techniques for transferring the data. NAS uses TCP/IP protocol to transfer the data across multiple devices, while SAN uses SCSI setup on fiber channels. Thus, NAS can be implemented on any physical network supporting TCP/IP such as Ethernet, FDDI, or ATM. However, SAN can be implemented only fiber channel. SAN has better performance than NAS because TCP has higher overhead and SCSI faster than the TCP/IP network (Firdhous, 2012).
References
Abernathy, R., & McMillan, T.
(2016). CISSP Cert Guide: Pearson IT
Certification.
Coulouris, G. F.,
Dollimore, J., & Kindberg, T. (2005). Distributed
systems: concepts and design: Pearson education.
Firdhous, M.
(2012). Implementation of security in distributed systems-a comparative study. arXiv preprint arXiv:1211.2032.
Stewart,
J., Chapple, M., & Gibson, D. (2015). ISC
Official Study Guide. CISSP Security
Professional Official Study Guide (7th ed.): Wiley.
MapReduce and RDF Data
Query Processing Optimized Performance
This project
provides a survey on the state-of-the-art techniques for applying MapReduce to
improve the Resource Description Framework (RDF) data query processing
performance. There has been a tremendous
effort from the industry and researchers to develop efficient and scalable RDF
processing systems. Most of the complex
data-processing tasks require multiple cycles of the MapReduce which are
chained together into sequential. The decomposition
of a task into cycles or subtasks are often implemented. Thus, the low overall workflow cost is a key
element in the decomposition. Each cycle
of the MapReduce results in significant overhead. When using RDF, the decomposition problem
reflects the distribution of operations such as SELECT, and JOIN into subtasks which
is supported by MapReduce cycle. The issue of the decomposition is related to
the operations order because the neighboring operation in a query plan can be
effectively grouped into the same subtasks.
When using MapReduce, the operation order is based on the requirement of
key partitioning so that the neighboring operations do not cause any conflict. Various techniques are proposed to enhance
the performance of the semantic web queries using RDF and MapReduce.
This
project begins with the overview of RDF, followed by RDF Store Architecture,
MapReduce Parallel Processing Framework, and Hadoop. RDF and SPARQL using semantic query is
discussed covering the syntax of SPARQL and the missing features that are
required to enhance the performance of RDF using MapReduce. Various techniques are discussed and analyzed
on the application of MapReduce to improve RDF query processing
performance. Some of these techniques
include HadoopRDF, RDFPath, and PigSPARQL.
Resource Description
Framework (RDF)
Resource
Description Framework (RDF) is described as an emerging standard for processing
metadata (Punnoose,
Crainiceanu, & Rapp, 2012; Tiwana & Balasubramaniam, 2001) (Punnoose
et al., 2012; Tiwana & Balasubramaniam, 2001). RDF provides interoperability between
applications that exchange machine-understandable information on the Web (Sakr
& Gaber, 2014; Tiwana & Balasubramaniam, 2001). The primary goal of RDF is to define a
mechanism and provide standards for the metadata and for describing resources on
the Web (Boussaid,
Tanasescu, Bentayeb, & Darmont, 2007; Firat & Kuzu, 2011; Tiwana &
Balasubramaniam, 2001) which makes no a priori assumptions
about a particular application domain or the associated semantics (Tiwana
& Balasubramaniam, 2001). These standards or mechanisms provided by the
RDF can prevent users from accessing irrelevant subjects because RDF provided
metadata that is relevant to the desired information (Firat
& Kuzu, 2011).
RDF
is also described as a Data Model (Choi,
Son, Cho, Sung, & Chung, 2009; Myung, Yeon, & Lee, 2010) for representing
labeled directed graphs (Choi
et al., 2009; Nicolaidis & Iniewski, 2017), and useful for a
Data Warehousing solution as the MapReduce framework (Myung
et al., 2010).
RDF is used as an important building block of Semantic Web of Web 3.0
(see Figure 1) (Choi
et al., 2009; Firat & Kuzu, 2011). The technologies of the Semantic Web are
useful for maintaining data in the Cloud (M.
F. Husain, Khan, Kantarcioglu, & Thuraisingham, 2010). These technologies of the Semantic Web
provide the ability to specify and query heterogeneous data in a standard
manner (M.
F. Husain et al., 2010).
RDF
Data Model can be extended to ontologies to include RDF Schema (RDFS) and Ontology
Web Language (OWL) to provide techniques to define and identify vocabularies
specified to a certain domain, schema and relations between the elements of the
vocabulary (Choi
et al., 2009).
RDS can be exported in various
file formats (Sun
& Jara, 2014).
The most common of these formats is RDF + XML and XSD (Sun
& Jara, 2014).
The OWL is used to add semantics to the schema (Sun
& Jara, 2014).
For instance, if “A isAssociatedWith B,” which implies that “B is AssociatedWith
A” (Sun
& Jara, 2014).
The OWL allows the ability to express these two things the same way (Sun
& Jara, 2014).
This similarity feature allowed by OWL is very useful for “joining” data
expressed in different schemas (Sun
& Jara, 2014).
This feature allows building relationship and joining up data from
multiple sites, described as “Linked Data” facilitating the heterogeneous data
stream integration (Sun
& Jara, 2014).
OWL enables new facts to be derived from known facts using the inference
rules (Nicolaidis
& Iniewski, 2017).
Another example which can be used to enforce the inference technique
using OWL is when a triple states that a car is a subtype of a vehicle and
another triple state that a Cabrio is a subtype of a car, the new fact will be
that Cabrio is a vehicle which can be inferred from the previous facts (Nicolaidis
& Iniewski, 2017).
RDF Data Model is described to be a simple and flexible framework (Myung et al., 2010). The underlying form and expression in RDF is a collection of “triples,” each consisting of a subject (s), a predicate (p), and an object (o) (Brickley, 2014; Connolly & Begg, 2015; Nicolaidis & Iniewski, 2017; Przyjaciel-Zablocki, Schätzle, Skaley, Hornung, & Lausen, 2013; Punnoose et al., 2012). The subjects and predicates are Resources; each encoded as a Uniform Resource Identifier (URI) to ensure the uniqueness, while the object can be a Resource or a Literal such as string, date or number (Nicolaidis & Iniewski, 2017). In (Firat & Kuzu, 2011), the basic structure of the RDF Data Model is based on a triplet of the object (O), quality (A) and value (V) (Firat & Kuzu, 2011). The basic role of RDF is to provide Data Model of the object, quality and value (OAV) (Firat & Kuzu, 2011). RDF Data Model is similar to the XML Data Model, where both do not include form-related information or names (Firat & Kuzu, 2011).
Figure 1: The Major Building Blocks of the Semantic Web Architectures. Adapted from (Firat & Kuzu, 2011).
RDF has been commonly used in
applications such as Semantic Web, Bioinformatics, and Social Networks because
of its great flexibility and applicability (Choi
et al., 2009).
These applications require a huge computation over a large set of data (Choi
et al., 2009).
Thus, the large-scale graph datasets are very common among these
applications of Semantic Web, Bioinformatics, and Social Networks (Choi
et al., 2009).
However, the traditional techniques for processing such large-scale of
the dataset are found to be inadequate (Choi
et al., 2009).
Moreover, RDF Data Model enables
existing heterogeneous database systems to be integrated into a Data Warehouse
because of its flexibility (Myung
et al., 2010).
The flexibility of the RDF Data Model also provides users the inference
capability to discover unknown knowledge which is useful for large-scale data
analysis (Myung
et al., 2010). RDF triples require terabytes of disk space
for storage and analysis (M.
Husain, McGlothlin, Masud, Khan, & Thuraisingham, 2011; M. F. Husain et
al., 2010).
Researchers are encouraged to develop efficient repositories, because
there are only a few existing frameworks such as RDF-3X, Jena, Sesame, BigOWLIM
for Semantic Web technologies (M.
Husain et al., 2011; M. F. Husain et al., 2010).
These frameworks are single-machine RDF systems and are widely used
because they are user-friendly and perform well for small and medium-sized RDF datasets
(M.
Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).
The RDF-3X is regarded to be the fastest single machine RDF systems regarding
query performance that vastly outperforms previous single machine systems (M.
Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).
However, the performance of RDF-3X diminishes for queries with unbound
objects and low selectivity factor (M.
Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).
These frameworks are confronted by the large RDF graphs (M.
Husain et al., 2011; M. F. Husain et al., 2010). Therefore, the storage of a large volume of
RDF triples and the efficient query of the RDF triples are challenging and are regarded
to be critical problems in Semantic Web (M.
Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).
These challenges also limit the scaling capabilities (M.
Husain et al., 2011; M. F. Husain et al., 2010; Sakr & Gaber, 2014).
RDF Store Architecture
The main purpose of the RDF store is to build a database for storing and retrieving data of any data expressed in RDF (Modoni, Sacco, & Terkaj, 2014). The term RDF store is used as an abstract for any system that can handle RDF data, allowing the ingestion of serialized RDF data and the retrieval of these data, and providing a set of APIs to facilitate the integration with other third-party application as the client application (Modoni et al., 2014). The term triple store often refers to these types of systems (Modoni et al., 2014). RDF store includes two major components; the Repository and the Middleware. The Repository represents a set of files or database (Modoni et al., 2014). The Middleware is on top of the repository and in constant communication with it (Modoni et al., 2014). Figure 2 illustrates the RDF store architecture. The Middleware has its components; Storage Provider, Query Engine, Parser/Serializer, and Client Connector (Modoni et al., 2014). The current RDF stores are categorized into three groups; database based stores, native stores, and hybrid stores (Modoni et al., 2014). Examples of the database based stores include MySQL, Oracle 12c, which are built on top of existing commercial database engines (Modoni et al., 2014). Examples of native stores are AllegroGraph, OWLIM which are built as database engines from scratch (Modoni et al., 2014). Examples of the hybrid stores are Virtuoso, and Sesame which supports architectural styles; native and DBMS-backed (Modoni et al., 2014).
Figure 2: RDF
Store Architecture (Modoni et al., 2014)
MapReduce Parallel
Processing Framework and Hadoop
In 2004, Google introduced MapReduce framework as a Parallel Processing
framework which deals with a large set of
data (Bakshi, 2012; Fadzil, Khalid, &
Manaf, 2012; White, 2012-important-buildingblocks). The MapReduce framework has gained much
popularity because it has features for hiding sophisticated operations of the
parallel processing (Fadzil et al., 2012). Various MapReduce frameworks
such as Hadoop were introduced because of the enthusiasm towards MapReduce (Fadzil et al., 2012). The capability of the MapReduce
framework was realized by different research areas such as data warehousing,
data mining, and the bioinformatics (Fadzil et al., 2012). MapReduce framework consists of
two main layers; the Distributed File System (DFS) layer to store data and the
MapReduce layer for data processing (Lee, Lee, Choi, Chung, & Moon, 2012;
Mishra, Dehuri, & Kim, 2016; Sakr & Gaber, 2014). DFS is a major feature of the MapReduce
framework (Fadzil et al., 2012).
MapReduce framework is using large clusters of low-cost
commodity hardware to lower the cost (Bakshi, 2012; H. Hu, Wen, Chua, & Li, 2014;
Inukollu, Arsi, & Ravuri, 2014; Khan et al., 2014; Krishnan, 2013; Mishra
et al., 2016; Sakr & Gaber, 2014; White, 2012-important-buildingblocks). MapReduce framework is using
“Redundant Arrays of Independent (and inexpensive) Nodes (RAIN),” whose
components are loosely coupled and when
any node goes down, there is no negative
impact on the MapReduce job (Sakr & Gaber, 2014; Yang, Dasdan,
Hsiao, & Parker, 2007). MapReduce framework involves the
“Fault-Tolerance” by applying the replication technique and allows replacing
any crashed nodes with another node without affecting the currently running job (P. Hu & Dai, 2014; Sakr & Gaber,
2014). MapReduce framework involves the automatic
support for the parallelization of execution which makes the MapReduce highly
parallel and yet abstracted (P. Hu & Dai, 2014; Sakr & Gaber,
2014).
MapReduce was introduced to solve the
problem of parallel processing of a large set
of data in a distributed environment which required manual management of the
hardware resources (Fadzil et al., 2012; Sakr & Gaber,
2014). The complexity of the parallelization is solved by using two techniques: Map/Reduce technique, and Distributed File
System (DFS) technique (Fadzil et al., 2012; Sakr & Gaber,
2014). The parallel framework must be reliable to
ensure good resource management in the distributed environment using
off-the-shelf hardware to solve the scalability issue to support any future
requirement for processing (Fadzil et al., 2012). The earlier frameworks such as
the Message Passing Interface (MPI) framework was having a reliability issue
and had a fault-tolerance issue when
processing a large set of data (Fadzil et al., 2012). MapReduce framework covers the
two categories of the scalability; the structural scalability, and the load
scalability (Fadzil et al., 2012). It addresses the structural
scalability by using the DFS which allows forming large virtual storage for the
framework by adding off-the-shelf hardware.
MapReduce framework addresses the load scalability by increasing the
number of the nodes to improve the performance (Fadzil et al., 2012).
However,
the earlier version of MapReduce framework faced challenges. Among these
challenges are the join operation and the
lack of support for aggregate functions to join multiple datasets in one task (Sakr
& Gaber, 2014).
Another limitation of the standard MapReduce framework is found in the iterative processing which is required for analysis techniques such as
PageRank algorithm, recursive relational queries, and social network analysis (Sakr
& Gaber, 2014).
The standard MapReduce does not share the execution of work to reduce
the overall amount of work (Sakr
& Gaber, 2014).
Another limitation was found in the lack of support of data index and
column storage but support only for a sequential
method when scanning the input data. Such a lack of data index affected
the query performance (Sakr
& Gaber, 2014).
Moreover, many argued that MapReduce is not regarded to be the optimal
solution for structured data. It is
known as shared-nothing architecture, which supports scalability (Bakshi,
2012; Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012; Sakr & Gaber,
2014; White, 2012-important-buildingblocks), and the processing of large
unstructured data sets (Bakshi,
2012).
MapReduce has the limitation of performance and efficiency (Lee
et al., 2012).
Hadoop is a software framework which is derived from Big
Table and MapReduce and managed by Apache.
It was created by Doug Cutting and was named after his son’s toy
elephant (Mishra et al.,
2016). Hadoop allows applications to run on huge
clusters of commodity hardware based on MapReduce (Mishra et al.,
2016). The underlying concept of Hadoop is to allow
the parallel processing of the data across different computing nodes to speed
up computations and hide the latency (Mishra et al.,
2016). The Hadoop Distributed File System (HDFS) is
one of the major components of the Hadoop framework for storing large files (Bao, Ren, Zhang,
Zhang, & Luo, 2012; Cloud Security Alliance, 2013; De Mauro, Greco, &
Grimaldi, 2015) and allowing
access to data scattered over multiple nodes in without any exposure to the
complexity of the environment (Bao et al., 2012;
De Mauro et al., 2015). The MapReduce programming model is another major component
of the Hadoop framework (Bao
et al., 2012; Cloud Security Alliance, 2013; De Mauro et al., 2015) which is designed to implement the
distributed and parallel algorithms efficiently (De
Mauro et al., 2015).
HBase is the third component of Hadoop framework (Bao
et al., 2012).
HBase is developed on the HDFS and is a NoSQL (Not only SQL) type
database (Bao
et al., 2012).
The
key features of Hadoop include the scalability and flexibility, cost efficiency
and fault tolerance (H.
Hu et al., 2014; Khan et al., 2014; Mishra et al., 2016; Polato, Ré, Goldman,
& Kon, 2014; Sakr & Gaber, 2014).
Hadoop allows the nodes in the cluster to scale up and down based on the
computation requirements and with no change in the data formats (H.
Hu et al., 2014; Polato et al., 2014).
Hadoop also provides massively parallel computation to commodity
hardware decreasing the cost per terabyte of storage which makes the massively
parallel computation affordable when the volume of the data gets increased (H.
Hu et al., 2014).
The Hadoop technology offers the flexibility feature as it is not tight
with a schema which allows the utilization of any data either structured,
non-structures, and semi-structured, and
the aggregation of the data from multiple sources (H.
Hu et al., 2014; Polato et al., 2014).
Hadoop also allows nodes to crash without affecting the data
processing. It provides fault tolerance
environment where data and computation can be recovered without any negative
impact on the processing of the data (H.
Hu et al., 2014; Polato et al., 2014; White, 2012-important-buildingblocks).
RDF and SPARQL Using
Semantic Query
Researchers
have exerted effort in developing Semantic Web technologies which have been standardized
to address the inadequacy of the current traditional analytical techniques (M.
Husain et al., 2011; M. F. Husain et al., 2010). RDF, SPARQL
(Simple Protocol And RDF Query Language) are the most prominent standardized semantic
web technologies (M.
Husain et al., 2011; M. F. Husain et al., 2010). The Data Access Working Group (DAWG) of the
World Wide Web Consortium (W3C) in 2007 recommended SPARQL and provided
standards to be the query language for RDF, a protocol definition for sending
SPARQL queries from a client to a query processor and an XML-based serialization
format for results returned by the SPARQL query (Konstantinou,
Spanos, Stavrou, & Mitrou, 2010; Sakr & Gaber, 2014). RDF is regarded to be the standard for
storing and representing data (M.
Husain et al., 2011; M. F. Husain et al., 2010). SPARQL is the query language to retrieve data
from RDF triplestore (M.
Husain et al., 2011; M. F. Husain et al., 2010; Nicolaidis & Iniewski,
2017; Sakr & Gaber, 2014; Zeng, Yang, Wang, Shao, & Wang, 2013).
Like RDF, SPARQL is built on the “triple pattern,” which also contains
the “subject,” “predicate,” and “object” and is terminated with a full stop (Connolly
& Begg, 2015).
RDF triple is regarded to be a SPARQL triple pattern (Connolly
& Begg, 2015).
URIs are written inside angle brackets for identifying resources;
literal strings are denoted with either double or single quote; properties,
like Name, can be identified by their URI or more normally using a QName-style
syntax to improve readability (Connolly
& Begg, 2015). The triple pattern can include
variables which are not like the triple (Connolly
& Begg, 2015). Any or all of the values of
subject, predicate, and object in a triple pattern may be replaced by a
variable, which indicates data items of interest that will be returned by a
query (Connolly
& Begg, 2015).
The semantic query plays a significant role in the Semantic Web, and the standardization of SPARQL plays a significant role to achieve such semantic queries (Konstantinou et al., 2010). Unlike the traditional query languages, SPARQL does not consider the graph level, but rather it models the graph as a set of triples (Konstantinou et al., 2010). Thus, when using the SPARQL query, the graph pattern is identified, and the nodes which match this pattern are returned (Konstantinou et al., 2010; Zeng et al., 2013). SPARQL syntax is similar to SQL such as SELECT FROM WHERE syntax which is the most striking syntax (Konstantinou et al., 2010). The core syntax of SPARQL is a conjunctive set of triple patterns called as “basic graph pattern” (Zeng et al., 2013). Table 1 shows the syntax of SPARQL to retrieve data from RDF using SELECT statement.
Table 1: Example
of SPARQL syntax (Sakr & Gaber, 2014)
Although SPARQL syntax is similar to SQL
in the context of SELECT to retrieve data, SPARQL is not as mature as SQL (Konstantinou
et al., 2010).
The current form of SPARQL allows the access to the raw data using URIs
from RDF or OWL graph and letting the user perform the result processing (Konstantinou
et al., 2010).
However, SPARQL is expected to be the gateway to query information and
knowledge supporting as many features as SQL does (Konstantinou
et al., 2010).
SPARQL does not support any aggregated functions such as MAX, MIN, SUM,
AVG, COUNT, and the GROUP BY operations (Konstantinou
et al., 2010).
Moreover, SPARQL supports ORDER BY only on a global level and not solely
on the OPTIONAL part of the query (Konstantinou
et al., 2010).
For mathematical operations, SPARQL does not extend its support beyond
the basic mathematical operations (Konstantinou
et al., 2010).
SPARQL does not support the nested queries, meaning it does not allow
CONSTRUCT query in the FROM part of the query.
Moreover, SPARQL is missing the functionality offered by SELECT WHERE
LIKE statement in SQL, allowing for keyword-based queries (Konstantinou
et al., 2010).
While SPARQL offers regex() function for string pattern matching, it
cannot emulate the functionality of the LIKE operator (Konstantinou
et al., 2010).
SPARQL enables only the unbound variables in the SELECT part and
rejecting the use of functions or other operators (Konstantinou
et al., 2010). This limitation places SPARQL as
elementary query language where URIs or literals only are returned, while users
look for some result processing in the practical use cases (Konstantinou
et al., 2010).
SPARQL can be enhanced to include these missing features and
functionality to include stored procedures, triggers, and operations for data
manipulations such as update, insert, and delete (Konstantinou
et al., 2010).
There
is a group called SPARQL Working Group who are working on integrating these missing
features in SPARQL. SPARQL/Update is an
extension to SPARQL included in the leading Semantic Web development framework
“Jena” allowing the update operation, the creation and the removal of the RDF
graphs (Konstantinou
et al., 2010). ARQ is a query engine for Jena
which supports the SPARQL RDF Query language (Apache,
2017a). Some of the key features of the ARQ include
the update, the GROUP BY, access and extension of the SPARQL algebra, and
support for the federated query (Apache,
2017a). LARQ integrates SPARQL with Apache’s
full-text search framework Lucene (Konstantinou
et al., 2010) adding free text searches to SPARQL (Apache,
2017b). SPARQL+ extension of the ARC RDF sore offers
most of the common aggregates and extends the SPARUL’s INSERT with CONSTRUCT
clause (Konstantinou
et al., 2010).
The OpenLink’s Virtuoso extends SPARQL with aggregate functions, nesting,
and subqueries, allowing the user to insert SPARQL queries inside SQL (Konstantinou
et al., 2010).
SPASQL offers a similar functionality embedding SPARQL into SQL (Konstantinou
et al., 2010).
Although
SPARQL is missing a lot of SQL features, it offers other features which are not
part of SQL (Konstantinou
et al., 2010).
Some of these features include the OPTIONAL operator which does not
modify the results in case of non-existence and it can be met in almost all of
the query languages for RDF (Konstantinou
et al., 2010).
This feature is equivalent to the LEFT OUTER JOIN in SQL (Konstantinou
et al., 2010).
However, SPARQL syntax is much more user-friendly and intuitive than SQL
(Konstantinou
et al., 2010).
Techniques Applied on MapReduce
To Improve RDF Query Processing
Performance
With
the explosive growth of the data size, the traditional approach of analyzing
the data in a centralized server is not adequate to scale up (Punnoose
et al., 2012; Sakr & Gaber, 2014), and cannot scale
concerning the increasing RDF datasets (Sakr
& Gaber, 2014).
Although SPARQL is used to query RDF data, the query of RDF dataset at
the web scale is challenging because the computation of SPARQL queries requires
several joins between subsets of the dataset (Sakr
& Gaber, 2014).
New methods are introduced to improve the parallel computing and allow
storage and retrieval of RDF across large compute clusters which enables
processing data of unprecedented magnitude (Punnoose
et al., 2012).
Various solutions are introduced to solve these challenges and achieve
scalable RDF processing using the MapReduce framework such as PigSPARQL, and
RDFPath.
RDFPath
In (Przyjaciel-Zablocki, Schätzle, Hornung, & Lausen, 2011), the RDFPath is proposed as a declarative path query language for RDF which provides a natural mapping to the MapReduce programming model by design, while remaining extensible (Przyjaciel-Zablocki et al., 2011). It supports the exploration of graph properties such as shortest connections between two nodes in an RDF graph (Przyjaciel-Zablocki et al., 2011). RDFPath is regarded to be a valuable tool for the analysis of social graphs (Przyjaciel-Zablocki et al., 2011). RDFPath combines an intuitive syntax for path queries with an effective execution strategy using MapReduce (Przyjaciel-Zablocki et al., 2011). RDFPath does benefit from the horizontal scaling properties of MapReduce when adding more nodes to improve the overall executions time significantly (Przyjaciel-Zablocki et al., 2011). Using RDFPath, large RDF graphs can be handled while scaling linearly with the size of the graph that RDFPathh can be used to investigate graph properties such as a variant of the famous six degrees of separation paradigm typically encountered in social graphs (Przyjaciel-Zablocki et al., 2011). It focuses on the path queries and studies their implementation based on MapReduce. There are various RDF query languages such as RQL, SeRQL, RDQL, Triple, N3, Versa, RxPath, RPL, and SPARQL (Przyjaciel-Zablocki et al., 2011). RDFPath has a competitive expressiveness to these other RDF query languages (Przyjaciel-Zablocki et al., 2011). A comparison of RDFPath capabilities with these other RDF query language shows that RDFPath has the same capabilities of SPARQL 1.1for the adjacent nodes, adjacent edges, the degree of a node, and fixed-length path. However, RDFPath shows more capabilities than SPARQL 1.1 in areas like the distance between two nodes and shortest paths as it has partial support for these two properties. However, SPARQL 1.1 shows full support to the aggregate functions while RDFPath shows only partial support (Przyjaciel-Zablocki et al., 2011). Table 2 shows the comparison of RDFPath with other RDF query languages including SPARQL.
Table 2: Comparison of RDF Query Language, adapted from (Przyjaciel-Zablocki et al., 2011).
2. PigSPARQL
PigSPARQL
is regarded as a competitive yet easy to
use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing n large RDF
graphs out of the box (Schätzle, Przyjaciel-Zablocki, Hornung, & Lausen,
2013). PigSPARQL is described as a system for
processing SPARQL queries using the MapReduce framework by translating them
into Pig Latin programs where each Pig Latin program is executed by a series of
MapReduce jobs on a Hadoop cluster (Sakr & Gaber, 2014; Schätzle et al., 2013). PigSPARQL
utilizes the query language of Pig, which is a data analysis platform on top of
Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce (Schätzle et al., 2013). That intermediate layer provides an
abstraction level which makes PigSPARQL independent of Hadoop version and
accordingly ensures the compatibility to future changes of the Hadoop framework
as they will be covered by the underlying Pig layer (Schätzle et al., 2013). This intermediate layer of Pig Latin approach
provides the sustainability of PigSPARQL and is an attractive long-term
baseline for comparing various MapReduce based SPARQL implementations which are also underpinned by the competitiveness
with the existing systems such as HadoopRDF (Schätzle et al., 2013). As illustrated in Figure 3, the PigSPARQL
workflow begins with the SPARQL that is mapped to Pig Latin by parsing the
SPARQL query to generate an abstract syntax tree which is translated into a SPARQL Algebra tree (Schätzle et al., 2013). Several optimizations are applied on the
Algebra level like the early execution of filters and a re-arrangement of
triple patterns by selectivity (Schätzle et al., 2013). The optimized Algebra tree is traversed bottom-up,
and an equivalent sequence of Pig Latin
expressions are generated for every SPARQL Algebra operator (Schätzle et al., 2013). Pig automatically maps the resulting Pig
Latin script into a sequence of MapReduce iterations at the runtime (Schätzle et al., 2013).
PigSPARQL is described as easy to use and competitive baseline for the comparison of MapReduce based SPARQL processing. PigSPARQL exceeds the functionalities of most existing research prototypes with the support of SPARQL 1.0 (Schätzle et al., 2013).
Figure 3: PigSPARQL Workflow From SPARQL to MapReduce, adapted from (Schätzle et al., 2013).
3. Interactive SPARQL Query Processing on Hadoop: Sempala
In (Schätzle, Przyjaciel-Zablocki, Neu, & Lausen, 2014), an interactive SPARQL query processing techniques “Sempala” on Hadoop is proposed. Sempala is a SPARQL-over-SQL-on-Hadoop approach designed with selective queries (Schätzle et al., 2014). It shows significant performance improvements compared to existing approaches (Schätzle et al., 2014). The approach of Sempala is inspired by the trend of applying SQL-on-Hadoop field where several new systems are developed for interactive SQL query processing such as Hive, Sharl, Presto, Phoenix, Impala and so forth (Schätzle et al., 2014). Thus, Sempala as the SPARQL-over-SQL approach is introduced to follow the trend and provide interactive-time SPARQL query processing on Hadoop (Schätzle et al., 2014). With Sempala, the data is stored in RFD in a columnar layout on HDFS and use Impala, which is an open source massive parallel processing (MPP) SQL query engine for Hadoop, to serve as the execution layer on top (Schätzle et al., 2014). The architecture of Sempala is illustrated in Figure 4.
Figure 4: Sempala Architecture
adapted from (Schätzle et al., 2014).
Two main components of the architecture of the proposed Sempala; RDF Loader and Query Compiler (Schätzle et al., 2014). The RDF Loader converts an RDF dataset into the data layout used by Sempala. The Query Compiler rewrites a given SPARQL query into the SQL dialect of Impala based on the layout of the data (Schätzle et al., 2014). The Query Compiler of Sempala is based on the algebraic representation of SPARQL expressions defined by W3C recommendation (Schätzle et al., 2014). Jena ARQ is used to parse a SPARQL query into the corresponding algebra tree (Schätzle et al., 2014). Some basic algebraic optimization such as filter pushing is applied (Schätzle et al., 2014). The final step is to traverse the tree bottom up to generate the equivalent Impala SQL expressions based on the unified property table layout (Schätzle et al., 2014). In a comparison of Sempala with other Hadoop based systems such as Hive, PigSPARQL, MapMerge, and MAPSIN. Hive is the standard SQL warehouse for Hadoop based on MapReduce (Schätzle et al., 2014). The same query with minor syntactical modification can run on the same data because Impala is developed to be highly compatible with Hive (Schätzle et al., 2014). Sempala seems to follow a similar approach as PigSPARQL. However, in PigSPARQL, the Pig is used as the underlying system and intermediate level between MapReduce and SPARQL (Schätzle et al., 2014). MapMerge is an efficient map-side merge join implementation for scalable SPARQL BGP (“BasicGraphPatterns” (W3C, 2016)) which reduces the shuffling of the data between map and reduce phases in MapReduce (Schätzle et al., 2014). MAPSIN is an approach that uses HBase, which is standard NoSQL database for Hadoop to store RDF data and applies a map-side index nested loop join which avoids the reduce phase of the MapReduce (Schätzle et al., 2014). The findings of (Schätzle et al., 2014) shows that Sempala outperforms Hive and PigSPARQL, while MapMerge and MAPSIN could not be used because they only support SPARQL BGP (Schätzle et al., 2014).
4. Map-Side Index Nested Loop Join (MAPSIN JOIN)
MapReduce is
facing the challenge of processing joins because the datasets are very large (Sakr & Gaber, 2014). Two datasets can be joined using MapReduce, but they have to be located on the same
machine, which is not practical (Sakr & Gaber, 2014). Thus, solutions such as reduce-side approach are used and regarded to be the most prominent
and flexible join technique in MapReduce (Sakr & Gaber, 2014). The reduce-side
approach is also known as “Repartition Join” because datasets at the map phase are read and repartition according to the join
key at the shuffle phase, while the actual computation for join is done in the reduce phase (Sakr & Gaber, 2014). The problem with this approach is that the
datasets are transferred through the
network with no regard to the join output which can consume a lot of the
bandwidth of the network and cause bottleneck (Sakr & Gaber, 2014; Schätzle et al., 2013). Another solution called map-side join
solution is introduced, where the actual
join processing is done in the map phase to avoid the shuffle and reduce phase
and avoid transferring both datasets over the network (Sakr & Gaber, 2014). The
most common approach is the map-side merge join, although it is hard to
cascade, in addition to the advantage of avoiding the shuffle and reduce phase
is lost (Sakr & Gaber, 2014). Thus, the MAPSIN approach is proposed which is a map-side index nested
loop join based on HBase (Sakr & Gaber, 2014; Schätzle et al., 2013). The MAPSIN join has the indexing capabilities
of HBase which improves the query performance of the selective queries (Sakr & Gaber, 2014; Schätzle et al., 2013). The capabilities retain the flexibility of
reduce-side joins while utilizing the effectiveness of a map-side join without
any modification to the underlying framework (Sakr & Gaber, 2014; Schätzle et al., 2013).
Comparing MAPSIN with PigSPARQL, MAPSIN performs faster than PigSPARQL when using a sophisticated storage schema based on HBase which works well for selective queries but diminishes significantly in performance for less selective queries (Schätzle et al., 2013). However, MAPSIN does not support the queries of LUBM (Lehigh University Benchmark (W3C, 2016). The query runtime of MAPSIN is close to the runtime of the merge join approach (Schätzle et al., 2013).
5. HadoopRDF
HadoopRDF is proposed by (Tian, Du, Wang, Ni, & Yu, 2012) to combine the advantages of high fault tolerance and high throughput of the MapReduce distributed framework and the sophisticated indexing and query answering mechanism (Tian et al., 2012). HadoopRDF is developed on Hadoop cluster with many computers and echo node in the cluster has a sesame server to supply the service for storing and retrieving the RDF data (Tian et al., 2012). HadoopRDF is a MapReduce-based RDF system which stores data directly in HDFS and does not require any modification to the Hadoop framework (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Tian et al., 2012). The basic idea is to substitute the rudimentary HDFS without indexes and a query execution engine, with more elaborated RDF stores (Tian et al., 2012). The architecture of HadoopRDF is illustrated in Figure 5.
Figure 5: HadoopRDF Architecture,
adapted from (Tian et al., 2012).
The architecture of HadoopRDF is similar to the architecture of Hadoop which scales up to thousands of nodes (Tian et al., 2012). Hadoop framework is the core of the HadoopRDF (Tian et al., 2012). Hadoop is built on top of HDFS, which is a replicated key-value store under the control of a central NameNode (Tian et al., 2012). Files in HDFS are broken into chunks fixed size, and the replica of these chunks are distributed across a group of DataNodes (Tian et al., 2012). The NameNode tracks the size and location of each replica (Tian et al., 2012). Hadoop which is a MapReduce framework is used for the computational purpose in the data-intensive application (Tian et al., 2012). In the architecture of HadoopRDF, the RDF stores are incorporated into the MapReduce framework. HadoopRDF is an advanced SPARQL engine which splits the original RDF graph according to predicates and objects and utilizes a cost-based query execution plan for reduce-side join (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014; Schätzle et al., 2013). HadoopRDF can re-balance automatically when the cluster size changes but join processing is also done in the reduce phase (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014). The findings of (M. Husain et al., 2011) indicated that HadoopRDF is more scalable and handles low selectivity queries more efficiently than RDF-3X. Moreover, the result showed that HadoopRDF is much more scalable than BigOWLIM and provides more efficient queries for the large data set (M. Husain et al., 2011). HadoopRDF requires a pre-processing phase like most systems (Przyjaciel-Zablocki et al., 2013; Sakr & Gaber, 2014).
6. RDF-3X: RDF Triple eXpress
RDF-3X is proposed
by (Neumann & Weikum, 2008). The RDF-3X engine is an implementation of
SPARQL which achieves excellent performance by pursuing a RISC-style
architecture with a streamlined architecture (Neumann & Weikum, 2008). RISC
is Reduced Instruction Set Computer which is a type of microprocessor
architecture that utilizes a small, highly-optimized
set of instructions, rather than a more specialized
set of instruction often found in other types of architectures (Neumann & Weikum, 2008). Thus, RDF-3X follows the concept of
RISC-style with “reduced instruction set” designed to support RDF. RDF-3X is described to be a generic solution
for storing and indexing RDF triples that eliminates
the need for physical-design turning (Neumann & Weikum, 2008). RDF-3X provides a query optimizer for
choosing optimal join orders using a cost model based on statistical synopses
for entire join paths (Neumann & Weikum, 2008). It also provides a powerful and simple query
processor which leverage fast merge joins to the large-scale data (Neumann & Weikum, 2008). Three major components in RDF-3X; physical
design, query processor, and the query
optimizer. The physical design component
is workload-independent by creating appropriate indexes over a single “giant triples
table” (Neumann & Weikum, 2008). The query processor is RISC-style by relying
mostly on merge joins over sorted index lists.
The query optimizer focuses on join order in its generation of the execution plan (Neumann & Weikum, 2008).
The findings of (Neumann & Weikum, 2008) showed that RDF-3X addressed the challenge of schema-free data and copes very well with data that exhibit a large diversity of property names (Neumann & Weikum, 2008). The optimizer of RDF-3X is known to produce efficient query execution plan (Galarraga, Hose, & Schenkel, 2014). The RDF-3X maintains local indexes for all possible orders and combinations of the triple components and for aggregations which enable efficient local data access (Galarraga et al., 2014). RDF-3X does not support LUMB. RDF-3X is a single-node RDF-store which builds indexes over all possible permutations of subject, predicate and object (Huang, Abadi, & Ren, 2011; M. Husain et al., 2011; Schätzle et al., 2014; Zeng et al., 2013). RDF-X3 is regarded to be the fastest existing semantic web repository and state-of-the-art “benchmark” engine for single place machines (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013). Thus, it outperforms any other solution for queries with bound objects and aggregate queries (M. Husain et al., 2011). However, the performance of RDF-3X diminishes exponentially for unbound queries and queries with even simple joins if the selectivity factor is low (M. Husain et al., 2011; Przyjaciel-Zablocki et al., 2013). The experiment of (M. Husain et al., 2011) showed that RDF-3X is not only slower for such queries, it often aborts and cannot complete the query (M. Husain et al., 2011).
7. Rya: A Scalable RDF Triple Store for the Clouds
In (Punnoose et al., 2012), the Rya is proposed as a new scalable system for storing
and retrieving RDF data in cluster nodes.
In Rya, OWL model is used as a set of triples and store them in the
triple store (Punnoose et al., 2012). Storing all the data
in the triple store provides the benefits of using Hadoop MapReduce to run
large batch processing jobs against the data set (Punnoose et al., 2012). The first phase of
the process is performed only once at the time when the OWL model is loaded
into Rya (Punnoose et al., 2012). In phase 1,
MapReduce job runs to iterate through the entire graph of relationships and
output the implicit relationships found as explicit RDF triples stores into the
RDF store (Punnoose et al., 2012). The second phase of
the process is performed every time a query is run, and once all explicit and
implicit relationships are stored in Rya, the Rya query planner can expand the
query at the runtime to utilize all these relationships (Punnoose et al., 2012). Three table index
for indexing RDF triples is used to enhance the performance (Punnoose et al., 2012). The results of (Punnoose, Crainiceanu, & Rapp, 2015) showed that Rya outperformed SHARD. Moreover, in comparison
with the graph-partitioning algorithm introduced by (Huang et al., 2011), as indicated in (Punnoose et al., 2015), the performance of Rya showed superiority in many cases
over Graph Partitioning (Punnoose et al., 2015).
Conclusion
This
project provided a survey on the state-of-the-art techniques for applying MapReduce to improve the RDF data query
processing performance. Tremendous
effort from the industry and researchers have been exerted to develop efficient
and scalable RDF processing system. The
project discussed the RDF framework and
the major building blocks of the semantic web architecture. The RDF store architecture and the MapReduce
Parallel Processing Framework and Hadoop are
discussed in this project.
Researchers have exerted effort in developing Semantic Web technologies
which have been standardized to address
the inadequacy of the current traditional
analytical techniques. This paper also
discussed the most prominent standardized
semantic web technologies RDF and
SPARQL. The project also discussed and
analyzed in details various techniques applied on MapReduce to improve the RDF
query processing performance. These techniques include RDFPath, PigSPARQL,
Interactive SPARQL Query Processing on Hadoop (Sempala), Map-Side Index Nested
Loop Join (MAPSIN JOIN), HadoopRDF, RDF-3X (RDF Triple eXpress), and Rya (a
Scalable RDF Triple Store for the Clouds).
References
Apache, J. (2017a). ARQ – A SPARQL Processor for Jena
Apache, J.
(2017b). LARQ – Adding Free Text Searches
to SPARQL
Bakshi, K.
(2012). Considerations for big data:
Architecture and approach. Paper presented at the Aerospace Conference,
2012 IEEE.
Bao, Y., Ren, L.,
Zhang, L., Zhang, X., & Luo, Y. (2012). Massive
sensor data management framework in cloud manufacturing based on Hadoop.
Paper presented at the Industrial Informatics (INDIN), 2012 10th IEEE
International Conference on.
Boussaid, O.,
Tanasescu, A., Bentayeb, F., & Darmont, J. (2007). Integration and
dimensional modeling approaches for complex data warehousing. Journal of Global Optimization, 37(4),
571. doi:10.1007/s10898-006-9064-6
Choi, H., Son,
J., Cho, Y., Sung, M. K., & Chung, Y. D. (2009). SPIDER: a system for scalable, parallel/distributed evaluation of
large-scale RDF data. Paper presented at the Proceedings of the 18th ACM
conference on Information and knowledge management.
Cloud Security
Alliance. (2013). Big Data Analytics for Security Intelligence. Big Data Working Group.
Connolly, T.,
& Begg, C. (2015). Database Systems:
A Practical Approach to Design, Implementation, and Management (6th Edition
ed.): Pearson.
De Mauro, A.,
Greco, M., & Grimaldi, M. (2015). What
is big data? A consensual definition and a review of key research topics.
Paper presented at the AIP Conference Proceedings.
Fadzil, A. F. A.,
Khalid, N. E. A., & Manaf, M. (2012). Performance
of scalable off-the-shelf hardware for data-intensive parallel processing using
MapReduce. Paper presented at the Computing and Convergence Technology
(ICCCT), 2012 7th International Conference on.
Firat, M., &
Kuzu, A. (2011). Semantic web for e-learning bottlenecks: disorientation and
cognitive overload. International Journal
of Web & Semantic Technology, 2(4), 55.
Galarraga, L.,
Hose, K., & Schenkel, R. (2014). Partout:
a distributed engine for efficient RDF processing. Paper presented at the
Proceedings of the 23rd International Conference on World Wide Web.
Hu, H., Wen, Y.,
Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics:
A Technology Tutorial. Practical
Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453
Hu, P., &
Dai, W. (2014). Enhancing fault tolerance based on Hadoop cluster. International Journal of Database Theory and
Application, 7(1), 37-48.
Huang, J., Abadi,
D. J., & Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. Proceedings of the VLDB Endowment, 4(11),
1123-1134.
Husain, M.,
McGlothlin, J., Masud, M. M., Khan, L., & Thuraisingham, B. M. (2011).
Heuristics-based query processing for large RDF graphs using cloud computing. IEEE transactions on knowledge and data
engineering, 23(9), 1312-1327.
Husain, M. F.,
Khan, L., Kantarcioglu, M., & Thuraisingham, B. (2010). Data intensive query processing for large
RDF graphs using cloud computing tools. Paper presented at the Cloud
Computing (CLOUD), 2010 IEEE 3rd International Conference on.
Inukollu, V. N.,
Arsi, S., & Ravuri, S. R. (2014). Security Issues Associated with Big Data
in Cloud Computing. International Journal
of Network Security & Its Applications, 6(3), 45.
doi:10.5121/ijnsa.2014.6304
Jinquan, D., Jie,
H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New
Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.
Khan, N., Yaqoob,
I., Hashem, I. A. T., Inayat, Z., Ali, M. W. K., Alam, M., . . . Gani, A.
(2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 1-18.
doi:10.1155/2014/712826
Konstantinou, N.,
Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically approaching
the semantic web bottleneck. International
Journal of Web Engineering and Technology, 6(1), 83-111.
Krishnan, K.
(2013). Data warehousing in the age of
big data: Newnes.
Lee, K.-H., Lee,
Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing
with MapReduce: a survey. ACM SIGMOD
Record, 40(4), 11-20.
Mishra, B. S. P.,
Dehuri, S., & Kim, E. (2016). Techniques
and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing
(Vol. 17): Springer.
Modoni, G. E.,
Sacco, M., & Terkaj, W. (2014). A
survey of RDF store solutions. Paper presented at the Engineering,
Technology and Innovation (ICE), 2014 International ICE Conference on.
Myung, J., Yeon,
J., & Lee, S.-g. (2010). SPARQL basic
graph pattern processing with iterative MapReduce. Paper presented at the Proceedings
of the 2010 Workshop on Massive Data Analytics on the Cloud.
Neumann, T.,
& Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1),
647-659.
Nicolaidis, I.,
& Iniewski, K. (2017). Building
Sensor Networks: CRC Press.
Polato, I., Ré,
R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop
research—A systematic literature review. Journal
of Network and Computer Applications, 46, 1-25.
Przyjaciel-Zablocki,
M., Schätzle, A., Hornung, T., & Lausen, G. (2011). Rdfpath: Path query processing on large rdf graphs with mapreduce.
Paper presented at the Extended Semantic Web Conference.
Przyjaciel-Zablocki,
M., Schätzle, A., Skaley, E., Hornung, T., & Lausen, G. (2013). Map-side merge joins for scalable SPARQL BGP
processing. Paper presented at the Cloud Computing Technology and Science
(CloudCom), 2013 IEEE 5th International Conference on.
Punnoose, R.,
Crainiceanu, A., & Rapp, D. (2012). Rya:
a scalable RDF triple store for the clouds. Paper presented at the
Proceedings of the 1st International Workshop on Cloud Intelligence.
Punnoose, R.,
Crainiceanu, A., & Rapp, D. (2015). SPARQL in the cloud using Rya. Information Systems, 48, 181-195.
Sakr, S., &
Gaber, M. (2014). Large Scale and big
data: Processing and Management: CRC Press.
Schätzle, A.,
Przyjaciel-Zablocki, M., Hornung, T., & Lausen, G. (2013). PigSPARQL: a SPARQL query processing
baseline for big data. Paper presented at the Proceedings of the 12th
International Semantic Web Conference (Posters & Demonstrations
Track)-Volume 1035.
Schätzle, A.,
Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014). Sempala: interactive SPARQL query processing on hadoop. Paper
presented at the International Semantic Web Conference.
Sun, Y., &
Jara, A. J. (2014). An extensible and active semantic model of information
organizing for the Internet of Things. Personal
and Ubiquitous Computing, 18(8), 1821-1833. doi:10.1007/s00779-014-0786-z
Tian, Y., Du, J.,
Wang, H., Ni, Y., & Yu, Y. (2012). Hadooprdf: A scalable rdf data analysis
system. 8th ICIC, 633-641.
Tiwana, A., &
Balasubramaniam, R. (2001). Integrating knowledge on the web. IEEE Internet Computing, 5(3), 32-39.
White, T. (2012-important-buildingblocks).
Hadoop: The definitive guide: ”
O’Reilly Media, Inc.”.
Yang, H.-c.,
Dasdan, A., Hsiao, R.-L., & Parker, D. S. (2007). Map-reduce-merge: simplified relational data processing on large
clusters. Paper presented at the Proceedings of the 2007 ACM SIGMOD
international conference on Management of data.
Zeng,
K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A distributed graph engine for web scale RDF data. Paper presented
at the Proceedings of the VLDB Endowment.
Hadoop was developed by Yahoo and Apache to run jobs
in hundreds of terabytes of data (Yan, Yang, Yu, Li, & Li, 2012). A
various large corporation such as Facebook, Amazon have used Hadoop as
it offers high efficiency, high scalability, and high reliability (Yan et al., 2012). Hadoop has faced various
limitation such as low-level programming paradigm and schema, strictly batch
processing, time skew and incremental computation (Alam & Ahmed, 2014). The incremental computation is
regarded to be one of the major shortcomings of Hadoop technology (Alam & Ahmed, 2014). The efficiency on handling
incremented data is at the expense of losing the incompatibility with
programming models which are offered by
non-incremental systems such as MapReduce, which requires the implementation of
incremental algorithms and increasing the
complexity of the algorithm and the code (Alam & Ahmed, 2014). The caching technique is
proposed by (Alam & Ahmed, 2014) as a solution. This caching
solution will be at three levels; the Job, the Task and the Hardware (Alam & Ahmed, 2014).
Incoop is another solution proposed by (Bhatotia, Wieder, Rodrigues, Acar, &
Pasquin, 2011). The Incoop proposed solution is to extend
the open-source implementation of Hadoop of MapReduce programming paradigm to
run unmodified MapReduce program in an incremental method (Bhatotia et al., 2011; Sakr & Gaber,
2014). Incoop allows programmers to increment the
MapReduce programs automatically without any modification to the code (Bhatotia et al., 2011; Sakr & Gaber,
2014). Moreover, information about the previously
executed MapReduce tasks are recorded by Incoop to be reused in subsequent
MapReduce computation when possible (Bhatotia et al., 2011; Sakr & Gaber,
2014).
The Incoop is not a perfect solution, and it has some shortcomings which are
addressed by (Sakr & Gaber, 2014; Zhang, Chen,
Wang, & Yu, 2015). Some enhancements are implemented to Incoop
to include incremental HDFS called Inc-HDFS, Contraction Phase, and “Memoization-aware
Scheduler” (Sakr & Gaber, 2014). The Inc-HDFS provides the delta
technique in the inputs of two consecutive job runs and splits the input based
on the contents where the compatibility with HDFS is maintained. The
Contraction phase is a new phase in the MapReduce framework consisting of
breaking up the Reduce tasks into smaller sub-computation
forming an inverted tree allowing the small portion of the input changes
to the path from the corresponding leaf
to the root to be computed (Sakr & Gaber, 2014). The Memoization-aware Scheduler
is a modified version of the scheduler of
Hadoop taking advantage of the locality of memorized results (Sakr & Gaber, 2014).
Another solution called i2MapReduce proposed by (Zhang et al., 2015) which was compared to Incoop by (Zhang et al., 2015). The i2MapReduce does
not perform the task-level computation but rather
a key-value pair level incremental processing. This solution also supports more complex
iterative computation, which is used in data mining and reduces the I/O
overhead by applying various techniques (Zhang et al., 2015). IncMR is an enhanced framework
for the large-scale incremental data processing (Yan et al., 2012). It inherits the simplicity of
the standard MapReduce, it does not modify HDFS
and utilizes the same APIs of the MapReduce (Yan et al., 2012). When using IncMR, all programs
can complete incremental data processing without any modification (Yan et al., 2012).
In conclusion, various efforts are exerted by
researchers to overcome the incremental computation limitation of Hadoop, such
as Incoop, Inc-HDFS, i2MapReduce, and IncMR. Each proposed solution is an attempt to
enhance and extend the standard Hadoop to avoid overheads such as I/O, to
increase the efficiency, and without increasing the complexing of the
computation and without causing any modification to the code.
References
Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues.
Paper presented at the Computational Science and Computational Intelligence
(CSCI), 2014 International Conference on.
Bhatotia, P.,
Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental
computations. Paper presented at the Proceedings of the 2nd ACM Symposium
on Cloud Computing.
Sakr, S., &
Gaber, M. (2014). Large Scale and big
data: Processing and Management: CRC Press.
Yan, C., Yang,
X., Yu, Z., Li, M., & Li, X. (2012). Income:
Incremental data processing based on MapReduce. Paper presented at the
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.
Zhang,
Y., Chen, S., Wang, Q., & Yu, G. (2015). i^2MapReduce: Incremental
MapReduce for Mining Evolving Big Data. IEEE
transactions on knowledge and data engineering, 27(7), 1906-1919.
Keywords: World Wide Web, Web, Performance
Bottlenecks, Large-Scale Data.
Introduction
The
journey of the Web starting with Web 1.0 is known as the “Web of Information
Connections,” followed by Web 2.0 which is is known as “Web of People
Connections” (Aghaei,
Nematbakhsh, & Farsani, 2012). Web 3.0 is known as “Web of Knowledge
Connections, “Web 4.0 is known as the “Web of Intelligence Connections” (Aghaei
et al., 2012).
Web 5.0 is known as “Symbionet Web” (Patel,
2013).
This project discusses
this journey of the Web starting from Web 1.0 until Web 5.0. The inception of the World Wide Web (W3)
known as the “Web” goes back to 1989 when Berners-Lee introduced it through a
project for CERN. The underlying concept
behind the Web is hypertext paradigm which links documents together. The project was the starting point to change
the way we communicate with each other.
Web
1.0 is the first generation of the Web, and it was read-only, with no
interaction with the users. It was used to broadcast information to the
users. It used the simple framework of
Client/Server which is known as a single point of failure. Web 2.0 was introduced in 2004 by Tim
O’Reilly as a platform which allows read-write and interaction of users. The topology of Web 2.0 is Peer-To-Peer to
avoid the single point of the failure in Web 1.0. All nodes are serving as server and client
and have the same capabilities to respond to users requests. The topology of Web 2.0 Peer-to-Peer is
called Master/Slave. Web 3.0 was introduced
by Markoff of New York Times in 2006 and is known as the Semantic Web. Berners-Lee introduced the concept of the
Semantic Web in 2001. The Semantic Web
has a layered architecture including URIs, RDF, Ontology Web Language (OWL),
XML and other components. Web 4.0 and
Web 5.0 are still in progress. Web 4.0
is known as the “Intelligent Web,” while Web 5.0 is known as “Symbionet
Web.” It is expected that the Artificial
Intelligence will play a key role in Web 4.0, and consequently Web 5.0. The project addressed the main sources for
the large-scale data in each Web generation.
Moroever, the key technologies and the underlying architecture and
framework for each Web generation are also discussed in this paper.
The project also discussed and
analyzed the bottleneck and the performance of the Web for each Web generation
from Client/Server simple topology of Web 1.0 to Peer-To-Peer of Web 2.0
topology, to the layered topology of Semantic Web of Web 3.0. Moreover, the bottleneck is also discussed
for Web 4.0 using the Internet of Things technology. Each generation has added tremendous value to
our lives and how we communicate with each.
Web 1.0 – The Universal
Access to Read-Only Static Pages
Web
1.0 is the first generation of the World Wide Web (W3), known as the Web (Aghaei
et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015; Patel, 2013).
The inception of the W3 project took place in CERTN in 1989 to enhance
the effectiveness of the CERN communication system (Kujur
& Chhetri, 2015).
Berners-Lee realized this hypertext paradigm could be applied globally (Kujur
& Chhetri, 2015). The W3 project allowed access to the
information online (T.
J. Berners-Lee, 1992).
The term “Web” for the world goes to the similarities for the
construction of a spider (T.
J. Berners-Lee, 1992).
The
underlying concept of W3 was using hypertext paradigm, through which documents
are referring to each using links (T.
J. Berners-Lee, 1992).
The user can access the document from that link universally (T.
J. Berners-Lee, 1992).
The user can create documents and link them into the Web using the
hypertext (T.
J. Berners-Lee, 1992).
If the data is stored in the database, the server can be modified to
access the database from W3 clients and present it to the Web, as the case with
the generic Oracle server using the SQL SELECT statement (T.
J. Berners-Lee, 1992).
Large sets of structured data such as the database cannot be handled by
Web 1.0 hypertext alone (T.
J. Berners-Lee, 1992).
The solution for this limitation is to add “search” functionality to the
hypertext paradigm (T.
J. Berners-Lee, 1992).
The indexes, which are regarded to be “special documents,” can be used
for the search where the user can provide a keyword that results in that
“special document” or “index” which has a link to the documents found as a
result of that keyword (T.
J. Berners-Lee, 1992). The
phone book was the first document which was published on the Web (T.
Berners-Lee, 1996; T. J. Berners-Lee, 1992).
Berners-Lee
is regarded to be the innovator of the Web 1.0 (Aghaei
et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015). Web 1.0 is defined as “It is an information
space in which the items of interest referred to as resources are identified by
a global identifier called as Unifrom Resource Identifiers (URIs)” (Kujur
& Chhetri, 2015; Patel, 2013). Web
1.0 was read-only passive Web with no interaction with the websites (Choudhury,
2014).
The content of the data and the data
management are the sole responsibility of the webmaster. In the
1990s, the data sources included digital technology and database systems which
organizations widely adopted storing a large amount of data, such as bank
trading transactions, shopping mall records, and government sector archives (Hu,
Wen, Chua, & Li, 2014). Companies such as Google and Yahoo began to
develop search functions and portals to information for users (Kambil,
2008). Web 1.0 lasted from 1989 until 2005 (Kujur
& Chhetri, 2015; Patel, 2013). Example of Web 1.0 includes the “Britannica
Online” which provide information for read-only (Loretz,
2017).
Key
Technologies of Web 1.0
The key protocols
for Web 1.0 are HyperText Markup Language (HTML), HyperText Transfer Protocol
(HTTP), and Universal Resource Identifier (URI) (Choudhury,
2014; Patel, 2013).
New protocols include XML,
XHTML. The Cascading Style Sheet (CSS)
Server-Side Scripting include ASP, PHP, JSP, CGI, and Perl (Patel,
2013). The Client-Side Scripting include JavaScript,
VBScript, and Flash (Patel,
2013).
Web
1.0 Architecture
The underlying
architecture of the Web 1.0 is Client/Server topology where the Server
retrieves data responding to the client requests where the browser is residing (T.
J. Berners-Lee, 1992). A common library of the information access
code of the network is shared by the clients
(T.
J. Berners-Lee, 1992). The servers existed at the time of W3 project
were Files, VMS/Help, Oracle, and GNU Info (T.
J. Berners-Lee, 1992).
The Client/Server topology is described as a single point of failure as
there is a total dependency on the server (Markatos,
2002).
Web
1.0 Performance Bottleneck
Web
1.0 framework consists of a web server, clients which are connected to the
server using the internet (Mosberger
& Jin, 1998).
HTTP is the protocol that connects the client and the server (Mosberger
& Jin, 1998).
In (Mosberger
& Jin, 1998), httperf tool was used to test the
load and the performance of the web server which responds to several client requests
(Mosberger
& Jin, 1998).
The httperf tool has two main goals (Mosberger
& Jin, 1998). The first goal is about the good
and predictable performance. The second
goal is about the “ease of extensibility” (Mosberger
& Jin, 1998).
Load sustainability is a key factor in a good performance of the web
server (Mosberger
& Jin, 1998).
There are various client performance limits which should not be regarded
as server performance limits (Mosberger
& Jin, 1998).
The client CPU imposes a limitation on the client (Mosberger
& Jin, 1998).
The size of the Transmission Transfer Protocol (TCP) port space whose
numbers are sixteen bits wide. Privilege
process reserve 1,024 of the 64K available ports. The port cannot be reused until it gets
expired. Thus, the TCP TIME_WAIT plays a role as its expired state allows the
port to be reused (Mosberger
& Jin, 1998). This scenario can cause a serious limit and
bottleneck for client sustainable offered rate (Mosberger
& Jin, 1998).
With one minute timeout, the sustainable rate is about 1,075 requests
per seconds. However, with the recommended value of RFC-793 to be four-minute time
out, the maximum rate would be 268 requests per second instead (Mosberger
& Jin, 1998).
The total and the per-process number of file descriptors that can be
opened are limited in most operating systems (Mosberger
& Jin, 1998).
The “per-process limits” is ranged from 256 to 2,048. The file descriptor cannot be used until it is
closed, the httperf timeout value plays a role in the number of the open file
descriptors. If the client confronts
with a bottleneck, the operating system of the client can be tuned to increase
the limit of the open file descriptors (Mosberger
& Jin, 1998).
The TCP connection typically has
a “socket receive” and “send buffer” (Mosberger
& Jin, 1998).
The clients loads are limited to the clients’ memory available for the
“socket receive” (Mosberger
& Jin, 1998).
Concurrent TCP connections can cause a bottleneck and poor performance (Mosberger
& Jin, 1998).
For
the server side performance, the granularity of the process scheduling in
operating systems is measured in millisecond ranges and plays a key role in the
performance of the server responding to several requests from several clients (Mosberger
& Jin, 1998).
Tools such as httperf check the network activity for input and output
using select() functions and monitor the real-time using gettimeofday()
functions (Mosberger
& Jin, 1998).
The limited number of the ephemeral ports is ranged from 1,024 to 5,000
which can cause a problem when running out of ports (Mosberger
& Jin, 1998).
Tools such as httperf re-use the ports as soon as they are
released. However, the incompatibility
of the TCP between Unix and NT broke this solution where Unix allows pre-empting
the TIME_WAIT state, while NT did not allow it during the arrival of the SYN
segment (Mosberger
& Jin, 1998).
Allocating the ports using round-robin method solves this problem. Several thousand TCP control blocks can
cause slow system calls (Mosberger
& Jin, 1998). The hash table is usually used to look up the
TCP control blocks for incoming network traffic is standard for execution time (Mosberger
& Jin, 1998).
However, some BSD-derived systems still use linear control block search
for the bind() and connect() system calls, which can increase the slow system
calls. The solution was found when
system closes the connection. Thus,
tools such as httperf applied the concept of closing the connection by using
RESET (Mosberger
& Jin, 1998).
Web 2.0 – The Universal
Access to Read-Write Web Pages
In 2004, Dale Dougherty the Vice-President of O’Reilly Media defined Web 2.0 as read-write web (Aghaei et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015; Patel, 2013). Web 2.0 is defined by Tim O’Reilly as cited in (Aghaei et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015; Miller, 2008) as follows : “Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform, and an attempt to understand the rules for success on that new platform. Chief among those rules is this: Build applications that harness network effects to get better the more people use them.” Others defined Web 2.0 as a “Transition” from Web 1.0 where information is isolated to computing platforms that are interlinked together which function as a local software for the user (Miller, 2008). In an attempt to differentiate between Cloud Computing and Web 2.0, Tim stated as cited in (Miller, 2008) “Cloud computing refers specifically to the use of the Internet as a computing platform; Web 2.0, as I’ve defined it, is an attempt to explore and explain the business rules of that platform.”
Web 2.0 has shifted Web 1.0 not only from read-only to be read-write but also to be technology centric, business-centric and user-centric (Choudhury, 2014). The technology-centric is found in the platform concept of the Web 2.0 which is different from a client/server framework of Web 1.0. The platform technology is associated with blogs, wikis, and Really Simple Syndication (RSS) feeds (Choudhury, 2014). The business-centric concept is reflected in the shift to the internet as a platform and comprehending the key success factors using this new platform concept on the internet (Choudhury, 2014). The user-centric concept is the shift from companies publishing content for read, to communities of users who are interacting and communicating with each other using the new platform on the internet (Choudhury, 2014). Tim O’Reilly identified a list of the differences between Web 1.0 and Web 2.0 (O’Reilly, 2007) (see Figure 1).
Figure 1: Web 1.0 vs. Web 2.0 Examples (O’Reilly, 2007).
Web
2.0 has other attributes such as “wisdom web,” “people-centric web,” and
“participative web” with reading the writing capabilities (Aghaei
et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015; Patel, 2013).T With Web 2.0, the user can have
flexible web design, updates, collaborative content creation and
modification (Aghaei
et al., 2012; Patel, 2013). The support for collaboration is one of the
major characteristics of Web 2.0, where people can share data (Patel,
2013).
Examples of Web 2.0 implementation
include a social network such as MySpace, Facebook, Twitter, media sharing such
as Youtube (Patel,
2013). Thus, the data for Web 2.0 is generated from
all these resources of MySpace, Facebook, Twitter and so force. With
Web 2.0, the data is growing very fast and entering a new level of “Petabyte
age” (Demirkan
& Delen, 2013).
Key
Technologies of Web 2.0
Web
2.0 utilized key technologies to allow people to communicate with each other
through the new platform on the internet (Aghaei
et al., 2012).
The key technologies of Web 2.0 include The RSS, Blogs, Mashups, Tags,
Folksonomy, and Tag Clouds (Aghaei
et al., 2012).
Three development approaches to create Web 2.0 applications: Asynchronous JavaScript and XML (AJAX), Flex
and Google Web Toolkit (Aghaei
et al., 2012).
Web
2.0 Architecture
Web 2.0 has
various architecture patterns (Governor,
Hinchcliffe, & Nickull, 2009). Three main levels of the Web 2.0 architecture
patterns starting from the most concrete to the most abstract which is
high-level design pattern (Governor
et al., 2009).
Some of these Web 2.0 architecture patterns include Service-Oriented
Architecture (SOA), Software as a Service (SaaS), Participation-Collaboration,
Asynchronous Particle Update, Mashup,
Rich User Experience (RUE), the Synchronized Web, Collaborative Tagging, Declarative
Living and Tag Gardening, Semantic Web Grounding, Persistent Rights Management,
and Structured Information (Governor
et al., 2009).
Web
2.0 Performance Bottleneck
In Web 1.0, the
Client/Server architecture provided users to access the data through the
internet. However, the users had the
experience of “wait” (Miller,
2008). As discussed earlier, the client has
bottlenecks in addition to the server’s bottleneck. The issue in Web 1.0 is that all
communications among computers had to go through the server first (Miller,
2008). Due to the requirement of having every client
passes through the server first, the concept of Peer-to-Peer was established to
solve this overload and bottleneck on the server side. Web 2.0 works with Peer-to-Peer framework (Aghaei
et al., 2012; Patel, 2013). While in Web 1.0, the server has the full
responsibility and capabilities to respond to clients, in Web 2.0 using the
Peer-to-Peer computing, each computer has the same responsibilities and
capabilities as the server (Miller,
2008). This new relationship between computers is
referred to as master/slave where the central sever acts as the master and the
client computers act as the slave (Miller,
2008). In Peer-to-Peer framework in Web 2.0, each
computer servers as a client as well as a server (Miller,
2008).
Peer-To-Peer framework provided the
capability of streaming live video from a single source to a large number of
receivers or peers over the internet without any special support from the
network (Magharei
& Rejaie, 2006, 2009). This capability is called P2P streaming
mechanism (Magharei
& Rejaie, 2006, 2009). There are two main bottlenecks with a P2P
streaming mechanism (Magharei
& Rejaie, 2006, 2009). The first bottleneck is called the “bandwidth
bottleneck,” and the second bottleneck is called “content bottleneck” (Magharei
& Rejaie, 2006, 2009). The “bandwidth bottleneck” is experienced by
the peer when the aggregate bandwidth available from all other peers is not
sufficient to fully utilize the incoming access link bandwidth (Magharei
& Rejaie, 2006, 2009). The “content bottleneck” is experienced by
the peer when the useful content from other peers is not sufficient to fully
utilize the bandwidth available in the network (Magharei
& Rejaie, 2006, 2009).
The discussion on the bottleneck in
Web 2.0 is not limited to Peer-to-Peer but also to platform computing. In Web
2.0, the user interacts with the web as it is not only read as the case in Web
1.0 but also has the write capabilities.
This feature of read-write capabilities can cause a bottleneck with a
high concurrent reading and writing operations using a large set of data (Choudhury,
2014). A data-intensive application or very
large-scale data transfer can cause a bottleneck, and it can be very costly (Armbrust
et al., 2009).
To solve this issue is to send disks or even whole computers via
overnight delivery services (Armbrust
et al., 2009).
When the data is moved to the Cloud, the data does not have any
bottleneck any longer because the data transfer is within the Cloud such as
storing the data in S3 in Amazon Web
Services, which can be transferred without any bottleneck to EC2 (Elastic
Compute Cloud) (Armbrust
et al., 2009).
WAN bandwidth can cause a bottleneck, and the intra-cloud networking
technology can also have a performance bottleneck (Armbrust
et al., 2009).
One Gigabit Ethernet (1GbE) reflects a bandwidth that can have a
bottleneck because it is not sufficient to process a large set of data using
technology such as Map/Reduce. However,
in the Cloud Computing, the 10Gigabit Ethernet is used for such aggregation
links (Armbrust
et al., 2009).
Map/Reduce is a processing
strategy to divide the operation into two jobs Map and then Reduce using the
“filtering-join-aggregation” tasks (Ji,
Li, Qiu, Awada, & Li, 2012). When using the classic Hadoop MapReduce, the
cluster is not artificially segregated into Map and reduce slots (Krishnan,
2013). Thus, the application jobs are bottlenecked
on the Reduce operation which limits the scalability in the job execution (Krishnan,
2013).
Because of the scalability bottleneck faced by the traditional Map/Reduce,
Yahoo introduced YARN for Yet Another Resource Negotiators to overcome such
scalability bottleneck in 2010 (White,
2012).
There are additional performance
issues which cannot be predicted. For
instance, there is a performance degradation when using Virtual Machines (VMs) which
share CPU and main memory in the Cloud Computing platform (Armbrust
et al., 2009). Moreover, the bottleneck at the computing
platform can be caused when moving a large set of data continuously to a remote
CPUs (Foster,
Zhao, Raicu, & Lu, 2008). The Input/Output (I/O) operations can also
cause a performance degradation issue (Armbrust
et al., 2009; Ji et al., 2012). Flash
memory is another aspect that can minimize the performance. The scheduling of VMs for High-Performance
Computing (HPC) apps is another unpredictable performance issue (Armbrust
et al., 2009).
The issue of such scheduling for HPC is to ensure that all threads of a
program are running concurrently which is not provided by either VMs or the
operating systems (Armbrust
et al., 2009).
Another issue with the computing platform about bottleneck is the
availability of the cloud environment is threatened when there is a flooding attack
which will affect the available bandwidth, processing power and the memory (Fernandes,
Soares, Gomes, Freire, & Inácio, 2014). Thus, to minimize the bottleneck issue when
processing a large set of data in the computing platform, the data must be
distributed over many computers (Foster
et al., 2008; Modi, Patel, Borisaniya, Patel, & Rajarajan, 2013).
Web 3.0 – Semantic Web
Web 3.0 is the
third generation of the Web. It was
introduced by John Markoff of the New York Times in 2006 (Aghaei
et al., 2012).
Web 3.0 is known as “Semantic Web” as well as the “Web of Cooperation” (Aghaei
et al., 2012), and as “Executable Web (Choudhury,
2014; Kujur & Chhetri, 2015; Patel, 2013). Berners-Lee introduced the concept of Semantic
Web in 2001 (T.
Berners-Lee, Hendler, & Lassila, 2001). The underlying concept of Web 3.0 is to link,
integrate and analyze data from various datasets to obtain new information
stream (Aghaei
et al., 2012).
Web 3.0 has a variety of capabilities (Aghaei
et al., 2012).
When using Web 3.0, the data management can be improved, the accessibility
of the mobile internet can be supported, creativity and innovation are
simulated, the satisfaction of the customers is enhanced, and the collaboration
in the social web is organized (Aghaei
et al., 2012; Choudhury, 2014). Another key factor for Web 3.0 is that the
web is no longer understood only by human but also by machines (Aghaei
et al., 2012; Choudhury, 2014). In other words, the web is understood by human
and machines in Web 3.0, where the machines first understand the web followed
by a human (Aghaei
et al., 2012; Choudhury, 2014). Web 3.0 supports world wide database and web oriented
architecture which was described as a web of document (Patel,
2013).
Web 3.0 characteristics include portable
personal web, and consolidating dynamic content, lifestream, individuals, RDF,
and user engagements (Aghaei
et al., 2012; Patel, 2013). Examples of Web 3.0 include Google Map, My
Yahoo (Patel,
2013). Since Web 3.0 is described as the mobile and
sensor-based era, the majority of the data sources are from mobile and
sensor-based devices (Chen,
Chiang, & Storey, 2012).
Key
Technologies of Web 3.0
While
in Web 2.0 the content creativity of users is the target, in Web 3.0 the linked
data sets are the target. Web 3.0 is not only for publishing data on the web
but also linking related data (Choudhury,
2014). Linked Data principles introduced by
Berners-Lee as the rules to publish and connect data on the web in 2007 (Aghaei
et al., 2012; Choudhury, 2014). These rules are summarized below:
URI should be used as Names of Things. HTTP URIs should be used to look up those
Names.
Useful Information should be provided
using the standards of RDF, SPARQL by looking up the URIs.
Links to other URIs should be included to
discover more things.
In (T.
Berners-Lee et al., 2001) the technology of
the Semantic Web included XHTML, SVG, and SMIL and placed on top of the XML
layer. XSLT is used for the
transformation engines, while XPath and XPointer are used for path and pointer
engines (T.
Berners-Lee et al., 2001). CSS and XSL are used for the style engines
and formatters (T.
Berners-Lee et al., 2001).
Web
3.0 Architecture
The
Semantic Web framework is a multi-layered architecture. The degree of the structure among objects is
based on a model called Resource Description Framework” (RDF) (Aghaei
et al., 2012; Patel, 2013). The structure of the semantic data include
from the bottom up, the Unicode and URI at the bottom of the framework,
followed by the Extensible Markup Language (XML), RDF, RDF Schema, Ontology Web
Language (OWL), Logic and Proof and Trust at the top (see Figure 2).
Figure 2: Web 3.0 – Semantic Web Layered Architecture (Patel, 2013).
There
are eight categories for Web Semantic identified by (T.
Berners-Lee et al., 2001) to describe the
relation of Web Semantic with Hypermedia Research. The first category describes the basic node,
link and anchor data mode. The second
category reflects the typed nodes, links, and anchors. The Conceptual Hypertext is the third
category of this relation. Virtual Links
and Anchors is the fourth category, while searching and querying is the fifth
category. Versioning and Authentication
features, Annotation and User Interface Design beyond the navigational
hypermedia reflect the last three categories of the relation of Semantic Web
with Hypermedia Research (T.
Berners-Lee et al., 2001).
Web
3.0 is expected to include four major drivers in accordance with Steve Wheeler
as cited in (Chisega-Negrila,
2016). The first driver includes the distributed
computing. The second driver includes
the extended smart phone technology. The third driver includes the
collaborative intelligence. The last
driver is the 3D visualization and interaction (Chisega-Negrila,
2016).
Web
3.0 Performance Bottleneck
The result of the research in (Firat & Kuzu, 2011) found that the components of the Semantic Web of XML, RDF and OWL help overcome hypermedia bottlenecks in various areas of the eLearning such as the cognitive overload, disorientation in hypermedia (Firat & Kuzu, 2011). However, Web 3.0 faces the bottleneck of the search in sensor networks (Nicolaidis & Iniewski, 2017), as the use of mobile technology has been increasing (see Figure 3) (Nicolaidis & Iniewski, 2017).
Figure 3: Increasing Use of the Mobile Technology (Nicolaidis & Iniewski, 2017)
The
wireless communication in sensor networks is causing the bottleneck with the
increasing number of requests (Nicolaidis
& Iniewski, 2017).
Thus, the time to answer queries gets increased, which result in
decreasing the search experience of the users (Nicolaidis
& Iniewski, 2017).
Thus, the approach of the push-based is used to force sensors to push
regularly their new readings to a base station, which decrease the latency to
the users’ requests (Nicolaidis
& Iniewski, 2017). However, this approach cannot
guarantee that the data is up-to-date as it can be outdated (Nicolaidis
& Iniewski, 2017).
If there are no much changes, the update can be sent when there is a
change. However, if the changes are too often, this approach can cause
congestion of the wireless channel resulting in delayed or missing messages (Nicolaidis
& Iniewski, 2017).
Prediction-Model-Based approaches are proposed to reduce the volume of
the data when transmitting dynamic sensor data (Nicolaidis
& Iniewski, 2017).
However, to create an accurately predicted model, a series of sensor
readings need to be transmitted (Nicolaidis
& Iniewski, 2017).
Using the prediction approach, the latency is reduced because the
prediction at a based station instead of contacting a sensor over the non-reliable
multi-hop connection (Nicolaidis
& Iniewski, 2017).
Moreover, scaling the systems to more sensors causes another bottleneck
which needs to be solved by utilizing the distribution approach (Nicolaidis
& Iniewski, 2017).
Moreover,
in (Konstantinou,
Spanos, Stavrou, & Mitrou, 2010), the contemporary
Semantic Web is defined as the counterpart of the Knowledge Acquisition
bottleneck where it was too expensive to acquire and encode the large amount of
knowledge that is needed for the application (Konstantinou
et al., 2010).
The annotation of the content in Web Semantic is still an open issue and
is regarded as an obstacle for Semantic Web applications which need the considerable
volume of data to demonstrate their utility (Konstantinou
et al., 2010).
Web 4.0 – Intelligent Web
The rapid increase in the communication using wireless enables another major transition in the Web (Kambil, 2008). This transition enables people to connect with objects anywhere and anytime at the physical world as well as at the virtual world (Kambil, 2008). Web 4.0 is the fourth generation of Web and is known as “web of intelligence connections” (Aghaei et al., 2012) or “Ultra-Intelligent Electronic Agent” (Choudhury, 2014; Kujur & Chhetri, 2015; Patel, 2013). It is read-write, execution, and concurrency web with intelligent interactions (Aghaei et al., 2012; Patel, 2013). Some consider it as “Symbiotic Web” where human and machines can interact in symbiosis fashion (Aghaei et al., 2012; Choudhury, 2014; Kujur & Chhetri, 2015; Patel, 2013) and as “Ubiquitous Web” (Kujur & Chhetri, 2015; Patel, 2013; Weber & Rech, 2009). Using Web 4.0, machines will be intelligent reading the web content and deliver web pages with superior performance at the real-time (Aghaei et al., 2012; Choudhury, 2014). It is also known as WebOS which will act as the “Middleware” which will act as an operating system (Aghaei et al., 2012; Choudhury, 2014). WebOS is expected to resemble or be equivalent to our brain which will highly interact intelligently (Aghaei et al., 2012; Choudhury, 2014). As indicated in (Aghaei et al., 2012) “the web is moving toward using artificial intelligence to become an intelligent web.” Web 4.0 will reflect the integration between people and virtual worlds and objects at the real-time (Kambil, 2008). The Artificial Intelligence technologies are expected to play a role in Web 4.0 (Weber & Rech, 2009).
The major challenges for Web 4.0
generating value that is based on full integration of the physical objects and
virtual objects with other contents that are generated by users (Kambil,
2008). This challenge could lead to the next
generation of Supervisory Control and Data Acquisition (SCADA) applications (Kambil,
2008). The challenge can also lead to generating
value from sources such as “entertainment” which collect information from human
and objects (Kambil,
2008).
The migration of virtual world to
physical is another challenge in Web 4.0 (Patel,
2013). A good example of Web 4.0 is provided by (Patel,
2013),
which is to search or Google your home to locate an object such as your car
key.
An application of the Web 4.0 is
implemented by Rafi Haladjian and Olivier who created the first consumer
electronics in Amazon, which can recognize you and provide recommended product
and personalized advice (Patel,
2013). The time frame for Web 4.0 is expected to be
2020 – 2030 (Loretz,
2017; Weber & Rech, 2009). Web
4.0 is still in progress.
Web 4.0 and “Internet of Things”
Some like (Loretz, 2017) considers “Internet of Things” as part of Web 3.0 and Web 4.0, while others like (Pulipaka, 2016) categorize it under Web 4.0. Thus, the discussion on the “Internet of Things” from the bottleneck perspective is addressed under the Web 4.0 section of this paper.
As indicated in (Atzori, Iera, & Morabito, 2010) Internet of Things (IoT) is regarded to be “one of the most promising
fuels of Big Data expansion” (De Mauro, Greco, & Grimaldi, 2015). IoT seems promising as Google
acquired Nest for $3.2 billion in January 2014 (Dalton, 2016). The
Nest is a smart hub producer at the forefront of the Internet of Things (Dalton, 2016). This acquiring can tell the
importance of the IoT. IoT is becoming
powerful because it affects our daily life and the behavior of the users (Atzori et al., 2010). The underlying concept of the
IoT is the ubiquitous characteristics that using various devices such as
sensors, mobile phones and so forth (Atzori et al., 2010).
As cited in (Batalla & Krawiec, 2014) “Internet of Things (IoT) is global
network infrastructure, linking physical and virtual objects through the exploitation
of data capture and communication capabilities” (Batalla & Krawiec, 2014). IoT is described by (Batalla & Krawiec, 2014) as “a huge connectivity platform for self-managed objects.” IoT is increasingly growing, and the reasons for such strong growth go to
the inexpensive cost of the computing including sensors and the growth of Wi-Fi
(Gholap & Asole, 2016; Gubbi, Buyya,
Marusic, & Palaniswami, 2013), and 4G-LTE (Gubbi et al., 2013). Other factors include the growth of mobiles, the rise of software
developments, the emergence of standardized low-power wireless technologies (Gholap & Asole, 2016).
With the advancement in the Web, from static web pages in Web 1.0 to network web in Web 2.0, to ubiquitous
computing web in Web 3.0, the increased requirement for “data-on-demand” using complex, and intuitive queries becomes
significant (Gubbi et al., 2013). With IoT, many objects and many
things surrounding people will be on the network (Gubbi et al., 2013). The Radio Frequency IDentification (RFID) and the technologies of
the sensor network emerge to respond to the IoT network challenges where
information and communication systems are embedded in the environment around us
invisibly (Gubbi et al., 2013). The computing criterion for the
IoT will go beyond the traditional scenarios of the mobile computing which
utilize the smartphones and portables (Gubbi et al., 2013). IoT will evolve to connect existing everyday objects and embed
intelligence into our environment (Gubbi et al., 2013).
Web 4.0 and IoT Performance Bottleneck
The elements of IoT include the RFID, Wireless Sensor Networks (WSN),
Addressing Schemes, Data Storage and Analytics, and Visualization (Gubbi et al., 2013). IoT will require the persistence
of the network to channel the traffic of the data ubiquitously. IoT confronts a bottleneck at the interface
between the gateway and wireless sensor devices (Gubbi et al., 2013). The bottleneck at the interface is between the Internet and smart
object networks of the RFID or WSN subnets (Jin, Gubbi, Marusic, & Palaniswami,
2014). Moreover, the scalability of the address of
the device of the existing network must be sustainable (Gubbi et al., 2013). The performance of the network or
the device functioning should not be affected by adding networks and devices (Gubbi et al., 2013; Jin et al., 2014). The Uniform Resource Name (URN) system will play a significant role in the development of IoT to overcome these issues (Gubbi et al., 2013).
Moreover, although Cloud can enhance and simplify the communication of
IoT, the Cloud can still represent a bottleneck in certain scenarios (Botta, de Donato, Persico, & Pescapé,
2016). As indicated in (Gubbi et al., 2013), the high capacity and large-scale web data generated by IoT and as IoT
grows, the Cloud becomes a bottleneck (Gubbi et al., 2013). A framework proposed by (Gubbi et al., 2013) to enable scalability of the cloud to provide the capacity that is required for IoT. While the proposed
framework of (Gubbi et al., 2013) enables the separation of the networking, computation, storage and
visualization theme, it allows the independent growth in each domain, at the
same time enhances each other in an environment that is shared among them (Gubbi et al., 2013).
Web 4.0 and IoT New Challenges
IoT faces additional challenges such as Addressing and Networking Issues (Atzori et al., 2010). The investigation effort has been exerted about the integration of RFID tags
into IPv6. Mobile IP is proposed as a
solution for the mobility in IoT scenarios (Atzori et al., 2010). Moreover, the DNS (domain name servers), which provide IP address of a
host from a certain input name, does not seem to serve the IoT scenarios where
communications are among objects and not hosts. Object Name Service (ONS) is
proposed as a solution to the DNS issue (Atzori et al., 2010). ONS will associate a reference
to a description of the object and the related RFID tag identifier, and it must
work in a bidirectional manner (Atzori et al., 2010). For the complex operation of
IoT, the Object Code Mapping Service (OCMS) is still an open issue (Atzori et al., 2010). TCP as the Transmission Control
Protocol is found inadequate and
inefficient for the transmission control of end-to-end in the IoT (Atzori et al., 2010). The TCP issue is still an open
issue for IoT (Atzori et al., 2010). Other issues of IoT include
Quality of Service, Security, and
Privacy.
Web 5.0 – Symbionet Web
Web 5.0 is still in progress and can be regarded as “Symbionet Web” (Loretz, 2017; Patel, 2013) or “Telepathic Web” (Loretz, 2017). In Web 5.0, people will be able to have their own Personal Servers (PS), where they can store and communicate with their personal data using Smart Communicator (SC) such as Smart Phones, Tablets and so forth (Patel, 2013). The Smart Communications will be 3D Virtual World of the Symbionet (Patel, 2013). Web 5.0 will be aware of your emotions and feelings (Kambil, 2008). Objects such as tools such as headset are investigated for emotional interaction (Kambil, 2008). While there is a claim from some companies that they map emotions and feelings, this claim can be hard to imagine because emotions and feelings are complex (Kambil, 2008). However, some technologies are examining the emotions effect (Kambil, 2008). There is an idea that there will “brain implant” which enables the person to communicate with the internet and web by thoughts (Loretz, 2017). The person will be able to open pages just by thoughts (Loretz, 2017). The time frame for Web 5.0 is after 2030 (Loretz, 2017).
Conclusion
This
project discussed the Web from the inception of Web 1.0 until the last generation
of Web 5.0. The project addressed the main
characteristics of each generation and the main sources for generating the
large-scale web data. Web 1.0 is known
as the “Web of Information Connections” where information is broadcasted by
companies for users to read. Web 2.0 is
known as the “Web of People Connections” where people connected. Web 3.0 is known as the “Web of Knowledge
Connections” where people share knowledge.
Web 4.0 is known as the “Web of Intelligence Connections” where
Artificial Intelligence is expected to play a role. Web 5.0 is known as the “Symbionet Web” where
emotions and feelings are expected to be communicated to the machines, and be
part of the Web interactions.
The
project also discussed the key technologies and the underlying architecture of
each Web generation. Moreover, this paper also discussed and analyzed the
performance bottlenecks when accessing the large-scale data for each Web
generation, and the proposed solutions for some of these bottlenecks and the open
issues.
References
Aghaei, S., Nematbakhsh,
M. A., & Farsani, H. K. (2012). Evolution of the world wide web: From WEB
1.0 TO WEB 4.0. International Journal of
Web & Semantic Technology, 3(1), 1.
Armbrust,
M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., . . .
Stoica, I. (2009). Above The Clouds: A Berkeley View of Cloud Computing.
Atzori,
L., Iera, A., & Morabito, G. (2010). The internet of things: A survey. Computer networks, 54(15), 2787-2805.
Batalla,
J. M., & Krawiec, P. (2014). Conception of ID layer performance at the
network level for Internet of Things. Personal
and Ubiquitous Computing, 18(2), 465-480.
Berners-Lee,
T. (1996). WWW: Past, present, and future. Computer,
29(10), 69-77.
Berners-Lee,
T., Hendler, J., & Lassila, O. (2001). The semantic web.
Berners-Lee,
T. J. (1992). The world-wide web. Computer
networks and ISDN systems, 25(4-5), 454-459.
Botta,
A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud
Computing and Internet Of Things: a Survey. Future
Generation computer systems, 56, 684-700.
Chen,
H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and
Analytics: From Big Data to Big Impact. MIS
quarterly, 36(4), 1165-1188.
Chisega-Negrila,
A. M. (2016, 2016). IMPACT OF WEB 3.0 ON
THE EVOLUTION OF LEARNING, Bucharest.
Choudhury,
N. (2014). World Wide Web and its journey from web 1.0 to web 4.0.
Dalton,
C. (2016). Brilliant Strategy for
Business: How to plan, implement and evaluate strategy at any level of
management: Pearson UK.
De
Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research
topics. Paper presented at the AIP Conference Proceedings.
Demirkan,
H., & Delen, D. (2013). Leveraging the capabilities of service-oriented
decision support systems: Putting analytics and big data in cloud. Decision Support Systems, 55(1),
412-421.
Fernandes,
D. A., Soares, L. F., Gomes, J. V., Freire, M. M., & Inácio, P. R. (2014).
Security issues in cloud environments: a survey. International Journal of Information Security, 13(2), 113-170.
doi:/10.1007/s10207-013-0208-7
Firat,
M., & Kuzu, A. (2011). Semantic web for e-learning bottlenecks:
disorientation and cognitive overload. International
Journal of Web & Semantic Technology, 2(4), 55.
Foster,
I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper
presented at the 2008 Grid Computing Environments Workshop.
Gholap,
K. K., & Asole, S. (2016). Today’s Impact of Big Data on Cloud. International Journal of Engineering
Science, 3748.
Governor,
J., Hinchcliffe, D., & Nickull, D. (2009). Web 2.0 Architectures: What entrepreneurs and information architects
need to know: ” O’Reilly Media, Inc.”.
Gubbi,
J., Buyya, R., Marusic, S., & Palaniswami, M. (2013). Internet of Things
(IoT): A vision, architectural elements, and future directions. Future Generation computer systems, 29(7),
1645-1660.
Hu,
H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big
data analytics: A technology tutorial. IEEE
Access, 2, 652-687.
Ji,
C., Li, Y., Qiu, W., Awada, U., & Li, K. (2012). Big Data Processing in Cloud Computing Environments. Paper
presented at the 2012 12th International Symposium on Pervasive Systems,
Algorithms and Networks.
Jin,
J., Gubbi, J., Marusic, S., & Palaniswami, M. (2014). An information
framework for creating a smart city through internet of things. IEEE Internet of Things Journal, 1(2),
112-121.
Kambil,
A. (2008). What is your Web 5.0 strategy? Journal
of business strategy, 29(6), 56-58.
Konstantinou,
N., Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically
approaching the semantic web bottleneck. International
Journal of Web Engineering and Technology, 6(1), 83-111.
Krishnan,
K. (2013). Data warehousing in the age of
big data: Newnes.
Kujur,
P., & Chhetri, B. (2015). Evolution of World Wide Web: Journey From Web 1.0
to Web 4.0 IJCST, 6(1).
Magharei,
N., & Rejaie, R. (2006). Understanding
mesh-based peer-to-peer streaming. Paper presented at the Proceedings of
the 2006 international workshop on Network and operating systems support for
digital audio and video.
Magharei,
N., & Rejaie, R. (2009). Prime: Peer-to-peer receiver-driven mesh-based
streaming. IEEE/ACM Transactions on
Networking (TON), 17(4), 1052-1065.
Markatos,
E. P. (2002). Tracing a large-scale peer
to peer system: an hour in the life of gnutella. Paper presented at the
Cluster Computing and the Grid, 2002. 2nd IEEE/ACM International Symposium on.
Miller,
M. (2008). Cloud Computing, Web-Based Applications That Change the Way You Work
and Collaborate Online. Michael Miller.
Modi,
C., Patel, D., Borisaniya, B., Patel, A., & Rajarajan, M. (2013). A survey
on security issues and solutions at different layers of Cloud computing. The Journal of Supercomputing, 63(2),
561-592.
Mosberger,
D., & Jin, T. (1998). httperf—a tool for measuring web server performance. ACM SIGMETRICS Performance Evaluation
Review, 26(3), 31-37.
Nicolaidis,
I., & Iniewski, K. (2017). Building
Sensor Networks: CRC Press.
O’Reilly,
T. (2007). What is web 2.0.
Patel,
K. (2013). Incremental journey for World Wide Web: introduced with Web 1.0 to
recent Web 5.0–a survey paper.
Weber,
S., & Rech, J. (2009). An overview and differentiation of the evolutionary
steps of the web XY movement: the web before and beyond 2.0. Handbook of Research on Web 2.0, 3.0, and X.
0: Technologies, Business, and Social Applications: Technologies, Business, and
Social Applications, 12.
White, T. (2012). Hadoop:
The definitive guide: ” O’Reilly Media, Inc.”.