Graph Dataset Partitioning Methods

Dr. Aly, O.
Computer Science

Introduction

Cloud Computing programs are designed to serve certain purposes. They can be tailored for Graph or Data Parallelism, which require the utilization of the data striping and distribution or graph partitioning and mapping (Sakr & Gaber, 2014). The Data Parallelism (DPLL) is distinguished from the Graph Parallelism (GPLL). The DPLL is described as a form of parallelizing computation as a result of distributing data across multiple nodes and running corresponding tasks in parallel on those machines. The DPLL is achieved when each machine runs one or many tasks over different partitions of data (Sakr & Gaber, 2014). MapReduce uses the DPLL where Hadoop Distributed File System (HDFS) partition the input datasets into blocks allowing MapReduce to effectively utilize the DPLL through running a map task per one or many blocks. The GPLL, on the other hand, is described as another form of parallelism which focuses more on distributing graphs as opposed to data (Sakr & Gaber, 2014). The GPLL is widely used in many domains such as Machine Learning (ML), Data Mining (DM), Physics, and many others.

The graph can be distributed over machines or nodes in a distributed system using various types of Graph Partitioning (GPRT) (Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). When using the GPRT techniques, the work and the vertices are divided over distributed nodes for efficient distributed computation (Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). As the case with DPLL, the underlying concept of the GPRT is to process different parts of the graph in parallel by distributing a large graph across many nodes in the cluster (Sakr & Gaber, 2014). Thus, GPRT allows and enables GPLL (Sakr & Gaber, 2014). The main objectives of the GPRT are to distribute the work uniformly over many processors by partitioning the vertices into these processors equally weighted partitions while minimizing inter-processor communication reflected by edges crossing between partitions (Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). This technique is called “Edge Cut Metric” (Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). Pregel and GraphLab are good examples of using the GPRT technique.

There are Standard GPRT techniques which have various flaws and limitations as discussed below. There are alternative models such as Bipartite Graph Model, Hypergraph Model, Multi-Constraint Partitioning Model, and Skewed Partitioning Model. These models are only viable if efficient and effective algorithms can be developed to partition them (Hendrickson & Kolda, 2000). Various research studies have demonstrated that the Multi-Level paradigm for partitioning is proved to be robust and general. Several researchers developed this paradigm independently in the early 1990s and became popular and known as Chaco and METIS partitioning tools. Regarding the graph processing, there are various Cloud-Based Graph Processing Platforms such as Pregel, PEGASUS, HADI, Surfer, Trinity, and GraphLab (Sakr & Gaber, 2014).

This discussion and analysis of the alternative Graph Partitioning Models are limited only to Bipartite Graph Model and Multi-Constraint Partitioning Model. The discussion begins with the Standard Graph Partitioning Techniques with an analysis of their flaws and limitations. Moreover, the discussion and analysis of the Cloud-Based Graph Processing Platforms are limited to Pregel and GraphLab.

The Standard Graph Partitioning Approach

The graph is a set of vertices and edges, where vertices of the graph can represent units of computations, and the edges can encode data dependencies (Hendrickson & Kolda, 2000). The Standard Graph Partitioning approach partition the vertices of the graph into equally weighted sets in such a way that the weight of the edges is minimized when they are crossing between sets (Hendrickson & Kolda, 2000). Chaco and METIS are well-known software packages which are using this standard approach. There are shortcomings and limitations to the standard Graph Partitioning approach. The “Edge Cut Metric” has four major flaws (Hendrickson & Kolda, 2000). The first flaw of the “Edge Cut Metric” involves the edge cuts which are not proportional to the total communication volume. Thus, it overcounts the true volume of communication because two or more edges which may be representing the same information flow are not recognized by the edge cut metric. The second flaw of the “Edge Cut Metric” involves the time to send a message on a parallel computer which is a function of the latency and the size of the message (Hendrickson & Kolda, 2000). The Graph Partitioning approaches try to minimize the total volume, but not the total number of messages. In certain scenarios, the message volume can be less important than the message latencies (Hendrickson & Kolda, 2000). The third flaw of the “Edge Cut Metric” and measure involves the performance of a parallel application which is limited by the slowest processor. The communication effort can be out of balance although the computational work is in balance. Thus, the focus should be on minimizing the maximum volume and number of messages handled by any single processor instead of focusing on minimizing the total communication volume or even the total number of messages (Hendrickson & Kolda, 2000). The last flaw of the “Edge Cut Metric” involves the number of switches the message is routed through to travel distances. Communication should be restricted to processors which are close to each other to prevent the contention of messages and enhance the overall throughput of the message traffic (Hendrickson & Kolda, 2000).

The Standard Graph Partitioning technique suffer from four limitations because of the lack of “expressibility” in the model (Hendrickson & Kolda, 2000). The emphasis in the discussion of (Hendrickson & Kolda, 2000) was on two standard models called “Directed,” and “Undirected” Graph Models. The first limitation involves the symmetric and un-symmetric data dependencies. The “Undirected Graph” model can only express symmetric data dependencies (Hendrickson & Kolda, 2000). If the model is un-symmetric, the dependencies are un-symmetric, which can be represented using a “Directed Graph” model. In the “Directed Graph” model, the edges are directed from the data producing vertex to the data consuming vertex (Hendrickson & Kolda, 2000). The “Standard” Model cannot represent un-symmetric data dependencies. There are two solutions to make the “Standard” Model represent un-symmetric dependencies (Hendrickson & Kolda, 2000). The first solution is to convert the directed edges to undirected edges (un-symmetric to symmetric). The second solution is an extension to the first solution based on the communication if one-way or two-way. If the edge represents only one-way communication, it gets a weight of one, and if it represents two-way communication, it gets a weight of two (Hendrickson & Kolda, 2000). The second limitation is represented in the symmetric model, which forces the partition of the input and output data to be identical (Hendrickson & Kolda, 2000). Although this approach might be desirable in certain cases, it poses unnecessary restriction in many cases (Hendrickson & Kolda, 2000). The third limitation of the Standard Model involves the assumption that the input and output of the calculation are the sizes. The last limitation of the Standard Model involves the several distinct phases for the calculation, which include the application of a matrix and a preconditioner in an iterative method, solving a differential equation and applying boundary conditions, simulating different phenomena in a multi-physics calculation and advancing a grid and detecting contacts in a transient dynamics computation (Hendrickson & Kolda, 2000). Such a combination of phases cannot be described using the “Undirected Graph” Model.

Graph Partitioning Models Alternatives – Bipartite Model, and Multi-Constraint Partitioning Model

To overcome some of the above flaws and challenges of the Standard Graph Models, some models are proposed and discussed in (Hendrickson & Kolda, 2000). “Bipartite Graph Model” is proposed to solve the un-symmetric data dependencies and un-symmetric partitions which cannot be encoded by the Undirected Graph Model, because this standard model can encode only symmetric data dependencies and symmetric partitions as discussed in the above flaws section.

The “Bipartite” Model can be applied to other applications involving un-symmetric dependencies and multiple phases (Hendrickson & Kolda, 2000). There are three advantages for the “Bipartite” Model. The first advantage involves the capability of encoding a class which is un-symmetric that cannot be encoded by the standard “Undirected” Graph Model. The second advantage is that the “Bipartite” model allows for the initial partition to be different from the final partition by representing each vertex twice, once as an initial vertex and once as a final vertex allowing for a reduction in communication (Hendrickson & Kolda, 2000). However, this model cannot provide a symmetric partition which can be preferred in many applications (Hendrickson & Kolda, 2000). The third advantage involves the load balance which is offered by partitioning both the initial and the final vertices (Hendrickson & Kolda, 2000). Although the “Bipartite” Graph Model cover the “expressibility” which is missing from the Standard Graph Models, it is still facing the Edge Cut Metric challenge, sharing the other associated problems which the Standard Models are suffering from (Hendrickson & Kolda, 2000). The Edge Cut Metric issue can be resolved by enhancing the quantity of the graph using other methods and techniques than the cut edges (Hendrickson & Kolda, 2000). Moreover, the “Bipartite” model cannot encode more than the two main computational operations (Hendrickson & Kolda, 2000).

A solution to the limited two-operation encoding limitation involves a k-partite graph where the first set of vertices is connected to a second set, which is connected to a third set and so on. This technique is called Multi-Constraint methodology. This Multi-Constraint technique is not an alternative to the other models, but rather an enhancement to the other models (Hendrickson & Kolda, 2000). The goal of this model is to partition the vertices of that graph in a way to minimize the communication, while balancing the k weight that is assigned to each vertex, resulting in balancing each computation phase (Hendrickson & Kolda, 2000). Moreover, this model was developed to minimize the edge cuts (Hendrickson & Kolda, 2000). Using this model, the edges represent the data dependencies in every phase of all computations (Hendrickson & Kolda, 2000). This Multi-Constraint Model includes the “Bipartite” and k-partite techniques as a special case. Although this model is powerful, it is difficult to implement (Hendrickson & Kolda, 2000).

Cloud-Based Graph Processing Platforms of Pregel and GraphLab

Pregel is known as a vertex-oriented graph processing engine which implements a Bulk Synchronous Parallel (BSP) model, where programs are expressed as a sequence of iterations (Sakr & Gaber, 2014). In each of these programs, a vertex can receive messages which are sent in the previous iteration, send messages to other vertices and modify its state. Google introduced the Pregel system as a scalable platform for the implementation of graph algorithms (Sakr & Gaber, 2014). Pregel implements the algorithm of the vertex-oriented PageRank under the Message Passing paradigm (Sakr & Gaber, 2014). Pregel passes the results of the computation between workers, executing Compute() on all vertices in parallel (Sakr & Gaber, 2014). In Pregel, messages are passed over the network, and vertices can vote to halt if there is no work to do (Sakr & Gaber, 2014). In Pregel, each vertex in a graph is assigned a unique ID, and partitioning of the graph is accomplished using a hash(ID) mode N function where N is the number of partitions. The hash function can be modified by users. After the graph is partitioned, partitions are mapped to cluster machines using a mapping function of a user choice (Sakr & Gaber, 2014). In Pregel, if graphs or cluster sizes are modified after partitioning, the entire graphs need to be re-partitioned before any processing (Sakr & Gaber, 2014).

GraphLab, in contrast to Pregel, employs a two-phase partitioning strategy (Sakr & Gaber, 2014). The input graph is partitioned in the first phase into some partitions using a hash-based random algorithm, while the number of partitions must be larger than the number of the cluster nodes (Sakr & Gaber, 2014). The partition in GraphLab is called an “atom,” where commands are stored to generate vertices and edges (Sakr & Gaber, 2014). GraphLab maintains in each atom some information about the atom’s neighboring vertices and edges which is denoted as ghosts which are used as a caching capability for efficient adjacent data accessibility (Sakr & Gaber, 2014). In the second phase of the GraphLab’s two-phase partitioning strategy, GraphLab stores the connectivity structure and the locations of atoms in an atom index file which is referred to as meta-graph. This meta-graph contains the number of the vertices, and edges encoding connectivity among atoms, and is split uniformly across the cluster nodes (Sakr & Gaber, 2014). Atoms are then loaded by cluster machines, and each machine constructs its partitions by executing the commands in each of its assigned atoms (Sakr & Gaber, 2014). GraphLabs allows future modification to the graph to be appended as additional commands in atoms without any requirement to repartition the entire graph by generating partitions through executing commands stored in the atoms (Sakr & Gaber, 2014). Moreover, the same graph atoms can be reused for different sizes of clusters by simply re-dividing the corresponding atom index file and re-executing atom commands (Sakr & Gaber, 2014).

Moreover, GraphLab is designed and developed for ML, and DM algorithms, which are not supported naturally by MapReduce (Sakr & Gaber, 2014). The abstraction of the GraphLab allows asynchronous, dynamic, graph-parallel computation while ensuring the consistency of the data, and achieving a high degree of parallel performance in the Shared-Memory setting (Sakr & Gaber, 2014). This asynchronous parallel model of GraphLab is different from Pregel BSP model (Sakr & Gaber, 2014). The GraphLab framework is extended to the distributed setting while ensuring strong data consistency (Sakr & Gaber, 2014).

How To Handle the Unevenness of the Cloud Network Bandwidth

The large graph can have over hundreds of gigabytes which require data-intensive processing (Chen et al., 2012; Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). The Cloud Computing environment is described to be a good fit for processing large graph, the (Chen et al., 2012; Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). The graph processing systems support a vertex-oriented execution model which allows users to develop custom logics on vertices (Chen et al., 2012). However, the random access on the vertex-oriented computation can generate a significant amount of network traffic (Chen et al., 2012). The GPRT technique is known to be effective to reduce the network traffic in graph processing (Chen et al., 2012; Hendrickson & Kolda, 2000; Sakr & Gaber, 2014). However, in the Cloud environment, the GPRT needs to be effectively integrated into large graph processing (Chen et al., 2012; Sakr & Gaber, 2014). The network bandwidth of the Cloud environment is uneven among various machine pairs, where the machines are grouped first into “pods,” which are connected high-level switches (Chen et al., 2012; Sakr & Gaber, 2014). The bandwidth of the intra-pod is higher than the bandwidth of the cross-pod (Chen et al., 2012; Sakr & Gaber, 2014). Moreover, the users are not aware of the topology information due to the virtualization techniques which is employed by the Cloud (Chen et al., 2012; Sakr & Gaber, 2014). Thus, the unevenness of the Cloud network bandwidth requires network optimizations and tuning on graph partitioning and processing (Chen et al., 2012; Sakr & Gaber, 2014). A framework is proposed by (Chen et al., 2012; Sakr & Gaber, 2014), which is described as the network performance aware graph partitioning with the aim to improve the network performance of the graph partitioning process (Chen et al., 2012; Sakr & Gaber, 2014). The underlying concept of the framework is to partition, store, and process the graph partitions based on their number of cross-partition edges (Chen et al., 2012; Sakr & Gaber, 2014). Thus, the partition with a large number of cross-partition edges is stored in the machines with the high network bandwidth between them because the network traffic for those graph partitions requires high bandwidth (Chen et al., 2012; Sakr & Gaber, 2014). This framework partitions both the data graph and machine graph and is known as “Machine Graph” framework (Chen et al., 2012; Sakr & Gaber, 2014). This framework is described as complete weighted undirected graph models the machines which are chosen for graph partitioning to capture the unevenness of the network bandwidth (Chen et al., 2012; Sakr & Gaber, 2014). Each chosen machine is modeled as a vertex; an edge represents the connectivity between the two machines (Chen et al., 2012; Sakr & Gaber, 2014). The bandwidth between any two machines is represented as the weight of an edge (Chen et al., 2012; Sakr & Gaber, 2014). The machine graph can be built with no knowledge or control of the physical network topology (Chen et al., 2012; Sakr & Gaber, 2014).

Partition Sketch is another framework proposed by (Chen, Weng, & He; Sakr & Gaber, 2014) for the unevenness of the network bandwidth in the Cloud environment. Using the Partition Sketch, the process of a multi-level graph partitioning algorithm is modeled as a “tree structure” (Sakr & Gaber, 2014). In this framework, each node represents the graph acting as the input for the partition operation at a level of the entire graph partitioning process (Chen et al.; Sakr & Gaber, 2014). It begins with the root node which represents the input graph, followed by the non-leaf nodes at the level of ( i + 1) which represents the partitions of the ith iteration, and the leaf nodes represent the graph partitions which are generated by the multi-level graph partitioning algorithm. This framework of the Partition Sketch is described as a binary tree since the graph partitioning is often implemented using “bisection” (Chen et al.; Sakr & Gaber, 2014). The “Graph Machine,” and “Partition Sketch” framework are proposed frameworks for the network bandwidth aware graph partitioning techniques for the cloud (Chen et al.; Sakr & Gaber, 2014). These frameworks enhance a popular multilevel graph partitioning algorithm with the network performance awareness (Chen et al.; Sakr & Gaber, 2014).

References

Chen, R., Weng, X., & He, B. On the efficiency and programmability of large graph processing in the cloud.

Chen, R., Yang, M., Weng, X., Choi, B., He, B., & Li, X. (2012). Improving large graph processing on partitioned graphs in the cloud. Paper presented at the Proceedings of the Third ACM Symposium on Cloud Computing.

Hendrickson, B., & Kolda, T. G. (2000). Graph partitioning models for parallel computing. Parallel computing, 26(12), 1519-1534.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Traditional Distributed Models and Cloud Computing

Dr. Aly, O.
Computer Science

Introduction

The traditional distributed models involve the Shared Memory and the Message Passing programming models. The SM and the MP techniques are two major parallel computing architecture models (Coulouris, Dollimore, & Kindberg, 2005; Manoj, Manjunath, & Govindarajan, 2004). They provide the basic interaction model for distributed tasks and lack any facility to automatically parallelize and distribute tasks or tolerate faults (Hammoud, Rehman, & Sakr, 2012; Sakr & Gaber, 2014). Advanced models such as MapReduce, Pregel, and GraphLab are introduced to overcome the inefficiencies and challenges of these traditional distributed models especially when porting them to the Cloud environment (Fernández et al., 2014; Low et al., 2012; Sakr & Gaber, 2014). These new models are built upon these two traditional models of the SM and MP and offer various key properties for the Cloud environment (Sakr & Gaber, 2014). These models are also referred to as “distributed analytics engines” (Sakr & Gaber, 2014).

Shared Memory Distributed Programming

The SM system has a global memory that is accessible to all the processors in the system (Manoj et al., 2004). The SM techniques include two models based on the nature of sharing of this global memory across processors. The first model is the Uniform-Memory-Access (UMA) (Manoj et al., 2004; Pulipaka, 2016). The second model is the Non-Uniform-Memory-Access (NUMA) (Coulouris et al., 2005; Dean & Ghemawat, 2004; Hennessy & Patterson, 2011; Manoj et al., 2004; Mishra, Dehuri, & Kim, 2016). The access time to the memory in UMA model from two processors is equal. The access time to the memory in NUMA varies for different processors (Manoj et al., 2004). Hardware Distributed Shared Memory (DSM) is an example of the NUMA including Stanford DASH, SCI, DDM, and KSRI (Manoj et al., 2004). When using a uniprocessor environment, the programming in SM systems involves updating shared data, which is regarded to be a simple programming process (Manoj et al., 2004). However, with the increasing number of processors, the environment experiences increased contention and long latencies in accessing the shared memory (Manoj et al., 2004). The contention and latencies diminish the performance and limit scalability (Manoj et al., 2004). Additional issues with the SM systems involve data access synchronization, cache coherency, and memory consistency (Manoj et al., 2004). The developers must ensure appropriate memory access order through synchronization primitives (Manoj et al., 2004). Moreover, when using hardware in the implementation of the SM abstraction, the cost of the system gets increased (Manoj et al., 2004).

In SM programming model, the data is not explicitly communicated but implicitly exchanged via sharing which entails the use of synchronization techniques within the distributed programs (Sakr & Gaber, 2014). The tasks can communicate by reading and writing to the shared memory or disks locations. The tasks can access any location in the distributed memories/disk which is similar to the threads of a single process in operating systems, where all threads share the process address space and communicate by reading and writing to that space (Sakr & Gaber, 2014). The distributed tasks are prevented from simultaneously writing to a shared data to avoid inconsistent or corrupted data when using synchronization which is required to control the order in which read/write operations are performed by various tasks (Sakr & Gaber, 2014). The communication between two tasks is implicit through reads and writes to shared arrays and variables and synchronization is explicit through locks and barriers (Sakr & Gaber, 2014).

The SM programming model must meet two main requirements: developers do not need to explicitly encode functions to send/receive messages, and the underlying storage layer provides a shared view of all tasks. MapReduce satisfies these two requirements using a map and reduce, and the communications occur only between the map and reduce phases under the full control of the MapReduce engine. Moreover, the synchronization is also handled by the MapReduce engine (Sakr & Gaber, 2014). For the storage layer, MapReduce utilizes the Hadoop Distributed File System (HDFS) which provides a shared abstraction for all tasks, where any task transparently access any location in HDFS (Sakr & Gaber, 2014). Thus, MapReduce provides the shared-memory abstraction which is provided internally by Hadoop that entails MapReduce engine and HDSF (Sakr & Gaber, 2014). GraphLab also offers a shared-memory abstraction by eliminating the need for users to send/receive messages in update functions explicitly and provides a shared view among vertices in a graph (Fernández et al., 2014; Low et al., 2012; Sakr & Gaber, 2014). GraphLab allows scopes of vertices to overlap and vertices to read and write from and to their scopes which can cause a conflict of read-write and write-write between vertices sharing scope (Fernández et al., 2014; Low et al., 2012; Sakr & Gaber, 2014).

Message Passing Distributed Programming

In MP programming model the distributed tasks communicate by sending and receiving messages where the distributed tasks do not share an address space at which they can access each other’s memories (Sakr & Gaber, 2014). The MP programming model provides the abstraction which is similar to that of the process and not the threads in the operating system (Sakr & Gaber, 2014). This model incurs communications overheads for explicitly sending and receiving messages that contain data such as variable network latency, potentially excessive data transfer (Hammoud et al., 2012; Sakr & Gaber, 2014). When using MP programming model, no explicit synchronization is needed because for every send operation; there is a corresponding receive operation (Sakr & Gaber, 2014). Moreover, no illusion for single shared address space is required for the distributed system for tasks to interact (Sakr & Gaber, 2014).

Message Passing Interface (MPI) is a popular example of the MP programming model, which is an industry-standard library for writing message-passing programs (Hammoud et al., 2012; Sakr & Gaber, 2014). Pregel is regarded the common analytics engine which utilizes the message-passing programming model (Malewicz et al., 2010; Sakr & Gaber, 2014). In Pregel, vertices can communicate only by sending and receiving messages, which should be explicitly encoded (Malewicz et al., 2010; Sakr & Gaber, 2014). The SM programs are easier to develop than the MP programs (Manoj et al., 2004; Sakr & Gaber, 2014). The code structure of SM programs is often not much different than its respective sequential one, with only additional directives to be added to specify parallel/distributed tasks, the scope of variables, and synchronization points (Manoj et al., 2004; Sakr & Gaber, 2014). Large-scale distributed systems such as the Cloud imply non-uniform access latencies such as accessing remote data takes more time than accessing local data, thus enforcing programmers to lay out data close to relevant tasks (Sakr & Gaber, 2014).

References

Coulouris, G. F., Dollimore, J., & Kindberg, T. (2005). Distributed systems: concepts and design: pearson education.

Dean, J., & Ghemawat, S. (2004). MapReduce: simplified data processing on large clusters. OSDI’04 Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation” dalam: International Journal of Engineering Science Invention. 10-100.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409.

Hammoud, M., Rehman, M. S., & Sakr, M. F. (2012). Center-of-gravity reduce task scheduling to lower mapreduce network traffic. Paper presented at the Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.

Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach: Elsevier.

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., & Hellerstein, J. M. (2012). Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8), 716-727.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Manoj, N., Manjunath, K., & Govindarajan, R. (2004). CAS-DSM: A compiler assisted software distributed shared memory. International Journal of Parallel Programming, 32(2), 77-122.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Pulipaka, G. (2016). Distributed Shared Memory Programming for Hadoop, MapReduce, and HPC Architecture. Retrieved from https://medium.com/@gp_pulipaka/distributed-shared-memory-programming-for-hadoop-mapreduce-and-hpc-357a1b226ff6.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.