Skip to content

Faunus and Hadoop

Dan LaRocque edited this page Sep 5, 2014 · 29 revisions
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.

Faunus is designed to support global graph traversals over massive-scale graphs. In particular, this typically leads to two particular use cases: graph derivations and graph statistics.

  • Graph derivation: Given an input graph, derive a new graph based upon the input graph’s structure and semantics. Other terms include graph rewriting and graph transformations.
    • deriving a cousin-graph from father and brother edges.
    • deriving a youth friend-graph by removing all vertices with age greater than 30 and keeping only friend edges.
    • adding properties to the graph elements according to some traversal.
  • Graph statistic: A graph is a complex structure that can best be understood when the nature of its structure is distilled down to a manageable set of numbers. An analogous term is graph/network analysis.
    • counting the number of vertices and edges in the graph.
    • determining the degree distribution of the graph.
    • determining the distribution of edges labels in the graph.
    • counting the number of paths that lead to vertices.
    • emanating all paths of length 3 between two sets of vertices.

Introduction to Hadoop

Hadoop is a distributed storage and processing system that greatly simplifies the creation of distributed computing jobs. The default storage layer for Hadoop is HDFS. HDFS is similar to any other file system in that it can be used to store arbitrarily formatted files (e.g. binary, text, etc.). However, HDFS allows for the storage of files so large that they can not be represented on a single machine. As such, HDFS is a distributed file system. Given files distributed over HDFS, it is possible to process these files using Hadoop’s distributed processing framework, MapReduce. MapReduce represents a computation as a series of parallel/atomic key/value pair computations. There are two steps to MapReduce:

  • Map: For every <key1,value1> input, yield <key2,value2> outputs.
  • Reduce: For every <key2,list<value2>> input, yield <key3, value3> outputs.

In this way, the map-step can be seen as a parallel analysis of all the <key1,value1> pairs represented in the source file/location. The reduce-step aggregates all the values (value2) emitted by the previous map-step that have the same key (key2). In the reduce-step, some algorithm is evaluated over those values (list<value2>) to ultimately write an arbitrary number of <key3,value3> pairs to the sink file/location.

Introduction to Graphs

A graph is a data structure composed of vertices (dots,things) and edges (lines,links). A vertex has a set of incoming edges and a set of outgoing edges. There are numerous types of graphs and the one used by Faunus is the property graph model exposed by the Blueprints API. A property graph is multi-relational in that edges are labeled to denote different types of relationships between vertices. Moreover, every vertex/edge (generally known as an element) can have an arbitrary number of key/value pairs associated with it (note that these key/value pairs should not be confused with the key/value pairs of Hadoop). Graphs are typically processed using traversals. A traversal is a algorithmic walk over the graph in order to arrive at a particular destination (e.g. search or derivation) or to yield some side-effect in the process (e.g. a ranking or recommendation or statistic).

Faunus and Hadoop

Faunus leverages Hadoop for what Hadoop’s MapReduce is best at: parallel processing a massive number of atomic records with (as best as possible) limited communication between records. With Faunus, these atomic records are the vertices of a graph and the processing is global graph traversals.

A graph can be represented by an adjacency list. Each “row” represents a vertex along with its incident edges. For property graphs, the vertex and edge properties are stored in the row as well. Faunus interprets a graph from this perspective. Every key/value in Hadoop is a single vertex along with its properties and its incoming and outgoing edges. As such, a Faunus MapReduce job operates on each vertex in parallel.

There are two types of derivation operations in Faunus.

  • Map-Only: Filters, Element Mutations
    • Map: For every <null,vertex> input, yield a <null,vertex> output.
  • MapReduce: Traversals, Vertex Filters, Groupings/Aggregations
    • Map: For every <null,vertex> input, yield a <id,tagged_element> output.
    • Reduce: For every <id,list<tagged_element>> input, yield a <null,vertex> output.

For every vertex in a map-only step, the vertex is either allowed or filtered from the next processing step or the vertex is mutated in some way (e.g. properties removed/added). For every vertex in a map-reduce step, the vertex’s edges are analyzed and messages are sent to the adjacent vertices. These messages are then aggregated at the reduce step and the message receiving vertex is updated in some way.

Given that the input and output of both map-only and map-reduce steps is <null,vertex>, it is possible to chain these steps together to yield a more complex derivation. The language used to do the chaining is called Gremlin. Faunus’ implementation of Gremlin is breadth-first, cluster oriented, and massively parallel.

The purpose of Faunus (as with Gremlin in general) is to either yield a graph derivation or a graph statistic. These two concepts are graphically represented below.

Graph derivation

Graph statistic

References

The following articles provided insight into the various theoretical and technical issues encountered during Faunus’ design and development.