Skip to content

SequenceFile Format

Dan LaRocque edited this page Sep 5, 2014 · 5 revisions
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.

  • InputFormat: org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
  • OutputFormat: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

Hadoop’s native binary data file is the SequenceFile. Every Writable object implements methods that enable it to both read itself from and write itself to a SequenceFile. Because both FaunusVertex and FaunusEdge implement Writable, they can be captured by a SequenceFile. Moreover, given that a SequenceFile is a binary format, it supports a more compact representation that found with other text-based formats such as GraphSON.

Faunus-Specific Compression

The following is a list of compression techniques used by Faunus within a SequenceFile.

  • Variable-width encoding of all ints and longs.
  • Edge’s sorted by direction to reduce the number of direction encodings.
  • Edge’s sorted by label to reduce the number of label encodings.
  • Only the adjacent vertex id stored as the root vertex’s id can be inferred.
  • Element property type encoding represented by a single byte.

Intermediate Format

Given that a SequenceFile is compact, splittable, and a native Hadoop format, Faunus makes use of the SequenceFile as the intermediate representation between consecutive Faunus jobs. In other words, when a Faunus computation requires more than one MapReduce phase, a SequenceFile representing the output of the first MapReduce job is temporarily persisted in HDFS and fed as the input to the second MapReduce job.