Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

js-ipfs pinning performance #2197

Closed
dirkmc opened this issue Jun 24, 2019 · 4 comments
Closed

js-ipfs pinning performance #2197

dirkmc opened this issue Jun 24, 2019 · 4 comments
Labels
exp/wizard Extensive knowledge (implications, ramifications) required exploration kind/support A question or request for support status/ready Ready to be worked

Comments

@dirkmc
Copy link
Contributor

dirkmc commented Jun 24, 2019

ipfs.add() performance degrades severely once the number of pins exceeds 8192

Background

Users can pin a file or a block to prevent it from being garbage collected.

The pinning module maintains two sets of pins:

  • direct: the CID of the block that is pinned
  • recursive: the CID of the root node of a DAG of blocks

These pin sets are stored in the block store with the following structure:

  • if the number of pins is less than 8192, create a node with
    • 256 links pointing to an empty block
    • a link pointing to each pinned block
  • if the number of pins is greater than 8192
    • distribute pins deterministically amongst 256 buckets
    • each bucket is a node with one or more pins, with the same structure described above (ie buckets with > 8192 pins distribute them into sub-buckets etc)

Performance

A pin set with less than 8192 pins is stored in a single DAG node. Once there are more than 8192 pins, they are distributed between 256 buckets, each with its own DAG node. Each time a new pin is added to the set, the distribution across the group of buckets is calculated and written to the block store. The distribution is deterministic, so in reality only one bucket changes each time a new pin is added.

For example, if we simplify and say there are 8 buckets, with 5 pins (A - E):
[] [D] [] [EA] [] [C] [] [B]
When we add pin F only one bucket changes:
[] [D] [] [EA] [] [C] [] [BF]

We can improve performance by adding a cache that remembers the structure of the pin sets, and only write nodes that change to the block store (instead of writing all nodes each time a pin is added or removed). This improves ipfs.add() performance dramatically once we exceed 8192 pins:

pinning-perf

Memory usage

The pinner uses fnv1a to distribute pins. fnv1a outputs a number (8 bytes) so if we also use it for cache keys each key will be 8 bytes. Each pin is represented by a DAG link pointing to the pinned CID. The DAG link has

  • a name (the empty string)
  • a size (number)
  • a cid
    • version (number)
    • codec (eg 'sha2-256')
    • multihash (the hash itself)
    • multibaseName (eg 'base58btc')

So rounding up, a DAG link requires about 128 bytes of memory. eg 10k pins requires about 1MB memory for the cache.
Note: Storing the DAGLink object (rather than just the CID as a Buffer) saves us having to re-create a lot of JavaScript Objects but uses about twice the memory. However this memory would need to be reserved anyway each time a pin is added.
Note: The cache is not used if there are less than 8192 pins

Command line

When invoking ipfs add from the command line, with the daemon running, we need to load the http api each time. This can be several times slower than the add operation itself, so we should look at optimizing it.

@alanshaw alanshaw added exp/wizard Extensive knowledge (implications, ramifications) required exploration kind/support A question or request for support status/ready Ready to be worked labels Jul 10, 2019
@achingbrain
Copy link
Member

Does anyone have any context as to why we store pinsets as DAGs instead of storing individual CIDs we don't want to gc in leveldb? I'm thinking having dedicated datastores for pinned CIDs might be more performant than having to perform all these operations every time we read/write the pinsets.

cc @Stebalien @daviddias

@Stebalien
Copy link
Member

Does anyone have any context as to why we store pinsets as DAGs instead of storing individual CIDs

The goal was to eventually store the entire repo in a single DAG.

IMO, we should do this but at a different layer. I'd like to:

  1. Store the pin set as key/values in the datastore (easy, can use datastore queries, performant, etc.).
  2. Create a new dag-backed datastore (using a tiered HAMT) for everything except blocks.

@achingbrain
Copy link
Member

Great, I think 1) is what I'm suggesting so I'll give that a go and see what the performance difference is like.

@achingbrain
Copy link
Member

This has been fixed by #2771

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
exp/wizard Extensive knowledge (implications, ramifications) required exploration kind/support A question or request for support status/ready Ready to be worked
Projects
None yet
Development

No branches or pull requests

4 participants