Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Is this thin? #18

Closed
Ericson2314 opened this issue Aug 12, 2016 · 13 comments
Closed

Is this thin? #18

Ericson2314 opened this issue Aug 12, 2016 · 13 comments
Labels
kind/support A question or request for support

Comments

@Ericson2314
Copy link

Ericson2314 commented Aug 12, 2016

I think this is a good format, and I love the idea that the thin waist of IPFS gets it's own name and spec, but I don't think this format works well as a thin waist.

It's great it use from the application's perspective, but from the perspective of an implementation (storage, networking, etc), imposes extra obligations. The implementation needs to, e.g.:

But suppose that a bunch of implementations just used "bare-bones" Merkle nodes made up of a) a list hashes of other nodes, and b) a list of bytes. It's straightforward to implement IPLD on top of that: replace hashes in links with indices into the list of hashes, pack into canonical form, and use that as the list of bytes. All of bullet points I provisioned now become the responsibility of the bare-bones Merkle DAG <-> IPNS shim, and not the underlying system. If this is indeed a valid implementation strategy, then it would be nice to specify that bare-bones format so the shim can indeed be reused.

Now, it sometimes to make sense to give lower layers more information than they strictly need for the sake of optimization. But I don't see how a storage / networking layer could leverage the knowledge that nodes have the rich stricture of IPLD rather than the dumb structure of that bare-bones example. If there indeed is none, then all layers should use the bare bone format with a shim, and thus the bare bones format (by it's simplicity) becomes the thin waist.

@Stebalien
Copy link
Contributor

understand/verify some Unicode encoding (probably UTF-8)

Only if it creates IPLD objects and wants to validate that they aren't malformed. For storage, networking, etc... one can treat names as bytes. Also, one of the really nice things about UTF-8 is that it only uses bytes 127+ for non-ASCII data. This means that, e.g., when parsing a path one can just split on "/" (0x47) without doing any UTF-8 parsing as 0x47 can only mean "/".

with #9 understand IPNS

See #9 (comment). Basically, @jbenet agrees with you (and so do I). While IPLD will support links like this, IPLD implementations shouldn't resolve them.

with #1 traverse other IPLD objects when dereferencing

Even without #1, this is true. Being able to point into/through objects was a motivation for IPLD.

But suppose that a bunch of implementations just used "bare-bones" Merkle nodes made up of a) a list hashes of other nodes, and b) a list of bytes. It's straightforward to implement IPLD on top of that: replace hashes in links with indices into the list of hashes, pack into canonical form, and use that as the list of bytes. All of bullet points I provisioned now become the responsibility of the bare-bones Merkle DAG <-> IPNS shim, and not the underlying system. If this is indeed a valid implementation strategy, then it would be nice to specify that bare-bones format so the shim can indeed be reused.

That's effectively what the Protobuf version is. Yes, this does make it easier to traverse a DAG. Unfortunately, this doesn't actually fix anything because we want IPLD to be compatible with existing merkeldags. That is, we want to be able to interpret the block chain, git, etc. as IPLD data structures so we already need to support multiple complex formats. Someone who knows more about the motivation for IPLD can probably give you a better answer to this.

@Ericson2314
Copy link
Author

Ericson2314 commented Aug 13, 2016

Only if it creates IPLD objects...

That's all true, but it's nice to be able to lift those restrictions and be able to create objects too.

See #9 (comment). Basically, @jbenet agrees with you (and so do I).

Whew :).

That is, we want to be able to interpret the block chain, git, etc. as IPLD data structures so we already need to support multiple complex formats. Someone who knows more about the motivation for IPLD can probably give you a better answer to this.

Even without #1, this is true. Being able to point into/through objects was a motivation for IPLD.

Ok, but just cause we want a format that does that doesn't mean it needs to be the thin waist.

That's effectively what the Protobuf version is.

Not quite. The protobuf version also has string keys---mine explicitly forgoes any attempt to support fs-style paths and leaves that to a higher layer. I'm a bit skeptical of the size field on links too---that might be better as an optional hint, or left as wire format detail not part of the core spec as it doesn't add any information.


Unfortunately, this doesn't actually fix anything because we want IPLD to be compatible with existing merkeldags. That is, we want to be able to interpret the block chain, git, etc. as IPLD data structures so we already need to support multiple complex formats. Someone who knows more about the motivation for IPLD can probably give you a better answer to this.

Oh, Ok. Now things make a little more sense. My idea for this was to abuse multihash: instead of using just a one-byte discriminant, do something variable-size and re-purpose some large tags for these legacy protocols. Certainly any way of supporting these legacy formats is going to involve some dubious layering.

@jbenet
Copy link
Contributor

jbenet commented Aug 13, 2016

Lots to say of course, a bit limited for time today. Just this for now:

The point of IPLD as "a thin-waist" is to create a common format for distributed authenticated protocols. There needs to be enough in the thin waist to be valuable, and not too much to be bloated. We think the current construction is the right balance, there's very little that it does / forces, but enough that meaty applications can interop with it. As we have it, we'll be able to link all existing merkledags too. (The merkle forest is seeded \o/)

@Ericson2314
Copy link
Author

Ericson2314 commented Aug 14, 2016

Sure, it's the weekend, happy to read more later. At this point, well, I figured you all do think it is a good thin waist seeing that this is what you went with :). And yes, I am enough of an "end-to-end-er" in that I gravitate towards absolutely minimalism while adding more layers to pick up the ergonomic slack.

The plan special casing the hashing for the sake of various legacy formats (including but not limited to the old protobufs one) I find interesting because that is the only area I can see my bare-bones<->IPLD shim idea running into trouble.

@nicola
Copy link
Member

nicola commented Aug 14, 2016

@Ericson2314 thanks for filing this in. I am going to go through your argument to make sure I agree and we understand each other!

  • A thin layer to be thin must be simple and should just do as little as possible and be enough to be used across different applications, changing the pieces below it (in our case content-distribution) and above it (applications)
  • IPLD is a data model (JSON-like) that differently from JSON & co. introduce a Link Object that links to a new object via an IPLD path. An IPLD path is a path that traverse this link object attributes prefixed by a (special form of a) hash and a optional namespace.
  • [simple case] Consider the simple IPLD case in which you have IPLD objects that point to each other ONLY via (special) hashes - so no path, no namespace.
  • [advanced cases] Now to the simple objects we allow slightly complex objects:
    • where links have a path after the hash (HASH+PATH): your client/parser/resolver needs to know how to resolve them (which is easy!
    • where links have a namespace before the hash (NS+HASH[+PATH]): your client/parser/resolver needs to know about the namespace.

Would the simple case be the bare-bone case you are discussing?
Supporting advanced cases mainly means making the IPLD path "just identifiers" to a parser, but not to a resolver, would this be too complex?

Let me know!

@nicola nicola added kind/support A question or request for support discussion labels Aug 14, 2016
@Ericson2314
Copy link
Author

Ericson2314 commented Aug 15, 2016

First two are totally correct. But let me elaborate on the rest.

simple vs complex.

There's more difference than that more than just that.. IPLD add JSON-like structure within each merkle node. My bare-bones (node-hashes, bytes) on the contrary has purely unstructured nodes---all the structure comes from the Merkle dag itself. I'll explain this structure, and why don't like it, piece by piece.

Firstly (granted, I didn't talk about this effect before), this creates "observable chunking choices" if someone is, say, converting to JSON to IPLD. Every parent-child edge can be "outlined" into a full-blown IPLD link (the child is replaced with its hash), or "inlined" and kept as is. This means that anyone creating nodes needs to decide between these two options that are semantically equivalent. This choice essentially boils down to performance tradeoffs: bigger chunks (less outlining) might mean less requests), smaller chunks means more deduplication. But data often lives longer than the optimal of such tuning---i.e. newer hardware and cleverer algorithms and wire/memory formats can probably reduce the cost of coarser grained chunking. This should be all the more so true for the permanent web. I'd feel more comfortable if every child had to be "outlined" and given its own hash.

So say we do that, so IPLD now conceptually looks like:

<ipld> ::= { (<string>: <link>)* }
        |  [ <link>* ]
        |  <atom> # i.e. strings, numbers, booleans, null

My second complain is strings/maps. As I mentioned in previous comments, from the implement's side, this creates burdens. When creating nodes, the implementation needs to validate UTF-8 (and the receiver may want to too). Also when decoding nodes the implementation probably wants to make sure no map key is duplicated.

Now, perhaps most languages have good unicode implementations, and this is no big deal. It is almost my experience that perhaps first class strings and maps, or more broadly mandatory self-describing data (cause that's generally how this stuff is used in the JSON world) is perhaps not as useful as it seems. [This is definitely more opinion territory, and I don't really expect to make anybody that doesn't think this already change their minds. But rather consider there's more people like me.]

I write mainly Haskell and Rust, which are statically typed and put algebraic data types everywhere. For single language systems, we don't need self describing data because both ends know exactly what form it is in. Now programs definitely do evolve, and self-describing data does too, but JSON and like formats (including IPLD) are notoriously awkward for this because they make adding more fields struct a lot more easier than variants to a enum (using the easier-to-understand Rust/Swift terminology). Furthermore, we often like to refactor our data in more drastic ways---changing whole hierarchies---where JSON-style self description no longer works for backwards compatibility. [I'm sure everybody does this at some point, but I think static types make it less painful / more common.] So basically for the use-cases I'm most excited to use IPFS for, the strings and maps are awkward / unhelpful.

Taking out strings and maps leaves:

<ipld> ::= [ <link>* ]
        |  <atom> # numbers, booleans, null

Which is closer. I need to go now, but I'll hopefully soon continue anti-motivating features and stripping it away. Also, remember as I wrote above for those that like IPLD exactly the way it is, it can be implemented on top of my bare-bones with little-to-no penalty. I'll elaborate on that too.

@Ericson2314
Copy link
Author

Ericson2314 commented Aug 15, 2016

The language is now simple enough where the different sorts of atoms seem a bit incongruous.
Mathematically there are many different types of numbers, and computationally different ways to represent/approximate those different types, so it's hard to come up with a one size fits all choice.
Booleans and null don't have those variations, but again in a statically typed world you don't need that reflection of these built-in atoms too, so let's just have byte as our only atom. Now we have:

<ipld> ::= [ <link>* ]
        |  <byte>

Putting only a single byte in only leaf nodes is a bit ridiculous, so let's allow every node to have some bytes:

<ipld> ::= [ <link>* ; <byte>* ]

We're now close but what goes in a link? Something perhaps like:

<link> ::= "ipfs" / <multihash>
        | "ipld" / <pubkey>
        |  <link> / <natural-number-index>

Note that I've already removed maps but not arrays so positional indexing still make sense for relative paths---this will be my strawman for discrediting relative paths :).

I've mentioned mutable paths, but let me expand a bit. If all software storing/exchange IPFS nodes ought to make sure that IPLD references are defensible, then we are blurring the boundary between IPFS and IPNS---but I think everyone agrees this is sketchy. If we are just distinguishing mutable links for the user's sake, and IPFS implementations are free to ignore them, I don't think this is very thin-waist-ish. IPLD and application data are both black boxes as far as IPFS is concerned---all upper layers are equally opaque.

[Now conversely, for IPLD-as-is repurposed as a batteries-included standard especially tailed for end languages written with dynamic types I think it's fine to put any sort of mutable or immutable link there.]

That leaves indexing, or more generally relative paths. My problem for the thin waist is they provide a fall sense of cheapness. Given an untrusted network (the default for IPFS), a node requesting data for a multihash/path still needs the node pointed to by the hash and it's children down to the relative node to prove that second node is indeed the proper descendant. Maybe for some optimized network protocol there can be a way to prefetch such a chain, but for the data itself I don't see what this buys us. It creates multiple canonical forms for the exact same data since we're talking immutable data. [Conversely with mutable link data / cycles relative paths are super nice in order to make more updates atomic.]

With those two gone, we have:

<link> = "ipfs" / <multihash>

the "ipfs/" doesn't add any information anymore, so we can just drop it:

<link> = <multihash>

inlining link in our minimal ipfs, and removing more information-less charcters we have get:

<ipld> ::= <multihash>* <byte>*

Now THAT is my minimal thin waist proposal.

@nicola nicola mentioned this issue Aug 22, 2016
2 tasks
@nicola
Copy link
Member

nicola commented Aug 22, 2016

@Ericson2314, thank you so much for writing all this down, this is really valuable. I haven't been much connected recently, and I had to think more about what you say.

I will go in this order (1) address your complaints (2) discuss your thinner layer

1

Complaint 1: inline vs outlined: if every child had to be "outlined" and given its own hash.

Here I guess we are still discussing a data model that is JSON-like. At the very very beginning of my IPLD interest, I just needed every attribute to be a link, eventually linking to a leaf. This is really great for many reasons: a lot of data == a lot of deduplication (to be proven!), we can make cheaper proofs of membership, the data model is REALLY simple.

However, as soon as I started talking to the IPFS people I realized that being too strict would cause some issues to them, let me give you some examples:

{
  folder: {
    shard1: [hash1, hash2, hash3]
  },
  permission: { r: "Nicola", w: "Juan" }
}

hash1
{
  file: hashFile1,
  size: 300
}

Now, objects like permission or simple maps like size would become really big if every attribute would point to a hash (since the size of a hash is bigger than the permission object or the size value). Also, this means that each client would need to store ALL of these hashes mapping (which it makes sense if the value is repeated multiple times)

Complaint 2: self-describing data is not as useful as it seems

Let's live this as you said: there's more people like me.
This is really important since you are bringing us a use case.

2

There has been a lot of conversation about this recently, I strongly suggest you to participate in this: ipfs/specs#130. I will be short (but please ping me online if you want me to expand on this).

The way I see this is to provide a generic data model (like in the IPLD on CBOR case) and a path scheme that makes it easier to traverse and point to data. However, given the CID proposal, we can have IPLD on other formats, at a point in which we are even supporting byte arrays and array position in the path scheme.

In conclusion, the way I see is the following: we have a byte array (that can be an encoded CBOR or your particular implementation described above), a path scheme and a parser (in the current IPLD, it's an IPLD+CBOR parser). The role of this parser is: given a path, give me the bytes in the byte array. Typing is done via the CID bit.

How does this sound?

@Ericson2314
Copy link
Author

If we can do what I suggest in ipfs/specs#130 (comment) to make IPLD just another format supported by CIDs, then I am pretty much OK with however IPLD turns out. :)

@Ericson2314
Copy link
Author

Ericson2314 commented Aug 25, 2016

@nicola about the issue of fixed hashing ever child causing too much fragmentation. A middle ground is to only hash the children of maps or lists (+ strings = non-atoms) if the children are themselves non-atoms. This means for the JSON:

{
  "foo": null,
  "bar": "asdf asdf asdf asdf asdf asdf adsf asdf asdf adsf asdf asdf asdf q",
  "baz": { "a": 1, "b": 2, "c": 3 },
  "qwer": [ true false false ]
}

the hashed nodes are:

{
  "foo": null,
  "bar": <hash>,
  "baz": <hash>,
  "qwer": <hash>
}
"asdf asdf asdf asdf asdf asdf adsf asdf asdf adsf asdf asdf asdf q"
{ "a": 1, "b": 2, "c": 3 }
[ true false false ]

For easy memorizing, this mirrors when a pointer would be needed if maps are stored as association lists.

@nicola
Copy link
Member

nicola commented Sep 12, 2016

Hey, I am going to close this issue,

Resolution: CID makes IPLD thinner than IPLD+CBOR

We can discuss the other issues that we discussed (e.g. how to hash every children & so on) in new issues, feel free to re-open if you think there is more conversation to do, or open others!

@nicola nicola closed this as completed Sep 12, 2016
@nicola nicola removed the discussion label Sep 12, 2016
@Stebalien
Copy link
Contributor

Resolution: CID makes IPLD thinner than IPLD+CBOR

That's not true at all. CIDs reduce the size of the IPLD spec but increase the implementation complexity (the complaint in this post). That is, with CIDs, reading an IPLD object requires understanding every supported format.

@Ericson2314
Copy link
Author

That's true, but I was thinking one can just understand the CID-supported formats they want and ignore the rest.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/support A question or request for support
Projects
None yet
Development

No branches or pull requests

4 participants