Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

car: CARv2 spec #248

Closed
wants to merge 2 commits into from
Closed

car: CARv2 spec #248

wants to merge 2 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Mar 10, 2020

WIP for discussion.

Highlights:

  • The v1 "header" block becomes a "version" block in v2, a CBOR encoded {version:2}—we can exact match the first 11 bytes to determine if it's a v2 (0x0aa16776657273696f6e02—essentially our CARv2 file magic number), otherwise fallback to decoding as per v1. CARv1 code should read this block and bork on version:2.
  • New "header" block is the first block after the version block, properly CID-prefixed, contains:
    • a deterministic flag
    • an array of CID + selector pairs

Deterministic CARs have to exhaustively describe the contained DAGs via the roots + selectors (the selectors can describe boundaries of partial DAGs but they must do so exhaustively). Therefore the CID of the header block for a deterministic CAR should uniquely identify that graphs it contains.

Non-deterministic CARs are lax, anywhere from the deterministic exhaustiveness to just-a-bundle-of-blocks with no roots. Where roots are present, selectors are optional.

A CARv1 would be read as a non-deterministic CARv2 with absent selectors.

@vmx
Copy link
Member

vmx commented Mar 10, 2020

I had a quick read and not really something to add as I'm not that deep into CAR files. Though it's sad to see that we can't do a [ varint | CID | dag-cbor block ] * that also includes the header. due to not being able to distinguish Version 1.

@mikeal
Copy link
Contributor

mikeal commented Mar 10, 2020

This is a good way to approach the initial discussion, but I think that the next version of the spec should probably be either a separate file or a separate sub-section so that we can preserve the original spec. (I read the diff wrong, reading the rendered new file is much clearer)

@mikeal
Copy link
Contributor

mikeal commented Mar 10, 2020

Do we want to try adding the index as a trailer?

* A compressed integer
* A "version" block of byte-length described by the preceding compressed integer
* A series of CID-prefixed blocks:
* A compressed integer prefixing each block indicating the total byte-length of that block, including the length of the encoded CID
Copy link
Contributor

@warpfork warpfork Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to have an int for the CID length and an int for the content data length, separately.

Figuring out how long a CID is... is a surprisingly nontrivial operation. https://github.com/ipfs/go-cid/pull/95/files recently added support for this, but involves at least three different varint parses (spread across more than just that one repo!) -- and in the meanwhile, you're passing around an overly large byte slice. (How much too large? I dunno. I guess you just take some bytes and wing it...?)

So, considering all that... yeah: I'd advocate for an int to state the CID size separately. While technically, sure, it's "defined" to just put the CID at the front of the content bytes and just report the total size... a separate size marker for CID and data separately seems advantageous: it's simpler, it's more antifragile, and it might even result in more efficient processing since it'll let code spend more time operating with known amounts of bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warpfork At that stage isn't it simpler to just have proper CBOR:

type DataFrame struct {
  Cid bytes
  Data bytes
}

Copy link
Contributor

@mikeal mikeal Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t seen a CBOR parser decode this as fast as we could parse the int based encoding so I’m worried about moving in that direction.

I do like the idea of separating the int lengths. Right now you’ve basically gotta parse the entire CID in order to work through the file. If you have a specific set of CID’s you’re scanning for you could do much faster binary comparison as you scan through the file and avoid full CID parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this through a bit more combining with #248 (comment) :

Having v2 being a uniform stream of {cidlen}{cid}{payloadlen}{payload} where {cid} ( position 2 ) has a special codec signifying car header as opposed to just dag-cbor would solve all your symmetry issues and be super-fast to check for well-formedness...

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall decent direction, I like it. Left a number of comments, mostly not being able to deduce a specifc answer from the spec alone ( which I suppose merits fixing... )

A meta-thought I want to surface: the spec is just that - a spec. If we cite limitations in the spec, and then do not implement any of them in the actual parsers, the limitations are nul and void. That is to say - I would be wary labeling too many things as invalid, because for each such case you need to implement the corresponding err("NOPE!") in the corresponding libraries...

If something is icky but harmless - let's just not mention it and move on.

* **Individual block reads**: as the CAR format contains no index information, reads require either a partial scan to discover the location of a required block or an external index must be maintained and referenced for a seek and partial read of that data. See below regarding indexing.
* **DAG traversal**: without an external index, traversal of a DAG specified by a "root" CID is not possible without dumping all blocks into a more convenient data store or by partial scans to find each block as required, which will likely be too inefficient to be practical.
* **Modification**: CARs may be appended after initial write as there is no constraint in the Header regarding total length. Care must be taken in appending if a CAR is intended to contain coherent DAG data.

### Security and Verifiability

The roots specified by the Header of a CAR must appear somewhere in its Data section, however there is no requirement that the roots define entire DAGs, nor that all blocks in a CAR must be part of DAGs described by the root CIDs in the Header. Therefore, the roots must not be used alone to determine or differentiate the contents of a CAR.
A deterministic CAR should be uniquely describable by the CID of its Header block. Two CARs with the same Header CID with a `deterministic` value of `true` should be comprised of identical bytes when formatted according to this specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean in practice? Depth-first, output block on "first-seen"? Or "it depends"? If the latter: we need a more nuanced deterministic marker aside from "true/false"...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to cover these details as they relate to the header. If the header is a collection of [root, selector] pairings, then the deterministic ordering is actually defined by the selector spec.

* A compressed integer
* A "version" block of byte-length described by the preceding compressed integer
* A series of CID-prefixed blocks:
* A compressed integer prefixing each block indicating the total byte-length of that block, including the length of the encoded CID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warpfork At that stage isn't it simpler to just have proper CBOR:

type DataFrame struct {
  Cid bytes
  Data bytes
}

[ varint | cbor block ] [ varint | CID | dag-cbor block ] [ varint | CID | block ] [ varint | CID | block ] …
```

To maintain backward compatibility, Version 2 retains the lead-in block without CID but _only_ includes a CBOR encoded map containing a single key/value pair: `{ version: 2 }`. Therefore, a decoder may read the first 11 bytes of a CAR and match it against the bytes: `0x0aa16776657273696f6e02` (`0x0a` is a varint representing `10`, followed by 10 bytes of the CBOR encoded form of `{ version: 2 }`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for a leading Version seems like too much magic... Isn't a try-parse-as-v2 or try-parse-as-v1 sufficient here? This also addresses @vmx's ick-factor

Copy link
Contributor

@mikeal mikeal Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was actually my idea so I’ll defend it.

One way we could maintain backwards compatibility is to keep the old v1 style header but only ever encode it as { version: 2 }. To the old v1 parser this looks like a CAR file with no roots. For a v2 parser you don’t need to even parse it, you can check if it’s a v2 CAR file by doing a simple binary comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step a little further back though: if we accept that v1 files "in the wild" all start with 0x0aa16776657273696f6e02, then the only concern you have is "what will a v1 ( old) parser do when it encounters a ver != 1 file. We want it to croak. Having v2 being a uniform stream of {cidlen}{cid}{payloadlen}{payload} should not possibly be decodeable as valid cbor, hence "problem solved"? Or you are concerned about the possibility of crafting a valid v2 file that "means something" to a v1 parser...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it to croak

There are many different forms of “croak” though 😉

Being unable to even read the header is likely to cause an exception. Reading the header, seeing that the version is beyond the available range, and not having any root information to use in order to parse the file, is a much less aggressive form of croaking. It’s likely to cause the CAR file to simply be ignored rather than causing a noticeable error.

Now, this may not be the desired behavior. We may want a stronger form of croaking, we may want to see clear exceptions, but I’m not sure we do and I’m a little worried about code in the Filecoin ecosystem encountering new CAR files in the future and what the failure mode looks like.

Copy link
Contributor

@ribasushi ribasushi Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being unable to even read the header is likely to cause an exception.

Only in clients linked against a sufficiently old go-car/js-car not knowing how to read v != 1

Or are you saying that at this point there is likely a significant number of implementations out there that wrote their own car handlers not using any of our libraries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a high degree of certainty that, by the time we ship this, there will be car handlers we didn’t write in use. For one thing, we don’t have a Rust CAR implementation and there’s Rust stuff happening in the ecosystem.

How well they are written and if they can handle something like this, I have no idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...
So perhaps we should pivot and ship v2 with only the following changes:

  • make versioning forwards compatible
  • potentially fix the cid-length specification / symmetry of data-frames questions

And leave everything else for v3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we're willing to drop selectors from scope, are we? We already have Go code in go-car that's passing in a Root+Selector (https://github.com/ipld/go-car/blob/master/selectivecar.go#L21-L24) and presumably corresponding code in go-fil-markets that's working this way, so the data is there and we could get ontop of this pretty quick.


### Data

Immediately following the Header block, **one or more** IPLD blocks are concatenated to form the _Data_ section of the CAR format. _(Caveat: see [Zero blocks](#zero-blocks) under Unresolved Issues.)_ Each block is encoded into a _Section_ by the concatenation of the following values:
Immediately following the Header block, **one or more** IPLD blocks are concatenated to form the _Data_ section of the CAR format. It is invalid for a CAR to contain zero blocks after the Header. Each block is encoded into a _Section_ by the concatenation of the following values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit wasteful in the case of non-deterministic files. I have a use-case where multiple car files are given and the second is just a "pinset" ( roots have already been seen in a previous car ). Something like:
( almost done writing up, should have it working correctly by early Thursday, handwavey CLI args ) :
ipfs dag import $( ipld-dagger --stdout=car-fifos ) where the inner invocation returns 2 named pipes, the first being an unordered car stream, the second being just the roots to pin, together being governed by one "suspended GC / pin-pass transaction".

Yes, it doesn't cost too much to just keep around the block-contents of each root, and re-emit them at the end again, but what would be the point? The second car is still incomplete either way...

Additionally keep in mind that car files will be used for patching in the future across the ecosystem, and it is conceivable that a "rollback" will contain no data at all, just the "previous root".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this use case, I think we should have a header that doesn’t specify any roots, and we should figure out who/what this might break and how to mitigate.

I know that there are parts of the Filecoin ecosystem that assume there is always a single root and will break without it, so one tradeoff of this might be that those CAR files won’t work at all in Filecoin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are parts of the Filecoin ecosystem

My understanding was that the FIL ecosystem only deals in "deterministic cars" and anything else is virtually unsupported. Whereas .car as a concept within ipfs & friends ecosystems is way less strict, and is essentially a "better block put interface". Is my approach to this wrong...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s not entirely accurate.

Filecoin clients that use the regular deal flow only use deterministic car files. The offline dealflow, or really any deal that is added to the network without our online dealflow code, supports any CAR file. The retrieval market does not require deterministic CAR files in order to read out data. It does require that there is a single root though.

I was also told that there are other actors in the Filecoin ecosystem reading CAR files that also assume there is always a single root and that multiple roots, in addition to no roots, would be problematic. In this case, I’m talking about developers outside of Protocol Labs building applications and tooling on top of the network.


### Performance

Some considerations regarding performance:

* **Streaming**: the CAR format is ideal for dumping blocks via streaming reads as the Header can be loaded first and minimal state is required for ongoing parsing.
* **Streaming**: the CAR format is ideal for dumping blocks via streaming reads as the Header can be loaded first and minimal state is required for ongoing parsing. Describing partial graphs with a deterministic CAR may be more difficult as a streaming operation as the selector boundaries may not be known ahead of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems mis-matched dumping ( implies export from a stateful application into a static .car ) vs streaming reads ( implies reading a static .car ). Also I am still fuzzy on how verification can be guaranteed on a non-exhaustive selector... Anyway: perhaps:

Suggested change
* **Streaming**: the CAR format is ideal for dumping blocks via streaming reads as the Header can be loaded first and minimal state is required for ongoing parsing. Describing partial graphs with a deterministic CAR may be more difficult as a streaming operation as the selector boundaries may not be known ahead of time.
* **Streaming**: the **deterministic** CAR format is ideal for ingesting blocks via streaming reads as the Header can be loaded first and minimal state is required for ongoing parsing. Describing partial graphs with a deterministic CAR may be more difficult as a streaming operation as the selector boundaries may not be known ahead of time.


Deterministic CAR creation is not covered by this specification. However, deterministic generation of a CAR from a given graph is possible and is relied upon by certain uses of the format, most notably, [Filecoin](https://filecoin-project.github.io/specs).
Where a `deterministic` value of `true` is present in a CAR, the header is able to describe the entirety of the graph represented by the trailing blocks in the _Data_ section. Such a CAR may be reproduced, byte-for-byte, from a block source using the same roots and selectors.
Copy link
Contributor

@ribasushi ribasushi Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficiently precise. If I give you a selector saying "root + next level of nodes only" - the car files will not have any leaves, thus you do not describe the entire graph, nor can you fully cryptographically verify such a .car file offline... Or perhaps I am confusing the level of determinism we spoke about on Monday?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"entirety of the graph" is probably in the wrong here

able to describe the entirety of a graph from the roots, limited by the selectors

I don't think this makes any claims about being able to verify it offline. You could do that if you had a * selector but I don't know if there's any way to do that with more advanced selectors. Maybe I should just add text to that effect:

The completeness of such a graph may not be able to be verified without the source graph in the case of a restrictive selector

Described as an IPLD Schema:

```ipldsch
type CarRoot struct {
Copy link
Contributor

@mikeal mikeal Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I want to explore is making the entire CAR header block a completely valid Selector (or perhaps we need to define some notion of a “multi-selector”).

It would be pretty amazing to be able to take a selector block, drop it into a selector engine, and spit out a CAR file with a CID for the header block that matches the selector you passed.

This may not be worth it. For instance, the deterministic boolean would become problematic as it’s unlikely to be there if you just create a selector block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require a selector to ship with a CID embedded in it wouldn't it? I don't think we have that kind of structure in selectors as they are today (afaik!)

@rvagg
Copy link
Member Author

rvagg commented Mar 20, 2020

@ribasushi

At that stage isn't it simpler to just have proper CBOR:

type DataFrame struct {
  Cid bytes
  Data bytes
}

This is an interesting proposal. There's two paths here -- cbor or dag-cbor. In both cases you you'd end up replacing varints with CBOR's ints describing lengths, prefixed with type data. If we used plain cbor we'd need to specify a smallest-representation rule (as per dag-cbor #236).

For plain cbor you'd only add an extra byte for an average CID (more for long hashes) over using a varint. For dag-cbor you'd get an additional 3 bytes to properly represent it (tag, plus length, plus weird 0x0 lead byte). The data block would depend on how much data there is but I'd expect that you'd only be adding 1 or possibly 2 bytes for average block sizes.

A benefit is that you simplify the decode for new languages. They just need a cbor parser. Another benefit is that we can lean on our existing materials (like schemas) making this more part of the IPLD family in a doc & spec sense rather than such a special snowflake.

The complication for efficient implementations is that you'd want to do this streaming and partial so we'd end up rewriting the basics of major type 2 cbor parsing, which, tbh, isn't too much beyond what we do now, but at least we have easy varint libraries to lean on.

Edit: will also require a type lead-in, list or map, list being the most efficient, so representation tuple but that's an extra byte at the front of each "frame".

@warpfork
Copy link
Contributor

Relatedly: there's now also work on a new format called "dar", which I think had a lot of the same concerns as discussed here: https://github.com/anorth/go-dar

@mikeal
Copy link
Contributor

mikeal commented Feb 22, 2021

I’m still considering all the details, but my initial reaction to DAR is that it’s a better approach. In trying to iterate on the existing CAR format we don’t have a lot of opportunities for efficiency gains.

DAR seems to do a nice job of introducing constraints we don’t have in CAR that then provide space savings and assurances while also adding in a few useful features.

Keeping in mind that we already have the existing CAR format, so use cases that cannot live within the constraints of DAR will still have a format they can rely, I think we may want to put off iterating on CAR in favor of DAR or something similar.

@mikeal
Copy link
Contributor

mikeal commented Feb 22, 2021

I’ll also add that, given how much we’ve already adopted the existing CAR format w/ recommended constraints (single root only) the utility of an iteration that ads the complexity of detecting and potentially supporting two versions of the CAR format may be worse than just adding an entirely new format. It don’t think we gain anything by having two version of CAR instead of two separate but complimentary formats.

All such considerations are deferred to the user of the CAR format and should be documented there as this specification does not inherently support determinism.
Duplicate blocks are strictly not valid in a deterministic CAR. The first instance of a block discovered by any root + selector combination will be present, all repeated occurances of the same block encountered during further graph traversal are omitted.

### Additional Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have meant to be an h2?

(I like using the two-line syntax variant for these because it's much more visually distinctive and easier to catch myself in typos like this, fwiw 🙃 )

@warpfork
Copy link
Contributor

What can we do with this PR?

My concerns and goals today would be:

  • get the open PR count in this repo down
  • document whatever we learned -- by whatever means necessary
  • bonus/stretch goal: slowly anneal towards understanding what's up in the discussion of {car} vs {carv2} vs {carbs} vs {dar}.
    • (figuring out how to land this PR should probably not get hung up on this though; bigger topic / multiple topics.)

I'm not actually sure how to wrangle this one, at a glance. First of all -- are we still actively working on it? (It's not that stale; I'm just on a close-a-thon today.)

Here are some options for moving forward I can think of (but maybe there are also other better ideas? I'm receptive):

  • Could we extract this content into an exploration report, and merge it thusly?
    • (always an options, but might not be fun in this case: there's a good amount of content here, and it's intermixed with other existing content, so this suggests a lot of rewriting.)
  • Should we merge it almost entirely as-is, but maybe add some disclaimers about how widely implemented the v2 is as of today?
    • (I'm not actually sure how widely implemented the v2 is! So someone else's input here will be critical :))
  • Is this a candidate for the "merge-ignore" change management strategy -- where we "merge" the commits (so they're referencable in git history) but don't actually apply the diffs to the content as it appears in master?
    • (this is a very non-committal approach. Doesn't mean we can't resume the work later; if we do, we just cherry-pick the content and keep going. It just results in fewer open branches in the meanwhile.)
  • ... or is this actually still actively in flight, and I should hush myself on pushing for merge?

@rvagg
Copy link
Member Author

rvagg commented Apr 21, 2021

Not being actively worked on, should probably not be merged, this really is more like an exploration report than anything else. Focus has moved on to ways to augment CARv1 (CARBS, etc.) and entirely different approaches to a CARv2 (i.e. DAR). Nobody is actively exploring improving the existing format in-place and I don't see work on the horizon. Although this PR and discussion does contain some potentially useful tidbits for any future work to learn from.

@rvagg
Copy link
Member Author

rvagg commented May 6, 2021

I just wrote this elsewhere and wanted to record it somewhere more permanent, it clarifies the versioning statements in OP:

The problem we have with iterating on versions and doing any kind of successful differentiation from CARv1 is the instability of the leading bytes. The header in CARv1 is: [varint][DAG-CBOR({roots:[CID,...],version:1})]. And that initial varint is not reliable in any way except that it's a varint. The DAG-CBOR bit following it is a bit more reliable because it starts with something like: MAP(2){STRING(5,"roots"),ARRAY(... and it breaks down from there. So that gives you some form of magic bytes that you'll most likely see, 0xa265726f6f7473 but that's after the varint.
The risk that this presents is that if you want to have clear differentiation then your leading magic bytes need to be clearly distinguishable from anything that suggests CARv1.

In this PR I suggested just making the magic bytes 0x0aa16776657273696f6e02 because it's [varint(10)][DAG-CBOR({version:2})] and the existing CAR readers can read that and bork at "oh, this is version 2, I don't know how to deal with this (it also has missing roots, but we could add in an empty roots array to be safer for decoders that assume it's there). Then version detection is much more straightforward, you could read the first 11 bytes and very safely determine whether you have a v2, then fall back to v1 parsing. What's more, there's a straightforward path to v3, it's just 0x0aa16776657273696f6e03, so just that 11th byte is different, nice and stable and CARv1 decoders will error meaningfully ("Don't know how to decode version 3").

0x0aa16776657273696f6e02 is first 0x0a which is just uint(10) because varints this small are just ints. Then followed by 10 bytes of CBOR:

a1                                                # map(1)
  67                                              #   string(7)
    76657273696f6e                                #     "version"
  02                                              #   uint(2)

Of course we could just abandon version sniffing from leading-bytes, but that's not very multiformats is it!

/cc @raulk

masih added a commit to ipld/go-car that referenced this pull request Jun 11, 2021
Define the magic CAR v2 prefix proposed in ipld/specs#248 and assert
that it remains a valid CAR v1 header, along with other tests that
verify the expected length, content and version number.

Define basic structs to represent CAR v2 header, along with placeholder
for the future "characteristics" bitfield, and the optional padding
between CAR v1 dump and index added to the end of a CAR v2 archive.

Relates to:
- ipld/specs#248 (comment)
@raulk
Copy link
Member

raulk commented Jun 14, 2021

@rvagg Makes sense to me. Question:

  • Is it possible for CARv2 and subsequent versions to extend that initial header structure for additional purposes, as long as they preserve the version field. Or are we constraining this initial header and convicting it to ever only being used as a version discriminator, leaving it up to the version itself to specify the header?
    • e.g. CARv2 introduces the concept of "characteristics" that express traits about the CAR and the DAG, e.g. whether an index follows, whether the DAG is order depth-first, whether it's deduplicated, etc.

@willscott
Copy link
Member

My understanding is that the initial magic bytes is a signal of 'this is not a carv1' (and is in fact carv2) - and indicates to expect a carv2 metadata header with the characteristics to follow. If we end up wanting features that need to change that second header in a non-compatible way, we would consider changing the magic bytes at the beginning to indicate we're now on carv3

@raulk
Copy link
Member

raulk commented Jun 14, 2021

Understood. Comparing the 0x00 approach with the magic number, what the latter buys is a nicely readable error message at the expense of 10 extra bytes, right?

@willscott
Copy link
Member

Understood. Comparing the 0x00 approach with the magic number, what the latter buys is a nicely readable error message at the expense of 10 extra bytes, right?

and a static 8 byte magic header for mime type detection.

@rvagg
Copy link
Member Author

rvagg commented Jun 15, 2021

It might be best if we just view the CARv1 header as a minor mistake in that it didn't lead with a version (like we do with everything else multiformats style) but baked it into the block itself, which leaves little wriggle room for changing it around. Sorting order in DAG-CBOR gives us version coming last in most cases unless we want to have longer property names. We could break that rule for this to make version sniffing easier, but it all gets very messy very quickly. So if we just treat <varint length>dag-cbor({version:X}) as our multiformats-style leading magic bytes then we get to do whatever we want after that and can break from version to version (e.g. version 3 could be 0x0aa16776657273696f6e03 etc.). It's an unfortunately long series of magic bytes but we won't be alone and maybe we can be OK with that and write it off as an accident of history?

@mikeal
Copy link
Contributor

mikeal commented Jun 15, 2021

We’ve considered the header a mistake for some time.

If we’re going to break backwards compat on the header then we could just drop the header and make the format begin with a registered multicodec identifier followed by the cid + block data for the “new header” format. That would be “better” in every respect other than backwards compatibility.

@willscott
Copy link
Member

We’ve considered the header a mistake for some time.

If we’re going to break backwards compat on the header then we could just drop the header and make the format begin with a registered multicodec identifier followed by the cid + block data for the “new header” format. That would be “better” in every respect other than backwards compatibility.

I think we end up with much of the same complexity that we've spent the last month hashing out here. I'm hesitant to go back to redesign now since that was considered as an option previously, but rejected when we decided to do the least work we could to embed the carv1 as-is with an attached index.

We've gotten to a fairly straight forward fixed-length header prepended to a carv1. We get versioning. we can signal that an index is appended. that's what we need now. The goal was not to envision the format we actually want, which we've explicitly pushed back to carv3.

@rvagg rvagg closed this Jun 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants