Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADL/single-node-view of a full unixFS file. #14

Merged
merged 10 commits into from
Nov 4, 2021
Merged

Conversation

willscott
Copy link
Collaborator

@willscott willscott commented Nov 2, 2021

Relevant for @warpfork:

  • defines a StreamableByteNode interface for extending a Kind_Bytes ipld.Node to also be usable as an io.Reader

Remaining work:

  • Hook up NewUnixFSFile to the top level Reify when files are encountered
  • support cidv0 'non-raw' leaf nodes
  • Roundtrip tests between the file builder and this decoder/ADL
  • Tests / fixtures of ipfs-style file extraction

@warpfork
Copy link
Member

warpfork commented Nov 2, 2021

EXCITING

Copy link
Member

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty grand and I'm excited about high level interfaces like this emerging.

I wonder if we'll eventually want something like this generalized on datamodel.Node. Perhaps a way to get an io.ReadSeeker. (Whatever that would be: also in an optional way, of course, because we don't want to break existing code that implements that interface. Feature detection would be important.) Would love @mvdan's opinions on the long game there.

Despite mentioning generalizing, I also don't see a major reason we can't merge this sooner than later; we can worry about sussing out and finalizing a generalized form later.

file/file.go Outdated
}

func (s *shardNodeFile) Prototype() ipld.NodePrototype {
return basicnode.Prototype__Bytes{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return something more self-descriptive.

What exactly "self-descriptive" means here, I don't know -- it might be something that has to be somewhat best-effort. Maybe it should peek at the root blocks of the thing and try to see if there's any sharding parameters, and return something with those if possible? If it can't infer all the parameters that were used creating this ADL the first time, still returning a best guess that's at least the same ADL but with default parameters is probably better than nothing.

The overall purpose of .Prototype() is so that things like traversal.FocusedTransform (or the traversal.TransformFn that the user provides, if this is the leaf thing being replaced) can regenerate things correctly while making a copy-on-write update. So if it forgets that something is an ADL we get very different behavior in a way that's probably not desired!

file/file.go Outdated
return 0, err
}
if pbl, ok := lnk.(dagpb.PBLink); ok {
target, err := s.lsys.Load(ipld.LinkContext{Ctx: s.ctx}, pbl.Hash.Link(), basicnode.Prototype.Any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use a prototype that's specific to dag-pb structures, since we know it's that all the way down in unixfsv1? (Or possibly a separate prototype for bytes, sometimes, if the CID indicates raw mode? I'm not familiar with exactly how that part of the protocol works; maybe the dag-pb structures already account for that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaf nodes can potentially be bytes, depending on the type of file encoding. We should know off of the cid codec.
I wonder if there's a way to avoid having this type of logic every time the link system is used?

file/file.go Outdated
return 0, err
}

asFSNode, err := NewUnixFSFile(s.ctx, target, s.lsys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting way of recursing. This will work correctly, but may cause long chains of Read calls delegating to Read calls delegating to Read calls if it's a deeply recursive structure for a large file, right?

(I'd have no objection to merging that as a first pass anyway. It's probably dominated by IO costs, etc, and it would be something easy to revisit later without external API change if necessary.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only does it as the read reaches subsequent nodes that need to be accessed, so the tree is loaded as it's accessed.
I can probably do a bit better at deleting references to the multi-reader upon hitting EOF to help the garbage collector clean up better for big files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more worried about keeping the context associated to the New event with the node for it's whole lifetime. I wonder what the right way to manage that context lifetime is.

@willscott
Copy link
Collaborator Author

The code implementation should be complete now. remaining work is to find a test fixture to help validate the ADL over cidv0 non-raw leaf files.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interface API exposed here is good enough for the foreseeable future. We might want to expose this datamodel.Node interface superset in go-ipld-prime at some point, but we also don't need to.

file/file.go Outdated
return newWrappedNode(substrate)
}

return &shardNodeFile{ctx, lsys, substrate, false, nil}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use keyed composite literals for readability? :)

file/file.go Outdated
return nil, err
}
lli := links.ListIterator()
if lli.Done() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just do if links.Length() == 0, I think

file/file.go Outdated

type singleNodeFile struct {
ipld.Node
ptr int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps offset or read, since this is not really a pointer

}

func (f *singleNodeFile) Read(p []byte) (int, error) {
buf, err := f.Node.AsBytes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we worried about repeated calls to AsBytes? it won't be a problem for basicnode.Bytes, but I don't think AsBytes is guaranteed to be trivially expensive. Perhaps we could call it on the first Read call, and hold onto it in a field until we hit EOF.

file/wrapped.go Outdated
// an empty degenerate one.
return &singleNodeFile{
Node: basicnode.NewBytes(nil),
ptr: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these can be omitted

file/shard.go Outdated
if err != nil {
return 0, err
}
if pbl, ok := lnk.(dagpb.PBLink); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, why do we use LookupByString("Links") above, and then a type assertion directly on a dagpb type here? seems like we should consistently be flexible to allow any IPLD node satisfying the same shape, or consistently assume dagpb types for simplicity/speed. Probably the former, to avoid unnecessary ties on go-codec-dagpb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants