Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow feature detection for io.ReaderAt on Bytes nodes #372

Merged
merged 3 commits into from
Mar 3, 2022

Conversation

willscott
Copy link
Member

We can handle large bytes by allowing Bytes kinded nodes to optionally support
the ReadAt method thereby implementing the ReaderAt interface.

This is proposed over a distinct the reader / seeker implementation because the ReaderAt interface is stateless - it does not need the addition of a position cursor to be tracked in the node between calls, while still allowing for partial reads of the node contents.

… nodes

We can handle large bytes by allowing Bytes kinded nodes to optionally support
the `ReadAt` method thereby implementing the ReaderAt interface.

This is proposed over a distinct the reader / seeker implementation because the ReaderAt interface is stateless - it does not need the addition of a position cursor to be tracked in the node between calls, while still allowing for partial reads of the node contents.
@willscott willscott requested a review from warpfork March 1, 2022 19:23
@willscott
Copy link
Member Author

Work i see in this repo for fully implementing this proposal if it gets general approval:

  • Add comment about optional support for the interface at the main Node AsBytes method
  • Add ReadAt to the bind node implementation
  • Make use of the interface when available in codec Encoding

@willscott willscott requested a review from mvdan March 1, 2022 19:26
@willscott
Copy link
Member Author

In our one encounter with large bytes so far, we chose io.Reader and kept an extra offset for state in the bytes. I think this proposal is primarily asking us to decide if we're okay with a fast path support for the ReadAt method, which seems like plausibly the most common interface that would be possible to support.

@mvdan
Copy link
Contributor

mvdan commented Mar 1, 2022

Before: #319

I lean towards ReadSeeker over ReaderAt because io.Reader is fairly common, and because ReaderAt doesn't give any information about the total number of bytes - with seeking, you can seek all the way to the end to find out. Note that, currently, the Length method isn't valid for Bytes nodes.

On the flip side, ReaderAt is slightly easier to implement, and one could always implement an io.Reader atop an io.ReaderAt. I think I see your point about altering state when reading/seeking as a potential problem, but beyond "IPLD nodes are immutable", is that really a problem? It wouldn't affect what the node would contain or how it would be encoded; one could always seek back to the beginning.

For the sake of not keeping the read/seek offset state as part of the node itself, that's why in the other thread I suggested AsLargeBytes() (io.ReadSeeker, error). The other advantage of this mechanism is that an implementation can support two concurrent readers.

There is one other thing to keep in mind about ReaderAt:

Clients of ReadAt can execute parallel ReadAt calls on the same input source.

This would mean that any implementation, including ADLs, would need to support concurrent calls.

@mvdan
Copy link
Contributor

mvdan commented Mar 1, 2022

I should also warn against having a node optionally implement an io interface directly, without a method to obtain that implementation instead. With AsLargeBytes() (io.ReadSeeker, error), you can decide at run time whether or not you're really "large bytes" or not, whereas you must decide what the set of methods on a type is at compile time. This matters for any and kinded unions; AsLargeBytes will be defined for those, but will error if the kind isn't actually bytes. You could potentially emulate this by having ReadAt calls return a sentinel error, but then checking whether a node value or type implements io.ReaderAt is almost pointless, as it doesn't give you a good answer.

@willscott
Copy link
Member Author

  • Having to calculate full size may also be expensive, it would be nice to be able to support offset reads without needing to sign up for that (maybe you don't support seek to the end, i suppose)
  • The point on the clearly defined method makes sense. An AsLargeBytes seems reasonable.
  • Do we not think supporting concurrent calls is desirable / causes undue burden versus read seeker?

@mvdan
Copy link
Contributor

mvdan commented Mar 1, 2022

  • (maybe you don't support seek to the end, i suppose)

Right, you could indeed error when asked to seek relative to the end.

Do we not think supporting concurrent calls is desirable / causes undue burden versus read seeker?

With the understanding that we need an AsT method either way, the difference is that ReaderAt requires concurrent calls to work, whereas ReadSeeker doesn't. Given that most users would presumably grab their own "reader" via the method, and not share it with other goroutines, then I imagine requiring concurrency would be unnecessary, and that requirement could make implementing a "big bytes ADL" that loads on-demand harder - one suddenly needs to worry about data races :)

All this said, it's still just a "leaning towards ReadSeeker" for me and not a strong feeling. I think my only ask is that, if we do go for ReaderAt, then we should leave space for implementers to provide the total length, such as by extending the existing Length method.

@willscott
Copy link
Member Author

I guess the main candidate here at the moment is the unixfsnode ADL. I think i see what a read-seeker would look like there, so i'll prototype that library as a candidate implementation next.

@warpfork
Copy link
Collaborator

warpfork commented Mar 1, 2022

I have read this, and believe you two have more detailed understanding than I've got loaded anywhere near to my "hot cache", and I trust you :)

This sounds broadly directionally correct to me. Between the two specific interfaces I don't have a strong opinion.

FWIW, having one alloc to return whatever-interface-it-is sounds probably acceptable to me. Iterators already have that same cost threshhold in most implementations. If we wanted to optimize heavily: for both iterators, and for this, we could embed one of the thing in the main structure, wrap it in some sync primitives, and do allocs iff more than one is requested -- at the cost of larger resident memory. That hasn't seemed necessary yet, but seems plausible as future work.

@mvdan
Copy link
Contributor

mvdan commented Mar 1, 2022

That sounds reasonable, but I also think that one alloc to be able to consume a "big bytes" node will likely be negligible. At that point, you're going to be worrying about bigger factors like copying the bytes or reusing buffers correctly.

@willscott
Copy link
Member Author

it's going to be hard to add AsLargeBytes to the core ipld.node interface because it'll gate updates of ipld-prime to a release where all derived gen'd code has been re-built to include the additional method, right?

@mvdan
Copy link
Contributor

mvdan commented Mar 2, 2022

We could declare it as an optional interface either for some time, or forever. I think either of those will work as long as "imlements the interface" is not understood as "can be used as large bytes". But in any case, we can start with declaring the extra interface for now, and later decide if enough implementations have transitioned to it that we can add the method straight to Node. For example:

type LargeBytesNode interface {
    Node
    AsLargeBytes() (io.ReadSeeker, error)
}

@willscott
Copy link
Member Author

the downside is that there are two checks on each access - first a check of whether the interface is implemented, and then the error check when attempting to get the io.Reader.
That said, this seems simplest for an incremental rollout

@mvdan
Copy link
Contributor

mvdan commented Mar 3, 2022

Indeed, so I'm inclined to believe that we could move the method to ipld.Node sometime down the line - perhaps in six months or so.

@willscott
Copy link
Member Author

@mvdan ready for review. Should we also think about a way to make a bytes node from an io.Reader as an extension to the builder side API?

I'm inclined to defer for now, but it seems useful to figure out for efficiency of copies at some point.

@mvdan
Copy link
Contributor

mvdan commented Mar 3, 2022

I am happy to defer on that too, for now. I think building "large bytes" nodes will fall into two categories:

  1. The user doesn't care about scalability or efficiency. They can use SetBytes.
  2. The user does care about scalability, so they will use some ADL depending on their needs, which will have its own efficient and tailored "build" API.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the docs talk about an io.Reader when it's actually ReadSeeker

@mvdan
Copy link
Contributor

mvdan commented Mar 3, 2022

We could also document that implementations may refuse to support certain kinds of seeks, like SeekEnd.

@willscott willscott merged commit c6f8188 into master Mar 3, 2022
@willscott willscott deleted the feat/bytereaderat branch March 3, 2022 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants