need API for handling large bytes nodes #319

warpfork · 2021-12-27T17:28:20Z

The Node interface needs to provide some way to handle "big" bytes -- bigger than would readily fit into a single []byte. Some sort of io.Reader implementation; and probably, an io.ReadSeeker.

This should make it possible to work even with data that wouldn't fit entirely in memory. For example, we should be able to use this for ADLs that store huge ranges of bytes, across multiple blocks of actual storage, which may happen via something like the FBL ADL.

A GetBytesReadSeeker() (io.ReadSeeker, error) method may be sufficient? Are there any more operations than that which we expect to need?

The current AsBytes() ([]byte, error) method should also remain (to avoid breaking existing code), but newly written generic code would be encouraged to prefer to use GetBytesReadSeeker so that it works regardless of data size.

Correspondingly, NodeAssembler should be able to create a new node of bytes by copying in from an io.Reader, or returning an io.WriteCloser.

We may be able to provide both these features with feature detection rather than making them a required new part of Node and NodeBuilder, but if so, should make sure we provide good helper functions that provide the streaming style interfaces regardless of whether the optional features are present, so that this doesn't become something user code has to be bothered about.

Stretch goal: think about having variants of these features which state that byte slices used for intermediate handling may be retained by the implementation, so that it can attempt zero-copy handling of data. (E.g. I can imagine that one might want to read some data off the filesystem; "chunk" it; and then it might be nice to pass off the chunks in a way that eventually ends up in a putv([][]byte,...)-style syscall.)

The text was updated successfully, but these errors were encountered:

mvdan · 2022-01-06T16:57:52Z

Another idea, to keep the API more consistent long-term, would be to deprecate the existing AsBytes and add a method that returns an io.ReadSeeker. Any implementations which can cheaply supply a []byte could do so by returning an implementation which has a Bytes() []byte method, such as a *bytes.Buffer.

I don't think we can do this sanely without breaking downstreams, but I thought it would be worth mentioning as an option.

Adding optional interfaces/methods seems fine. If so, big thumbs up to helpers to make handling them easier. For example, one can imagine a "reader from bytes node" funcion which would always return an io.ReadSeeker, even if a node only implements AsBytes - by wrapping it in a *bytes.Buffer. This would make it easy to support both small and large bytes nodes, as long as one only needs to do something high-level like copy all the bytes to a file on disk.

A few thoughts about the API:

I disagree with the name GetBytesReadSeeker. The Get part is redundant, at least. I'd also like for us to at least try to adhere to the AsT() (T, error) signature; perhaps AsLargeBytes() (io.ReadSeeker, error).
Maybe we can do away with the error, given that today, the error value is mostly used to signal "you called me when I'm not the right kind". But perhaps not, because of ADLs having extra logic.
ReadSeeker is likely the right interface. Just FYI, its only quirk is that getting the total length is cumbersome. Assuming you're at the start, you seek all the way to the end, which gives you the new offset (the length). Then you may need to seek back to the start to actually consume the bytes. If we think "length" will be a very common use case, we could supply a helper for this, or perhaps make our own interface which is ReadSeeker plus something like Len() int64.

rvagg · 2022-01-07T03:21:11Z

As much as a unified API would be beautiful, I think the API pain is going to make it too annoying. The common case is that you expect and get a single, cheap byte slice so AsBytes() is reasonable. But if we took that away and returned a reader (of some kind) and force the user to either have to consume the reader or feature-detect a Bytes() method on it, then it's just going to add to the existing verbosity of the API.

Maybe this is a case where you have to leave it to the user to apply what they know about the context of their data to the way they consume the API. Something like - we recommend you use AsBytesReader() unless you are sure that you're only ever going to encounter byte slices that come from a single, underlying block.

And re concern 3 for ReadSeeker - that's probably a good thing in our situation because for the multi-block case like FBL, we can only ever give you a hint about the length, if anything, anyway, no certainty. We probably should be suggesting to the user that there is a meaningful Len() value going to come out of this thing. Imagine an FBL that spans an entire movie video file that you have to collect via the network, any Len() for that will either be a guess (or based on an encoded hint, which is not certain) or require you to download all of them to provide it. I say leave off Len().

mvdan mentioned this issue Mar 1, 2022

Allow feature detection for io.ReaderAt on Bytes nodes #372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need API for handling large bytes nodes #319

need API for handling large bytes nodes #319

warpfork commented Dec 27, 2021

mvdan commented Jan 6, 2022

rvagg commented Jan 7, 2022

need API for handling large bytes nodes #319

need API for handling large bytes nodes #319

Comments

warpfork commented Dec 27, 2021

mvdan commented Jan 6, 2022

rvagg commented Jan 7, 2022