Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need API for handling large bytes nodes #319

Open
warpfork opened this issue Dec 27, 2021 · 2 comments
Open

need API for handling large bytes nodes #319

warpfork opened this issue Dec 27, 2021 · 2 comments
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful kind/architecture Core architecture of project kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up

Comments

@warpfork
Copy link
Collaborator

The Node interface needs to provide some way to handle "big" bytes -- bigger than would readily fit into a single []byte. Some sort of io.Reader implementation; and probably, an io.ReadSeeker.

This should make it possible to work even with data that wouldn't fit entirely in memory. For example, we should be able to use this for ADLs that store huge ranges of bytes, across multiple blocks of actual storage, which may happen via something like the FBL ADL.

A GetBytesReadSeeker() (io.ReadSeeker, error) method may be sufficient? Are there any more operations than that which we expect to need?

The current AsBytes() ([]byte, error) method should also remain (to avoid breaking existing code), but newly written generic code would be encouraged to prefer to use GetBytesReadSeeker so that it works regardless of data size.

Correspondingly, NodeAssembler should be able to create a new node of bytes by copying in from an io.Reader, or returning an io.WriteCloser.

We may be able to provide both these features with feature detection rather than making them a required new part of Node and NodeBuilder, but if so, should make sure we provide good helper functions that provide the streaming style interfaces regardless of whether the optional features are present, so that this doesn't become something user code has to be bothered about.

Stretch goal: think about having variants of these features which state that byte slices used for intermediate handling may be retained by the implementation, so that it can attempt zero-copy handling of data. (E.g. I can imagine that one might want to read some data off the filesystem; "chunk" it; and then it might be nice to pass off the chunks in a way that eventually ends up in a putv([][]byte,...)-style syscall.)

@warpfork warpfork added kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up kind/architecture Core architecture of project need/analysis Needs further analysis before proceeding exp/intermediate Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week labels Dec 27, 2021
@mvdan
Copy link
Contributor

mvdan commented Jan 6, 2022

Another idea, to keep the API more consistent long-term, would be to deprecate the existing AsBytes and add a method that returns an io.ReadSeeker. Any implementations which can cheaply supply a []byte could do so by returning an implementation which has a Bytes() []byte method, such as a *bytes.Buffer.

I don't think we can do this sanely without breaking downstreams, but I thought it would be worth mentioning as an option.

Adding optional interfaces/methods seems fine. If so, big thumbs up to helpers to make handling them easier. For example, one can imagine a "reader from bytes node" funcion which would always return an io.ReadSeeker, even if a node only implements AsBytes - by wrapping it in a *bytes.Buffer. This would make it easy to support both small and large bytes nodes, as long as one only needs to do something high-level like copy all the bytes to a file on disk.

A few thoughts about the API:

  1. I disagree with the name GetBytesReadSeeker. The Get part is redundant, at least. I'd also like for us to at least try to adhere to the AsT() (T, error) signature; perhaps AsLargeBytes() (io.ReadSeeker, error).

  2. Maybe we can do away with the error, given that today, the error value is mostly used to signal "you called me when I'm not the right kind". But perhaps not, because of ADLs having extra logic.

  3. ReadSeeker is likely the right interface. Just FYI, its only quirk is that getting the total length is cumbersome. Assuming you're at the start, you seek all the way to the end, which gives you the new offset (the length). Then you may need to seek back to the start to actually consume the bytes. If we think "length" will be a very common use case, we could supply a helper for this, or perhaps make our own interface which is ReadSeeker plus something like Len() int64.

@rvagg
Copy link
Member

rvagg commented Jan 7, 2022

As much as a unified API would be beautiful, I think the API pain is going to make it too annoying. The common case is that you expect and get a single, cheap byte slice so AsBytes() is reasonable. But if we took that away and returned a reader (of some kind) and force the user to either have to consume the reader or feature-detect a Bytes() method on it, then it's just going to add to the existing verbosity of the API.

Maybe this is a case where you have to leave it to the user to apply what they know about the context of their data to the way they consume the API. Something like - we recommend you use AsBytesReader() unless you are sure that you're only ever going to encounter byte slices that come from a single, underlying block.

And re concern 3 for ReadSeeker - that's probably a good thing in our situation because for the multi-block case like FBL, we can only ever give you a hint about the length, if anything, anyway, no certainty. We probably should be suggesting to the user that there is a meaningful Len() value going to come out of this thing. Imagine an FBL that spans an entire movie video file that you have to collect via the network, any Len() for that will either be a guess (or based on an encoded hint, which is not certain) or require you to download all of them to provide it. I say leave off Len().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful kind/architecture Core architecture of project kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

3 participants