Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify WriteRequest.write_offset spec for compressed blobs #193

Merged
merged 9 commits into from
Apr 22, 2021

Conversation

nodirg
Copy link
Contributor

@nodirg nodirg commented Apr 19, 2021

The CAS compression spec currently states that WriteRequest.write_offset
is the offset in the uncompressed blob, but it does not match Google's
implementation on the client or server side.

The original intention was to encode each chunk, but having the entire
stream compressed seems better also because it allows the
reading-from-data-source-eg-disk and writing-to-CAS to happen
concurrently.

The CAS compression spec currently states that WriteRequest.write_offset
is the offset in the uncompressed blob, but it does not match Google's
implementation on the client or server side.

The original intention was to encode each chunk, but having the entire
stream compressed seems better also because it allows the
reading-from-data-source-eg-disk and writing-to-CAS to happen
concurrently.
@google-cla google-cla bot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Apr 19, 2021
@nodirg
Copy link
Contributor Author

nodirg commented Apr 19, 2021

@ramymedhat

@nodirg
Copy link
Contributor Author

nodirg commented Apr 19, 2021

@EricBurnett
Copy link
Collaborator

I'm not sure which Google implementations you're referring to, but it being the offset into the uncompressed data is by design and correctly specified, and at least some Google implementations (e.g. RBE) get this right.

This has been implemented incorrectly in a few cases, e.g. bazelbuild/remote-apis-sdks#273 fixed using the wrong offset in the SDK. Please flag anything still doing it wrong (internally if it's a non-public implementation) and we'll get those implementations fixed too.

The original intention was to encode each chunk, but having the entire
stream compressed seems better also because it allows the
reading-from-data-source-eg-disk and writing-to-CAS to happen
concurrently.

Most CAS implementations don't support resumable uploads today so you can construct them however you wish (including streaming a compressed file from disk, compressing+concatenating chunks, or compressing on the fly), since you'll have to start over regardless if the stream gets broken. But yeah, it is a weakness of this scheme that resuming a partial upload of a large, compressed-on-disk file may require decompressing/recompressing the whole thing. That's a little unfortunate, but an uncommon enough corner case that I don't think it's worth solving for, and making the offsets be into the compressed data is problematic in its own right. Happy to expand if you need more information.

@nodirg
Copy link
Contributor Author

nodirg commented Apr 19, 2021

https://github.com/bazelbuild/remote-apis-sdks/blob/40d1266c08bdd688d691a2e4090a01d498318030/go/pkg/chunker/chunker.go#L206 increases offset with the length of the compressed chunk. Then this offset is used as WriteRequest.WriteOffset: https://github.com/bazelbuild/remote-apis-sdks/blob/40d1266c08bdd688d691a2e4090a01d498318030/go/pkg/client/bytestream.go#L49
This implementation is used in production.

I am working on a faster implementation of uploader. I tried the following scheme

  • Read a chunk from the uncompressed reader
  • Compress it with zstd.Encoder.EncodeAll()
  • Pass the compressed as WriteRequest.Data and use the uncompressed offset as WriteRequest.WriteOffset

but the server responded with an error message about invalid offset. I can provide the code, but it will take me some time.

@ramymedhat
Copy link

@rubensf

@EricBurnett
Copy link
Collaborator

Ahh. Hmm. I confused read and write there, but that was a mistake. In particular:

  • The read offset is specified once, at the beginning of the stream.
  • The write offset is specified at the beginning of the stream and on every subsequent message
  • The write offset in the bytestream spec is defined as:
    • // In the first WriteRequest of a Write() action, it indicates
      // the initial offset for the Write() call. The value must be equal to
      // the committed_size that a call to QueryWriteStatus() would return.
      //
      // On subsequent calls, this value must be set and must be equal to
      // the sum of the first write_offset and the sizes of all data bundles
      // sent previously on this stream.

That's more awkward than I remembered :(. Apologies for the confusion! And more to the point, I only confirmed that the RBE read implementation matches the spec, but if you're telling me the write implementation does not, that's unfortunate but good to know.

I'm...not actually sure what the right resolution is in this case. Making the write path use compressed offsets and the read path use uncompressed offsets would be easy I suppose, and I think more correctly matched to a strict reading of the ByteStream spec where it specifies the sum of the sizes of all data bundles sent on the stream. On the other hand, that's inconsistent with Read, and arguably problematic for anyone implementing resumable uploads - where it's not expected they'd commit the incoming bytes to storage byte-for-byte equally, and so wouldn't necessarily be able to handle resumptions of the form "the next bytes in a compressed stream, not aligned to any particular chunk boundary or interpretable without the exact bytes of the broken steam on hand". (Or to put it another way, I'm still pretty sure GetWriteStatus() should return the number of post-decompression bytes successfully persisted).

"Correct" in the ideal case would be to have the write offset in the initial chunk be the offset in the uncompressed bytes, but for subsequent chunks to be specified in terms of the sum of the lengths of the data bundles for validation purposes...except there's only one field in ByteStream and so this'd have you summing compressed and uncompressed bytes together today. Ugh.

@rubensf
Copy link
Contributor

rubensf commented Apr 20, 2021

Thinking about this has been quite a brain twist! But for the most part I concur with what Eric wrote.

Any scenario where the subsequent write_offsets refer to the uncompressed size would inherently disobey the spec (...the sum of the first write_offset and the sizes of all data bundles sent previously on this stream...), and thus be a no-go.

Changing the first write_offset to refer to the compressed size, however, would require QueryWriteStatus to return the compressed size, and doing that would effectively require us to expose one more API where the client can check the compressed -> uncompressed mapping. That sounds generally more work than it's worth. FWIW, I don't think this would mean "not aligned to any particular chunk boundary" since the server is the one specifying the committed_size for the client to use, and I'm confident the server will prefer an interpretable boundary as the committed_size :)

Additionally, the current implementation by RBE's client and server is technically adhering to the spec already. subsequent calls are already set to the first write_offset plus the size of all data bundles sent. The easiest solution here is likely to be clarifying the RE spec to explain this. Granted I'm likely bias since I worked on the SDK implementation.

Of course, we could always ignore what the Bytestream specifies and establish our own expectations, but in that scenario we might as well write our own REByteStream. That's not unreasonable but I don't think the pros outweight the cons.

@rubensf
Copy link
Contributor

rubensf commented Apr 20, 2021

The original intention was to encode each chunk, but having the entire
stream compressed seems better also because it allows the
reading-from-data-source-eg-disk and writing-to-CAS to happen
concurrently.

Also I'm not sure what you mean here? IIUC writing to CAS is sequential, and write_offset referring to the uncompressed or compressed offset shouldn't affect potential performance.

FWIW, the API doesn't specify anything regarding chunks for writes. The SDK uses chunks as an implementation detail. In any case, zstd has frames which are effectively another abstraction for chunks and compressed as a single unit. The SDK also already reads-from-disk and compress-data concurrently, though SDK specific discussions shouldn't happen on this thread.

@nodirg nodirg changed the title Fix incorrect protobuf spec Clarify WriteRequest.write_offset spec for compressed blobs Apr 20, 2021
@nodirg
Copy link
Contributor Author

nodirg commented Apr 20, 2021

Thanks. I've clarified the spec. PTAL.

@rubensf you are right. The difference in the spec doesn't require reading and writing to be sequential.

Copy link
Contributor

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complexity is caused by Bytestream not quite fitting our use case exactly, and our desire for an incremental improvement to uncompressed Bytestream. For REAPIv3, we should IMO replace Bytestream with our own protocol (there are a few people probably insterested in working on replacements, but I don't know the time scale for REAPIv3 so maybe it's too early to start).

To summarise the reasoning behind the current design:

  • The initial write_offset must refer to an offset in the uncompressed blob, because we have no way to specify a specific compressed version of the blob. (In most cases, I expect server implementations to only support 0 here anyway.)

  • Subsequent write_offset values refer to the amount of compressed data received so far (plus the initial uncompressed write_offset), because the bytestream spec states it, but also because it would be less useful for the server to check the uncompressed write offsets. If we count compressed data here, the server can do some simple checks that it agrees with the client on how much compressed data has been sent. On the other hand if we were to count uncompressed data, the server might not be able to perform checks until all the data has been sent anyway (eg if a block-based compressor is used) in which case it could just decompress everything and check the size + hash of the uncompressed blob.

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved
@EricBurnett
Copy link
Collaborator

Change LGTM! Thanks Nodir.

I wonder if we should go further and call out explicitly that this is nonsensical (adding uncompressed offset to compressed byte lengths gives...nothing meaningful) but done to fit the semantics of ByteStream, and/or that this will be split into distinct fields in V3 as we get off ByteStream, but that's optional. The text as written looks good, it's just going to reasonably make people question why it's defined so strangely :).

@mostynb
Copy link
Contributor

mostynb commented Apr 20, 2021

Change LGTM! Thanks Nodir.

I wonder if we should go further and call out explicitly that this is nonsensical (adding uncompressed offset to compressed byte lengths gives...nothing meaningful) but done to fit the semantics of ByteStream, and/or that this will be split into distinct fields in V3 as we get off ByteStream, but that's optional. The text as written looks good, it's just going to reasonably make people question why it's defined so strangely :).

This also sounds good to me.

The implementation in bazel-remote simply ignores the field in subsequent requests because it isn't useful.

@nodirg
Copy link
Contributor Author

nodirg commented Apr 21, 2021

Acked the nonsensical piece explicitly in the comment.

@mostynb
Copy link
Contributor

mostynb commented Apr 21, 2021

lgtm, but I would remove This might be improved in the next version. since it doesn't add any useful info (you could say this about any part of this file).

@nodirg
Copy link
Contributor Author

nodirg commented Apr 21, 2021

lgtm, but I would remove This might be improved in the next version. since it doesn't add any useful info (you could say this about any part of this file).

done

@nodirg
Copy link
Contributor Author

nodirg commented Apr 21, 2021

seems like everyone is happy? How do I get an official "LGTM"? Note that I don't have permissions to submit this CL.

@nodirg
Copy link
Contributor Author

nodirg commented Apr 22, 2021

Thanks for approving, @mostynb!

Folks, I don't have merging permissions. Please merge it for me. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants