Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable content checksums #4375

Closed
ckamm opened this issue Jan 20, 2016 · 13 comments
Closed

Enable content checksums #4375

ckamm opened this issue Jan 20, 2016 · 13 comments
Assignees
Labels
Enhancement ReadyToTest QA, please validate the fix/enhancement
Milestone

Comments

@ckamm
Copy link
Contributor

ckamm commented Jan 20, 2016

Due to #3235 and #2983 we use content checksums for detecting unnecessary uploads and renames-that-aren't. But content checksums are currently only computed for .eml files.

We should consider enabling content checksumming for everything.

@ckamm ckamm self-assigned this Jan 20, 2016
@ckamm ckamm added this to the 2.2-next milestone Jan 20, 2016
@guruz
Copy link
Contributor

guruz commented Jan 20, 2016

👍

I still vote for using a safe cryptograpgical checksum so we can later use it for more advanced magic.
But if @ckamm says we can anyway later switch...

@LukasReschke @ogoffart What do you think?

@dragotin
Copy link
Contributor

@moscicki your input here?

@guruz
Copy link
Contributor

guruz commented Jan 20, 2016

Maybe also from @PVince81 and @DeepDiver1975 about storage backends (on the server) natively supporting certain checksums so we can align with them.

@ckamm
Copy link
Contributor Author

ckamm commented Jan 20, 2016

Discussion on IRC regarding checksum type:

  • Performance of checksumming will rarely matter compared to the cost of uploads and downloads, we can use SHA1 or better.
  • We should choose a content checksum type that's supported by the server as transmission checksum. Otherwise we might have to compute two checksums.
  • The content checksum could be computed while we're reading the file into memory anyway, instead of as a separate pass.
  • The transmission checksum is currently part of the header of the last chunk's PUT request and must thus be computed in advance.

@ckamm
Copy link
Contributor Author

ckamm commented Jan 20, 2016

Other jobs that need to be done to expand our use of checksums:

  • Does the 'transmission checksum' vs 'content checksum' distinction make sense? Currently the server only sees the 'transmission checksum' and we discard the checksum we get from the server because the journal only stores the 'content checksum'.
  • Ensure we have a content checksum for downloaded files: either use the checksum the server sends or compute it after downloading.

@moscicki
Copy link
Contributor

Please consider that some storage backends such as ours already have a preferred checksum (adler32 in our case). It may be impossible for us to change the checksum type. More precisely, we have a technical ability to change the checksum type (for new files) but we may be constrained to do so by other considerations/policies. And also, consider that currently all our files are already checksummed with adler32 so any checksum type change could only apply to new files (so we would have a mix of checksum types in the system).

The adler32 checksum is not efficient for change detection for small files. Hence using it as content checksum may be problematic for files smaller than a given size (say 1KB). Also, consider this especially if you think in the future of storing checksum blocks which are by definition rather small. So if we want to avoid to compute additional content checksum then the content checksum mechanism with adler32 should be only enabled for files above the critical size (meaning that below that size you should count with a possibility of collision and false-positive). So that would option one but it is not optimal because it would not work for very small files (which I guess defeats the purpose for *.eml files).

So I think at least for us the distinction between content and transmission checksum makes sense. What is currently called a transmission checksum is actually enforced by the server and client cannot change it (another example: MD5 for Amazon S3). What is called content checksum is actually what client wants to use locally to achieve better operation with local change detection. In some cases both can be aligned, in other cases they cannot.

If we wanted to add additional content checksum than I would suggest:

  • on PUT: also send it as additional special header (e.g. OC-Client-Content-Checksum) just in case server decided to make use of this information in some way (or just store it as additional metadata). I would only consider to add this header unless you think that it is a better implementation choice to avoid the second pass on the client.
  • on GET: if transmission checksum does not match the type you expect you should compute the content checksum yourself and store locally. You may want to consider to store also the transmission checksum in case some correlation between the client and server checksums would be needed. This checksum however, is not normally needed by the client and should be included in a way which does not increase the memory footprint of the client (if that's possible at all)

I am open to discuss it further, these are just a bunch of initial thoughts but the initial constraints I described are valid.

@AnrDaemon
Copy link

A wishful note: Using two differently calculated checksums against same file often increase reliability.

@dragotin
Copy link
Contributor

Hm, I got pointed to this: https://tools.ietf.org/html/rfc3230
Maybe we should be compatible?

@dragotin
Copy link
Contributor

Created owncloud/core#22711 about this.

@ckamm ckamm assigned ckamm and unassigned dragotin Mar 1, 2016
ckamm added a commit to ckamm/owncloud-client that referenced this issue Mar 2, 2016
ckamm added a commit to ckamm/owncloud-client that referenced this issue Mar 2, 2016
@ckamm
Copy link
Contributor Author

ckamm commented Mar 2, 2016

Pull request #4532

@moscicki
Copy link
Contributor

moscicki commented Mar 2, 2016

Hello,

In 1.7.2 we have the content checksums implemented and enabled. Will this pull request affect the current transmission checksum implementation?

Thank you,

kuba

@ckamm
Copy link
Contributor Author

ckamm commented Mar 3, 2016

@moscicki This has no effects on transmission checksums, it's only about checksums that the client stores in its database.

ckamm added a commit that referenced this issue Mar 14, 2016
@ckamm ckamm added the ReadyToTest QA, please validate the fix/enhancement label Mar 14, 2016
@guruz
Copy link
Contributor

guruz commented Apr 6, 2016

-> #4638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ReadyToTest QA, please validate the fix/enhancement
Projects
None yet
Development

No branches or pull requests

5 participants