Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't re-download/re-upload manually copied files #3422

Closed
nalt opened this issue Jul 8, 2015 · 26 comments
Closed

Don't re-download/re-upload manually copied files #3422

nalt opened this issue Jul 8, 2015 · 26 comments
Assignees

Comments

@nalt
Copy link

nalt commented Jul 8, 2015

For large files, it often makes sense to copy them directly between two computers, instead of waiting for an upload/download cycle. The client should detect such directly copied files and not re-download them from the server.

Expected behaviour

The client should not download a file, which has been copied manually to the ownCloud folder from another client

Actual behaviour

The client downloads an already existing file again from the server, if the file has been copied manually into the ownCloud folder

Steps to reproduce

Two clients: A,B

  1. Stop owncloud client B
  2. Create a file FILE.DAT on A, let it sync to the server
  3. Copy FILE.DAT to the same location on B manually
  4. Start owncloud client B
  5. The client on B downloads FILE.DAT again from the server

Server configuration

Operating system: Ubuntu 12.04

Web server: Apache 2.4.12

Database: MySQL

PHP version: 5.5.23

ownCloud version: 8.0.2

Storage backend:

Client configuration

Client version: 1.8.3

Operating system: Ubuntu 15.04

OS language: DE

Installation path of client: /usr/bin

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/23945516-don-t-re-download-re-upload-manually-copied-files?utm_campaign=plugin&utm_content=tracker%2F216457&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F216457&utm_medium=issues&utm_source=github).
@ghost
Copy link

ghost commented Jul 8, 2015

There is definitely an open feature request for this available here but can't find it at the moment. Will see if i can find it tomorrow.

@ogoffart
Copy link
Contributor

ogoffart commented Jul 9, 2015

This would be a use cases for checksum. @dragotin

@phil-davis
Copy link
Contributor

This would be really nice to have for slow connections when you want to "seed" a new client install. e.g. If I have nGB in ownCloud on my laptop client. Now I want my wife's laptop to have ownCloud with the files. If I could either:
a) Create the ownCloud folder before installing the client, put all the files in it with content and time-stamps... preserved (rsync, robocopy them there, whatever). Then install the client and let it sync.
or
b) Install the ownCloud client and have a checkbox on the installer to tell it "do not start a sync just yet". Then put all the files in place from the other laptop... Then let the client start to sync.

The client would find that the files on the server and files on the local client already matched - a little bit of time sending messages back-and-forth client<->server but no need to transfer any real file data.

I suspect that any code that succeeded in doing this sort of thing would work for both sequences (a) or (b) anyway.

@guruz guruz added the Feature label Jul 10, 2015
@guruz guruz changed the title Manually copied files are downloaded again Don't re-download/re-upload manually copied files Jul 10, 2015
@guruz guruz added this to the backlog milestone Jul 10, 2015
@tflidd
Copy link

tflidd commented Jul 28, 2015

@prophoto
Copy link

prophoto commented Aug 3, 2015

bump :0)...definitely need this feature. I have over 5tb of files I'd like to have in owncloud. In order to get all those files it would probably take over a year for the client to upload all of them!

@lhartmann
Copy link

This issue also makes it extremely inconvenient to migrate from other sync tools (unison, rsync, dropbox, ...) to owncloud. I have over 100GB of files stored in 4 locations (linux+unison+rsync) and decided to make owncloud available for my co-workers... If we all have to redownload all files over again our network will be clogged for weeks.

I know very little of owncloud's internals, but let me try brainstorming a possible solution...

For server side:

  • Do NOT assume ETAG = hash! Hashing is expensive and must remain an option, keeping compatibility with older clients.
  • Regular ETAG-based behaviour must remain sufficient to detect any potential change that might have happened.
  • Add hash column to the file database, maybe on filecache.
  • Hashes should be stored along with their type, e.g.: "MD5:afafafafafafafafaf". This should allow future algorithm changes without rehashing the whole database, and would also allow mobile clients to prefer a hardware-accelerated hash, when available.
  • Let hash column accept, and default to, "null" values for unknown hashes. (compatibility for SQL insert with older servers)
  • Add an "update trigger" to the filecache table, clearing the hash column on update if etag changes but hash does not. (compatibility for SQL update with older servers, removes bad old hashes automatically).
  • Offload hashing to the client when possible and safe (desktop/mobile client during file upload).
  • Do not trust client-provided hashes for shared files. An ill-intended user should only be able only harm himself.
  • Maybe have server hash shared files during upload... You are already running network and disk I/O on the server, and my i7 cpu can hash 280MB/s on single core (SHA1)... I don think this should impact performance that much.
  • Optionally have a cron job hashing null-hashed files every now and then. This also helps upgrading from older servers.
  • Optionally hash and update table on download requested.
  • If a file is modified on client-side hash it and update the local DB.
  • If during the sync an ETAG has changed but the HASH matches, then just update ETAG (do not transfer data).

P.S.: I do realize that hashing on download is CPU- and disk-expensive on the server side, but pointless downloading is network intensive and uses just as much disk io...

@JH6
Copy link

JH6 commented Aug 13, 2015

This would indeed make owncloud a lot more usefull. I'm having around 600GB of data. If I would reinstall one computer and need to resync all this data (instead of first backing it up on an external drive and just copy after reinstallation) that would take a lot of time.

@AnrDaemon
Copy link

Gotta agree with @lhartmann - he seems to have thought it through quite thoroughly.
The disk- and CPU usage on rehash is a determined factor.
Network bandwidth is not.
When you play known cards against unknown, you better keep your hand tight and prepare for the worst.

@lhartmann
Copy link

My suggestion is flawed for large files. Imagine you have a 50GB virtualbox disk image to sync, where only 10kB were actually modified. You would have to rehash all of the 50GB, which is terrible, but so would be to transfer 50GB instead of 10kB.

Recent clients (I believe) can upload fragments of a file, say 4MB, per HTTP connection. I don't think this is a fixed value which complicates things a lot... However, if a standard fragment size can be defined at least for owncloud clients, then there is another potential solution:

  • All ETAG functionality is to remain identical, for compatibility. Change detection should behave as usual:
    • If ETAGs are identical files must be considered identical, ignoring HASH data.
    • If ETAGs differ files MAY differ, clients MAY decide to verify HASHes.
  • HASH functionality is for segments, not full files. The server holds no file HASH information.
  • Create a separate table for hashes of segments (hashcache?) with fileid, offset, size and HASH columns.
    • File's ETAG must not be included or it would invalidate all segment hashes unnecessarily.
    • Including offset and size makes us future-proof, and may allow other backends to provide HASHes on variable sized-segments, or even the complete-file.
    • hashcache must be updated whenever the server HASHes a segment.
    • hashcache entries must be flushed for any file that changes without owncloud server intervention.
    • hashcache will be empty after a server update from old versions.
  • HASH information MUST NOT be included in replies to regular propfind request for the following reasons:
    • That would make responses big: 500kB+ in hash entries for a 50GB file with 4MB segments (12.5k segments at 40B each, not including segment offset, size and HTTP header encoding).
    • The first sync after a server update would only finish after all segments were hashed on the server, despite ETAGs indicating no changes. This would be terrible for upgrading.
  • Client behaviour remains the same for discovery and queueing of potential changes (ETAG-based).
  • During the transfer phase, before transferring each file, the client should request a propfind again on the potentially modified file, but now with a special HTTP header (maybe X-OwnCloud-Hash-Request: true) requesting hash info to be included. Benefits:
    • New clients will only request hashes that may potentially be used to save network traffic.
    • old/aternative clients will never ever receive hash headers thy don't care about.
    • Prevents transfer of thousands of unnecessary HASH headers during discovery.
    • Eliminates hashcache-related disk IO during discovery.
  • The server should respond to a propfind+hash request as a regular propfind, but including HASH headers for each individual cached segment. Maybe something like "X-OwnCloud-Hash: MD5,offset,size,AABBCCDDEEFF". Non-hashcached segments SHOULD be HASHed on demand for backends where it makes sense.
    • HASHing happens only when it may save network traffic: When an ETAG has changed and a file is about to be transfered.
    • Allows for a smooth upgrade from old servers, as unchanged files (detected by ETAG) will not be hashed until really necessary.
    • Yes, it will take some time to HASH the 50GB disk image, but that will save 49.996GB of data transfer.
  • Other backends (such as samba, nfs, dropbox, ...) may not allow proper HASH functionality. In this case the server should reply with the best available information, maybe a full-file hash if dropbox supports it, or hash headers for just a few of the segments. The client should understand the lack of HASH info for a segment as an indicator of potential modification, and proceed to the usual data transfer.
    • For other backends the server may still cache HASHes for segments as they pass through the server in either direction (backend->server->client or client->server->backend), this way the hashcache will be valid until the backend file is changed outside of owncloud server. This should allow for segment transfers even if the backends have no segment hash support, as long as owncloud server is the only one changing the files directly.
  • Old owncloud and generic webdav clients may do unaligned transfers, will not provide any hashes on upload, and will never receive (or request) a HASH list for which they have no use.
  • New Owncloud clients should always do segment-aligned transfers, and handle/provide segment HASH accordingly. Server may choose to ignore client-provided HASH if trusting it is deemed unsafe (shared files).
  • When the server receives a misaligned segment or a segment with untrusted HASH, it MUST delete the invalidated entries from hashcache. Updating hashcache now could save a little disk IO on the next propfind+hash, but it is not mandatory.
  • I don't think SQL insert and update triggers as I previously proposed would work in this setup, as validating cross-table updates does not seem trivial. However they were only useful on server version dancing such as: Install or update to a new server (creates a valid hashcache), downgrade to an old server (modifies filecache leaving hashcache inconsistent), then update to a new server again (finds a bad hashcache, but believes it to be valid). I believe dropping hashcahe entirely is probably safer in this case.

Phew... Took me 2 hours to write this, but at least this feels even better than the previous idea. :-)

@AnrDaemon
Copy link

LH> My suggestion is flawed for large files.

Not quite.
Perhaps, you do not understand this, but knowing if hash matching, and
downloading whole file just to make the hash match is not the same.
Once you know that you have two different, but large, files, you have more
than one option to sync them.

LH> Imagine you have a 50GB virtualbox disk image to sync, where only 10kB
LH> were actually modified. You would have to rehash all of the 50GB, which is
LH> terrible, but so would be to transfer 50GB instead of 10kB.

You WILL need to rehash it, it is unavoidable. There's just no other way to
ensure that your judgment is consistent.
The interesting thing comes at the time of rehashing.
You could compute a number of separate hashes for chunks of the file.
Say, 4 or 8MB per chunk sounds reasonable.

LH> HASH information MUST NOT be included in replies to regular propfind
LH> request for the following reasons:

It doesn't need to be, but there's options.
You could then combine chunk hashes into a string and compute a single hash
over i.e. a half of them. Client do the same. They exchange one hash at a
time. Then a simple binary search would quickly reveal the part, which needs
sync. Basically, the same principle that is used in other tools, like rsync,
or, god forbid, torrents.

LH> That would make responses big: 500kB+ in hash entries for a 50GB file
LH> with 4MB segments

That assuming md5 hash, which is weak. sha256 hash is ~100 octets per chunk.
Which is another reason to go into playing a hot potato game with hashes
before sending changes over the wire.
There's, however, edge cases, that may require thought through.
Let's pick your example of 50GB file, and cut a part of it out of the middle?
Blind 4MB chunk hashing will make you redownload all 25GB even though nothing
of it has changed.

@tflidd
Copy link

tflidd commented Nov 6, 2015

@lhartmann @AnrDaemon : there are already topics about differential sync:
#179
owncloud/core#16162
https://dragotin.wordpress.com/2015/02/09/incremental-sync-in-owncloud/

But this topic is not. Don't get me wrong, it's a great feature and both issues could be dealt at the same time. The problem here could be easier to implement and would already help a lot of users. Picture, music, or film archives are very static.

@lhartmann
Copy link

Unless the filesystem supports keeping track of partial file changes You will need to rehash the 50GB file on the client side every time a single byte is changed. Annoying but still better than tranfering all that data. On the server side hashing is required only after server update, but would be gradual and on demand. Once all files are hashed the client may send modified hashes along with modified segments, so normal operation would not need the server to hash anything. Exceptions are shared files where trusting an ill-intended client is unsafe, and old/alternate clients which would not provide the new hash as segments are uploaded.

Hashing is supposed to consider every segment independently, as if there were 12k 4MB files instead of a single 50GB file, so there would be no edge issues. Note that full file hashing is discouraged on the server, for it would require 50GB disk IO for every segment updated.

@tflidd Sorry about the crossed topic post. I will move the misplaced posts away as soon as I have the time.

@AnrDaemon
Copy link

@tflidd, apology for hijacking the thread, but these two questions are so tightly linked that it isn't really possible to point a finger and say "there's stops the one and starts another."

@theminor
Copy link

Is there any workaround to this issue? Perhaps a way to manually edit metadata on the remote files or trick the Owncloud sync client into thinking it has already synced the files? I have a new Owncloud installation with terabytes of data that is in two locations that is exactly the same, but Owncloud tries to re-sync everything, which would take forever and use a ton of bandwidth.

@prophoto
Copy link

@theminor sadly I haven't found one yet. There needs to be a way to seed the server but as of now there is none. I also have over 5TB of data with more than 1M files which would take years to complete.

@ogoffart
Copy link
Contributor

now that we have checksum in the database, we could try to detect copies.

I don't know if the server support the webdav COPY method.
Even if it does, we would need some extentions so the server can return new etag or fileid, and that we can set metimes and such.

@shsdev
Copy link

shsdev commented Jan 11, 2016

Same problem. I have about 500GB which would be re-downloaded over WIFI by the client even if the complete data is copied beforehand.

@JH6
Copy link

JH6 commented Jan 12, 2016

Any progress yet? My data is growing relatively fast with at least two user having more then 500GB. Syncing this again on reinstall of their laptops (which inevitable will happen someday) will be problematic that.

@tflidd
Copy link

tflidd commented Jan 23, 2016

it is scheduled for the 2.2 version:
#4375

@mark-orion
Copy link

When will 2.2. be released ? I just reinstalled a machine and can share the frustrating experience of owncloud client wanting to download everything again although its already there.
Has anybody found a workaround for this problem ? I mean the sync client MUST keep track of what has been synced in some way so there should be a way of simply pushing that information across.

@danimo danimo removed this from the backlog milestone Feb 22, 2016
@Nico83500
Copy link

Hi,
Is there a solution for this issue ?
Thank you !

@brunodegoyrans
Copy link

Hi

It would be a greeeeeat feature!

@ghost
Copy link

ghost commented Oct 3, 2016

@mark-orion @Nico83500

When moving files from e.g. PC A to PC B make sure to copy over the hidden file .csync_journal.db to the same place / folder on PC B and make sure the file mtime stays the same. The sync client keeps its state in there and won't re-download the files.

@brunodegoyrans and others

Please use the emoticon at the first post of this issue to say that you would like to see that feature. This avoids that this issue gets filled up with no-content posts like "me too".

@PVince81
Copy link
Contributor

PVince81 commented Nov 8, 2016

The server is already able to store checksums as sent by the client.
These checksums can be retrieved by querying the property "oc:checksums" through PROPFIND and compare with local checksums to avoid redownloads.

@PVince81
Copy link
Contributor

PVince81 commented Nov 8, 2016

I don't know if the server support the webdav COPY method.

OC server supports the COPY method, but it doesn't seem to copy the checksum (it doesn't copy any other metadata either). This could be improved and have it copy the checksum field at least.

Raised here owncloud/core#26584

@ckamm
Copy link
Contributor

ckamm commented Jun 22, 2017

The original problem should be solved for client 2.4 by #5838 if the size and mtime of the server and client file are identical.

@PVince81 @ogoffart Your discussion of COPY relates to the case of avoiding the upload of a file that was locally copied? Okay if I close this one and create a new issue for that enhancement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests