Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta-sync, or sync only changed bytes in a file #417

Open
exokkk opened this issue Jul 15, 2016 · 55 comments
Open

Delta-sync, or sync only changed bytes in a file #417

exokkk opened this issue Jul 15, 2016 · 55 comments

Comments

@exokkk
Copy link

exokkk commented Jul 15, 2016

Hi all,

Delta Sync would be great for my truecrypt/veracrypt huge files (~30-100gb). Without delta sync I must stay absent from this product.

Delta Sync would also provide you a feature to distinguish from owncloud

Couldn't there be an optional (maybe extension / folder / file -based) mechanism to perform Delta Sync ("optionally" as I agree that Delta Sync does not make sense for all kind of files/folders)? Maybe even using something existing like rsync?

@rullzer
Copy link
Member

rullzer commented Jul 15, 2016

I have looked into zsync (client rsync). If we would ever implement such a feature it will most likely will be using that. Since you have to offload all the computation to the client side else you will kill the server.

I don't know about truecrypt/veracrypt. But actually most container formats (and encrypted is even worse). don't lend themself particually well for delta sync. Since often a small change results in a a lot of changed bytes.

@exokkk
Copy link
Author

exokkk commented Jul 15, 2016

I can only speak for truecrypt (but suppose that veracrypt does the same): if you change 1MB within the container the whole file changes only a bit more than 1MB as well (I am very sure about this). Also not even password changes would change the whole file, only little parts, refer https://news.ycombinator.com/item?id=6523286 and http://crypto.stackexchange.com/questions/18479/how-does-truecrypt-change-password-without-the-need-for-a-complete-re-encryption
So Delta Sync + Truecrypt (veracrypt) is a really perfect combination.
Although I can see that this feature will not be desired from many people there are some cases, like mine, where it would be great. Maybe there are other cases that I/we cannot think of but exist. Not sure, but VM images might profit from delta sync as well for example. Also, for some files you might uncompress -> compare -> delta_sync -> compress_server_side_again [ok this might be a too costly action, I do not know. This would work for e.g. *.pptx etc as well]
Clientside computation seems rational to me.

Thanks for considering the feature in any future release.

@tflidd
Copy link
Contributor

tflidd commented Jul 16, 2016

That is an interesting feature for virtual images. Are there any experiences with encrypted containers and diff sync tools?

One annoying thing about the sync-process is that you must transfer the files through the client. You can't just place them from a hard disk or use a faster transfer mechanism. Therefore many ask for a diff-sync feature but the ability to compare files (based on a hash-sum) would already help a lot of these people and it's much easier to implement.

@maverick74
Copy link

I don't mind at all to offload all the computation to the client side, as long as such feature is made available!!! We, actually, consider this a very important feature for business!

To have an idea in a 0-10 scale what would be the cost (not monetary) of developing this?

Tks

@wudimenghuan
Copy link

Dropbox and Onedrive have delta sync.
Seafile have delta sync, but it cause files broken.
I hope you see the rsync. I do need delta sync

@ariselseng
Copy link
Member

@rullzer How would the client have the previous file for calculating the delta?

@Bigpet
Copy link

Bigpet commented Mar 23, 2017

@cowai you either need to keep a copy of your last sync around (using file-system specific things like shadow copy seems out of the question for the broad range of platforms with sync clients) or you have to do block-level syncing instead, like "syncthing" does it.

@stratacast
Copy link

I think block-level syncing like syncthing is probably the easiest implementation in code, and perhaps the cheapest to write. I'm seriously interested in this, and I know some companies that are too (Quickbooks files man...ugly stuff). Like @Bigpet said, you'd need a copy of the file before changes onhand, or put some hooks into writes that go into that specific directory, but the latter sounds very messy and dangerous. I wish I knew how to write code better because I would 100% do this..I'm definitely a Kindergarten koder compared to a lot of people that put stuff on github. Thought I'd voice that there's interest on my end, and on the end of local companies I know.

@eglipeter
Copy link

Are there concrete plans when delta syncing will be available? May I hope to see this implemented in Nextcloud 13 already?

@gschenck
Copy link

There is some progress on owncloud:

owncloud/core#16162

@ahmedammar
Copy link

ahmedammar commented Oct 26, 2017

@gschenck Please feel free to try out the latest code, the core implementation should be complete now.

@jkaberg
Copy link

jkaberg commented Oct 26, 2017

@ahmedammar any plans on submitting the PR against NC as well further down the road?

@ahmedammar
Copy link

@jkaberg once the work is complete and merged in oC I can have a look, assuming the code-base isn't too different at the core ...

@maverick74
Copy link

@ahmedammar can you give us an update about the feature? (If possible a probable ETA?)

@ahmedammar
Copy link

@maverick74 no ETA for nextcloud, if someone is willing to open a bounty for it I could look into it more urgently, otherwise, for reference:
owncloud/client#6131
owncloud/core#29404

@L00maca
Copy link

L00maca commented Dec 24, 2017

It's not much and I'm not even sure I did this right since I never did this before, but I don't mind chipping in to help this get done.
Bountysource

@jospoortvliet
Copy link
Member

The bounty is already at 115 dollar now. It should not be terribly hard to get this merged in Nc client and server, I think, but it won't make it for 13 😄

@ahmedammar
Copy link

I won’t be looking into this until oC actually merge first, since that saves me any duplicated effort. Unless this bounty gets so big that I can ignore oC all together :)

@maverick74
Copy link

FWIW i guess there are some news at owncloud/core#29404

@wudimenghuan
Copy link

@maverick74 So It can be merged... @rullzer @jospoortvliet

@tflidd
Copy link
Contributor

tflidd commented Mar 21, 2018

FWIW i guess there are some news at owncloud/core#29404

That's the server side. Client-side is still on a development branch and subject to testing (https://github.com/owncloud/client/labels/Delta-sync). Unless this is not finished, it doesn't make a lot of sense to merge anything at the moment, so you can only help testing it.

@petrk94
Copy link

petrk94 commented Apr 9, 2018

I think nextcloud should hurry up, delta sync will be released in the next owncloud update:
https://owncloud.com/owncloud-implements-delta-sync-technology/

@jospoortvliet
Copy link
Member

@petrk94 yeah, it could in theory be merged - but ownCloud notes it'll be in testing until 2019, let's see. @ahmedammar can make a PR for the server - the client will get it as we sync upstream actively still.

@petrk94
Copy link

petrk94 commented Apr 12, 2018

Im wondering why I get so much thump down, just want to keep the thread updated :/

@nextcloud-bot nextcloud-bot added the stale Ticket or PR with no recent activity label Jun 20, 2018
@jcklpe
Copy link

jcklpe commented Aug 9, 2018

If I'm understanding stuff correctly it sounds like NextCloud won't be having this feature any time soon, correct?

@iskradelta
Copy link

iskradelta commented Oct 19, 2019

@ariselseng rsync is only cpu intensive on the sender side. The sender side can be the client or the server, depending on if the user is uploading or downloading. There is a limit to how many users can be syncing their tree (initial downloading) at the same time, that limit is the cpu available to the server, if not hitting bandwidth limit before that, and only gets hit - when the users tree (files) have changed timestamp or size - so once synced - many users can keep "syncing" without causing high cpu.

When, if ever, this becomes a problem there is a solution, to condier caching to avoid the expensive checksumming. But I dont like it, since it means we just assume that syncing means "is always initial sync" - that users dont have any of their data on their phones/clients. And its really a benefit (zsync pre-calculated metadatafile) when all the users are downloading the same tree (files), again in the case of zsync makes sense when its made for public data like iso images.

There is a reason even dropbox is using librsync. Its the best tool, the best.

@ahmedammar
Copy link

Good luck.

@jospoortvliet
Copy link
Member

@iskradelta I look forward to try out your experiment ;-)

wrt others asking about priorities - we prioritize things that benefit more users or that are paid for by customers. While everyone here cares deeply about deltasync, 99% of the users don't handle very big files in which small parts are regularly changed - the only scenario's I can think of are VM's and encrypted filesystems, both of which are never used by the vast majority of computer users. The drive and E2E have big benefits for normal users, meanwhile, so we focus there. And finishing those is taking more than long enough, I hope you don't mind that we don't take on another huge task until we have those both done. Our team can actually barely handle the support load for customers, that's the main reason we are not making much progress. We're trying to hire more people for 3 years already :(

@nextcloud nextcloud deleted a comment from PrivatePuffin Jan 31, 2020
@nextcloud nextcloud deleted a comment from PrivatePuffin Jan 31, 2020
@kesselb

This comment has been minimized.

@nextcloud nextcloud deleted a comment from PrivatePuffin Jan 31, 2020
@kesselb

This comment has been minimized.

@PrivatePuffin

This comment has been minimized.

@realies

This comment has been minimized.

@kesselb

This comment has been minimized.

@Lordroran

This comment has been minimized.

@RedKage

This comment has been minimized.

@tehXor

This comment has been minimized.

@nextcloud nextcloud locked as too heated and limited conversation to collaborators Feb 14, 2020
@jospoortvliet
Copy link
Member

jospoortvliet commented Feb 20, 2020

I think it was explained before but:

  • small files (under 5 or 10 mb) don't benefit from deltasync - the overhead is not worth it
  • files that are compressed and/or encrypted usually change everywhere when a small modification is made, so they don't benefit either

So almost all common file types, including office documents (yes they are compressed), images, music and large PSD files etc do not benefit from it. A metadata change to a large movie might (not always, depends on the file format) and sometimes to large images, too. But how often do you do that? Once a month? It is really almost exclusively nice for VM images and encrypted container formats. And yes, they matter, but aren't the most important in the world for most of our users, sorry.

Look, customers use Nextcloud in many ways. SIEMENS for example uses it only with HUGE files (minimum 30 gigabyte, typically 50-100gb). Some media companies use it with PSD files of hundreds of MB's. If we could make those cases much more efficient with deltasync, we would look into it, but it wouldn't make a difference so we don't.

There is little point in discussing this further. We have a lot of work to do and until we have a larger team and have finished other tasks, we won't get to this. If somebody else wants to do it - please, go ahead, pull requests are welcome. If somebody wants to pay for it, get in contact with sales.

@skjnldsv skjnldsv added 0. Needs triage Pending check for reproducibility or if it fits our roadmap 1. to develop Accepted and waiting to be taken care of and removed 0. Needs triage Pending check for reproducibility or if it fits our roadmap labels Aug 20, 2020
@solracsf solracsf changed the title Requesting delta-sync in longterm [$325.00] Delta-sync, or sync only changed bytes in a file Sep 17, 2021
@solracsf solracsf removed the bounty label Dec 1, 2021
@nextcloud nextcloud unlocked this conversation Nov 23, 2022
@jggc
Copy link

jggc commented May 1, 2023

Since there is no mention of this use case yet in this thread : CAD files.

We are creating a lot of .rvt files that are 99% block duplicates of previous versions.

With current client it takes a few minutes to sync, with Syncthing it takes 2-3 seconds.

I opened a forum thread about our specific setup but just posting here to keep this alive and maybe bring a business use case with it.

I would be interested in backporting the fixes from owncloud if some people are ready to sponsor this.

@rrauenza
Copy link

Since there is no mention of this use case yet in this thread : CAD files.

Adobe Lightroom database is another one. It's an sqlite database that mostly just gets appends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests