Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication #392

Open
avatar-lavventura opened this issue Oct 30, 2019 · 6 comments

Comments

@avatar-lavventura
Copy link

  • Motivation:

When a file is updated and resync again, decrement its block duplication on nodes all over the world and decrease communication cost (only downloaded the updated blocks) and save storage (only updated section of the file as blocks will be stored).


  • Problem:

Example, there is a .tar.gz file, which contains a data.txt file, file.tar.gz (~100 GB) stored in my IPFS-repo, which is pulled from another Node-a.

I open the data.txt file and added a single character in a random locations in the file (beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo. Here update is only few kilobytes.

[*] When I deleted a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, which will lead to download complete file to be downloaded.

As a result, when node-a wants to re-get the updated tar.gz file a re-sync will take place and whole file will be downloaded all over again. As a result there will be duplication of blocks (~100 GB in this example ) even the change is made only for few kilobytes. And iteratively this duplication will be distributed to all over the peers, which is very inefficient and consumes high amount of storage and additional communication cost over time.


  • Solution:

Other clouds are try to solve this problem using Block-Level File Copying. On their case like IPFS, since blocklist is considered for "Block-Level File Copying"; when a file is updated (a character is added at the beginning of the file), Dropbox, One-Drive will re-upload the whole file since the first block's hash will be change and it will also affect/change the hash of all the consequent blocks. This doesn't solve the problem.

=> I believe better soluiton is to consider between each commits of the files, approach that git-diff uses could be considered. This will only uploads the changed (diff) parts of the file, that will be few kilobytes on the example I give, and its diffed blocks will be merged when other nodes pull that file. So as communication cost only few kilobytes of that will be transferes and that amount of data will be added to storage will be only few kilobytes as well.

I know that it will be difficult to re-design IPFS's design, but this could be done as a wrapper solution that combines IPFS and git, and users can use it for very large files based on their needs.


This problem is not considered as priority by IPFS team but at least it should be on the priority.

IPFS team is considering adding that eventually, but it’s not a priority.


Please see discussion I have already opened. Please feel free to add your ideas in to them.

=> Does IPFS provide block-level file copying feature?

=> Efficiency of IPFS for sharing updated file

@avatar-lavventura avatar-lavventura changed the title git-diff feature: Improve efficiency of IPFS for sharing updated file //decrease file/block duplication git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication Oct 30, 2019
@ivan386
Copy link

ivan386 commented Oct 30, 2019

Test this:
ipfs add -s=rabin filename
Rabin is different chunking algorithm. It more effective for big files with small chages.

@dbaarda
Copy link

dbaarda commented Oct 31, 2019

Note that *.tar.gz files are compressed by default in a way that typically means a single byte change in an internal data file will result in nearly every byte changing in the compressed output.

This is why things like rsync cannot efficiently update standard *.gz files. Some (all?) gzip implementation support an --rsyncable argument that will sacrifice a little bit of compression to minimise the differences in the compressed output. Interestingly it does this using something similar to rabin chunking under the hood, though I think it predates rabin.

So you will need gzip --rsyncable when creating your tar.gz and use ipfs add -s rabin to get any deduping.

@Stebalien Stebalien transferred this issue from ipfs/kubo Nov 1, 2019
@MatthewSteeples
Copy link

I know this goes against most of what IPFS plans to do (as it's a layer up) but it would interesting to do some research into whether "understanding" compressed formats would have any benefit. I'm thinking something along the lines of the following:

  1. For .tar files, IPFS could be smart enough to store each file individually and rebuild the tar container itself when requested. For people using tar files as backups, this may result in a considerable saving if the files don't change much over time.
  2. For .gz files, IPFS could be smart enough to decompress the file before processing. The storage layer could have some form of compression applied to it anyway (so this doesn't take any additional disk space), as could the network layer (so no additional bandwidth is consumed)
  3. For zip files you get a combination of the above. Effectively you treat the zip file as a folder.

These would enable small changes in files to only require that file to be retransferred (as the rest could be rebuilt).

Caveats/downsides:

  • Would only work on unprotected files
  • Files can be compressed more or less efficiently, so would need to consider whether that detail is persisted
  • You're no longer transferring the actual blocks of a file, but are rebuilding a file on demand
  • Additional complexity, having to understand file formats

@vmx
Copy link
Member

vmx commented Nov 4, 2019

  1. For .tar files, IPFS could be smart enough to store each file individually and rebuild the tar container itself when requested. For people using tar files as backups, this may result in a considerable saving if the files don't change much over time.

That reminds me of some work @mib-kd743naq is doing, where you store a hybrid in IPFS which can served up individual files, but also as tar archives. I couldn't find a good link to it, but I'm sure @mib-kd743naq can tell more about this.

@momack2
Copy link
Contributor

momack2 commented Dec 3, 2019

@ribasushi - thought this thread might be interesting to you.

@RubenKelevra
Copy link

RubenKelevra commented Jan 25, 2020

I think the tar command of ipfs should be changed to allow any folder stored via ipfs-mfs to be cat out as tar and each tar container to be imported as ipfs-mfs folder.

Compression is a different story, since it would be best to support this on the storage layer side. If data is stored compressed, it might be wise to be able to export individual files compressed as well, or as a container which supports multiple files with individual compressions (tar can't handle this).

Deduplication works already fine, you just have to switch to rabin or the new buzhash. See ipfs/kubo#6841 (comment)

So I think this ticket can actually be closed, since it's already implemented :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants