Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for large folders #59

Merged
merged 14 commits into from
Aug 23, 2017

Conversation

greezybacon
Copy link
Collaborator

@greezybacon greezybacon commented Dec 7, 2016

First of all, thank you for the project. I think it's very promising and very elegantly implemented.

I'll offer my use case first, as it might differ from others' usage of the project. I have two servers, we'll call them A and B. They are network file servers and should accept writes from any client at any time. It is my goal to use bitpocket to synchronize the writes between the two boxes. And yes, I'm aware that a cluster file system with distributed locking might be a better option; however, I think that's overly complicated. In our environment, folks don't generally edit the same file at the same time. Having a 10 minute lag between the servers is acceptable. I'm using the idea to replace Microsoft(R) DFSR (Distributed File System Replication), which works similarly to this project. (with the exception of a GUI)

When I started testing bitpocket I noticed that it deleted files unexpectedly and that it also pushed deleted files back. I believe this is because of how it tracks added and deleted files. Although describing why I think that is in a few paragraphs is very difficult.

I've adjusted how bitpocket works so that it assumes that, after the sync is complete, both of the servers have the same contents. Then, when the sync is performed the next time, a file list from both servers is collected and compared against the previous sync listing. A list of locally added and deleted files and remotely deleted files is compiled. These two lists are used for the pull and push to ensure that none of the files is deleted nor resurrected unexpectedly.

The most significant change here is that the stock version did not use any exclude list for the push.

I've also adjusted the backup method to run in parallel with the pull sync. This reduces one of the rsyncs between the servers. In my environment, I have about 500G of data in about 250k files. So each rsync between the servers is time consuming. And what's more, there is a bug in recent versions of rsync such that, when extended attributes are synced (--xattrs), then the sync time goes from about a minute to about 20 minutes.

I can go into more detail for any of the edits or commits but mainly want to start the conversation. I also don't mind rebasing to change parts of it.

Cheers,

Jared Hancock added 6 commits November 30, 2016 17:05
Fixes an issue where a file does not appear as new on the local side but does
not exist on the remote side. When the sync is performed the local file is
removed. If it is re-created locally, it will again not appear as new since the
previous sync marked it as existing before running the sync.

This fix suggests a file listing should be captured of the remote side and
compared with the file listing on the local side. This will help assess files
which appear to have been deleted remotely but still exist locally, as well as
files which exist locally only but do not exist remotely. This comparison can
be performed without comparing current and previous file listings.

It also incorporates several speed improvements, especially where using rsync
to synchronize extended attributes.
This adjusts the matching mechanism to create a list of locally added and
deleted files, and remotely deleted files. Then, when the push and pull are
run, locally added files are excluded from the pull and remotely deleted
files are excluded from the push.

This fixes the issue where remotely deleted files are not pulled, but then,
because they were not excluded, were also pushed back to the remote server.
Therefore, if files were removed remotely, they would appear again during
the sync.
Also, because the `tree-current` includes files which might have been
removed during the sync, the `tree-after` is a better representation of the
state of the tree after the sync. It also includes the locally added files
already and is properly sorted.
This removes one of the trips to the remote server. It does the dry-run of the
pull and analyzes the output. For each file which exists locally, a backup is
made. The filenames are then piped to another instance of rsync which actually
performs the sync.
@greezybacon greezybacon force-pushed the feature/inline-backup-pull branch 2 times, most recently from 0240b92 to c107d19 Compare December 8, 2016 19:59
This removes the usage of the `added-prev` file which included both files
fetched from the remote server in the `pull` process as well as files
created locally during the sync. Instead, it adds and removes filenames of
files added or deleted during the `pull` part of the sync to the `tree-prev`
file. This allows the files which were added in parallel with the `pull` to
be naturally included in the next run of the sync process.

This patch also removes another invocation of `rsync` to collect a local
snapshot of the file system after the sync has completed.
Some directories, such as new empty ones added remotely, would not be pulled
because of a wildcard match mishap. Then, because the directories were not
pulled, the remote directories would be removed in the following push cycle.
@standin000
Copy link

can't reproduce disappearing and reappearing files scenario in master version, How do you meet it? @greezybacon

@greezybacon
Copy link
Collaborator Author

I've also attempted to figure it out. I worked at trying to piece back together what I noticed when bitpocket had troubles. I tried putting together a test to reproduce it. I put this together, which the master branch fails. I think the problems surround the usage of the added-prev file, since it adds exclusions from the following sync. Strangely, my branch fails this test too.

diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb
index 3cc3e83..1142014 100644
--- a/spec/spec_helper.rb
+++ b/spec/spec_helper.rb
@@ -35,6 +35,10 @@ def rm(path)
   FileUtils.rm(path)
 end
 
+def mv(a, b)
+  FileUtils.mv(a, b)
+end
+
 RSpec::Matchers.define :exist do
   match do |filename|
     File.exist?(filename)
diff --git a/spec/sync_spec.rb b/spec/sync_spec.rb
index 080c73e..a7953c9 100644
--- a/spec/sync_spec.rb
+++ b/spec/sync_spec.rb
@@ -75,4 +75,27 @@ describe 'bitpocket sync' do
     local_path('a').should_not exist
     remote_path('a').should_not exist
   end
+
+  it 'handles remote deletes between syncs' do
+    touch remote_path('a/c')
+    touch remote_path('a/f')
+
+    sync.should succeed
+
+    # After the sync, 'c' and 'f' are in 'added-prev', so they are excluded
+    # from the next pull
+
+    rm remote_path('a/c')
+    mkdir remote_path('a/b')
+    mv remote_path('a/f'), remote_path('a/b/f')
+
+    sync.should succeed
+
+    local_path('a/c').should_not exist
+    local_path('a/f').should_not exist
+    local_path('a/b/f').should exist
+
+    remote_path('a/c').should_not exist
+    remote_path('a/f').should_not exist
+  end
 end

I also continue to observe that bitpocket has this side effect where some files seem to get pulled back even if they were locally edited. And sometimes the same thing happes to permissions. If they are changed locally, the old permissions will be pulled. I haven't found a solution to either of these yet.

@greezybacon
Copy link
Collaborator Author

Okay, I was able to get the test to pass on my branch.

@greezybacon greezybacon force-pushed the feature/inline-backup-pull branch 2 times, most recently from 84dc6ba to f38c978 Compare April 7, 2017 19:17
@greezybacon
Copy link
Collaborator Author

Okay, I think most of my problems may have been self-inflicted. Since I was attempting to work around the performance issues, I probably introduced the strange behavior. I was able to fix the syncing of permissions.

If a file was not being changed or added, then the pull() function would not
queue it for rsync. Therefore changes to only change time and permissions would
not be synchronized in the pull() phase and would be replaced in the push()
phase.
The split pull using the inline backup technique neglects extra options to
rsync which may infleuence the files pulled, such as --acls or --xattrs. Such
changes would be reverted in the push phase of the sync.
@greezybacon greezybacon changed the title Fix disappearing and reappearing files, improve performance Improve performance for large folders May 8, 2017
Jared Hancock added 2 commits May 26, 2017 09:32
This patch adds symbolic files into the local and remote file listings. Without
them in the lists, when the lists are compared, locally-added symbolic links
appear to have been removed on the remote server. So, when the pull is run, the
link files are not protected and are removed.
If a folder is received with the --files-from parameter, and the folder name
ends with a trailing slash, then rsync will copy the contents of the folder
recursively. This is not intended in the sync process. Instead, only the stat
information on the folder should be pulled. The files in the folders will be
communicated via separate line items.
@greezybacon greezybacon force-pushed the feature/inline-backup-pull branch 2 times, most recently from 3567bcb to d93b64d Compare August 3, 2017 22:06
@aggsol
Copy link

aggsol commented Aug 4, 2017

Will this ever be merged?

@greezybacon
Copy link
Collaborator Author

@aggsol I don't have the authority to merge this. But I am using bitpocket with these patches for mission-critical data syncing. I don't know if @sickill has any interest in continuing to maintain the product. So perhaps I should apply to become a maintainer...

This patch adds a check before local deletion to inspect if a file to be
deleted was marked as remotely deleted.

This should help a situation where, when a sync runs for a long time (more than
a few seconds), it becomes likely that a new file will be created on the local
or remote system. Such a file could be removed by the sync if it sorts
alphabetically after large files holding the sync up. When the new file is
encountered, it is removed because it was not protected by the exclude list in
`local-add-del`.
@ku1ik
Copy link
Owner

ku1ik commented Aug 15, 2017

Hey @greezybacon, I'm happy to invite you as one of the maintainers of this project 🎉

I don't really have time and motivation to keep this project going, so I hope you and @torfason (if he's still interested / using it) can keep it in good health.

@torfason
Copy link
Collaborator

torfason commented Aug 20, 2017

Hi Jared (@greezybacon) and Marcin (@sickill),

I have not reviewed in detail each the commits that Jared has been submitting, but I read through a number of them. They are thoughtful and in line with what I saw as the key attraction of bitpocket - it being a simple to set up batch synchronizer with as minimal dependencies as possible, and as robust as possible.

It seems to me that Jared is currently the most dedicated user and developer of bitpocket, and I would vote to make him maintainer. It would be great to see this product continue to be improved.

Best,
Magnus

ps. I don't see myself contributing significantly to the project in the near future, so if Jared takes on the mantle, it seems it would be in the role of lead maintainer, which would be great from my perspective.

@ku1ik
Copy link
Owner

ku1ik commented Aug 20, 2017

Go for it @greezybacon! ;)

@greezybacon
Copy link
Collaborator Author

greezybacon commented Aug 23, 2017

The issues that two of the three last commits fixed exist in the master branch, I believe. I'll merge this branch and continue to track any issues I notice as new issues. I'll also go through the existing issues and pull requests to see which ones still apply or need to be rebased.

@greezybacon greezybacon merged commit 007d326 into ku1ik:master Aug 23, 2017
@aggsol
Copy link

aggsol commented Aug 24, 2017

If you feel up to the task, removing the "maintainers needed" notice from the readme won't keep new users away anymore ;-)

@ku1ik
Copy link
Owner

ku1ik commented Aug 24, 2017

@aggsol thanks for pointing this out. I've just removed this notice from README.

@greezybacon greezybacon deleted the feature/inline-backup-pull branch February 7, 2018 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants