-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for large folders #59
Improve performance for large folders #59
Conversation
Fixes an issue where a file does not appear as new on the local side but does not exist on the remote side. When the sync is performed the local file is removed. If it is re-created locally, it will again not appear as new since the previous sync marked it as existing before running the sync. This fix suggests a file listing should be captured of the remote side and compared with the file listing on the local side. This will help assess files which appear to have been deleted remotely but still exist locally, as well as files which exist locally only but do not exist remotely. This comparison can be performed without comparing current and previous file listings. It also incorporates several speed improvements, especially where using rsync to synchronize extended attributes.
This adjusts the matching mechanism to create a list of locally added and deleted files, and remotely deleted files. Then, when the push and pull are run, locally added files are excluded from the pull and remotely deleted files are excluded from the push. This fixes the issue where remotely deleted files are not pulled, but then, because they were not excluded, were also pushed back to the remote server. Therefore, if files were removed remotely, they would appear again during the sync.
Also, because the `tree-current` includes files which might have been removed during the sync, the `tree-after` is a better representation of the state of the tree after the sync. It also includes the locally added files already and is properly sorted.
This removes one of the trips to the remote server. It does the dry-run of the pull and analyzes the output. For each file which exists locally, a backup is made. The filenames are then piped to another instance of rsync which actually performs the sync.
0240b92
to
c107d19
Compare
This removes the usage of the `added-prev` file which included both files fetched from the remote server in the `pull` process as well as files created locally during the sync. Instead, it adds and removes filenames of files added or deleted during the `pull` part of the sync to the `tree-prev` file. This allows the files which were added in parallel with the `pull` to be naturally included in the next run of the sync process. This patch also removes another invocation of `rsync` to collect a local snapshot of the file system after the sync has completed.
0f65a78
to
0812d40
Compare
Some directories, such as new empty ones added remotely, would not be pulled because of a wildcard match mishap. Then, because the directories were not pulled, the remote directories would be removed in the following push cycle.
0812d40
to
5bb793c
Compare
can't reproduce disappearing and reappearing files scenario in master version, How do you meet it? @greezybacon |
I've also attempted to figure it out. I worked at trying to piece back together what I noticed when bitpocket had troubles. I tried putting together a test to reproduce it. I put this together, which the diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb
index 3cc3e83..1142014 100644
--- a/spec/spec_helper.rb
+++ b/spec/spec_helper.rb
@@ -35,6 +35,10 @@ def rm(path)
FileUtils.rm(path)
end
+def mv(a, b)
+ FileUtils.mv(a, b)
+end
+
RSpec::Matchers.define :exist do
match do |filename|
File.exist?(filename)
diff --git a/spec/sync_spec.rb b/spec/sync_spec.rb
index 080c73e..a7953c9 100644
--- a/spec/sync_spec.rb
+++ b/spec/sync_spec.rb
@@ -75,4 +75,27 @@ describe 'bitpocket sync' do
local_path('a').should_not exist
remote_path('a').should_not exist
end
+
+ it 'handles remote deletes between syncs' do
+ touch remote_path('a/c')
+ touch remote_path('a/f')
+
+ sync.should succeed
+
+ # After the sync, 'c' and 'f' are in 'added-prev', so they are excluded
+ # from the next pull
+
+ rm remote_path('a/c')
+ mkdir remote_path('a/b')
+ mv remote_path('a/f'), remote_path('a/b/f')
+
+ sync.should succeed
+
+ local_path('a/c').should_not exist
+ local_path('a/f').should_not exist
+ local_path('a/b/f').should exist
+
+ remote_path('a/c').should_not exist
+ remote_path('a/f').should_not exist
+ end
end I also continue to observe that bitpocket has this side effect where some files seem to get pulled back even if they were locally edited. And sometimes the same thing happes to permissions. If they are changed locally, the old permissions will be pulled. I haven't found a solution to either of these yet. |
Okay, I was able to get the test to pass on my branch. |
84dc6ba
to
f38c978
Compare
Okay, I think most of my problems may have been self-inflicted. Since I was attempting to work around the performance issues, I probably introduced the strange behavior. I was able to fix the syncing of permissions. |
If a file was not being changed or added, then the pull() function would not queue it for rsync. Therefore changes to only change time and permissions would not be synchronized in the pull() phase and would be replaced in the push() phase.
f38c978
to
91adea8
Compare
The split pull using the inline backup technique neglects extra options to rsync which may infleuence the files pulled, such as --acls or --xattrs. Such changes would be reverted in the push phase of the sync.
This patch adds symbolic files into the local and remote file listings. Without them in the lists, when the lists are compared, locally-added symbolic links appear to have been removed on the remote server. So, when the pull is run, the link files are not protected and are removed.
If a folder is received with the --files-from parameter, and the folder name ends with a trailing slash, then rsync will copy the contents of the folder recursively. This is not intended in the sync process. Instead, only the stat information on the folder should be pulled. The files in the folders will be communicated via separate line items.
5f539a0
to
7a0f69d
Compare
3567bcb
to
d93b64d
Compare
Will this ever be merged? |
This patch adds a check before local deletion to inspect if a file to be deleted was marked as remotely deleted. This should help a situation where, when a sync runs for a long time (more than a few seconds), it becomes likely that a new file will be created on the local or remote system. Such a file could be removed by the sync if it sorts alphabetically after large files holding the sync up. When the new file is encountered, it is removed because it was not protected by the exclude list in `local-add-del`.
d93b64d
to
ac0622c
Compare
Hey @greezybacon, I'm happy to invite you as one of the maintainers of this project 🎉 I don't really have time and motivation to keep this project going, so I hope you and @torfason (if he's still interested / using it) can keep it in good health. |
Hi Jared (@greezybacon) and Marcin (@sickill), I have not reviewed in detail each the commits that Jared has been submitting, but I read through a number of them. They are thoughtful and in line with what I saw as the key attraction of bitpocket - it being a simple to set up batch synchronizer with as minimal dependencies as possible, and as robust as possible. It seems to me that Jared is currently the most dedicated user and developer of bitpocket, and I would vote to make him maintainer. It would be great to see this product continue to be improved. Best, ps. I don't see myself contributing significantly to the project in the near future, so if Jared takes on the mantle, it seems it would be in the role of lead maintainer, which would be great from my perspective. |
Go for it @greezybacon! ;) |
The issues that two of the three last commits fixed exist in the master branch, I believe. I'll merge this branch and continue to track any issues I notice as new issues. I'll also go through the existing issues and pull requests to see which ones still apply or need to be rebased. |
If you feel up to the task, removing the "maintainers needed" notice from the readme won't keep new users away anymore ;-) |
@aggsol thanks for pointing this out. I've just removed this notice from README. |
First of all, thank you for the project. I think it's very promising and very elegantly implemented.
I'll offer my use case first, as it might differ from others' usage of the project. I have two servers, we'll call them A and B. They are network file servers and should accept writes from any client at any time. It is my goal to use
bitpocket
to synchronize the writes between the two boxes. And yes, I'm aware that a cluster file system with distributed locking might be a better option; however, I think that's overly complicated. In our environment, folks don't generally edit the same file at the same time. Having a 10 minute lag between the servers is acceptable. I'm using the idea to replace Microsoft(R) DFSR (Distributed File System Replication), which works similarly to this project. (with the exception of a GUI)When I started testing
bitpocket
I noticed that it deleted files unexpectedly and that it also pushed deleted files back. I believe this is because of how it tracks added and deleted files. Although describing why I think that is in a few paragraphs is very difficult.I've adjusted how
bitpocket
works so that it assumes that, after the sync is complete, both of the servers have the same contents. Then, when the sync is performed the next time, a file list from both servers is collected and compared against the previous sync listing. A list of locally added and deleted files and remotely deleted files is compiled. These two lists are used for the pull and push to ensure that none of the files is deleted nor resurrected unexpectedly.The most significant change here is that the stock version did not use any exclude list for the push.
I've also adjusted the backup method to run in parallel with the pull sync. This reduces one of the rsyncs between the servers. In my environment, I have about 500G of data in about 250k files. So each rsync between the servers is time consuming. And what's more, there is a bug in recent versions of rsync such that, when extended attributes are synced (
--xattrs
), then the sync time goes from about a minute to about 20 minutes.I can go into more detail for any of the edits or commits but mainly want to start the conversation. I also don't mind rebasing to change parts of it.
Cheers,