Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add directory tracking to sync #425

Merged
merged 15 commits into from
Jun 12, 2023
Merged

Add directory tracking to sync #425

merged 15 commits into from
Jun 12, 2023

Conversation

pietern
Copy link
Contributor

@pietern pietern commented Jun 1, 2023

Changes

This change replaces usage of the repofiles package with the filer package to consolidate WSFS code paths.

The repofiles package implemented the following behavior. If a file at foo/bar.txt was created and removed, the directory foo was kept around because we do not perform directory tracking. If subsequently, a file at foo was created, it resulted in an fs.ErrExist because it is impossible to overwrite a directory. It would then perform a recursive delete of the path if this happened and retry the file write.

To make this use case work without resorting to a recursive delete on conflict, we need to implement directory tracking as part of sync. The approach in this commit is as follows:

  1. Maintain set of directories needed for current set of files. Compare to previous set of files. This results in mkdir of added directories and rmdir of removed directories.
  2. Creation of new directories should happen prior to writing files. Otherwise, many file writes may race to create the same parent directories, resulting in additional API calls. Removal of existing directories should happen after removing files.
  3. Making new directories can be deduped across common prefixes where only the longest prefix is created recursively.
  4. Removing existing directories must happen sequentially, starting with the longest prefix.
  5. Removal of directories is a best effort. It fails only if the directory is not empty, and if this happens we know something placed a file or directory manually, outside of sync.

Tests

  • Existing integration tests pass (modified where it used to assert directories weren't cleaned up)
  • New integration test to confirm the inability to remove a directory doesn't fail the sync run

This deprecates usage of the `repofiles` package in favor
of the filer package and consolidates the code paths into WSFS.

Note: one potentially breaking change here is the following.
If a file at `foo/bar.txt` is created and removed, the directory
`foo` is kept around because we do not perform directory tracking.
If subsequently we need to write a file at `foo`, it will result
in an `fs.ErrExist` because it is impossible to overwrite a directory.

The previous implementation performed a recursive delete of the path
if this happened, where this implementation will return the `fs.ErrExist`
error to the user.

We can mitigate this in one of two ways:
* Track directories to remove as part of a `diff` and remove them
* Attempt to remove an empty directory tree if we see this error
* ...?
Sync currently doesn't clean up remote empty directories.

This change computes the set of directories that have been removed
between on an incremental update and removes those as well.
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just needs integration tests for two cases:

  1. We delete empty directory trees on the workspace
  2. We do not delete if the directory tree is not empty (ie has a file in it)

Copy link
Contributor

@fjakobs fjakobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try this out

@pietern pietern changed the base branch from sync-use-filer to main June 5, 2023 22:30
@pietern pietern mentioned this pull request Jun 6, 2023
@pietern pietern changed the title Delete directories when they become empty Use filer in sync command Jun 6, 2023
@pietern pietern changed the title Use filer in sync command Add directory tracking to sync Jun 6, 2023
@pietern pietern marked this pull request as ready for review June 6, 2023 09:27
libs/sync/diff.go Show resolved Hide resolved
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor suggestions

libs/sync/snapshot.go Outdated Show resolved Hide resolved
libs/sync/diff_test.go Show resolved Hide resolved
@pietern
Copy link
Contributor Author

pietern commented Jun 9, 2023

@shreyas-goenka Could you take a look at the dirset*.go files?

Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pietern pietern enabled auto-merge (squash) June 12, 2023 11:38
@pietern pietern merged commit 16bb224 into main Jun 12, 2023
@pietern pietern deleted the sync-delete-empty-directories branch June 12, 2023 11:47
@pietern pietern mentioned this pull request Jun 12, 2023
pietern added a commit that referenced this pull request Jun 12, 2023
## Changes

CLI:
* Add directory tracking to sync
([#425](#425)).
* Add fs cat command for dbfs files
([#430](#430)).
* Add fs ls command for dbfs
([#429](#429)).
* Add fs mkdirs command for dbfs
([#432](#432)).
* Add fs rm command for dbfs
([#433](#433)).
* Add installation instructions
([#458](#458)).
* Add new line to cmdio JSON rendering
([#443](#443)).
* Add profile on `databricks auth login`
([#423](#423)).
* Add readable console logger
([#370](#370)).
* Add workspace export-dir command
([#449](#449)).
* Added secrets input prompt for secrets put-secret command
([#413](#413)).
* Added spinner when loading command prompts
([#420](#420)).
* Better error message if can not load prompts
([#437](#437)).
* Changed service template to correctly handle required positional
arguments ([#405](#405)).
* Do not generate prompts for certain commands
([#438](#438)).
* Do not prompt for List methods
([#411](#411)).
* Do not use FgWhite and FgBlack for terminal output
([#435](#435)).
* Skip path translation of job task for jobs with a Git source
([#404](#404)).
* Tweak profile prompt
([#454](#454)).
* Update with the latest Go SDK
([#457](#457)).
* Use cmdio in version command for `--output` flag
([#419](#419)).

Bundles:
* Check for nil environment before accessing it
([#453](#453)).

Dependencies:
* Bump github.com/hashicorp/terraform-json from 0.16.0 to 0.17.0
([#459](#459)).
* Bump github.com/mattn/go-isatty from 0.0.18 to 0.0.19
([#412](#412)).

Internal:
* Add Mkdir and ReadDir functions to filer.Filer interface
([#414](#414)).
* Add Stat function to filer.Filer interface
([#421](#421)).
* Add check for path is a directory in filer.ReadDir
([#426](#426)).
* Add fs.FS adapter for the filer interface
([#422](#422)).
* Add implementation of filer.Filer for local filesystem
([#460](#460)).
* Allow equivalence checking of filer errors to fs errors
([#416](#416)).
* Fix locker integration test
([#417](#417)).
* Implement DBFS filer
([#139](#139)).
* Include recursive deletion in filer interface
([#442](#442)).
* Make filer.Filer return fs.DirEntry from ReadDir
([#415](#415)).
* Speed up sync integration tests
([#428](#428)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants