Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up non recursive list_dir for google cloud storage #318

Conversation

vlaminckaxel
Copy link
Contributor

The non-recursive implementation of list_dir for google cloud was slow.
The issue was already stated in the code that we just filtered out the top-level files and dirs out of al the blobs.

This is fixed by setting the delimiter='/' when listing a directory non-recursive.

Perf results

Before

┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │         10 │ 0:00:01.754618 │ ± 0:00:00.043073 │ 0:00:01.854687 │   5,500 │
│ List Folders   │ List shallow non-recursive │         10 │ 0:00:01.839951 │ ± 0:00:00.146716 │ 0:00:02.194600 │   5,500 │
│ List Folders   │ List normal recursive      │         10 │ 0:00:02.670551 │ ± 0:00:00.142644 │ 0:00:02.984851 │   7,877 │
│ List Folders   │ List normal non-recursive  │         10 │ 0:00:02.568673 │ ± 0:00:00.037830 │ 0:00:02.609728 │     113 │
│ List Folders   │ List deep recursive        │         10 │ 0:00:06.076180 │ ± 0:00:00.893733 │ 0:00:07.018396 │   7,955 │
│ List Folders   │ List deep non-recursive    │         10 │ 0:00:05.576774 │ ± 0:00:00.790192 │ 0:00:07.463698 │      31 │
│ Glob scenarios │ Glob shallow recursive     │         10 │ 0:00:01.943264 │ ± 0:00:00.073614 │ 0:00:02.100214 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │         10 │ 0:00:02.041357 │ ± 0:00:00.409244 │ 0:00:03.196892 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │         10 │ 0:00:02.708558 │ ± 0:00:00.059786 │ 0:00:02.772853 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │         10 │ 0:00:02.355530 │ ± 0:00:00.033168 │ 0:00:02.397684 │      12 │
│ Glob scenarios │ Glob deep recursive        │         10 │ 0:00:08.697164 │ ± 0:00:02.117151 │ 0:00:13.550607 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │         10 │ 0:00:10.489393 │ ± 0:00:02.967771 │ 0:00:15.646034 │      25 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

After

┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │         10 │ 0:00:01.782473 │ ± 0:00:00.079213 │ 0:00:01.978639 │   5,500 │
│ List Folders   │ List shallow non-recursive │         10 │ 0:00:01.658828 │ ± 0:00:00.055595 │ 0:00:01.796994 │   5,500 │
│ List Folders   │ List normal recursive      │         10 │ 0:00:02.438342 │ ± 0:00:00.086166 │ 0:00:02.603272 │   7,877 │
│ List Folders   │ List normal non-recursive  │         10 │ 0:00:00.172392 │ ± 0:00:00.008785 │ 0:00:00.191022 │     113 │
│ List Folders   │ List deep recursive        │         10 │ 0:00:08.002241 │ ± 0:00:01.882228 │ 0:00:10.528643 │   7,955 │
│ List Folders   │ List deep non-recursive    │         10 │ 0:00:00.168641 │ ± 0:00:00.007435 │ 0:00:00.179845 │      31 │
│ Glob scenarios │ Glob shallow recursive     │         10 │ 0:00:01.811696 │ ± 0:00:00.034521 │ 0:00:01.851655 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │         10 │ 0:00:01.825440 │ ± 0:00:00.080909 │ 0:00:01.984854 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │         10 │ 0:00:02.665863 │ ± 0:00:00.068024 │ 0:00:02.777973 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │         10 │ 0:00:00.172190 │ ± 0:00:00.006265 │ 0:00:00.189009 │      12 │
│ Glob scenarios │ Glob deep recursive        │         10 │ 0:00:07.589083 │ ± 0:00:01.563928 │ 0:00:11.491544 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │         10 │ 0:00:00.169732 │ ± 0:00:00.006388 │ 0:00:00.179240 │      25 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

@vlaminckaxel vlaminckaxel marked this pull request as draft February 6, 2023 20:08
@vlaminckaxel vlaminckaxel marked this pull request as ready for review February 6, 2023 22:42
cloudpathlib/gs/gsclient.py Show resolved Hide resolved
cloudpathlib/gs/gsclient.py Show resolved Hide resolved
tests/performance/cli.py Show resolved Hide resolved
tests/mock_clients/mock_gs.py Show resolved Hide resolved
tests/mock_clients/mock_gs.py Show resolved Hide resolved
@pjbull
Copy link
Member

pjbull commented Feb 9, 2023

Thanks @vlaminckaxel! Looks like a great speedup here. Let me know if you'll have a chance to tackle the code review comments, it would be great to get this in for our next release in a few days.

@pjbull pjbull changed the base branch from master to 318-local-gs-speedup February 15, 2023 23:52
@pjbull
Copy link
Member

pjbull commented Feb 15, 2023

Merging this into a local branch to do code review tweaks and run live server tests. Thanks @vlaminckaxel!

@pjbull pjbull merged commit c7f9cbc into drivendataorg:318-local-gs-speedup Feb 15, 2023
pjbull added a commit that referenced this pull request Feb 16, 2023
…cloud storage (#321)

* Speed up non recursive list_dir for google cloud storage (#318)

* use correct env vars for gs, azure bucket perf tests

* speedup non recursive _list_dir for gs

* linting

* make gs mock compatible with delimiter

* linting

* Code review changes

* revert dirs first

---------

Co-authored-by: axel.vlaminck <axel.vlaminck@gmail.com>
@pjbull
Copy link
Member

pjbull commented Feb 17, 2023

@vlaminckaxel this got shipped in v0.13.0, thanks! Test it out when you get a chance.

@vlaminckaxel
Copy link
Contributor Author

@vlaminckaxel this got shipped in v0.13.0, thanks! Test it out when you get a chance.

Works like a charm 👍

@pjbull pjbull mentioned this pull request Sep 22, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants