Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSPath: Implement sliced object downloads #387

Closed
joconnor-ecaa opened this issue Dec 18, 2023 · 7 comments · Fixed by #389
Closed

GSPath: Implement sliced object downloads #387

joconnor-ecaa opened this issue Dec 18, 2023 · 7 comments · Fixed by #389

Comments

@joconnor-ecaa
Copy link
Contributor

Sliced downloads are beneficial in cases of downloading a small number of large files. GCP documentation here with example code using the cloud storage python client:

https://cloud.google.com/storage/docs/sliced-object-downloads

Unsure if this is within scope of cloudpathlib but such a feature would be useful.

@pjbull
Copy link
Member

pjbull commented Dec 20, 2023

Thanks @joconnor-ecaa. I think we'd take a PR that added a download_chunks_concurrently_kwargs as a parameter when initializing GSClient that defaults to None, which we store as private property on the GSClient object. Then if that property is set, when we run _download_file we check and use the sliced object downloads, passing those kwargs.

Our docstring should link to the method docs that have the full list of kwargs:
https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.transfer_manager#google_cloud_storage_transfer_manager_download_chunks_concurrently

@joconnor-ecaa
Copy link
Contributor Author

joconnor-ecaa commented Dec 21, 2023

Hi @pjbull and thanks for the advice! I've implemented this feature as suggested but am running into trouble with the local tests. Have opened a PR #389 in draft for visibility.

Turns out that storage_manager pickles the client object, which fails when running local tests with a MockClient.

AttributeError: Can't pickle local object 'mocked_client_class_factory.<locals>.MockClient'

I don't know too much about pickle so if anyone can help with the error that would be great. If not, options are either to run this test only on the live cloud servers (not sure when this is appropriate) or to put this PR on hold.

@pjbull
Copy link
Member

pjbull commented Dec 21, 2023

@joconnor-ecaa can you put the whole stack trace so I can see how we end up getting there?

@joconnor-ecaa
Copy link
Contributor Author

Sure thing:

tests/test_gs_specific.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cloudpathlib/cloudpath.py:893: in download_to
    return self.client._download_file(self, destination)
cloudpathlib/gs/gsclient.py:129: in _download_file
    transfer_manager.download_chunks_concurrently(
venv/lib/python3.11/site-packages/google/cloud/storage/transfer_manager.py:895: in download_chunks_concurrently
    maybe_pickled_blob = _pickle_client(blob) if needs_pickling else blob
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = <tests.mock_clients.mock_gs.MockBlob object at 0x11d96e7d0>

    def _pickle_client(obj):
        """Pickle a Client or an object that owns a Client (like a Blob)"""
    
        # We need a custom pickler to process Client objects, which are attached to
        # Buckets (and therefore to Blobs in turn). Unfortunately, the Python
        # multiprocessing library doesn't seem to have a good way to use a custom
        # pickler, and using copyreg will mutate global state and affect code
        # outside of the client library. Instead, we'll pre-pickle the object and
        # pass the bytestring in.
        f = io.BytesIO()
        p = pickle.Pickler(f)
        p.dispatch_table = copyreg.dispatch_table.copy()
        p.dispatch_table[Client] = _reduce_client
>       p.dump(obj)
E       AttributeError: Can't pickle local object 'mocked_client_class_factory.<locals>.MockClient'

venv/lib/python3.11/site-packages/google/cloud/storage/transfer_manager.py:1318: AttributeError

I managed to follow the thread a little. storage_manager needs to pickle the client when using worker_type="process" as noted e.g. here. The test passes if you pass worker_type="thread" via download_chunks_concurrently_kwargs.

@pjbull
Copy link
Member

pjbull commented Dec 22, 2023

Thanks, that's helpful. We shouldn't be hitting the actual transfer_manager code in this case, since these are the local mocked tests that don't communicate with the backend.

Instead, we need to add a patch for transfer_manager like we do for StorageClient.

You can add a mock_transfer_manager class in mock_gs.py that just implements the one method we use as an @staticmethod and just have it do the local copy.

When the test suite is run against the live backends, the actual codepath here will get tested.

@pjbull
Copy link
Member

pjbull commented Jan 4, 2024

Fixed in #389, thanks @joconnor-ecaa

@pjbull pjbull closed this as completed Jan 4, 2024
@pjbull
Copy link
Member

pjbull commented Feb 26, 2024

Thanks @joconnor-ecaa! This is now released in the latest version of cloudpathlib on PyPI (0.18.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants