Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GCS artifact storage capabilities. #152

Merged
merged 13 commits into from
Jul 17, 2018

Conversation

bnekolny
Copy link
Contributor

No description provided.

@bnekolny
Copy link
Contributor Author

I still need to add some tests (working through the nuances of no moto equivalent for GCS.

We're using GCS for storage, so is there an appetite for this being merged into master?

Add google-cloud-storage as a dependency.

Fixing a couple bugs with the GCS store.
@codecov-io
Copy link

codecov-io commented Jul 13, 2018

Codecov Report

Merging #152 into master will decrease coverage by 0.24%.
The diff coverage is 83.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   49.79%   49.54%   -0.25%     
==========================================
  Files          89       89              
  Lines        4322     4503     +181     
==========================================
+ Hits         2152     2231      +79     
- Misses       2170     2272     +102
Impacted Files Coverage Δ
mlflow/store/artifact_repo.py 89.77% <83.82%> (-3.75%) ⬇️
mlflow/sagemaker/__init__.py 16.6% <0%> (-0.87%) ⬇️
mlflow/sagemaker/container/__init__.py 0% <0%> (ø) ⬆️
mlflow/sagemaker/cli.py 76.19% <0%> (+1.9%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4787a3...00ec0be. Read the comment docs.

@mateiz
Copy link
Contributor

mateiz commented Jul 14, 2018

This definitely sounds like a good idea as long as you provide a way to test it. It's not a lot of code, and it will obviously help a lot of users.

One question about the URI scheme: is the gs:// scheme also recognized for Google Cloud Storage in TensorFlow, Spark, and other systems that support writing to cloud storage? Or are multiple schemes used? In the latter case, we'll probably want to choose the most common one because we want to make it possible for jobs to write directly to the artifact URI for a run without writing to a local file first (especially for large datasets). We already have this issue with S3, where Hadoop/Spark use s3a or s3n, and we plan to do the conversion automatically there.

@bnekolny
Copy link
Contributor Author

@mateiz the gs:// scheme is a standard reference for GCS, much like s3://. There are different HTTP urls, such as the S3 equivalent of s3.amazonaws.com, but I believe gs:// is more appropriate for a storage mechanism.

I added some tests, but let me know your thoughts. We're running on GCP and are already using a fork with this code in it, but are hoping to get it merged into mainline.

@smurching
Copy link
Collaborator

Ah this is awesome, thanks @bnekolny - giving it a look now :)

Copy link
Collaborator

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for the PR @bnekolny, this will definitely be helpful for users - leaving a few comments :)



class GCSArtifactRepository(ArtifactRepository):
"""Stores artifacts on Google Cloud Storage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding docs, would you be able to update the Storage subsection of the tracking docs to (a) mention GCS as a storage option and (b) include a link to the the GCS auth docs? Specifically, we should update this file - thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added docs here: databricks@e387bd0

infos = []
prefix = dest_path + "/"

results = self.gcs.Client().get_bucket(bucket).list_blobs(prefix=prefix)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a little confused whether list_blobs returns all the blobs in the bucket - the GCS docs suggest that the page_token argument is required:

page_token (str) – (Optional) Opaque marker for the next “page” of blobs. If not passed, will return the first page of blobs.

However based on this SO post it seems passing page_token isn't necessary. Would you happen to know for sure one way or the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that passing no page_token does return all the objects. I haven't looked at the implementation, but I've gotten >50k objects back from the python sdk without using page_tokens.

from mlflow.utils.file_utils import TempDir


class TestGCSArtifactRepo(unittest.TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick on this - we've been planning to migrate all our tests to pytest format (instead of using unittest). Would you be able to convert these tests to pytest format? It should mainly be a matter of converting the test methods to functions & using the tmpdir fixture (link) instead of the TempDir utility in MLflow.

To override the GOOGLE_APPLICATION_CREDENTIALS environment variable / mock the GCS client, we can also use pytest fixtures e.g:

@pytest.fixture()
def gcs_credentials():
     old_creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS", None)
     mock_creds_path = "/dev/null"
     os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = mock_creds_path
     yield mock_creds_path
     if old_creds_path is not None:
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = old_creds_path

class TestGCSArtifactRepo(unittest.TestCase):
def setUp(self):
# Make sure that the environment variable isn't set to actually make calls
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/dev/null'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be safe, let's save/restore the original value of this environment variable (if it was set) to avoid mutating it for other tests

self.assertEqual(repo.list_artifacts()[0].path, mockobj.f)
self.assertEqual(repo.list_artifacts()[0].file_size, mockobj.size)

def test_log_artifact(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add tests for log_artifacts and download_artifacts? Also, ideally here we'd assert on the result of the upload / or test that the GCS upload_from_filename method is called with the right arguments.

@bnekolny
Copy link
Contributor Author

@smurching I've updated the code to address all of your comments, let me know if there is anything else. I'll watch to make sure tests pass and tweak things if necessary to make sure those go through.

Copy link
Collaborator

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome this LGTM, merging to master - thanks for the hard work on this @bnekolny :)

We're hoping to make a release in the next few days so you should see the GCS functionality in the PyPi installation of MLflow soon!

@smurching smurching merged commit 0ba50e7 into mlflow:master Jul 17, 2018
jdlesage added a commit to jdlesage/mlflow that referenced this pull request Apr 24, 2020
* Fix search with space in parameters name

* fix linting
dbczumar pushed a commit to dbczumar/mlflow that referenced this pull request Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants