-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding GCS artifact storage capabilities. #152
Adding GCS artifact storage capabilities. #152
Conversation
I still need to add some tests (working through the nuances of no We're using GCS for storage, so is there an appetite for this being merged into master? |
Add google-cloud-storage as a dependency. Fixing a couple bugs with the GCS store.
2f4f58e
to
e78c560
Compare
Codecov Report
@@ Coverage Diff @@
## master #152 +/- ##
==========================================
- Coverage 49.79% 49.54% -0.25%
==========================================
Files 89 89
Lines 4322 4503 +181
==========================================
+ Hits 2152 2231 +79
- Misses 2170 2272 +102
Continue to review full report at Codecov.
|
This definitely sounds like a good idea as long as you provide a way to test it. It's not a lot of code, and it will obviously help a lot of users. One question about the URI scheme: is the gs:// scheme also recognized for Google Cloud Storage in TensorFlow, Spark, and other systems that support writing to cloud storage? Or are multiple schemes used? In the latter case, we'll probably want to choose the most common one because we want to make it possible for jobs to write directly to the artifact URI for a run without writing to a local file first (especially for large datasets). We already have this issue with S3, where Hadoop/Spark use s3a or s3n, and we plan to do the conversion automatically there. |
@mateiz the gs:// scheme is a standard reference for GCS, much like s3://. There are different HTTP urls, such as the S3 equivalent of s3.amazonaws.com, but I believe gs:// is more appropriate for a storage mechanism. I added some tests, but let me know your thoughts. We're running on GCP and are already using a fork with this code in it, but are hoping to get it merged into mainline. |
Ah this is awesome, thanks @bnekolny - giving it a look now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for the PR @bnekolny, this will definitely be helpful for users - leaving a few comments :)
|
||
|
||
class GCSArtifactRepository(ArtifactRepository): | ||
"""Stores artifacts on Google Cloud Storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding docs, would you be able to update the Storage
subsection of the tracking docs to (a) mention GCS as a storage option and (b) include a link to the the GCS auth docs? Specifically, we should update this file - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added docs here: databricks@e387bd0
infos = [] | ||
prefix = dest_path + "/" | ||
|
||
results = self.gcs.Client().get_bucket(bucket).list_blobs(prefix=prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a little confused whether list_blobs
returns all the blobs in the bucket - the GCS docs suggest that the page_token
argument is required:
page_token (str) – (Optional) Opaque marker for the next “page” of blobs. If not passed, will return the first page of blobs.
However based on this SO post it seems passing page_token
isn't necessary. Would you happen to know for sure one way or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that passing no page_token does return all the objects. I haven't looked at the implementation, but I've gotten >50k objects back from the python sdk without using page_tokens.
from mlflow.utils.file_utils import TempDir | ||
|
||
|
||
class TestGCSArtifactRepo(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to nitpick on this - we've been planning to migrate all our tests to pytest
format (instead of using unittest
). Would you be able to convert these tests to pytest
format? It should mainly be a matter of converting the test methods to functions & using the tmpdir
fixture (link) instead of the TempDir
utility in MLflow.
To override the GOOGLE_APPLICATION_CREDENTIALS
environment variable / mock the GCS client, we can also use pytest fixtures e.g:
@pytest.fixture()
def gcs_credentials():
old_creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS", None)
mock_creds_path = "/dev/null"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = mock_creds_path
yield mock_creds_path
if old_creds_path is not None:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = old_creds_path
class TestGCSArtifactRepo(unittest.TestCase): | ||
def setUp(self): | ||
# Make sure that the environment variable isn't set to actually make calls | ||
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/dev/null' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be safe, let's save/restore the original value of this environment variable (if it was set) to avoid mutating it for other tests
self.assertEqual(repo.list_artifacts()[0].path, mockobj.f) | ||
self.assertEqual(repo.list_artifacts()[0].file_size, mockobj.size) | ||
|
||
def test_log_artifact(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add tests for log_artifacts
and download_artifacts
? Also, ideally here we'd assert on the result of the upload / or test that the GCS upload_from_filename
method is called with the right arguments.
…en restore after gcs tests.
@smurching I've updated the code to address all of your comments, let me know if there is anything else. I'll watch to make sure tests pass and tweak things if necessary to make sure those go through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome this LGTM, merging to master - thanks for the hard work on this @bnekolny :)
We're hoping to make a release in the next few days so you should see the GCS functionality in the PyPi installation of MLflow soon!
* Fix search with space in parameters name * fix linting
No description provided.