Adding GCS artifact storage capabilities. #152

bnekolny · 2018-07-13T21:15:51Z

No description provided.

bnekolny · 2018-07-13T21:17:01Z

I still need to add some tests (working through the nuances of no moto equivalent for GCS.

We're using GCS for storage, so is there an appetite for this being merged into master?

Add google-cloud-storage as a dependency. Fixing a couple bugs with the GCS store.

codecov-io · 2018-07-13T21:53:14Z

Codecov Report

Merging #152 into master will decrease coverage by 0.24%.
The diff coverage is 83.82%.

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   49.79%   49.54%   -0.25%     
==========================================
  Files          89       89              
  Lines        4322     4503     +181     
==========================================
+ Hits         2152     2231      +79     
- Misses       2170     2272     +102

Impacted Files	Coverage Δ
mlflow/store/artifact_repo.py	`89.77% <83.82%> (-3.75%)`	⬇️
mlflow/sagemaker/__init__.py	`16.6% <0%> (-0.87%)`	⬇️
mlflow/sagemaker/container/__init__.py	`0% <0%> (ø)`	⬆️
mlflow/sagemaker/cli.py	`76.19% <0%> (+1.9%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4787a3...00ec0be. Read the comment docs.

mateiz · 2018-07-14T00:21:44Z

This definitely sounds like a good idea as long as you provide a way to test it. It's not a lot of code, and it will obviously help a lot of users.

One question about the URI scheme: is the gs:// scheme also recognized for Google Cloud Storage in TensorFlow, Spark, and other systems that support writing to cloud storage? Or are multiple schemes used? In the latter case, we'll probably want to choose the most common one because we want to make it possible for jobs to write directly to the artifact URI for a run without writing to a local file first (especially for large datasets). We already have this issue with S3, where Hadoop/Spark use s3a or s3n, and we plan to do the conversion automatically there.

bnekolny · 2018-07-14T08:38:54Z

@mateiz the gs:// scheme is a standard reference for GCS, much like s3://. There are different HTTP urls, such as the S3 equivalent of s3.amazonaws.com, but I believe gs:// is more appropriate for a storage mechanism.

I added some tests, but let me know your thoughts. We're running on GCP and are already using a fork with this code in it, but are hoping to get it merged into mainline.

smurching · 2018-07-16T23:45:52Z

Ah this is awesome, thanks @bnekolny - giving it a look now :)

smurching

Thanks a ton for the PR @bnekolny, this will definitely be helpful for users - leaving a few comments :)

smurching · 2018-07-16T23:51:13Z

mlflow/store/artifact_repo.py

+
+
+class GCSArtifactRepository(ArtifactRepository):
+    """Stores artifacts on Google Cloud Storage.


Thanks for adding docs, would you be able to update the Storage subsection of the tracking docs to (a) mention GCS as a storage option and (b) include a link to the the GCS auth docs? Specifically, we should update this file - thanks!

I added docs here: databricks@e387bd0

smurching · 2018-07-17T00:05:51Z

mlflow/store/artifact_repo.py

+        infos = []
+        prefix = dest_path + "/"
+
+        results = self.gcs.Client().get_bucket(bucket).list_blobs(prefix=prefix)


Was a little confused whether list_blobs returns all the blobs in the bucket - the GCS docs suggest that the page_token argument is required:

page_token (str) – (Optional) Opaque marker for the next “page” of blobs. If not passed, will return the first page of blobs.

However based on this SO post it seems passing page_token isn't necessary. Would you happen to know for sure one way or the other?

I can confirm that passing no page_token does return all the objects. I haven't looked at the implementation, but I've gotten >50k objects back from the python sdk without using page_tokens.

smurching · 2018-07-17T00:13:09Z

tests/store/test_gcs_artifact_repo.py

+from mlflow.utils.file_utils import TempDir
+
+
+class TestGCSArtifactRepo(unittest.TestCase):


Sorry to nitpick on this - we've been planning to migrate all our tests to pytest format (instead of using unittest). Would you be able to convert these tests to pytest format? It should mainly be a matter of converting the test methods to functions & using the tmpdir fixture (link) instead of the TempDir utility in MLflow.

To override the GOOGLE_APPLICATION_CREDENTIALS environment variable / mock the GCS client, we can also use pytest fixtures e.g:

@pytest.fixture() def gcs_credentials(): old_creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS", None) mock_creds_path = "/dev/null" os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = mock_creds_path yield mock_creds_path if old_creds_path is not None: os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = old_creds_path

smurching · 2018-07-17T00:14:09Z

tests/store/test_gcs_artifact_repo.py

+class TestGCSArtifactRepo(unittest.TestCase):
+    def setUp(self):
+        # Make sure that the environment variable isn't set to actually make calls
+        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/dev/null'


Just to be safe, let's save/restore the original value of this environment variable (if it was set) to avoid mutating it for other tests

smurching · 2018-07-17T00:22:27Z

tests/store/test_gcs_artifact_repo.py

+        self.assertEqual(repo.list_artifacts()[0].path, mockobj.f)
+        self.assertEqual(repo.list_artifacts()[0].file_size, mockobj.size)
+
+    def test_log_artifact(self):


Could we also add tests for log_artifacts and download_artifacts? Also, ideally here we'd assert on the result of the upload / or test that the GCS upload_from_filename method is called with the right arguments.

…en restore after gcs tests.

bnekolny · 2018-07-17T18:20:47Z

@smurching I've updated the code to address all of your comments, let me know if there is anything else. I'll watch to make sure tests pass and tweak things if necessary to make sure those go through.

smurching

Awesome this LGTM, merging to master - thanks for the hard work on this @bnekolny :)

We're hoping to make a release in the next few days so you should see the GCS functionality in the PyPi installation of MLflow soon!

* Fix search with space in parameters name * fix linting

Adding GCS artifact storage capabilities.

e78c560

Add google-cloud-storage as a dependency. Fixing a couple bugs with the GCS store.

bnekolny force-pushed the bn/gcs-artifact branch from 2f4f58e to e78c560 Compare July 13, 2018 21:18

Fix pep8 issues.

1265fa9

Add tests for GCSArtifactRepo.

937236f

bnekolny added 4 commits July 14, 2018 02:47

Trigger

4898668

pep8 fix.

635dfcd

Use super instead of setting artifact_uri

edcc3c9

Removing an unecessary lambda.

bb4c9fe

smurching reviewed Jul 17, 2018

View reviewed changes

bnekolny added 5 commits July 17, 2018 08:43

Adding GCS information to the storage docs.

e387bd0

Store the old GOOGLE_APPLICATION_CREDENTIALS environment variable, th…

87028b0

…en restore after gcs tests.

Convert from unittest to pytest.

f644203

Verifying call signature of gcs client calls.

fecc45a

Adding tests for log_artifacts and _download_artifacts.

d5991b7

Ignore redifined-outer-name in pytest fixtures.

00ec0be

smurching approved these changes Jul 17, 2018

View reviewed changes

smurching merged commit 0ba50e7 into mlflow:master Jul 17, 2018

jdlesage added a commit to jdlesage/mlflow that referenced this pull request Apr 24, 2020

Fix search with space in parameters name (mlflow#152)

ae82ec7

* Fix search with space in parameters name * fix linting

dbczumar pushed a commit to dbczumar/mlflow that referenced this pull request Jun 15, 2022

Changing the user split.py should re-run the split step (mlflow#152)

1927c4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GCS artifact storage capabilities. #152

Adding GCS artifact storage capabilities. #152

bnekolny commented Jul 13, 2018

bnekolny commented Jul 13, 2018

codecov-io commented Jul 13, 2018 •

edited

Loading

mateiz commented Jul 14, 2018

bnekolny commented Jul 14, 2018

smurching commented Jul 16, 2018

smurching left a comment

smurching Jul 16, 2018

bnekolny Jul 17, 2018

smurching Jul 17, 2018

bnekolny Jul 17, 2018

smurching Jul 17, 2018

smurching Jul 17, 2018

smurching Jul 17, 2018

bnekolny commented Jul 17, 2018

smurching left a comment



		class GCSArtifactRepository(ArtifactRepository):
		"""Stores artifacts on Google Cloud Storage.

		from mlflow.utils.file_utils import TempDir


		class TestGCSArtifactRepo(unittest.TestCase):

Adding GCS artifact storage capabilities. #152

Adding GCS artifact storage capabilities. #152

Conversation

bnekolny commented Jul 13, 2018

bnekolny commented Jul 13, 2018

codecov-io commented Jul 13, 2018 • edited Loading

Codecov Report

mateiz commented Jul 14, 2018

bnekolny commented Jul 14, 2018

smurching commented Jul 16, 2018

smurching left a comment

Choose a reason for hiding this comment

smurching Jul 16, 2018

Choose a reason for hiding this comment

bnekolny Jul 17, 2018

Choose a reason for hiding this comment

smurching Jul 17, 2018

Choose a reason for hiding this comment

bnekolny Jul 17, 2018

Choose a reason for hiding this comment

smurching Jul 17, 2018

Choose a reason for hiding this comment

smurching Jul 17, 2018

Choose a reason for hiding this comment

smurching Jul 17, 2018

Choose a reason for hiding this comment

bnekolny commented Jul 17, 2018

smurching left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 13, 2018 •

edited

Loading