Lin 673 lineapy.get for MLflow #829

mingjerli · 2022-10-28T02:49:30Z

Description

Implement LineaArtifact.get_value() for artifacts that use MLflow backend.
Implement LineaArtifact.get_metadata() for artifacts.
Implement lineapy.api.delete to delete MLflow artifact record from lineapy db (not touching MLflow artifact side)
Annotate statsmodel and xgboost and add to MLflow integration.
Add RTD docs for MLflow integration

Fixes # (issue)

LIN-669, LIN-673, LIN-675, LIN-676, LIN-690

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Adding LineaArtifact.get_value and LineaArtifact.get_metadata to existing test.
For lineapy.delete, manually validate the artifact entry has been deleted from node_value, artifact and mlflow_artifact_storage tabels.

* LIN-674 Add mlflow related configs in lineapy config * LIN-671-enable-pip-install-lineapy[mlflow] * Use Enum instead of Literal for ML_MODELS_STORAGE_BACKEND * Add test for mlflow config This reverts commit 879ffa9.

* LIN-668 Add metadata for mlflow storage backend * Add mlflow_registry_uri into config items * LIN-672 Implement lineapy.save for mlflow models * LIN-670 Add test for lineapy.save to mlflow backend

yoonspark · 2022-11-02T16:25:13Z

To add new docstrings into RTD's API reference page, please run:

cd docs && rm -rf build source/build source/_build source/autogen && SPHINX_APIDOC_OPTIONS=members sphinx-apidoc -d 2 -f -o ./source/autogen ../lineapy/ && make html

This will update docs in lineapy/docs/source/autogen/, which should be committed.

yoonspark · 2022-11-02T19:25:09Z

docs/source/guide/manage_artifacts/index.rst

   storage_location/index
+   storage_backend/index


I am not sure if the distinction between "storage location" and "storage backend" will be clear to our user. As such, why not combine these two sections? That is, the current PR's contents relating to MLflow can be put under existing Changing Storage Location section and be titled "Storing model artifact values in MLflow", i.e.:

Changing Storage Location - Storing Artifact Metadata in PostgreSQL - Storing Artifact Values in Amazon S3 - Storing Model Artifact Values in MLflow

I understand why you have this comment. However, location and backend are at different layers of abstraction. I think putting them together is more confusing.

MLflow can have its own storage location, it can be local/s3/postgres. In this case, we are basically saying for this type of artifact(ML model), we use MLflow to handle the storage; it could be s3/local/gcp/... and we don't really care, we just need to specify the host of MLflow and MLflow will take care rest of it. However, for LineaPy itself, we are the host of LineaPy. We need to configure the underlying storage location and how the catalog(db) is hosted.

is the name storage_location and storage_backend also used by mlflow for those two configs? if so, then it is probably already clear for mlflow users as you said.

hmm i think they have a backend store and artifact store. the backend store is configured using --backend-store-uri and the artifact store is configured using --default-artifact-root.

not sure if i'm completely right here but the storage_location here will be the --backend-store-uri of mlflow and storate_backend will be "mlflow"?

@mingjerli Ok, here's my understanding based on your explanation above:

If this understanding is correct, then I suggest we make things more clear in the landing page of Changing Storage Backend section, like so:

Out of the box, LineaPy is the default storage backend for all artifacts. For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems. Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models). This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).

NOTE: Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3. For instance, LineaPy is a storage backend that can use different storage locations.

Using MLflow as Storage Backend to Save ML Models

@lionsardesai I think what you are referring to is how to setup/run an MLflow server, not how end users connect to MLflow. End users are using mlflow.set_tracking_uri and mlflow.set_registry_uri for configuration to connect to MLflow. They don't need to know how the MLflow server has been set up exactly(as long as their IT/ops people tell them which tracking_uri and registry_uri they should use). And the crazy part is you can almost put anything into set_tracking_uri, filepath, s3path, database ...

@yoonspark agree, added

docs/source/guide/manage_artifacts/storage_backend/mlflow.rst

yoonspark · 2022-11-02T19:35:09Z

docs/source/guide/manage_artifacts/storage_backend/mlflow.rst

+.. code:: python
+
+    # LineaPy way
+    artifact = lineapy.get('model')
+    model = artifact.get_value()
+
+    artifact.get_code() # to slice the code
+    lineapy.to_pipeline(['model']) # to create a pipeline 
+
+    # MLflow way
+    metadata = artifact.get_metadata()
+    client = mlflow.MlflowClient()
+    latest_version = client.search_model_versions("name='clf'")[0].version
+    mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}')    


From the view of a fresh new user, this snippet is a bit confusing. Perhaps, it is because the snippet is covering two different scenarios (saving model fully into LineaPy backend vs. saving model into MLflow backend) in one shot.

How is model different from mlflow_model above?

Does artifact in metadata = artifact.get_metadata() differ from that in artifact = lineapy.get('model')?

Under # MLflow way, what is the role of metadata, which is not used in any of the subsequent steps?

andycui97 · 2022-11-03T02:23:07Z

docs/source/references/configurations.rst

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| logging_file                        | logging file path                     | Path    | `$LINEAPY_HOME_DIR/lineapy.log`            | `LINEAPY_LOGGING_FILE`                          |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| mlflow_tracking_uri                 | mlflow tracking                       | string  | None                                       | `LINEAPY_MLFLOW_TRACKING_URI`                   |


Maybe for documentation we should break down "integration specific" configurations into a separate section to prevent this table from becoming massive.

andycui97 · 2022-11-03T03:56:19Z

lineapy/api/models/linea_artifact.py

+        return self.db.get_node_value_path(self._node_id, self._execution_id)
+
+    @lru_cache(maxsize=None)
+    def get_metadata(self, lineapy_only: bool = False) -> ArtifactInfo:


is this storage metadata only? If so lets rename to get_storage_metadata.

No, just for MLflow all the metadata are storage related.

andycui97 · 2022-11-03T03:57:36Z

lineapy/api/models/linea_artifact.py

@@ -79,9 +76,10 @@ class LineaArtifact:
    """name of the artifact"""


nitpick but nice to fix: i think its more common to have the comment above the parameter and not below right?

Not sure we should move this, I think this is for the RTD purpose.
https://docs.lineapy.org/en/latest/autogen/lineapy.api.models.html#lineapy.api.models.linea_artifact.LineaArtifact.name

andycui97 · 2022-11-03T04:00:35Z

lineapy/data/types.py

+    model_flavor: str
+
+
+class ArtifactInfo(TypedDict):


Prefer not to have this be a TypedDict but just a normal class (dataclass is fine)

Don't like having to "index" TypedDicts by a string.

andycui97 · 2022-11-03T04:08:37Z

lineapy/api/models/linea_artifact.py

-                artifact_storage_dir.joinpath(pickle_filename)
-                if isinstance(artifact_storage_dir, Path)
-                else f'{artifact_storage_dir.rstrip("/")}/{pickle_filename}'
+        if self._artifact_id is None or self.date_created is None:


This logic seems like it would be good in an init or post_init to populate
_artifact_id and date_created so you can avoid the checks here (and so that other methods can use them in the future).

andycui97 · 2022-11-03T04:10:45Z

lineapy/api/models/linea_artifact.py

+        assert isinstance(self.date_created, datetime)
+
+        storage_path = self._get_storage_path()
+        storage_backend = (


nitpick, but can we break this down into a proper if/else statement?

I would expect this logic will be more complicated with more integrations so we might as well write this out ...

andycui97 · 2022-11-03T04:19:02Z

lineapy/plugins/serializers/mlflow_io.py

+mlflow_io = {}
+
+try:
+    import mlflow


I have a non-blocking question about how this works but we should create a ticket if you think its worth it.

Right now this try catch runs when lineapy module is initialized. This means if a user installs a library during a session because they realize its missing, they must restart their environment and reload the extension to get this to work.

I wonder if we should try to wrap this logic into a decorator that runs a function, catches any mlflow module missing errors, runs this import block, and retries the function again before erroring and decorate all the functions in this file with it.

This way if a user installs a missing package and simply reruns the cell it may work and they wont have to restart kernel.

You are right here, this won't cover the case that users install mlflow after lineapy is initialized. I wouldn't prioritize supporting this scenario either. If users need to read/write mlflow from lineapy, my guess is they are already mlflow users (at least for now).

andycui97 · 2022-11-03T04:22:24Z

lineapy/plugins/serializers/mlflow_io.py

+                    "registered_model_name", name
+                )
+                kwargs["artifact_path"] = kwargs.get("artifact_path", name)
+                model_info = flavor_io["serializer"](value, **kwargs)


Add a code comment here saying this line is the actual "save"

Its kind of non-obvious since the actual function is looked up from a dictionary ...

andycui97 · 2022-11-03T04:24:55Z

lineapy/plugins/serializers/mlflow_io.py

+        return a ModelInfo(MLflow model metadata) if successfully save with
+        mlflow; otherwise None.
+
+    Note that, using Any for type checking in case mlflow is not installed.


Move this comment up to the top of the docstring. This is actually incredibly important for people to understand the "failure mode" for this function is returning None.

lionsardesai · 2022-11-03T04:58:53Z

lineapy/db/db.py

+    def delete_mlflow_metadata_by_artifact_id(self, artifact_id: int) -> None:
+        """
+        Delete MLflow metadata for the artifact
+        """
+        res_query = self.session.query(MLflowArtifactMetadataORM).filter(
+            MLflowArtifactMetadataORM.artifact_id == artifact_id
+        )
+        res_query.delete()
+        self.renew_session()
+


This should not be called. For us to be writing to an external system, we should retain every bit of metadata that we can for future audit purposes. Propose adding a delete flag or a status column to indicate that the artifact tied to the specific mlflow's model_uri has been deleted.

andycui97 · 2022-11-03T05:10:07Z

lineapy/api/api.py

+            logging.debug(f"No valid pickle path found for {node_id}")
+    elif lineapy_metadata.storage_backend == ARTIFACT_STORAGE_BACKEND.mlflow:
+        db.delete_artifact_by_name(artifact_name, version=version)
+        try:


This piece where we delete_artifact_by_name and delete_node_value_from_db is repeated code for both artifact types (and probably future type as well. Let's refactor this.

tests/test_mlflow.py

andycui97 · 2022-11-03T05:20:29Z

tests/unit/utils/test_config_mlflow.py

+from lineapy.utils.config import DEFAULT_ML_MODELS_STORAGE_BACKEND, options
+from tests.util import clean_lineapy_env_var
+
+mlflow = pytest.importorskip("mlflow")


not needed for this set of tests?

I add this test to check how the default_ml_models_storage_backend behaves if we set the mlflow_tracking_uri. Because of the mlflow dependency, I need separate it from all other config tests.

tests/util.py

andycui97

Please address the comments as you see fit.

@lionsardesai has also left an important comment on deletion from the DB ...
I don't think it should block the PR but I had a discussion with him and I think it should be implemented before Mlflow is "feature complete" so it might be worth to roll into this PR.

mingjerli marked this pull request as ready for review October 28, 2022 06:16

mingjerli requested review from lionsardesai and andycui97 October 28, 2022 11:39

mingjerli changed the title ~~Lin 673 lineapy.get~~ Lin 673 lineapy.get for MLflow Oct 31, 2022

mingjerli changed the base branch from LIN-631-mlflow-integration to main November 1, 2022 19:52

mingjerli added 8 commits November 1, 2022 15:53

LIN-674, LIN-671 add mlflow configs (#825)

b4fcf3c

* LIN-674 Add mlflow related configs in lineapy config * LIN-671-enable-pip-install-lineapy[mlflow] * Use Enum instead of Literal for ML_MODELS_STORAGE_BACKEND * Add test for mlflow config This reverts commit 879ffa9.

LIN-672 lineapy.save for mlflow (#828)

8ef41a4

* LIN-668 Add metadata for mlflow storage backend * Add mlflow_registry_uri into config items * LIN-672 Implement lineapy.save for mlflow models * LIN-670 Add test for lineapy.save to mlflow backend

WIP-lineapy-get-metadata

d4d8bba

WIP - Implement Artifact.get_value and Artifact.get_metadata

c443b09

Implement delete for MLflow

88c2e5f

Add statsmodels and xgboost serializer/deserializer for MLflow

509e5e8

Add doc

0d3e9f8

Add RTD for MLflow

8cb1a02

mingjerli force-pushed the LIN-673-lineapy.get branch from 5228384 to 8cb1a02 Compare November 2, 2022 05:05

yoonspark reviewed Nov 2, 2022

View reviewed changes

docs/source/guide/manage_artifacts/storage_backend/mlflow.rst Outdated Show resolved Hide resolved

yoonspark reviewed Nov 2, 2022

View reviewed changes

andycui97 reviewed Nov 3, 2022

View reviewed changes

lionsardesai reviewed Nov 3, 2022

View reviewed changes

andycui97 reviewed Nov 3, 2022

View reviewed changes

tests/test_mlflow.py Show resolved Hide resolved

andycui97 reviewed Nov 3, 2022

View reviewed changes

tests/util.py Show resolved Hide resolved

andycui97 approved these changes Nov 3, 2022

View reviewed changes

mingjerli added 6 commits November 3, 2022 11:44

Update docs to address PR review

0743102

Address PR feedback

4d9718c

Change mlflow deletion db logic

7188ee9

rebase

8ac717b

refactor common code for different storage backend saving logic

62770eb

Add doc for backend storage

1c5ace1

mingjerli merged commit 7f8167f into main Nov 4, 2022

lionsardesai deleted the LIN-673-lineapy.get branch November 9, 2022 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lin 673 lineapy.get for MLflow #829

Lin 673 lineapy.get for MLflow #829

mingjerli commented Oct 28, 2022 •

edited

Loading

yoonspark commented Nov 2, 2022 •

edited

Loading

yoonspark Nov 2, 2022

mingjerli Nov 3, 2022 •

edited

Loading

lionsardesai Nov 3, 2022

lionsardesai Nov 3, 2022 •

edited

Loading

yoonspark Nov 3, 2022

mingjerli Nov 4, 2022

yoonspark Nov 2, 2022

andycui97 Nov 3, 2022

andycui97 Nov 3, 2022

mingjerli Nov 3, 2022

andycui97 Nov 3, 2022

mingjerli Nov 3, 2022

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022

andycui97 Nov 3, 2022

mingjerli Nov 3, 2022

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

lionsardesai Nov 3, 2022

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022

mingjerli Nov 4, 2022

andycui97 left a comment

		@@ -79,9 +76,10 @@ class LineaArtifact:
		"""name of the artifact"""

Lin 673 lineapy.get for MLflow #829

Lin 673 lineapy.get for MLflow #829

Conversation

mingjerli commented Oct 28, 2022 • edited Loading

Description

Type of change

How Has This Been Tested?

yoonspark commented Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

mingjerli Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionsardesai Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andycui97 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

andycui97 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andycui97 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

andycui97 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andycui97 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andycui97 left a comment

Choose a reason for hiding this comment

mingjerli commented Oct 28, 2022 •

edited

Loading

yoonspark commented Nov 2, 2022 •

edited

Loading

mingjerli Nov 3, 2022 •

edited

Loading

lionsardesai Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading

andycui97 Nov 3, 2022 •

edited

Loading