Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lin 673 lineapy.get for MLflow #829

Merged
merged 14 commits into from
Nov 4, 2022
Merged

Lin 673 lineapy.get for MLflow #829

merged 14 commits into from
Nov 4, 2022

Conversation

mingjerli
Copy link
Contributor

@mingjerli mingjerli commented Oct 28, 2022

Description

  • Implement LineaArtifact.get_value() for artifacts that use MLflow backend.
  • Implement LineaArtifact.get_metadata() for artifacts.
  • Implement lineapy.api.delete to delete MLflow artifact record from lineapy db (not touching MLflow artifact side)
  • Annotate statsmodel and xgboost and add to MLflow integration.
  • Add RTD docs for MLflow integration

Fixes # (issue)

LIN-669, LIN-673, LIN-675, LIN-676, LIN-690

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

  • Adding LineaArtifact.get_value and LineaArtifact.get_metadata to existing test.
  • For lineapy.delete, manually validate the artifact entry has been deleted from node_value, artifact and mlflow_artifact_storage tabels.

@mingjerli mingjerli marked this pull request as ready for review October 28, 2022 06:16
@mingjerli mingjerli changed the title Lin 673 lineapy.get Lin 673 lineapy.get for MLflow Oct 31, 2022
@mingjerli mingjerli changed the base branch from LIN-631-mlflow-integration to main November 1, 2022 19:52
* LIN-674 Add mlflow related configs in lineapy config
* LIN-671-enable-pip-install-lineapy[mlflow]
* Use Enum instead of Literal for ML_MODELS_STORAGE_BACKEND
* Add test for mlflow config

This reverts commit 879ffa9.
* LIN-668 Add metadata for mlflow storage backend
* Add mlflow_registry_uri into config items
* LIN-672 Implement lineapy.save for mlflow models
* LIN-670 Add test for lineapy.save to mlflow backend
@yoonspark
Copy link
Contributor

yoonspark commented Nov 2, 2022

To add new docstrings into RTD's API reference page, please run:

cd docs && rm -rf build source/build source/_build source/autogen && SPHINX_APIDOC_OPTIONS=members sphinx-apidoc -d 2 -f -o ./source/autogen ../lineapy/ && make html

This will update docs in lineapy/docs/source/autogen/, which should be committed.

Comment on lines 46 to +47
storage_location/index
storage_backend/index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the distinction between "storage location" and "storage backend" will be clear to our user. As such, why not combine these two sections? That is, the current PR's contents relating to MLflow can be put under existing Changing Storage Location section and be titled "Storing model artifact values in MLflow", i.e.:

Changing Storage Location
  - Storing Artifact Metadata in PostgreSQL
  - Storing Artifact Values in Amazon S3
  - Storing Model Artifact Values in MLflow

Copy link
Contributor Author

@mingjerli mingjerli Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you have this comment. However, location and backend are at different layers of abstraction. I think putting them together is more confusing.

MLflow can have its own storage location, it can be local/s3/postgres. In this case, we are basically saying for this type of artifact(ML model), we use MLflow to handle the storage; it could be s3/local/gcp/... and we don't really care, we just need to specify the host of MLflow and MLflow will take care rest of it. However, for LineaPy itself, we are the host of LineaPy. We need to configure the underlying storage location and how the catalog(db) is hosted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the name storage_location and storage_backend also used by mlflow for those two configs? if so, then it is probably already clear for mlflow users as you said.

Copy link
Contributor

@lionsardesai lionsardesai Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i think they have a backend store and artifact store. the backend store is configured using --backend-store-uri and the artifact store is configured using --default-artifact-root.

not sure if i'm completely right here but the storage_location here will be the --backend-store-uri of mlflow and storate_backend will be "mlflow"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingjerli Ok, here's my understanding based on your explanation above:

image

If this understanding is correct, then I suggest we make things more clear in the landing page of Changing Storage Backend section, like so:

Out of the box, LineaPy is the default storage backend for all artifacts. For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems. Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models). This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).

NOTE: Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3. For instance, LineaPy is a storage backend that can use different storage locations.

  • Using MLflow as Storage Backend to Save ML Models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lionsardesai I think what you are referring to is how to setup/run an MLflow server, not how end users connect to MLflow. End users are using mlflow.set_tracking_uri and mlflow.set_registry_uri for configuration to connect to MLflow. They don't need to know how the MLflow server has been set up exactly(as long as their IT/ops people tell them which tracking_uri and registry_uri they should use). And the crazy part is you can almost put anything into set_tracking_uri, filepath, s3path, database ...

@yoonspark agree, added

Comment on lines 75 to 88
.. code:: python

# LineaPy way
artifact = lineapy.get('model')
model = artifact.get_value()

artifact.get_code() # to slice the code
lineapy.to_pipeline(['model']) # to create a pipeline

# MLflow way
metadata = artifact.get_metadata()
client = mlflow.MlflowClient()
latest_version = client.search_model_versions("name='clf'")[0].version
mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the view of a fresh new user, this snippet is a bit confusing. Perhaps, it is because the snippet is covering two different scenarios (saving model fully into LineaPy backend vs. saving model into MLflow backend) in one shot.

  1. How is model different from mlflow_model above?
  2. Does artifact in metadata = artifact.get_metadata() differ from that in artifact = lineapy.get('model')?
  3. Under # MLflow way, what is the role of metadata, which is not used in any of the subsequent steps?

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| mlflow_tracking_uri | mlflow tracking | string | None | `LINEAPY_MLFLOW_TRACKING_URI` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for documentation we should break down "integration specific" configurations into a separate section to prevent this table from becoming massive.

return self.db.get_node_value_path(self._node_id, self._execution_id)

@lru_cache(maxsize=None)
def get_metadata(self, lineapy_only: bool = False) -> ArtifactInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this storage metadata only? If so lets rename to get_storage_metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just for MLflow all the metadata are storage related.

@@ -79,9 +76,10 @@ class LineaArtifact:
"""name of the artifact"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but nice to fix: i think its more common to have the comment above the parameter and not below right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_flavor: str


class ArtifactInfo(TypedDict):
Copy link
Contributor

@andycui97 andycui97 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer not to have this be a TypedDict but just a normal class (dataclass is fine)

Don't like having to "index" TypedDicts by a string.

artifact_storage_dir.joinpath(pickle_filename)
if isinstance(artifact_storage_dir, Path)
else f'{artifact_storage_dir.rstrip("/")}/{pickle_filename}'
if self._artifact_id is None or self.date_created is None:
Copy link
Contributor

@andycui97 andycui97 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like it would be good in an init or post_init to populate
_artifact_id and date_created so you can avoid the checks here (and so that other methods can use them in the future).

assert isinstance(self.date_created, datetime)

storage_path = self._get_storage_path()
storage_backend = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but can we break this down into a proper if/else statement?

I would expect this logic will be more complicated with more integrations so we might as well write this out ...

mlflow_io = {}

try:
import mlflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a non-blocking question about how this works but we should create a ticket if you think its worth it.

Right now this try catch runs when lineapy module is initialized. This means if a user installs a library during a session because they realize its missing, they must restart their environment and reload the extension to get this to work.

I wonder if we should try to wrap this logic into a decorator that runs a function, catches any mlflow module missing errors, runs this import block, and retries the function again before erroring and decorate all the functions in this file with it.

This way if a user installs a missing package and simply reruns the cell it may work and they wont have to restart kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right here, this won't cover the case that users install mlflow after lineapy is initialized. I wouldn't prioritize supporting this scenario either. If users need to read/write mlflow from lineapy, my guess is they are already mlflow users (at least for now).

"registered_model_name", name
)
kwargs["artifact_path"] = kwargs.get("artifact_path", name)
model_info = flavor_io["serializer"](value, **kwargs)
Copy link
Contributor

@andycui97 andycui97 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a code comment here saying this line is the actual "save"

Its kind of non-obvious since the actual function is looked up from a dictionary ...

return a ModelInfo(MLflow model metadata) if successfully save with
mlflow; otherwise None.

Note that, using Any for type checking in case mlflow is not installed.
Copy link
Contributor

@andycui97 andycui97 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this comment up to the top of the docstring. This is actually incredibly important for people to understand the "failure mode" for this function is returning None.

lineapy/db/db.py Outdated
Comment on lines 785 to 794
def delete_mlflow_metadata_by_artifact_id(self, artifact_id: int) -> None:
"""
Delete MLflow metadata for the artifact
"""
res_query = self.session.query(MLflowArtifactMetadataORM).filter(
MLflowArtifactMetadataORM.artifact_id == artifact_id
)
res_query.delete()
self.renew_session()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be called. For us to be writing to an external system, we should retain every bit of metadata that we can for future audit purposes. Propose adding a delete flag or a status column to indicate that the artifact tied to the specific mlflow's model_uri has been deleted.

logging.debug(f"No valid pickle path found for {node_id}")
elif lineapy_metadata.storage_backend == ARTIFACT_STORAGE_BACKEND.mlflow:
db.delete_artifact_by_name(artifact_name, version=version)
try:
Copy link
Contributor

@andycui97 andycui97 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece where we delete_artifact_by_name and delete_node_value_from_db is repeated code for both artifact types (and probably future type as well. Let's refactor this.

from lineapy.utils.config import DEFAULT_ML_MODELS_STORAGE_BACKEND, options
from tests.util import clean_lineapy_env_var

mlflow = pytest.importorskip("mlflow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed for this set of tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this test to check how the default_ml_models_storage_backend behaves if we set the mlflow_tracking_uri. Because of the mlflow dependency, I need separate it from all other config tests.

Copy link
Contributor

@andycui97 andycui97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments as you see fit.

@lionsardesai has also left an important comment on deletion from the DB ...
I don't think it should block the PR but I had a discussion with him and I think it should be implemented before Mlflow is "feature complete" so it might be worth to roll into this PR.

@mingjerli mingjerli merged commit 7f8167f into main Nov 4, 2022
@lionsardesai lionsardesai deleted the LIN-673-lineapy.get branch November 9, 2022 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants