-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lin 673 lineapy.get for MLflow #829
Conversation
* LIN-674 Add mlflow related configs in lineapy config * LIN-671-enable-pip-install-lineapy[mlflow] * Use Enum instead of Literal for ML_MODELS_STORAGE_BACKEND * Add test for mlflow config This reverts commit 879ffa9.
* LIN-668 Add metadata for mlflow storage backend * Add mlflow_registry_uri into config items * LIN-672 Implement lineapy.save for mlflow models * LIN-670 Add test for lineapy.save to mlflow backend
5228384
to
8cb1a02
Compare
To add new docstrings into RTD's API reference page, please run: cd docs && rm -rf build source/build source/_build source/autogen && SPHINX_APIDOC_OPTIONS=members sphinx-apidoc -d 2 -f -o ./source/autogen ../lineapy/ && make html This will update docs in |
storage_location/index | ||
storage_backend/index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the distinction between "storage location" and "storage backend" will be clear to our user. As such, why not combine these two sections? That is, the current PR's contents relating to MLflow can be put under existing Changing Storage Location
section and be titled "Storing model artifact values in MLflow", i.e.:
Changing Storage Location
- Storing Artifact Metadata in PostgreSQL
- Storing Artifact Values in Amazon S3
- Storing Model Artifact Values in MLflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you have this comment. However, location and backend are at different layers of abstraction. I think putting them together is more confusing.
MLflow can have its own storage location, it can be local/s3/postgres. In this case, we are basically saying for this type of artifact(ML model), we use MLflow to handle the storage; it could be s3/local/gcp/... and we don't really care, we just need to specify the host of MLflow and MLflow will take care rest of it. However, for LineaPy itself, we are the host of LineaPy. We need to configure the underlying storage location and how the catalog(db) is hosted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the name storage_location and storage_backend also used by mlflow for those two configs? if so, then it is probably already clear for mlflow users as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i think they have a backend store and artifact store. the backend store is configured using --backend-store-uri and the artifact store is configured using --default-artifact-root.
not sure if i'm completely right here but the storage_location here will be the --backend-store-uri of mlflow and storate_backend will be "mlflow"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mingjerli Ok, here's my understanding based on your explanation above:
If this understanding is correct, then I suggest we make things more clear in the landing page of Changing Storage Backend
section, like so:
Out of the box, LineaPy is the default storage backend for all artifacts. For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems. Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models). This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).
NOTE: Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3. For instance, LineaPy is a storage backend that can use different storage locations.
- Using MLflow as Storage Backend to Save ML Models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lionsardesai I think what you are referring to is how to setup/run an MLflow server, not how end users connect to MLflow. End users are using mlflow.set_tracking_uri
and mlflow.set_registry_uri
for configuration to connect to MLflow. They don't need to know how the MLflow server has been set up exactly(as long as their IT/ops people tell them which tracking_uri and registry_uri they should use). And the crazy part is you can almost put anything into set_tracking_uri
, filepath, s3path, database ...
@yoonspark agree, added
.. code:: python | ||
|
||
# LineaPy way | ||
artifact = lineapy.get('model') | ||
model = artifact.get_value() | ||
|
||
artifact.get_code() # to slice the code | ||
lineapy.to_pipeline(['model']) # to create a pipeline | ||
|
||
# MLflow way | ||
metadata = artifact.get_metadata() | ||
client = mlflow.MlflowClient() | ||
latest_version = client.search_model_versions("name='clf'")[0].version | ||
mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the view of a fresh new user, this snippet is a bit confusing. Perhaps, it is because the snippet is covering two different scenarios (saving model fully into LineaPy backend vs. saving model into MLflow backend) in one shot.
- How is
model
different frommlflow_model
above? - Does
artifact
inmetadata = artifact.get_metadata()
differ from that inartifact = lineapy.get('model')
? - Under
# MLflow way
, what is the role ofmetadata
, which is not used in any of the subsequent steps?
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| mlflow_tracking_uri | mlflow tracking | string | None | `LINEAPY_MLFLOW_TRACKING_URI` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for documentation we should break down "integration specific" configurations into a separate section to prevent this table from becoming massive.
return self.db.get_node_value_path(self._node_id, self._execution_id) | ||
|
||
@lru_cache(maxsize=None) | ||
def get_metadata(self, lineapy_only: bool = False) -> ArtifactInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this storage metadata only? If so lets rename to get_storage_metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just for MLflow all the metadata are storage related.
@@ -79,9 +76,10 @@ class LineaArtifact: | |||
"""name of the artifact""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but nice to fix: i think its more common to have the comment above the parameter and not below right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should move this, I think this is for the RTD purpose.
https://docs.lineapy.org/en/latest/autogen/lineapy.api.models.html#lineapy.api.models.linea_artifact.LineaArtifact.name
model_flavor: str | ||
|
||
|
||
class ArtifactInfo(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer not to have this be a TypedDict but just a normal class (dataclass is fine)
Don't like having to "index" TypedDicts by a string.
lineapy/api/models/linea_artifact.py
Outdated
artifact_storage_dir.joinpath(pickle_filename) | ||
if isinstance(artifact_storage_dir, Path) | ||
else f'{artifact_storage_dir.rstrip("/")}/{pickle_filename}' | ||
if self._artifact_id is None or self.date_created is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems like it would be good in an init or post_init to populate
_artifact_id
and date_created
so you can avoid the checks here (and so that other methods can use them in the future).
lineapy/api/models/linea_artifact.py
Outdated
assert isinstance(self.date_created, datetime) | ||
|
||
storage_path = self._get_storage_path() | ||
storage_backend = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, but can we break this down into a proper if/else statement?
I would expect this logic will be more complicated with more integrations so we might as well write this out ...
mlflow_io = {} | ||
|
||
try: | ||
import mlflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a non-blocking question about how this works but we should create a ticket if you think its worth it.
Right now this try catch runs when lineapy module is initialized. This means if a user installs a library during a session because they realize its missing, they must restart their environment and reload the extension to get this to work.
I wonder if we should try to wrap this logic into a decorator that runs a function, catches any mlflow module missing errors, runs this import block, and retries the function again before erroring and decorate all the functions in this file with it.
This way if a user installs a missing package and simply reruns the cell it may work and they wont have to restart kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right here, this won't cover the case that users install mlflow after lineapy is initialized. I wouldn't prioritize supporting this scenario either. If users need to read/write mlflow from lineapy, my guess is they are already mlflow users (at least for now).
"registered_model_name", name | ||
) | ||
kwargs["artifact_path"] = kwargs.get("artifact_path", name) | ||
model_info = flavor_io["serializer"](value, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a code comment here saying this line is the actual "save"
Its kind of non-obvious since the actual function is looked up from a dictionary ...
return a ModelInfo(MLflow model metadata) if successfully save with | ||
mlflow; otherwise None. | ||
|
||
Note that, using Any for type checking in case mlflow is not installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this comment up to the top of the docstring. This is actually incredibly important for people to understand the "failure mode" for this function is returning None.
lineapy/db/db.py
Outdated
def delete_mlflow_metadata_by_artifact_id(self, artifact_id: int) -> None: | ||
""" | ||
Delete MLflow metadata for the artifact | ||
""" | ||
res_query = self.session.query(MLflowArtifactMetadataORM).filter( | ||
MLflowArtifactMetadataORM.artifact_id == artifact_id | ||
) | ||
res_query.delete() | ||
self.renew_session() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be called. For us to be writing to an external system, we should retain every bit of metadata that we can for future audit purposes. Propose adding a delete flag or a status column to indicate that the artifact tied to the specific mlflow's model_uri has been deleted.
lineapy/api/api.py
Outdated
logging.debug(f"No valid pickle path found for {node_id}") | ||
elif lineapy_metadata.storage_backend == ARTIFACT_STORAGE_BACKEND.mlflow: | ||
db.delete_artifact_by_name(artifact_name, version=version) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece where we delete_artifact_by_name
and delete_node_value_from_db
is repeated code for both artifact types (and probably future type as well. Let's refactor this.
from lineapy.utils.config import DEFAULT_ML_MODELS_STORAGE_BACKEND, options | ||
from tests.util import clean_lineapy_env_var | ||
|
||
mlflow = pytest.importorskip("mlflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed for this set of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this test to check how the default_ml_models_storage_backend
behaves if we set the mlflow_tracking_uri
. Because of the mlflow dependency, I need separate it from all other config tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the comments as you see fit.
@lionsardesai has also left an important comment on deletion from the DB ...
I don't think it should block the PR but I had a discussion with him and I think it should be implemented before Mlflow is "feature complete" so it might be worth to roll into this PR.
Description
LineaArtifact.get_value()
for artifacts that use MLflow backend.LineaArtifact.get_metadata()
for artifacts.lineapy.api.delete
to delete MLflow artifact record from lineapy db (not touching MLflow artifact side)statsmodel
andxgboost
and add to MLflow integration.Fixes # (issue)
LIN-669, LIN-673, LIN-675, LIN-676, LIN-690
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
LineaArtifact.get_value
andLineaArtifact.get_metadata
to existing test.lineapy.delete
, manually validate the artifact entry has been deleted from node_value, artifact and mlflow_artifact_storage tabels.