Skip to content

Commit

Permalink
Lin 673 lineapy.get for MLflow (LineaLabs#829)
Browse files Browse the repository at this point in the history
* LIN-674, LIN-671 add mlflow configs (LineaLabs#825)

* LIN-674 Add mlflow related configs in lineapy config
* LIN-671-enable-pip-install-lineapy[mlflow]
* Use Enum instead of Literal for ML_MODELS_STORAGE_BACKEND
* Add test for mlflow config

This reverts commit 879ffa9.

* LIN-672 lineapy.save for mlflow (LineaLabs#828)

* LIN-668 Add metadata for mlflow storage backend
* Add mlflow_registry_uri into config items
* LIN-672 Implement lineapy.save for mlflow models
* LIN-670 Add test for lineapy.save to mlflow backend

* WIP-lineapy-get-metadata

* WIP - Implement Artifact.get_value and Artifact.get_metadata

* Implement delete for MLflow

* Add statsmodels and xgboost serializer/deserializer for MLflow

* Add doc

* Add RTD for MLflow

* Update docs to address PR review

* Address PR feedback

* Change mlflow deletion db logic

* refactor common code for different storage backend saving logic

* Add doc for backend storage
  • Loading branch information
mingjerli authored Nov 4, 2022
1 parent 6f9e6c2 commit 7f8167f
Show file tree
Hide file tree
Showing 26 changed files with 1,188 additions and 164 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,3 +179,6 @@ Untitled*.ipynb
tests/outputs
*.pickle
.linea/linea_pickles

# mlflow
mlruns/
6 changes: 6 additions & 0 deletions docs/source/autogen/lineapy.api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ lineapy.api.api\_utils module
.. automodule:: lineapy.api.api_utils
:members:

lineapy.api.artifact\_serializer module
---------------------------------------

.. automodule:: lineapy.api.artifact_serializer
:members:

Module contents
---------------

Expand Down
8 changes: 8 additions & 0 deletions docs/source/autogen/lineapy.plugins.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
lineapy.plugins package
=======================

Subpackages
-----------

.. toctree::
:maxdepth: 2

lineapy.plugins.serializers

Submodules
----------

Expand Down
17 changes: 17 additions & 0 deletions docs/source/autogen/lineapy.plugins.serializers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
lineapy.plugins.serializers package
===================================

Submodules
----------

lineapy.plugins.serializers.mlflow\_io module
---------------------------------------------

.. automodule:: lineapy.plugins.serializers.mlflow_io
:members:

Module contents
---------------

.. automodule:: lineapy.plugins.serializers
:members:
15 changes: 15 additions & 0 deletions docs/source/guide/manage_artifacts/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,23 @@ imagine how difficult it would be to maintain correlations between the two. Line
Read more about configuration :ref:`here <configurations>`.


Storage Backend
---------------

Out of the box, LineaPy is the default storage backend for all artifacts.
For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems.
Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models).
This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).

.. note::

Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3.
For instance, LineaPy is a storage backend that can use different storage locations.

.. toctree::
:maxdepth: 1

artifact_reuse
storage_location/index
storage_backend/index
14 changes: 14 additions & 0 deletions docs/source/guide/manage_artifacts/storage_backend/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Changing Storage Backend
========================

Out of the box, LineaPy is the default storage backend for all artifacts.
For some existing storage systems(MLflow, database ...) used to save artifacts; saving one more copy in LineaPy causes syncing issue between the two systems.
Thus, LineaPy supports using different storage backends for some data types.
This support is essential for users to leverage functionalities from both LineaPy and their familiar tools.

Currently, LineaPy supports MLflow as a storage backend for ML models.

.. toctree::
:maxdepth: 1

mlflow
110 changes: 110 additions & 0 deletions docs/source/guide/manage_artifacts/storage_backend/mlflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
.. _mlflow:

Using MLflow as Storage Backend to Save ML Models
=================================================

.. include:: ../../../snippets/slack_support.rstinc

By default, LineaPy uses LineaPy to save artifacts for all object types.
However, for users who have access to MLflow, MLflow might be their first choice to save the ML model.
Thus, we enable using MLflow as the backend storage for ML models.

Configure MLflow
----------------

Depend on how our MLflow is configured. We might need to specify ``tracking URI`` and (optional) ``registry URI``in MLflow to start using MLflow.
.. code:: python
mlflow.set_tracking_uri('your_mlflow_tracking_uri')
mlflow.set_registry_uri('your_mlflow_registry_uri')
To let LineaPy be aware of the existence of MLflow, we need to set corresponding config items if we want to use MLflow as the storage backend for ML models.
.. code:: python
lineapy.options.set('mlflow_tracking_uri','your_mlflow_tracking_uri')
lineapy.options.set('mlflow_registry_uri','your_mlflow_registry_uri')
.. note::
For objects not supported by MLflow, it will fall back to using LineaPy as the storage backend as usual.
Set Default Storage Backend for ML Models
-----------------------------------------
Each user might have a different usage pattern for MLflow; some might use it for logging purposes and record all developing models. Some might treat it as a public space and only publish models that meet specific criteria to MLflow.
In the first case, users want to use MLflow to save artifacts(ML models) by default, and in the second case, users only want to use MLflow to save artifacts when they want.
Thus, we provide an option(``default_ml_models_storage_backend``) to let users decide the default storage backend for ML models when ``mlflow_tracking_uri`` has been set.

Here are behaviors about which storage backend to use for ML models:

* Only set ``mlflow_tracking_uri`` but not ``default_ml_models_storage_backend``

.. code:: python
lineapy.options.set("mlflow_tracking_uri", "databricks")
lineapy.save(model, 'model') # Use MLflow (if mlflow_tracking_uri is set, default value of default_ml_models_storage_backend is mlflow )
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='mlflow'``

.. code:: python
lineapy.options.set("mlflow_tracking_uri", "databricks")
lineapy.options.set("default_ml_models_storage_backend", "mlflow")
lineapy.save(model, 'model') # Use MLflow
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='lineapy'``

.. code:: python
lineapy.options.set("mlflow_tracking_uri", "databricks")
lineapy.options.set("default_ml_models_storage_backend", "lineapy")
lineapy.save(model, 'model') # Use LineaPy
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
Note that when using MLflow as storage backend, ``lineapy.save`` is wrapping ``mlflow.flavor.log_model`` under the hood.
Users can use all the arguments in ``mlflow.flavor.log_model`` in ``lineapy.save`` as well.
For instance, if we want to specify ``registered_model_name``, we can write the save statement as:

.. code:: python
lineapy.save(model, name="model", storage_backend="mlflow", registered_model_name="clf")
Retrieve Artifact from Both LineaPy and MLflow
----------------------------------------------

Depend on what users want to do (or be familiar with).
Users can retrieve the same artifact(ML model) from LineaPy API and MLflow API once users execute ``lineapy.save`` with ``mlflow`` as the storage backend to save the artifact.

* Retrieve artifact(model) with LineaPy API

.. code:: python
artifact = lineapy.get('model')
lineapy_model = artifact.get_value()
* Retrieve artifact(model) with Mlflow API

.. code:: python
client = mlflow.MlflowClient()
latest_version = client.search_model_versions("name='clf'")[0].version
# This is exactly the same object as `lineapy_model` in previous session
mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}')
Which MLflow Model Flavor is Supported
--------------------------------------

Currently, we are supporting following flavors: ``sklearn``, ``xgboost``, ``prophet`` and ``statsmodels``.
We plan to support all MLflow supported model flavors soon.

72 changes: 55 additions & 17 deletions docs/source/references/configurations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,37 @@ These items are determined by the following order:
- Configuration file
- Default values

+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+===============================+=========+============================================+=================================================+
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
* Core LineaPy configuration items

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+=======================================+=========+============================================+=================================================+
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+

* Configuration item for integration with other tools

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+=======================================+=========+============================================+=================================================+
| mlflow_tracking_uri | mlflow tracking | string | None | `LINEAPY_MLFLOW_TRACKING_URI` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| mlflow_registry_uri | mlflow registry | string | None | `LINEAPY_MLFLOW_REGISTRY_URI` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| default_ml_models_storage_backend | default storage backend for ml models | string | mlflow | `LINEAPY_DEFAULT_ML_MODELS_STORAGE_BACKEND` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+

All LineaPy configuration items follow following naming convention; in configuration file, all variable name should be lower case with underscore,
all environmental variable name should be upper case with underscore and all CLI options should be lower case.
Expand Down Expand Up @@ -107,3 +121,27 @@ Instead, if you want ot use environmental variables, you should configure it thr

Note that, which ``storage_options`` items you can set are depends on the filesystem you are using.
In the following section, we will discuss how to set the storage options for S3.

Artifact Backend Storage
------------------------

When an artifact is also an ML model, you can set the ``mlflow_tracking_uri`` and ``mlflow_registry_uri`` (depending on how your MLflow is configured) to use MLflow as the storage backend for ML models;
i.e., saving the artifact with ``lineapy.save(model, 'model', storage_backend='mlflow')`` to save the artifact(ML model) directly in MLflow but still register in the LineaPy artifact store.

For instance, if you want to use ``databricks`` as your MLflow tracking URI to save your ML models, you can set them with

.. code:: python
lineapy.options.set('mlflow_tracking_uri', 'databricks')
or you can put it in the LineaPy configuration files, and you can run

.. code:: python
lineapy.save(model, 'model', storage_backend='mlflow')
to save your artifact(ML model) in MLflow while you can still use it as a typical LineaPy artifact.
If the ``model`` is not supported by MLflow, it will fall back to using the standard LineaPy protocol to save the model as an artifact.

Furthermore, if the ``default_ml_models_storage_backend='mlflow'``(as default when you only set ``mlflow_tracking_uri``), there is no need to specify ``storage_backend='mlflow'`` in the ``lineapy.save`` to save the model in MLflow.
Or you can change to ``default_ml_models_storage_backend='lineapy'``, and save your artifacts(ML models) with LineaPy backend as default and use MLflow when you specify ``storage_backend='mlflow'`` in the ``lineapy.save``.
42 changes: 42 additions & 0 deletions lineapy/_alembic/versions/07d0db31e15f_mlflow_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""mlflow_integration
Revision ID: 07d0db31e15f
Revises: 4907800d9126
Create Date: 2022-11-03 16:26:37.217174
"""
import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = "07d0db31e15f"
down_revision = "4907800d9126"
branch_labels = None
depends_on = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table(
"mlflow_artifact_storage",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column("artifact_id", sa.Integer(), nullable=False),
sa.Column("backend", sa.String(), nullable=False),
sa.Column("tracking_uri", sa.String(), nullable=False),
sa.Column("registry_uri", sa.String(), nullable=True),
sa.Column("model_uri", sa.String(), nullable=False),
sa.Column("model_flavor", sa.String(), nullable=False),
sa.Column("delete_time", sa.DateTime(), nullable=True),
sa.ForeignKeyConstraint(
["artifact_id"],
["artifact.id"],
),
sa.PrimaryKeyConstraint("id"),
)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table("mlflow_artifact_storage")
# ### end Alembic commands ###
Loading

0 comments on commit 7f8167f

Please sign in to comment.