Change pyfunc scoring server pandas format to split (mlflow#690)

* Use 'split' record format * Fix azure test * Format * Test helper funcs fix * Scoring server handle * Print traceback when handling pyfunc server exception * pyfunc scoring tests * scoring server test file * More tests * Add scoring server tests * Lint fix * Azure var name * Shorten java test name * Docs and remove legacy sklearn serve_model * Docs update * Docs and new header * Fix sagemaker scoring tests * Azure docs update * Return pandas record oriented frame * Lint * Docs, lint * Docs improvement * Adjust content type naming and semantics * Content type adjustments * Lint * Address comments * Address comments * Address more comments * Lint * Include stacktrace as json key in exception text rather than formatted string for easier parsing * Message fix * Doc fixes * Doc tweak * remove redundant comments * Another docs fix * Fix content types * Address docs comments * Only log content type warning once * Doc formatting * Remove another instance of * Spacing fix * Address docs comments * Address more docs comments * Docs and java comments * python docs tweaks * Fix test and lint issue * Lint and test fixes * Tweak content type supported error response * Spark test fix * Remove unused sklearn imports * Fix lint errors, sagemaker test * Fix tests
anitaChillal · Nov 9, 2018 · 1e0c2bd · 1e0c2bd
1 parent 1b09159
commit 1e0c2bd
Show file tree

Hide file tree

Showing 26 changed files with 732 additions and 272 deletions.
diff --git a/docs/source/models.rst b/docs/source/models.rst
@@ -247,12 +247,34 @@ MLflow provides tools for deploying models on a local machine and to several pro
 Not all deployment methods are available for all model flavors. Deployment is supported for the
 Python Function format and all compatible formats.
 
+.. _pyfunc_deployment:
+
 Deploy a ``python_function`` model as a local REST API endpoint
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 MLflow can deploy models locally as local REST API endpoints or to directly score CSV files.
 This functionality is a convenient way of testing models before deploying to a remote model server.
 You deploy the Python Function flavor locally using the CLI interface to the :py:mod:`mlflow.pyfunc` module.
+The local REST API server accepts the following data formats as inputs:
+
+  * JSON-serialized Pandas DataFrames in the ``split`` orientation. For example,
+    ``data = pandas_df.to_json(orient='split')``. This format is specified using a ``Content-Type``
+    request header value of ``application/json; format=pandas-split``. Starting in MLflow 0.9.0,
+    this will be the default format if ``Content-Type`` is ``application/json`` (i.e, with no format
+    specification).
+
+  * JSON-serialized Pandas DataFrames in the ``records`` orientation. *We do not recommend using
+    this format because it is not guaranteed to preserve column ordering.* Currently, this format is
+    specified using a ``Content-Type`` request header value of  ``application/json; format=pandas-records``
+    or ``application/json``. Starting in MLflow 0.9.0, ``application/json`` will refer to the
+    ``split`` format instead. For forwards compatibility, we recommend using the ``split`` format
+    or specifying the ``application/json; format=pandas-records`` content type.
+
+  * CSV-serialized Pandas DataFrames. For example, ``data = pandas_df.to_csv()``. This format is
+    specified using a ``Content-Type`` request header value of ``text/csv``.
+
+For more information about serializing Pandas DataFrames, see
+https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
 
 * :py:func:`serve <mlflow.pyfunc.cli.serve>` deploys the model as a local REST API server.
 * :py:func:`predict <mlflow.pyfunc.cli.predict>` uses the model to generate a prediction for a local
@@ -266,16 +288,23 @@ For more info, see:
     mlflow pyfunc serve --help
     mlflow pyfunc predict --help
 
+.. _azureml_deployment:
+
 Microsoft Azure ML
 ^^^^^^^^^^^^^^^^^^
 The :py:mod:`mlflow.azureml` module can package ``python_function`` models into Azure ML container images.
 These images can be deployed to Azure Kubernetes Service (AKS) and the Azure Container Instances (ACI)
-platform for real-time serving.
+platform for real-time serving. The resulting Azure ML ContainerImage will contain a webserver that
+accepts the following data formats as input:
+
+  * JSON-serialized Pandas DataFrames in the ``split`` orientation. For example,
+    ``data = pandas_df.to_json(orient='split')``. This format is specified using a ``Content-Type``
+    request header value of ``application/json``.
 
 * :py:func:`build_image <mlflow.azureml.build_image>` registers an MLflow model with an existing Azure ML
   workspace and builds an Azure ML container image for deployment to AKS and ACI. The `Azure ML SDK`_ is
-  required in order to use this function. **The Azure ML SDK requires Python 3. It cannot be installed with
-  earlier versions of Python.**
+  required in order to use this function. *The Azure ML SDK requires Python 3. It cannot be installed with
+  earlier versions of Python.*
 
   .. _Azure ML SDK: https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-ml-py
 
@@ -324,18 +353,25 @@ platform for real-time serving.
 
     import requests
     import json
+
+    # `sample_input` is a JSON-serialized Pandas DatFrame with the `split` orientation
     sample_input = {
-        "residual sugar": {"0": 20.7},
-        "alcohol": {"0": 8.8},
-        "chlorides": {"0": 0.045},
-        "density": {"0": 1.001},
-        "sulphates": {"0": 0.45},
-        "total sulfur dioxide": {"0": 170.0},
-        "fixed acidity": {"0": 7.0},
-        "citric acid": {"0": 0.36},
-        "pH": {"0": 3.0},
-        "volatile acidity": {"0": 0.27},
-        "free sulfur dioxide": {"0": 45.0}
+        "columns": [
+            "alcohol",
+            "chlorides",
+            "citric acid",
+            "density",
+            "fixed acidity",
+            "free sulfur dioxide",
+            "pH",
+            "residual sugar",
+            "sulphates",
+            "total sulfur dioxide",
+            "volatile acidity"
+        ],
+        "data": [
+            [8.8, 0.045, 0.36, 1.001, 7, 45, 3, 20.7, 0.45, 170, 0.27]
+        ]
     }
     response = requests.post(
                   url=webservice.scoring_uri, data=json.dumps(sample_input),
@@ -358,19 +394,25 @@ platform for real-time serving.
 
     scoring_uri=$(az ml service show --name <deployment-name> -v | jq -r ".scoringUri")
 
+    # `sample_input` is a JSON-serialized Pandas DatFrame with the `split` orientation
     sample_input='
     {
-         "residual sugar": {"0": 20.7},
-         "alcohol": {"0": 8.8},
-         "chlorides": {"0": 0.045},
-         "density": {"0": 1.001},
-         "sulphates": {"0": 0.45},
-         "total sulfur dioxide": {"0": 170.0},
-         "fixed acidity": {"0": 7.0},
-         "citric acid": {"0": 0.36},
-         "pH": {"0": 3.0},
-         "volatile acidity": {"0": 0.27},
-         "free sulfur dioxide": {"0": 45.0}
+        "columns": [
+            "alcohol",
+            "chlorides",
+            "citric acid",
+            "density",
+            "fixed acidity",
+            "free sulfur dioxide",
+            "pH",
+            "residual sugar",
+            "sulphates",
+            "total sulfur dioxide",
+            "volatile acidity"
+        ],
+        "data": [
+            [8.8, 0.045, 0.36, 1.001, 7, 45, 3, 20.7, 0.45, 170, 0.27]
+        ]
     }'
 
     echo $sample_input | curl -s -X POST $scoring_uri\
@@ -385,6 +427,8 @@ For more info, see:
     mlflow azureml --help
     mlflow azureml build-image --help
 
+.. _sagemaker_deployment:
+
 Deploy a ``python_function`` model on Amazon SageMaker
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -394,7 +438,17 @@ To deploy remotely to SageMaker you need to set up your environment and user acc
 To export a custom model to SageMaker, you need a MLflow-compatible Docker image to be available on Amazon ECR.
 MLflow provides a default Docker image definition; however, it is up to you to build the image and upload it to ECR.
 MLflow includes the utility function ``build_and_push_container`` to perform this step. Once built and uploaded, you can use the MLflow
-container for all MLflow models.
+container for all MLflow models. Model webservers deployed using the :py:mod:`mlflow.sagemaker`
+module accept the following data formats as input, depending on the deployment flavor:
+
+  * ``python_function``: For this deployment flavor, The endpoint accepts the same formats
+    as the pyfunc server. These formats are described in the
+    :ref:`pyfunc deployment documentation <pyfunc_deployment>`.
+
+  * ``mleap``: For this deployment flavor, the endpoint accepts `only`
+    JSON-serialized Pandas DataFrames in the ``split`` orientation. For example,
+    ``data = pandas_df.to_json(orient='split')``. This format is specified using a ``Content-Type``
+    request header value of ``application/json``.
 
 * :py:func:`run-local <mlflow.sagemaker.run_local>` deploys the model locally in a Docker
   container. The image and the environment should be identical to how the model would be run

diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -153,23 +153,27 @@ When you run the example, it outputs an MLflow run ID for that experiment. If yo
 ``mlflow ui``, you will also see that the run saved a ``model`` folder containing an ``MLmodel``
 description file and a pickled scikit-learn model. You can pass the run ID and the path of the model
 within the artifacts directory (here "model") to various tools. For example, MLflow includes a
-simple REST server for scikit-learn models:
+simple REST server for python-based models:
 
 .. code:: bash
 
-    mlflow sklearn serve -r <RUN_ID> -m model
+    mlflow pyfunc serve -r <RUN_ID> -m model
 
 .. note::
 
     By default the server runs on port 5000. If that port is already in use, use the `--port` option to
-    specify a different port. For example: ``mlflow sklearn serve --port 1234 -r <RUN_ID> -m model``
+    specify a different port. For example: ``mlflow pyfunc serve --port 1234 -r <RUN_ID> -m model``
 
-Once you have started the server, you can pass it some sample data with ``curl`` and see the
-predictions:
+Once you have started the server, you can pass it some sample data and see the
+predictions.
+
+The following example uses ``curl`` to send a JSON-serialized Pandas DataFrame with the ``split``
+orientation to the pyfunc server. For more information about the input data formats accepted by
+the pyfunc model server, see the :ref:`MLflow deployment tools documentation <pyfunc_deployment>`.
 
 .. code:: bash
 
-    curl -d '[{"x": 1}, {"x": -1}]' -H 'Content-Type: application/json' -X POST localhost:5000/invocations
+    curl -d '{"columns":["x"], "data":[[1], [-1]]}' -H 'Content-Type: application/json; format=pandas-split' -X POST localhost:5000/invocations
 
 which returns::
 
@@ -178,7 +182,7 @@ which returns::
 .. note::
 
     The ``sklearn_logistic_regression/train.py`` script must be run with the same Python version as
-    the version of Python that runs ``mlflow sklearn serve``. If they are not the same version,
+    the version of Python that runs ``mlflow pyfunc serve``. If they are not the same version,
     the stacktrace below may appear::
 
         File "/usr/local/lib/python3.6/site-packages/mlflow/sklearn.py", line 54, in _load_model_from_local_file

diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -50,8 +50,8 @@ Training the Model
 ------------------
 
 
-First, train a linear regression model that takes two hyperparameters: ``alpha`` and ``l1_ratio``. 
-      
+First, train a linear regression model that takes two hyperparameters: ``alpha`` and ``l1_ratio``.
+
 .. plain-section::
 
   .. container:: python
@@ -220,7 +220,7 @@ On this page, you can see a list of experiment runs with metrics you can use to
     .. image:: _static/images/tutorial-compare.png
 
   .. container:: R
-  
+
       .. image:: _static/images/tutorial-compare-R.png
 
 You can  use the search feature to quickly filter out many models. For example, the query ``metrics.rmse < 0.8``
@@ -367,7 +367,7 @@ in MLflow saved the model as an artifact within the run.
 
       .. code::
 
-          mlflow sklearn serve /Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model -p 1234
+          mlflow pyfunc serve /Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model -p 1234
 
       .. note::
 
@@ -376,13 +376,17 @@ in MLflow saved the model as an artifact within the run.
           ``UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 1: ordinal not in range(128)``
           or ``raise ValueError, "unsupported pickle protocol: %d"``.
 
-      To serve a prediction, run:
+      Once you have deployed the server, you can pass it some sample data and see the
+      predictions. The following example uses ``curl`` to send a JSON-serialized Pandas DataFrame
+      with the ``split`` orientation to the pyfunc server. For more information about the input data
+      formats accepted by the pyfunc model server, see the
+      :ref:`MLflow deployment tools documentation <pyfunc_deployment>`.
 
       .. code::
 
-          curl -X POST -H "Content-Type:application/json" --data '[{"fixed acidity": 6.2, "volatile acidity": 0.66, "citric acid": 0.48, "residual sugar": 1.2, "chlorides": 0.029, "free sulfur dioxide": 29, "total sulfur dioxide": 75, "density": 0.98, "pH": 3.33, "sulphates": 0.39, "alcohol": 12.8}]' http://127.0.0.1:1234/invocations
+          curl -X POST -H "Content-Type:application/json; format=pandas-split" --data '{"columns":["alcohol", "chlorides", "citric acid", "density", "fixed acidity", "free sulfur dioxide", "pH", "residual sugar", "sulphates", "total sulfur dioxide", "volatile acidity"],"data":[[12.8, 0.029, 0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}' http://127.0.0.1:1234/invocations
 
-      which should return something like::
+      the server should respond with output similar to::
 
           {"predictions": [6.379428821398614]}
 
@@ -416,7 +420,7 @@ in MLflow saved the model as an artifact within the run.
       .. image:: _static/images/tutorial-serving-r.png
 
       .. note::
-        
+
           By default, a model is served using the R packages available. To ensure the environment serving
           the prediction function matches the model, set ``restore = TRUE`` when calling
           ``mlflow_rfunc_serve()``.

diff --git a/mlflow/azureml/__init__.py b/mlflow/azureml/__init__.py
@@ -31,6 +31,10 @@ def build_image(model_path, workspace, run_id=None, image_name=None, model_name=
     The resulting image can be deployed as a web service to Azure Container Instances (ACI) or
     Azure Kubernetes Service (AKS).
 
+    The resulting Azure ML ContainerImage will contain a webserver that processes model queries.
+    For information about the input data formats accepted by this webserver, see the
+    :ref:`MLflow deployment tools documentation <azureml_deployment>`.
+
     :param model_path: The path to MLflow model for which the image will be built. If a run id
                        is specified, this is should be a run-relative path. Otherwise, it
                        should be a local path.
@@ -307,6 +311,7 @@ def _get_mlflow_azure_resource_name():
 
 from azureml.core.model import Model
 from mlflow.pyfunc import load_pyfunc
+from mlflow.pyfunc.scoring_server import parse_json_input
 from mlflow.utils import get_jsonable_obj
 
 
@@ -316,8 +321,8 @@ def init():
     model = load_pyfunc(model_path)
 
 
-def run(s):
-    input_df = pd.read_json(s, orient="records")
+def run(json_input):
+    input_df = parse_json_input(json_input=json_input, orientation="split")
     return get_jsonable_obj(model.predict(input_df))
 
 """
diff --git a/mlflow/azureml/cli.py b/mlflow/azureml/cli.py
@@ -51,6 +51,10 @@ def build_image(model_path, workspace_name, subscription_id, run_id, image_name,
     Register an MLflow model with Azure ML and build an Azure ML ContainerImage for deployment.
     The resulting image can be deployed as a web service to Azure Container Instances (ACI) or
     Azure Kubernetes Service (AKS).
+
+    The resulting Azure ML ContainerImage will contain a webserver that processes model queries.
+    For information about the input data formats accepted by this webserver, see the following
+    documentation: https://www.mlflow.org/docs/latest/models.html#azureml-deployment.
     """
     # The Azure ML SDK is only compatible with Python 3. However, this CLI should still be
     # accessible for inspection rom Python 2. Therefore, we will only import from the SDK

diff --git a/mlflow/cli.py b/mlflow/cli.py
@@ -9,7 +9,6 @@
 
 import mlflow.azureml.cli
 import mlflow.projects as projects
-import mlflow.sklearn
 import mlflow.data
 import mlflow.experiments
 import mlflow.pyfunc.cli
@@ -204,7 +203,6 @@ def server(file_store, default_artifact_root, host, port, workers, static_prefix
         sys.exit(1)
 
 
-cli.add_command(mlflow.sklearn.commands)
 cli.add_command(mlflow.data.download)
 cli.add_command(mlflow.pyfunc.cli.commands)
 cli.add_command(mlflow.rfunc.cli.commands)

diff --git a/mlflow/exceptions.py b/mlflow/exceptions.py
@@ -10,16 +10,28 @@ class MlflowException(Exception):
     for debugging purposes. If the error text is sensitive, raise a generic `Exception` object
     instead.
     """
-    def __init__(self, message, error_code=INTERNAL_ERROR):
+    def __init__(self, message, error_code=INTERNAL_ERROR, **kwargs):
+        """
+        :param message: The message describing the error that occured. This will be included in the
+                        exception's serialized JSON representation.
+        :param error_code: An appropriate error code for the error that occured; it will be included
+                           in the exception's serialized JSON representation. This should be one of
+                           the codes listed in the `mlflow.protos.databricks_pb2` proto.
+        :param kwargs: Additional key-value pairs to include in the serialized JSON representation
+                       of the MlflowException.
+        """
         try:
             self.error_code = ErrorCode.Name(error_code)
         except (ValueError, TypeError):
             self.error_code = ErrorCode.Name(INTERNAL_ERROR)
         self.message = message
+        self.json_kwargs = kwargs
         super(MlflowException, self).__init__(message)
 
     def serialize_as_json(self):
-        return json.dumps({'error_code': self.error_code, 'message': self.message})
+        exception_dict = {'error_code': self.error_code, 'message': self.message}
+        exception_dict.update(self.json_kwargs)
+        return json.dumps(exception_dict)
 
 
 class RestException(MlflowException):

diff --git a/mlflow/java/scoring/src/main/java/org/mlflow/sagemaker/MLeapPredictor.java b/mlflow/java/scoring/src/main/java/org/mlflow/sagemaker/MLeapPredictor.java
@@ -55,17 +55,37 @@ public MLeapPredictor(String modelDataPath, String inputSchemaPath) {
   @Override
   protected PredictorDataWrapper predict(PredictorDataWrapper input)
       throws PredictorEvaluationException {
-    PandasRecordOrientedDataFrame pandasFrame = null;
+    PandasSplitOrientedDataFrame pandasFrame = null;
     try {
-      pandasFrame = PandasRecordOrientedDataFrame.fromJson(input.toJson());
+      pandasFrame = PandasSplitOrientedDataFrame.fromJson(input.toJson());
     } catch (IOException e) {
       logger.error(
-          "Encountered a JSON conversion error during conversion of Pandas dataframe to LeapFrame.",
+          "Encountered a JSON parsing error during conversion of input to a Pandas DataFrame"
+              + " representation.",
           e);
       throw new PredictorEvaluationException(
-          "Failed to transform input into a JSON representation of an MLeap dataframe."
-              + " Please ensure that the input is a JSON-serialized Pandas Dataframe"
-              + " with the `record` orientation.",
+          "Encountered a JSON parsing error while transforming input into a Pandas DataFrame"
+              + " representation. Ensure that the input is a JSON-serialized Pandas DataFrame"
+              + " with the `split` orientation.",
+          e);
+    } catch (InvalidSchemaException e) {
+      logger.error(
+          "Encountered a schema mismatch while transforming input into a Pandas DataFrame"
+              + " representation.",
+          e);
+      throw new PredictorEvaluationException(
+          "Encountered a schema mismatch while transforming input into a Pandas DataFrame"
+              + " representation. Ensure that the input is a JSON-serialized Pandas DataFrame"
+              + " with the `split` orientation.",
+          e);
+    } catch (IllegalArgumentException e) {
+      logger.error(
+          "Failed to transform input into a Pandas DataFrame because the parsed frame is invalid.",
+          e);
+      throw new PredictorEvaluationException(
+          "Failed to transform input into a Pandas DataFrame because the parsed frame is invalid."
+              + " Ensure that the input is a JSON-serialized Pandas DataFrame with the `split`"
+              + " orientation.",
           e);
     }