EBMs now work in Spark via ONNX and SynapseML #459

brandongreenwell-8451 · 2023-08-02T17:17:11Z

This is not an issue per se, but rather a suggestion to add to the FAQ or example usage in the docs. In short, we submitted a recent issue to Microsoft's SynapseML library to get Spark support for distributing scoring with EBM models. The fix came quick and you can now deploy EBM models via Spark. The basic idea is to convert to ONNX (via ebm2onnx) and then bring that into Spark via SynapseML. I think this would be useful to call out somewhere in the docs. There's a minimal working example in the issue linked to above that could be used (which was built on top of the Adult Census example in the EBM docs).

Happy to make a PR if you feel this is useful and let me know where in the docs this would best fit.

Minimal example (but requires SynapseML to be installed):

import numpy as np
import pandas as pd
import onnx
import ebm2onnx

from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingClassifier
from synapse.ml.onnx import ONNXModel


# Read in adult data from UCI ML repo
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

# Sample data and split into train/test sets
seed = 42
np.random.seed(seed)
df = df.sample(frac=0.05, random_state=seed)
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

# Fit a (default) EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

def convert_model(model, input_df):
    onnx_model = ebm2onnx.to_onnx(
        model,
        ebm2onnx.get_dtype_from_pandas(input_df),
        predict_proba=True
    )
    onnx_model.ir_version = 4 
    return onnx_model.SerializeToString()

# Load ONNX payload into an ONNXModel and inspect inputs/outputs.
payload = convert_model(ebm, input_df=X_train)
onnx_ml = ONNXModel().setModelPayload(payload)
print("Model inputs:" + str(onnx_ml.getModelInputs()))
print("Model outputs:" + str(onnx_ml.getModelOutputs()))

# Map the model input to the input dataframe's column name (FeedDict), and 
# map the output dataframe's column names to the model outputs (FetchDict)
onnx_ml = (
    onnx_ml.setDeviceType("CPU")
    .setFeedDict({"input": "features"})
    .setFetchDict({"probability": "probabilities", "prediction": "label"})
    .setMiniBatchSize(5000)
)

# Coerce test data features to Spark DataFrame and transform (i.e., compute and add scores)
X_test_sdf = spark.createDataFrame(X_test)
display(onnx_ml.transform(X_test_sdf))

paulbkoch · 2023-08-02T17:49:07Z

Hi @brandongreenwell-8451 -- This is really great! Yes, please submit a PR that adds this to the documentation. We use Jupyter Book for the docs, and all the notebooks for that go into https://github.com/interpretml/interpret/tree/develop/docs/interpret_docs

Then, I'd probably put this as a new file called "synapse" under "Deployment Guide" here:

interpret/docs/interpret_docs/_toc.yml

Lines 39 to 41 in 379ab3f

    
           - caption: Deployment Guide 
        
             chapters: 
        
             - file: deployment-guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EBMs now work in Spark via ONNX and SynapseML #459

EBMs now work in Spark via ONNX and SynapseML #459

brandongreenwell-8451 commented Aug 2, 2023 •

edited

Loading

paulbkoch commented Aug 2, 2023

EBMs now work in Spark via ONNX and SynapseML #459

EBMs now work in Spark via ONNX and SynapseML #459

Comments

brandongreenwell-8451 commented Aug 2, 2023 • edited Loading

paulbkoch commented Aug 2, 2023

brandongreenwell-8451 commented Aug 2, 2023 •

edited

Loading