Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBMs now work in Spark via ONNX and SynapseML #459

Open
brandongreenwell-8451 opened this issue Aug 2, 2023 · 1 comment
Open

EBMs now work in Spark via ONNX and SynapseML #459

brandongreenwell-8451 opened this issue Aug 2, 2023 · 1 comment

Comments

@brandongreenwell-8451
Copy link
Contributor

brandongreenwell-8451 commented Aug 2, 2023

This is not an issue per se, but rather a suggestion to add to the FAQ or example usage in the docs. In short, we submitted a recent issue to Microsoft's SynapseML library to get Spark support for distributing scoring with EBM models. The fix came quick and you can now deploy EBM models via Spark. The basic idea is to convert to ONNX (via ebm2onnx) and then bring that into Spark via SynapseML. I think this would be useful to call out somewhere in the docs. There's a minimal working example in the issue linked to above that could be used (which was built on top of the Adult Census example in the EBM docs).

Happy to make a PR if you feel this is useful and let me know where in the docs this would best fit.

Minimal example (but requires SynapseML to be installed):

import numpy as np
import pandas as pd
import onnx
import ebm2onnx

from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingClassifier
from synapse.ml.onnx import ONNXModel


# Read in adult data from UCI ML repo
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

# Sample data and split into train/test sets
seed = 42
np.random.seed(seed)
df = df.sample(frac=0.05, random_state=seed)
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

# Fit a (default) EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

def convert_model(model, input_df):
    onnx_model = ebm2onnx.to_onnx(
        model,
        ebm2onnx.get_dtype_from_pandas(input_df),
        predict_proba=True
    )
    onnx_model.ir_version = 4 
    return onnx_model.SerializeToString()

# Load ONNX payload into an ONNXModel and inspect inputs/outputs.
payload = convert_model(ebm, input_df=X_train)
onnx_ml = ONNXModel().setModelPayload(payload)
print("Model inputs:" + str(onnx_ml.getModelInputs()))
print("Model outputs:" + str(onnx_ml.getModelOutputs()))

# Map the model input to the input dataframe's column name (FeedDict), and 
# map the output dataframe's column names to the model outputs (FetchDict)
onnx_ml = (
    onnx_ml.setDeviceType("CPU")
    .setFeedDict({"input": "features"})
    .setFetchDict({"probability": "probabilities", "prediction": "label"})
    .setMiniBatchSize(5000)
)

# Coerce test data features to Spark DataFrame and transform (i.e., compute and add scores)
X_test_sdf = spark.createDataFrame(X_test)
display(onnx_ml.transform(X_test_sdf))
@paulbkoch
Copy link
Collaborator

Hi @brandongreenwell-8451 -- This is really great! Yes, please submit a PR that adds this to the documentation. We use Jupyter Book for the docs, and all the notebooks for that go into https://github.com/interpretml/interpret/tree/develop/docs/interpret_docs

Then, I'd probably put this as a new file called "synapse" under "Deployment Guide" here:

- caption: Deployment Guide
chapters:
- file: deployment-guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants