Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

starts_with function is serialised as UDF #578

Open
mpurins-coralogix opened this issue Dec 24, 2022 · 0 comments
Open

starts_with function is serialised as UDF #578

mpurins-coralogix opened this issue Dec 24, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@mpurins-coralogix
Copy link
Contributor

Describe the bug
When physical plan includes datafusion builtin function starts_with then it is serialized as UDF.

To Reproduce

Following test (can be added in somewhere in https://github.com/apache/arrow-ballista/blob/master/ballista/core/src/serde/physical_plan/mod.rs) should pass but it fails with DataFusionError(Plan("There is no UDF named \"startswith\" in the registry"))

#[test]
    fn roundtrip_builtin_startswith_function() -> Result<()> {
        let field_a = Field::new("a", DataType::Utf8, false);
        let field_b = Field::new("b", DataType::Utf8, false);
        let schema = Arc::new(Schema::new(vec![field_a, field_b]));

        let input = Arc::new(EmptyExec::new(false, schema.clone()));

        let execution_props = ExecutionProps::new();

        let fun_expr = functions::create_physical_fun(
            &BuiltinScalarFunction::StartsWith,
            &execution_props,
        )?;

        let expr = ScalarFunctionExpr::new(
            &format!("{}", BuiltinScalarFunction::StartsWith), // Note that this is how ballista creates this name in https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/functions.rs#L183
            fun_expr,
            vec![col("a", &schema)?, col("b", &schema)?],
            &DataType::Boolean,
        );

        let project =
            ProjectionExec::try_new(vec![(Arc::new(expr), "a".to_string())], input)?;

        roundtrip_test(Arc::new(project))
    }

Expected behavior
Above test should pass

Additional context
This is pretty much happening because UDFs and builtin functions are represented with the same ScalarFunctionExpr and ballista when serializing uses function name (which is startswith) and BuiltinScalarFunction::from_str (which matches only on starts_with) to differentiate between builtin and user defined functions.

While simplest fix probably would be in datafusion BuiltinScalarFunction::from_str to handle startswith as function name as well I believe that it would be better to fix it in some other way where builtin and udf functions are thrown in the same bag. Current approach means that in some given project code could unexpectedly break on datafusion updates if datafusion introduces new builtin function and project already had user defined function under the same name.

I wonder if it would be reasonable to separate builtin and user defined functions already in datafusion physical plan.

@mpurins-coralogix mpurins-coralogix added the bug Something isn't working label Dec 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant