Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a ScalarUDFImpl::simplfy() API #9289

Closed
Tracked by #9285
alamb opened this issue Feb 20, 2024 · 9 comments · Fixed by #9304
Closed
Tracked by #9285

Add a ScalarUDFImpl::simplfy() API #9289

alamb opened this issue Feb 20, 2024 · 9 comments · Fixed by #9304
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 20, 2024

Is your feature request related to a problem or challenge?

As part of porting arrow_cast #9287 and make_array #9288, it has become clear that some functions have special simplification semantics:

  1. arrow_cast simplifies to cast of a particular data type. It is important for the rest of datafusion to understand the Cast semantics as the there are several special cases for Expr::Cast (such as unwrap_cast_in_comparison)
  2. make_array has special rewrite rules to combine / fold with array_append and array_prepaend in the simplifier (source link)

Also I think some systems may want to provide the ability to define UDFs that are more like macros (can be expressed in terms of other built in expressions), which needs some way for datafusion to "inline" the definition

Similarly, specialized functions (e.g replace regexp_match with a version that had the regexp pre-compiled ...) - #8051 sound very similar

Describe the solution you'd like

I propose adding a function such as the following, we could implement the simplifications for make_array and arrow_cast in the UDF (and not in the main simplify code):

/// Was the expression simplified?
enum Simplified {
  /// The function call was simplified to an entirely new Expr
  Rewritten(Expr),
  /// the function call could not be simplified, and the arguments
  /// are return unmodified
  Original(Vec<Expr>)
}
pub trait ScalarUDFImpl {
...

  /// Apply any function specific simplification rules
  ///
  /// Some functions like arrow_cast have special semantic simplifications
  /// (into `Expr::Cast` for example) that can improve planning.
  ///
  /// If there is a simpler representation of calling this function with the
  /// specified arguments, return `Simplified::Rewritten` with the simplification.
  /// If no such simplification was possible, returns `Simplified::Original` with
  /// the unmodified arguments (the default)
  ///
  /// This function should only apply simplifications specific to this function.
  /// DataFusion will automatically simplify the arguments with a variety
  /// of rewrites during optimization
  fn simplify(&self, args: Vec<Expr>) -> Result<Simplified> {
    Ok(Simplified::Original(args)
  }
  ...

}

Describe alternatives you've considered

There may be better

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Feb 20, 2024

FYI @jayzhan211 -- I wonder if you have any thoughts about this ticket (or maybe have time to try implementing it).

I think a good test case would be to implement a udf function such as cast_to_int64 that was rewritten into cast(x as int64) or something

@alamb
Copy link
Contributor Author

alamb commented Feb 20, 2024

cc @universalmind303 and @thinkharderdev as I think you have been thinking about similar things

@Lordworms
Copy link
Contributor

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

@alamb
Copy link
Contributor Author

alamb commented Feb 21, 2024

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts

Note that #9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF)

@mustafasrepo
Copy link
Contributor

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts

Note that #9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF)

I think, function level optimizer would be much more scalable. Also, simplification logic will be more contained. Also, once we can bring similar changes in ScalarUDFImpl::simplfy to AggregateUDF for support of this feature.

@alamb
Copy link
Contributor Author

alamb commented Feb 21, 2024

@jayzhan211 -- I think this proposal is also likely to be needed to get make_array moved into datafusion-functions-array - #9288

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 21, 2024

@alamb How is simplify like for array concat operator rewrite? If we have MakeArrayUDF, args are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.

@jayzhan211
Copy link
Contributor

simplify for arrow_cast is just trying to do actual casting x as i64 right?

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2024

@alamb How is simplify like for array concat operator rewrite? If we have MakeArrayUDF, args are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.

🤔 that is a good point that the simplification is needed for StringConcat operator 🤔

simplify for arrow_cast is just trying to do actual casting x as i64 right?

Yes

Sorry for my delay, I have been out the last few days. I am catching up with reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants