Python wrapper classes for all user interfaces #750

timsaucer · 2024-07-09T12:10:04Z

Which issue does this PR close?

Closes #749.

Rationale for this change

As described the the issue linked above, some python users find the current approach to exposing the pyo3 generated functions and classes to be "unpythonic" (their words). This PR attempts to make a first class user interface by adding python documentation and type annotations in the place most python users will look. It has the added benefits of a cleaner help(datafusion.functions) and similar calls.

What changes are included in this PR?

Exposes some functions to python that were not previously available
Exposes some classes to python that were not previously available
Adds wrappers around classes
Adds wrappers around all functions within datafusion.functions
Adds documentation strings for all classes and functions.
Adds unit tests for some features that were not covered.
Adds convenience options, such as passing either a str or pathlib.Path for paths, updating select to accept column names as strings, and performing comparisons against literals.

Are there any user-facing changes?

Simplifies the module naming. Instead of datafusion.functions.functions it is simply datafusion.functions. Similar changes occur for other modules.
When performing expression comparison, you can pass in either an expression or a literal value. Instead of using df.filter(col("a") > lit(3)) you can simply do df.filter(col("a") > 3). Any value that can be coerced into a literal is valid for the RHS.
When doing a select operation you can pass in either expressions or column names as a string. Instead of doing df.select(col("a"), col("b") - col("c")) you can do df.select("a", col("b") - col("c")).
filter now accepts multiple arguments and will treat them as an AND filter. Instead of doing df.filter(col("a") > 3).filter(col("b") == "orange") you can use df.filter(col("a" > 3, col("b") == "orange").
All commands that accept string paths can also accept pathlib.Path objects.
plan, serde, producer, and consumer within datafusion.substrait are all marked as deprecated. Please use Plan, Serde, Producer, and Consumer, respectively.
For defining an UDF or UDAF, there is Volatility enum available for convenience. The str values are still accepted.

Example generated documenation:

…r facing python features

…tly exposed.

andygrove

This looks like a big step forward to me. I am not a Python expert though, so it would be good to get reviews from others.

andygrove · 2024-07-09T15:07:54Z

@jdye64 @charlesbluca fyi

python/datafusion/udf.py

python/datafusion/substrait.py

max-muoto

We can add a py.typed file to indicate to type-checkers that library is typed now (per PEP 561)

python/datafusion/record_batch.py

python/datafusion/expr.py

python/datafusion/context.py

python/datafusion/substrait.py

python/datafusion/functions.py

datapythonista · 2024-07-10T08:00:36Z

This looks great in terms of functionality, but I personally don't think it's worth to have wrappers for every DataFusion functionality just for the types and the docs. Afaik PyO3 creates Python docstrings from the Rust documentation of the exposed objects. So things can be documented in the existing Rust functions. And as you mentioned, for types, a .pyi file could be used. Besides a couple of other things, this seems to be what this PR provides, and I think things will be much easier to maintain if all this duplication of code in the wrappers doesn't exist.

For the renaming part datafusion.functions.functions -> datafusion.functions, I think the intended behavior is what I implemented in #751, and shouldn't be needed.

timsaucer · 2024-07-10T13:25:42Z

It's a well made point @datapythonista and maintainability is definitely a concern.

I would argue that going with a .pyi file approach is worse for maintainability than what I am proposing here. My reasoning is that if there is a change made to an underlying function or parameters on the rust side, then there is no validation that the associated .pyi files get updated. With the wrapper functions, the python unit tests may catch those changes.

What I'm really trying to address is a concern from users that the datafusion-python project doesn't feel "pythonic". I was hoping this approach would give 3 things

Documentation flushed out for all exposed classes and functions. As you correctly point out that could be done on the rust side with pyo3.
Type hinting. This cannot be done with pyo3 due to a hard limit on the C api for python. The existing work around is the .pyi files. This is important to a lot of users, especially those just starting out.
User friendly location for investigating the available functions and classes. A user can find the available functions and classes within python by calling help(datafusion) or dir. They can also look at the online documentation. However, there are still many who prefer to have the python code laid out for their viewing.

This is my reasoning. If the community prefers to go the route of not adding type hinting (or go with pyi approach) I'm willing to move things over to a different location.

src/common/data_type.rs

datapythonista · 2024-07-10T13:42:06Z

Thanks for the detailed information @timsaucer, and for all the work with this.

I'm not an expert on type hints but it's not immediately clear to me what would every approach validate in a reasonable way. By using wrappers we won't have types for objects that don't exist, which can happen in the pyi file. But other than that, feels like validations should ve mostly via mypy of the unittests, and should be the same for both approaches, no? Or am I missing something?

For your 3rd point, the PR I referenced in the previous message, #751, should address all the dir() calls, which should return the correct members. If I'm not missing anything help() would be working with point 1, when the objects have docstrings.

So, for me personally, I fully agree with the goals, but feels like a pyi file and documenting objects in Rust will make things simpler to maintain. I may surely be wrong.

Michael-J-Ward · 2024-07-10T13:48:28Z

The wrappers also allow us to offer more pythonic interfaces when the rust API is "clunky", at least by python standards.

An example of this is how @timsaucer cleaned up the named_struct interface

datafusion-python/python/datafusion/functions.py

Lines 946 to 954 in 79bb196

    
           def named_struct(name_pairs: list[(str, Expr)]) -> Expr: 
        
               """ 
        
               Returns a struct with the given names and arguments pairs 
        
               """ 
        
               name_pairs = [[Expr.literal(pair[0]), pair[1]] for pair in name_pairs] 
        
               # flatten 
        
               name_pairs = [x.expr for xs in name_pairs for x in xs] 
        
               return Expr(f.named_struct(*name_pairs))

timsaucer · 2024-07-10T14:05:58Z

@datapythonista I think your PR is great! It definitely cleans up a problem, but it's not the same one as what I was trying to say.

I know there are many users who look at the repositories to try to understand how a project works. All of this information can be found by looking in python using dir and such, but many people like to look through the repository to see how all the pieces fit together. For these users, having the user facing python code in one place makes it easier to understand the layout and design. Again, all of this same information can be found in other places. I believe that if we want to see greater adoption we should try to teach people in the way that is most comfortable to them.

Again, this isn't a hill I intend to die on. I want this PR to be helpful.

timsaucer · 2024-07-15T13:30:47Z

Updated all docstrings to match google format and added line length check into pre-commit configuration. The only thing I have left on my TODO list is updating the documentation building to get the cross references working right. I am currently testing mkdocs per @kylebarron recommendation.

Once that is complete, I'll move this from Draft to ready and ask for a re-review.

Thank you everyone for the thoughtful and productive discussion!

kylebarron · 2024-07-16T02:16:46Z

I am currently testing mkdocs per @kylebarron recommendation.

Given that this project is currently using sphinx for documentation, it would be a pretty big change to move from sphinx to mkdocs, including new configuration and converting existing re-structured text files to markdown. Maybe it would be better to discuss moving to mkdocs in another issue? I mentioned that in my workflow as I'm a big fan of mkdocs over sphinx, but this PR is already decently large.

timsaucer · 2024-07-17T02:09:25Z

Added a temporary work around for CI issues and added an issue to fix once the upstream is resolved. #763

Michael-J-Ward

Excellent!

benchmarks/db-benchmark/join-datafusion.py

examples/substrait.py

python/datafusion/__init__.py

andygrove

This is excellent @timsaucer. Thank you for all of the hard work.

IMO, having Python wrappers is the most practical way to make sure we offer the Pythonic experience that people want.

This is required as of `datafusion-python` v40. ref: apache/datafusion-python#750

timsaucer added 7 commits July 9, 2024 07:02

Expose missing functions to python

d00c00a

Initial commit for creating wrapper classes and functions for all use…

27e4f30

…r facing python features

Remove extra level of python path that is no longer required

a3429ab

Move import to only happen for type checking for hints

7937963

Comment out classes from __all__ in the top level that are not curren…

1f4c829

…tly exposed.

Add license comments

d7f5f68

Add missing import

79bb196

andygrove reviewed Jul 9, 2024

View reviewed changes