Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python wrapper classes for all user interfaces #750

Merged
merged 55 commits into from
Jul 18, 2024

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Jul 9, 2024

Which issue does this PR close?

Closes #749.

Rationale for this change

As described the the issue linked above, some python users find the current approach to exposing the pyo3 generated functions and classes to be "unpythonic" (their words). This PR attempts to make a first class user interface by adding python documentation and type annotations in the place most python users will look. It has the added benefits of a cleaner help(datafusion.functions) and similar calls.

What changes are included in this PR?

  • Exposes some functions to python that were not previously available
  • Exposes some classes to python that were not previously available
  • Adds wrappers around classes
  • Adds wrappers around all functions within datafusion.functions
  • Adds documentation strings for all classes and functions.
  • Adds unit tests for some features that were not covered.
  • Adds convenience options, such as passing either a str or pathlib.Path for paths, updating select to accept column names as strings, and performing comparisons against literals.

Are there any user-facing changes?

  • Simplifies the module naming. Instead of datafusion.functions.functions it is simply datafusion.functions. Similar changes occur for other modules.
  • When performing expression comparison, you can pass in either an expression or a literal value. Instead of using df.filter(col("a") > lit(3)) you can simply do df.filter(col("a") > 3). Any value that can be coerced into a literal is valid for the RHS.
  • When doing a select operation you can pass in either expressions or column names as a string. Instead of doing df.select(col("a"), col("b") - col("c")) you can do df.select("a", col("b") - col("c")).
  • filter now accepts multiple arguments and will treat them as an AND filter. Instead of doing df.filter(col("a") > 3).filter(col("b") == "orange") you can use df.filter(col("a" > 3, col("b") == "orange").
  • All commands that accept string paths can also accept pathlib.Path objects.
  • plan, serde, producer, and consumer within datafusion.substrait are all marked as deprecated. Please use Plan, Serde, Producer, and Consumer, respectively.
  • For defining an UDF or UDAF, there is Volatility enum available for convenience. The str values are still accepted.

Example generated documenation:
Screenshot 2024-07-09 at 8 49 27 AM

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a big step forward to me. I am not a Python expert though, so it would be good to get reviews from others.

@andygrove
Copy link
Member

@jdye64 @charlesbluca fyi

python/datafusion/udf.py Outdated Show resolved Hide resolved
python/datafusion/substrait.py Outdated Show resolved Hide resolved
Copy link

@max-muoto max-muoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a py.typed file to indicate to type-checkers that library is typed now (per PEP 561)

python/datafusion/expr.py Outdated Show resolved Hide resolved
python/datafusion/context.py Outdated Show resolved Hide resolved
python/datafusion/context.py Outdated Show resolved Hide resolved
python/datafusion/context.py Outdated Show resolved Hide resolved
python/datafusion/context.py Outdated Show resolved Hide resolved
python/datafusion/context.py Outdated Show resolved Hide resolved
python/datafusion/substrait.py Outdated Show resolved Hide resolved
@datapythonista
Copy link

This looks great in terms of functionality, but I personally don't think it's worth to have wrappers for every DataFusion functionality just for the types and the docs. Afaik PyO3 creates Python docstrings from the Rust documentation of the exposed objects. So things can be documented in the existing Rust functions. And as you mentioned, for types, a .pyi file could be used. Besides a couple of other things, this seems to be what this PR provides, and I think things will be much easier to maintain if all this duplication of code in the wrappers doesn't exist.

For the renaming part datafusion.functions.functions -> datafusion.functions, I think the intended behavior is what I implemented in #751, and shouldn't be needed.

@timsaucer
Copy link
Contributor Author

It's a well made point @datapythonista and maintainability is definitely a concern.

I would argue that going with a .pyi file approach is worse for maintainability than what I am proposing here. My reasoning is that if there is a change made to an underlying function or parameters on the rust side, then there is no validation that the associated .pyi files get updated. With the wrapper functions, the python unit tests may catch those changes.

What I'm really trying to address is a concern from users that the datafusion-python project doesn't feel "pythonic". I was hoping this approach would give 3 things

  1. Documentation flushed out for all exposed classes and functions. As you correctly point out that could be done on the rust side with pyo3.
  2. Type hinting. This cannot be done with pyo3 due to a hard limit on the C api for python. The existing work around is the .pyi files. This is important to a lot of users, especially those just starting out.
  3. User friendly location for investigating the available functions and classes. A user can find the available functions and classes within python by calling help(datafusion) or dir. They can also look at the online documentation. However, there are still many who prefer to have the python code laid out for their viewing.

This is my reasoning. If the community prefers to go the route of not adding type hinting (or go with pyi approach) I'm willing to move things over to a different location.

@datapythonista
Copy link

Thanks for the detailed information @timsaucer, and for all the work with this.

I'm not an expert on type hints but it's not immediately clear to me what would every approach validate in a reasonable way. By using wrappers we won't have types for objects that don't exist, which can happen in the pyi file. But other than that, feels like validations should ve mostly via mypy of the unittests, and should be the same for both approaches, no? Or am I missing something?

For your 3rd point, the PR I referenced in the previous message, #751, should address all the dir() calls, which should return the correct members. If I'm not missing anything help() would be working with point 1, when the objects have docstrings.

So, for me personally, I fully agree with the goals, but feels like a pyi file and documenting objects in Rust will make things simpler to maintain. I may surely be wrong.

@Michael-J-Ward
Copy link
Contributor

The wrappers also allow us to offer more pythonic interfaces when the rust API is "clunky", at least by python standards.

An example of this is how @timsaucer cleaned up the named_struct interface

def named_struct(name_pairs: list[(str, Expr)]) -> Expr:
"""
Returns a struct with the given names and arguments pairs
"""
name_pairs = [[Expr.literal(pair[0]), pair[1]] for pair in name_pairs]
# flatten
name_pairs = [x.expr for xs in name_pairs for x in xs]
return Expr(f.named_struct(*name_pairs))

@timsaucer
Copy link
Contributor Author

@datapythonista I think your PR is great! It definitely cleans up a problem, but it's not the same one as what I was trying to say.

I know there are many users who look at the repositories to try to understand how a project works. All of this information can be found by looking in python using dir and such, but many people like to look through the repository to see how all the pieces fit together. For these users, having the user facing python code in one place makes it easier to understand the layout and design. Again, all of this same information can be found in other places. I believe that if we want to see greater adoption we should try to teach people in the way that is most comfortable to them.

Again, this isn't a hill I intend to die on. I want this PR to be helpful.

@timsaucer
Copy link
Contributor Author

Updated all docstrings to match google format and added line length check into pre-commit configuration. The only thing I have left on my TODO list is updating the documentation building to get the cross references working right. I am currently testing mkdocs per @kylebarron recommendation.

Once that is complete, I'll move this from Draft to ready and ask for a re-review.

Thank you everyone for the thoughtful and productive discussion!

@kylebarron
Copy link
Contributor

I am currently testing mkdocs per @kylebarron recommendation.

Given that this project is currently using sphinx for documentation, it would be a pretty big change to move from sphinx to mkdocs, including new configuration and converting existing re-structured text files to markdown. Maybe it would be better to discuss moving to mkdocs in another issue? I mentioned that in my workflow as I'm a big fan of mkdocs over sphinx, but this PR is already decently large.

@timsaucer timsaucer marked this pull request as ready for review July 16, 2024 12:53
@timsaucer
Copy link
Contributor Author

Added a temporary work around for CI issues and added an issue to fix once the upstream is resolved. #763

Copy link
Contributor

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

benchmarks/db-benchmark/join-datafusion.py Outdated Show resolved Hide resolved
examples/substrait.py Outdated Show resolved Hide resolved
python/datafusion/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent @timsaucer. Thank you for all of the hard work.

IMO, having Python wrappers is the most practical way to make sure we offer the Pythonic experience that people want.

@andygrove andygrove merged commit aa8aa9c into apache:main Jul 18, 2024
11 checks passed
@timsaucer timsaucer deleted the tsaucer/python-explicit-imports branch August 1, 2024 12:13
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-feedstock that referenced this pull request Sep 18, 2024
This is required as of `datafusion-python` v40.

ref: apache/datafusion-python#750
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation and annotations for all user facing python classes and functions