Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Abstract Syntax Tree evaluator #5494

Merged
merged 186 commits into from
Oct 6, 2020
Merged

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jun 17, 2020

Purpose

This PR enables a new feature for constructing and evaluating abstract syntax trees (within some reasonable limitations). This is meant to assist with inequality joins. See #5493 for additional information about the proposed feature.

Resolves #5493. This will be used to implement #5401 and other features discussed in #5397.

Depends on #5716, #5735, #5832, #5859.

High-level APIs included:

  • Evaluate an expression on a table, returning a new column. Also called "n-ary transform."

Data sources:

  • Data source: Column data (any type, via type dispatcher)
  • Data source: Literal values (scalars)

Binary operators:

  • Arithmetic operators (ADD, SUB, MUL, DIV, TRUE_DIV, FLOOR_DIV, MOD, PYMOD, POW, BITWISE_AND, BITWISE_OR, BITWISE_XOR)
  • Comparators (EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL)
  • Logical operators (LOGICAL_AND, LOGICAL_OR)

Unary operators:

  • Arithmetic operators (IDENTITY, SIN, COS, TAN, ARCSIN, ARCCOS, ARCTAN, SINH, COSH, TANH, ARCSINH, ARCCOSH, ARCTANH, EXP, LOG, SQRT, CBRT, CEIL, FLOOR, ABS, RINT, BIT_INVERT)
  • Logical operators (NOT)

Additional functionality (not in scope for this PR):

  • Null value handling
  • Automatic up/downcasting for binary operators (float + double requires an upcast of the float to a double, while int && bool requires a downcast of the int to a logical).
    • Note: The current single-dispatch logic requires operand types to match. Double dispatch took a long time to compile and negatively affected performance.
  • Ternary operator (condition ? true_value : false_value)
  • Nullary operators (RAND, NOW, ROW)
  • Extracting components of timestamps (BlazingSQL supports this)

@bdice bdice added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Jun 17, 2020
@bdice bdice requested review from a team as code owners June 17, 2020 16:32
@bdice bdice self-assigned this Jun 17, 2020
@bdice bdice marked this pull request as draft June 17, 2020 16:32
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Jun 18, 2020

Codecov Report

Merging #5494 into branch-0.16 will decrease coverage by 0.90%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.16    #5494      +/-   ##
===============================================
- Coverage        83.21%   82.30%   -0.91%     
===============================================
  Files               92      101       +9     
  Lines            14730    16206    +1476     
===============================================
+ Hits             12258    13339    +1081     
- Misses            2472     2867     +395     
Impacted Files Coverage Δ
python/cudf/cudf/_version.py 44.80% <0.00%> (-0.72%) ⬇️
python/cudf/cudf/testing/fuzzer.py 0.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/json.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <0.00%> (ø)
python/cudf/cudf/benchmarks/bench_cudf_io.py 42.85% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/benchmarks/conftest.py 100.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 937fd7e...3692b80. Read the comment docs.

@bdice bdice changed the title [WIP] Abstract Syntax Tree evaluator [WIP] Abstract Syntax Tree evaluator [skip ci] Jun 26, 2020
@harrism
Copy link
Member

harrism commented Aug 23, 2020

My only outstanding concern is at least some tests with larger input columns.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm a random person browsing this code and I see "AST" I think "hmm, that sounds complex".

This is why I thought that the AST folder is (or should be) a detail API folder. Users don't create ASTs. n-ary transform should really be moved outside of the AST folder to a public API header, and the AST stuff should be detail stuff used by other APIs (binops, n-ary transform, inequality join).

For external software that wants to generate ASTs, that is also not "typical user" usage, so maybe this should be made clear.

@harrism
Copy link
Member

harrism commented Sep 24, 2020

If I'm a random person browsing this code and I see "AST" I think "hmm, that sounds complex".

This is why I thought that the AST folder is (or should be) a detail API folder. Users don't create ASTs. n-ary transform should really be moved outside of the AST folder to a public API header, and the AST stuff should be detail stuff used by other APIs (binops, n-ary transform, inequality join).

For external software that wants to generate ASTs, that is also not "typical user" usage, so maybe this should be made clear.

I agree with this, but this could probably be done in a followup, do you agree @nvdbaranec ? Since @bdice's internship ended, I'm looking to move this PR forward as-is and file issues for things that need to be cleaned up as followup PRs (possibly by other contributors). Brad will be returning to RAPIDS, but I'm sure he has plenty to focus on currently with finishing his PhD! :)

I made comments on a few unresolved threads above asking what things are still TODO. @jrhemstad if you know the answers it would help if you can resolve anything that is already resolved, and comment on what still needs to be done. (Or @bdice if you are listening and have a minute to respond.)

@bdice
Copy link
Contributor Author

bdice commented Sep 24, 2020

Brad will be returning to RAPIDS, but I'm sure he has plenty to focus on currently with finishing his PhD! :)

@harrism I appreciate your understanding -- I do have a lot on my plate at the moment. 😅 I replied to all open conversations that I saw, to clarify which items are still "to-do" and what could be done about them. We bit off a lot with this PR, so I'm glad to hear the plan to move it forward / file issues for the incomplete pieces.

@nvdbaranec nvdbaranec self-requested a review September 24, 2020 14:44
I've removed the r-value reference overload for the expression objects. As you mentioned it caused a bug.
Deleted r-value constructor overloads for 'expression' object
@harrism
Copy link
Member

harrism commented Sep 29, 2020

rerun tests

@harrism
Copy link
Member

harrism commented Oct 6, 2020

rerun tests

@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 6, 2020
@kkraus14 kkraus14 merged commit c65b212 into rapidsai:branch-0.16 Oct 6, 2020
@vyasr vyasr mentioned this pull request Jun 8, 2021
rapids-bot bot pushed a commit that referenced this pull request Jul 14, 2021
This PR implements conditional joins using expressions that are decomposed into abstract syntax trees for evaluation. This PR builds on the AST evaluation framework established in #5494 and #7418, but significantly refactors the internals and generalizes them to enable 1) expressions on two tables and 2) operations on nullable columns. This PR uses the nested loop join code created in #5397 for inner joins, but also substantially generalizes that code to enable 1) all types of joins, 2) joins with arbitrary AST expressions rather than just equality, and 3) handling of null values (with user-specified `null_equality`). A significant chunk of the code is currently out of place, but since this changeset is rather large I've opted not to move things in ways that will make reviewing this PR significantly more challenging. I will make a follow-up to address those issues once this PR is merged.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Conor Hoekstra (https://github.com/codereport)

URL: #8214
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Abstract Syntax Tree evaluator
8 participants