[REVIEW] Abstract Syntax Tree evaluator #5494

bdice · 2020-06-17T16:32:33Z

Purpose

This PR enables a new feature for constructing and evaluating abstract syntax trees (within some reasonable limitations). This is meant to assist with inequality joins. See #5493 for additional information about the proposed feature.

Resolves #5493. This will be used to implement #5401 and other features discussed in #5397.

Depends on #5716, #5735, #5832, #5859.

High-level APIs included:

Evaluate an expression on a table, returning a new column. Also called "n-ary transform."

Data sources:

Data source: Column data (any type, via type dispatcher)
Data source: Literal values (scalars)

Binary operators:

Arithmetic operators (ADD, SUB, MUL, DIV, TRUE_DIV, FLOOR_DIV, MOD, PYMOD, POW, BITWISE_AND, BITWISE_OR, BITWISE_XOR)
Comparators (EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL)
Logical operators (LOGICAL_AND, LOGICAL_OR)

Unary operators:

Arithmetic operators (IDENTITY, SIN, COS, TAN, ARCSIN, ARCCOS, ARCTAN, SINH, COSH, TANH, ARCSINH, ARCCOSH, ARCTANH, EXP, LOG, SQRT, CBRT, CEIL, FLOOR, ABS, RINT, BIT_INVERT)
Logical operators (NOT)

Additional functionality (not in scope for this PR):

Null value handling
Automatic up/downcasting for binary operators (float + double requires an upcast of the float to a double, while int && bool requires a downcast of the int to a logical).
- Note: The current single-dispatch logic requires operand types to match. Double dispatch took a long time to compile and negatively affected performance.
Ternary operator (condition ? true_value : false_value)
Nullary operators (RAND, NOW, ROW)
Extracting components of timestamps (BlazingSQL supports this)

GPUtester · 2020-06-17T16:33:14Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

codecov · 2020-06-18T19:04:59Z

Codecov Report

Merging #5494 into branch-0.16 will decrease coverage by 0.90%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.16    #5494      +/-   ##
===============================================
- Coverage        83.21%   82.30%   -0.91%     
===============================================
  Files               92      101       +9     
  Lines            14730    16206    +1476     
===============================================
+ Hits             12258    13339    +1081     
- Misses            2472     2867     +395

Impacted Files	Coverage Δ
python/cudf/cudf/_version.py	`44.80% <0.00%> (-0.72%)`	⬇️
python/cudf/cudf/testing/fuzzer.py	`0.00% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/json.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/main.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/benchmarks/bench_cudf_io.py	`42.85% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/benchmarks/conftest.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 937fd7e...3692b80. Read the comment docs.

…eferences, removed unnecessary templating.

…o<>).

harrism · 2020-08-23T23:35:49Z

My only outstanding concern is at least some tests with larger input columns.

nvdbaranec

If I'm a random person browsing this code and I see "AST" I think "hmm, that sounds complex".

This is why I thought that the AST folder is (or should be) a detail API folder. Users don't create ASTs. n-ary transform should really be moved outside of the AST folder to a public API header, and the AST stuff should be detail stuff used by other APIs (binops, n-ary transform, inequality join).

For external software that wants to generate ASTs, that is also not "typical user" usage, so maybe this should be made clear.

harrism · 2020-09-24T00:10:09Z

If I'm a random person browsing this code and I see "AST" I think "hmm, that sounds complex".

This is why I thought that the AST folder is (or should be) a detail API folder. Users don't create ASTs. n-ary transform should really be moved outside of the AST folder to a public API header, and the AST stuff should be detail stuff used by other APIs (binops, n-ary transform, inequality join).

For external software that wants to generate ASTs, that is also not "typical user" usage, so maybe this should be made clear.

I agree with this, but this could probably be done in a followup, do you agree @nvdbaranec ? Since @bdice's internship ended, I'm looking to move this PR forward as-is and file issues for things that need to be cleaned up as followup PRs (possibly by other contributors). Brad will be returning to RAPIDS, but I'm sure he has plenty to focus on currently with finishing his PhD! :)

I made comments on a few unresolved threads above asking what things are still TODO. @jrhemstad if you know the answers it would help if you can resolve anything that is already resolved, and comment on what still needs to be done. (Or @bdice if you are listening and have a minute to respond.)

bdice · 2020-09-24T01:57:02Z

Brad will be returning to RAPIDS, but I'm sure he has plenty to focus on currently with finishing his PhD! :)

@harrism I appreciate your understanding -- I do have a lot on my plate at the moment. 😅 I replied to all open conversations that I saw, to clarify which items are still "to-do" and what could be done about them. We bit off a lot with this PR, so I'm glad to hear the plan to move it forward / file issues for the incomplete pieces.

I've removed the r-value reference overload for the expression objects. As you mentioned it caused a bug.

Deleted r-value constructor overloads for 'expression' object

harrism · 2020-09-29T02:13:28Z

rerun tests

harrism · 2020-10-06T04:36:44Z

rerun tests

This PR implements conditional joins using expressions that are decomposed into abstract syntax trees for evaluation. This PR builds on the AST evaluation framework established in #5494 and #7418, but significantly refactors the internals and generalizes them to enable 1) expressions on two tables and 2) operations on nullable columns. This PR uses the nested loop join code created in #5397 for inner joins, but also substantially generalizes that code to enable 1) all types of joins, 2) joins with arbitrary AST expressions rather than just equality, and 3) handling of null values (with user-specified `null_equality`). A significant chunk of the code is currently out of place, but since this changeset is rather large I've opted not to move things in ways that will make reviewing this PR significantly more challenging. I will make a follow-up to address those issues once this PR is merged. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Conor Hoekstra (https://github.com/codereport) URL: #8214

bdice added 2 commits June 17, 2020 06:24

First draft of ast.

80aecd6

Fix style in tests CMakeLists.

65feb88

bdice added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Jun 17, 2020

bdice requested review from a team as code owners June 17, 2020 16:32

bdice self-assigned this Jun 17, 2020

bdice requested review from rgsl888prabhu and nvdbaranec June 17, 2020 16:32

bdice marked this pull request as draft June 17, 2020 16:32

bdice removed request for rgsl888prabhu and nvdbaranec June 17, 2020 16:32

bdice added 3 commits June 17, 2020 09:34

Update CHANGELOG.md.

bffe833

Working kernel. (Tests will fail without row dispatch.)

03accf4

Add row dispatch.

844379a

bdice added 2 commits June 19, 2020 09:11

Add comparators, refactor operation dispatch.

784a24b

Move to ast namespace.

c1ee9b5

bdice changed the title ~~[WIP] Abstract Syntax Tree evaluator~~ [WIP] Abstract Syntax Tree evaluator [skip ci] Jun 26, 2020

bdice added 10 commits June 26, 2020 14:29

Add operators header.

777f2b3

Interim work: added visitor pattern for tree parsing, improved data r…

f63d525

…eferences, removed unnecessary templating.

Remove templates from tests (not yet passing).

2cfca4a

Added column_reference and made progress on AST linearizer.

687d4d7

Use const& instead of shared pointers.

da054e8

Refactoring data references, eliminating TODOs.

aeb5a09

Implement operator!= for cudf::data_type (needed for std::not_equal_t…

5f871f5

…o<>).

Add intermediate counter, more linearizer info.

d673c8d

Starting device code.

7e04e62

Starting work on operator dispatch.

fc99a01

nvdbaranec requested changes Aug 24, 2020

View reviewed changes

Merge branch 'branch-0.16' into ast

1d9167c

Merge branch 'branch-0.16' into ast

ec0435e

harrism approved these changes Sep 24, 2020

View reviewed changes

nvdbaranec self-requested a review September 24, 2020 14:44

nvdbaranec approved these changes Sep 24, 2020

View reviewed changes

harrism added 3 commits September 25, 2020 07:56

Replace rmm::mr::get_default_resource()

bea507d

Merge branch 'branch-0.16' into ast

b44438a

Fix test utilities paths

1d0b36d

jrhemstad approved these changes Sep 24, 2020

View reviewed changes

harrism mentioned this pull request Sep 24, 2020

[BUG] Remaining testing and cleanup tasks for Abstract Syntax Tree #6320

Closed

4 tasks

Deleted r-value constructors for 'expression' object

e696736

I've removed the r-value reference overload for the expression objects. As you mentioned it caused a bug.

lamarrr mentioned this pull request Sep 25, 2020

Deleted r-value constructor overloads for 'expression' object bdice/cudf#1

Merged

Merge pull request #1 from lamarrr/patch-1

91fe9ef

Deleted r-value constructor overloads for 'expression' object

Merge branch 'branch-0.16' into ast

3692b80

harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 6, 2020

kkraus14 merged commit c65b212 into rapidsai:branch-0.16 Oct 6, 2020

vyasr mentioned this pull request Jun 8, 2021

Enable AST-based joining #8214

Merged

jlowe mentioned this pull request Jul 19, 2021

[FEA] Java bindings for AST expressions #8773

Closed

vyasr mentioned this pull request Jul 19, 2021

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Abstract Syntax Tree evaluator #5494

[REVIEW] Abstract Syntax Tree evaluator #5494

bdice commented Jun 17, 2020 •

edited by nvdbaranec

Loading

GPUtester commented Jun 17, 2020

codecov bot commented Jun 18, 2020 •

edited

Loading

harrism commented Aug 23, 2020

nvdbaranec left a comment •

edited by harrism

Loading

harrism commented Sep 24, 2020

bdice commented Sep 24, 2020

harrism commented Sep 29, 2020

harrism commented Oct 6, 2020

[REVIEW] Abstract Syntax Tree evaluator #5494

[REVIEW] Abstract Syntax Tree evaluator #5494

Conversation

bdice commented Jun 17, 2020 • edited by nvdbaranec Loading

Purpose

GPUtester commented Jun 17, 2020

codecov bot commented Jun 18, 2020 • edited Loading

Codecov Report

harrism commented Aug 23, 2020

nvdbaranec left a comment • edited by harrism Loading

Choose a reason for hiding this comment

harrism commented Sep 24, 2020

bdice commented Sep 24, 2020

harrism commented Sep 29, 2020

harrism commented Oct 6, 2020

bdice commented Jun 17, 2020 •

edited by nvdbaranec

Loading

codecov bot commented Jun 18, 2020 •

edited

Loading

nvdbaranec left a comment •

edited by harrism

Loading