Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for filters in the Druid Delta Lake connector #16288

Merged
merged 24 commits into from
Apr 29, 2024

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Apr 15, 2024

This patch adds support for Delta filters in the Delta Lake input source. Before this patch, the Delta Lake connector would read all Delta files in the latest snapshot, which could be inefficient. With the addition of filters, the connector translates them into Delta predicates, enabling the pushing down of filter predicates to the underlying Delta Kernel. This allows for the Kernel to perform data skipping and read only Delta files that match the filter predicates.

Description

An example of a >= range filter:

REPLACE INTO "employee-filter1" OVERWRITE ALL
SELECT "name", MILLIS_TO_TIMESTAMP(CAST("birthday" AS BIGINT) * 1000) AS __time, "id", "salary" FROM TABLE(
  EXTERN(
    '{
      "type": "delta",
      "tablePath": "/Users/foo/employee-table",
      "filter": {
        "type": ">=",
        "column": "name",
        "value": "Employee3"
      }
    }',
    '{"type":"json"}'
) EXTEND ("birthday" BIGINT, "name" VARCHAR, "id" BIGINT, "age" BIGINT, "salary" FLOAT)
PARTITIONED BY DAY
CLUSTERED BY name

An example of an AND filter:

REPLACE INTO "employee-filter2" OVERWRITE ALL
SELECT "name", MILLIS_TO_TIMESTAMP(CAST("birthday" AS BIGINT) * 1000) AS __time, "id", "salary" FROM TABLE(
  EXTERN(
    '{
      "type": "delta",
      "tablePath": "/Users/foo/employee-table",
      "filter": {
        "type": "and",
        "filters": [
          {
            "type": "<=",
            "column": "age",
            "value": "30"
          },
          {
            "type": ">=",
            "column": "name",
            "value": "Employee4"
          }
        ]
      }}',
    '{"type":"json"}'
) EXTEND ("birthday" BIGINT, "name" VARCHAR, "id" BIGINT, "age" BIGINT, "salary" FLOAT)
PARTITIONED BY DAY
CLUSTERED BY name

Core changes:

  • Add DeltaFilter interface and implementations for and, or, not, =, >, >=, <, <=.
  • Expose optional filter in the DeltaInputSource. If supplied, the filters are translated to Delta Predicate that is pushed down to the Kernel for pruning out data files.
  • The design of these objects is similar to IcebergFilter where the filters can be an expression tree that translates nicely to underlying Delta Predicate to be used in the Kernel. Updated ingestion and delta docs to show the usage of these filter objects.

Test changes

  • Add a new Delta table employee-delta-table-partitioned-name partitioned by employee name. Added instructions on how to generate in the test README.
  • The newly generated Delta table expected rows and schema is hardcoded in the PartitionedDeltaTable. The existing non-partitioned table is moved to NonPartitionedDeltaTable. The test classes can further be cleaned up by having a common DeltaTestTable interface. I can refactor and clean that up later when we add more test tables as needed.
  • Add tests for the filter POJOs in the delta/filter package.
  • Add serde test for DeltaInputSourceSerdeTest to test out jackson serialization of the input source objects.

Release note

Support for filters in the Delta Lake input source has been added. Utilize the Delta filter to prune unnecessary Delta files and read only data that matches the filter predicates. Please refer to the documentation for instructions on how to use filters during ingestion.


Key changed/added classes in this PR
  • DeltaFilter interface and its implementations
  • DeltaInputSource

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@abhishekrb19 abhishekrb19 marked this pull request as ready for review April 17, 2024 20:23
Copy link
Contributor

@317brian 317brian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits on the docs

docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
abhishekrb19 and others added 3 commits April 18, 2024 14:11
Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review.
I have an overarching question, regarding the <, >, <=, >= filters.
I think it would be easier if we have the type as "predicate", a predicate field like ">" and then the LHS and RHS for the predicate, rather than having a separate type for each, since that resembles closely to what we want to do.

Apart from that, is there a way to write something like the following, without extra effort from the developer to convert it to its normal form?
!(col1 > col2 && col3 < col4)

docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented Apr 20, 2024

Thanks for the review, @LakshSingla!

I have an overarching question, regarding the <, >, <=, >= filters.
I think it would be easier if we have the type as "predicate", a predicate field like ">" and then the LHS and RHS for the predicate, rather than having a separate type for each, since that resembles closely to what we want to do.

Yeah, as we discussed offline, the Delta Kernel API considers all the expressions, including AND, OR, as a Predicate type. So I think it'd make sense to treat user-facing filters as one common entity that can be translated one-to-one into an underlying Delta Predicate, rather than breaking them up into their own Filter and Predicate classes. Added javadocs to sort of reflect the intent. Let me know if that clarifies things.

is there a way to write something like the following, without extra effort from the developer to convert it to its normal form?
!(col1 > col2 && col3 < col4)

Yes, that should be possible! Please see the unit tests DeltaInputSourceTest.FilterParameterTests or DeltaNotFilterTest.testNotFilterWithAndExpression(). Related, I plan to add recursive flattening of the AND, OR filters to support complex filter expressions in a follow up.

@abhishekrb19 abhishekrb19 added this to the 30.0.0 milestone Apr 22, 2024
@LakshSingla
Copy link
Contributor

Thanks for the additional feature 🚀

@abhishekrb19 abhishekrb19 merged commit 1d7595f into apache:master Apr 29, 2024
88 checks passed
@abhishekrb19 abhishekrb19 deleted the delta_lake_filters_support branch April 29, 2024 18:31
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 2, 2024
* Delta Lake support for filters.

* Updates

* cleanup comments

* Docs

* Remmove Enclosed runner

* Rename

* Cleanup test

* Serde test for the Delta input source and fix jackson annotation.

* Updates and docs.

* Update error messages to be clearer

* Fixes

* Handle NumberFormatException to provide a nicer error message.

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Doc fixes based on feedback

* Yes -> yes in docs; reword slightly.

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Documentation, javadoc and more updates.

* Not with an or expression end-to-end test.

* Break up =, >, >=, <, <= into its own types instead of sub-classing.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 2, 2024
Adds a text box for delta filter that can accept an optional json
object.
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 2, 2024
Adds a text box for delta filter that can accept an optional json
object.
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 2, 2024
Adds a text box for delta filter that can accept an optional json
object.
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 2, 2024
Adds a text box for delta filter that can accept an optional json
object.
vogievetsky pushed a commit that referenced this pull request May 2, 2024
Adds a text box for delta filter that can accept an optional json
object.
LakshSingla pushed a commit that referenced this pull request May 3, 2024
Support for filters in the Druid Delta Lake connector
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request May 3, 2024
Adds a text box for delta filter that can accept an optional json
object.
abhishekrb19 added a commit that referenced this pull request May 3, 2024
Adds a text box for delta filter that can accept an optional json
object.
IgorBerman pushed a commit to IgorBerman/druid that referenced this pull request May 5, 2024
Adds a text box for delta filter that can accept an optional json
object.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants