-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
planner: add an upper bound for estimated row count of inner side of index join #41996
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/retest |
/test unit-test |
" │ └─IndexRangeScan 1000000.00 cop[tikv] table:t2, index:idx(b) range: decided by [eq(test.t.b, test.t.a)], keep order:false, stats:pseudo", | ||
" └─Selection(Probe) 1000000.00 cop[tikv] eq(test.t.a, 0)", | ||
" └─TableRowIDScan 1000000.00 cop[tikv] table:t2 keep order:false, stats:pseudo", | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a comparison, on current master branch, the execution plan is:
IndexJoin 1000000.00 root inner join, inner:IndexLookUp, outer key:test.t.a, inner key:test.t.b, equal cond:eq(test.t.a, test.t.b)
├─TableReader(Build) 1000.00 root data:Selection
│ └─Selection 1000.00 cop[tikv] lt(test.t.a, 1), not(isnull(test.t.a))
│ └─TableFullScan 500000.00 cop[tikv] table:t keep order:false, stats:pseudo
└─IndexLookUp(Probe) 1000000.00 root
├─Selection(Build) 500000000.00 cop[tikv] not(isnull(test.t.b))
│ └─IndexRangeScan 500000000.00 cop[tikv] table:t2, index:idx(b) range: decided by [eq(test.t.b, test.t.a)], keep order:false, stats:pseudo
└─Selection(Probe) 1000000.00 cop[tikv] eq(test.t.a, 0)
└─TableRowIDScan 500000000.00 cop[tikv] table:t2 keep order:false, stats:pseudo
return idxStats.NDV | ||
} | ||
} | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return max{single col ndv} instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
} | ||
} | ||
|
||
// 3. If we still haven't got an NDV, we use the minimal NDV in the column stats as a lower bound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer a more case 2.5, once the index prefix columns satisfied the keys,we can use the index‘s DNV / the abscent column’s DNV (thought they are completely independent distributions, and result after division is quite small than the real value, leading a higher row count estimations(but not that highest from direct dividing lowest col-ndv) is which we wanna get from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a clever one!
It's not that obvious but indeed a correct lower bound for the NDV. And I think it doesn't need to be prefix columns, it's enough if the index contains the columns we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we add this, there needs to be a strategy for choosing from multiple possible indexes, and we need to make sure the result is stable, we also need to care about some edge cases like prefix index and generated columns, and we also need to construct corresponding test cases, that's too much for this sprint.
Besides, this PR would be cherry-picked for a customer, safety is more important, and less change is preferred.
So I tend to add a TODO comment here and leave it to the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a clever one! It's not that obvious but indeed a correct lower bound for the NDV. And I think it doesn't need to be prefix columns, it's enough if the index contains the columns we want.
make sense,subset of index column is enough
/retest |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 23f34df
|
/retest |
What problem does this PR solve?
Issue Number: ref #31316
Problem Summary:
Because of our current implementation of index join, the row count for its inner side might be severely overestimated.
What is changed and how it works?
Add a reasonable upper bound for the inner children of index join to prevent very severe estimation errors.
Specifically, the average row count for the inner side
IndexScan
that corresponds to each row from the outer side should be no larger than (total row count / NDV of join key columns).Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.