Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: refactor joinResultGenerator to handle the unmatched outer records #7288

Merged
merged 15 commits into from
Aug 8, 2018

Conversation

zz-jason
Copy link
Member

@zz-jason zz-jason commented Aug 6, 2018

What have you changed? (mandatory)

refactor the joinResultGenerator:

  • add a function onMissMatch to handle the unmatched outer records
  • refactor the emit method to tryToMatch, which returns a boolean value to indicate whether the outer record can be joined with any inner record.

This PR is a bugfix, we should not handle the unmatched outer records in the old emit function: this function only joins a batch of inner rows, not all of them. We only know that the outer records can not be joined with any inner record on a serials of emit function calls, and then decide whether the outer record is unmatched.

After this PR, a typical function call to the recordJoiner is:

hasMatch := false
for innerIter.Current() != innerIter.End() {
    matched, err := g.tryToMatch(outer, innerIter, chk)
    // handle err
    hasMatch = hasMatch || matched
}
if !hasMatch {
    g.onMissMatch(outer)
}

What is the type of the changes? (mandatory)

  • Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

  • unit test
  • explain test

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

No

Does this PR affect tidb-ansible update? (mandatory)

No

Does this PR need to be added to the release notes? (mandatory)

release note:
Fix a bug that the `Outer Join` may output the same unmatched outer records multiply times under certain special conditions.

Add a few positive/negative examples (optional)

An negative case, which is already added in the test:

Before this PR:

drop table if exists t;
create table t(a bigint, b bigint, unique key idx1(a, b));
insert into t values(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6);

set @@tidb_max_chunk_size = 2; -- this is important, which expose the bug in a small data set
select * from t t1 left join t t2 on t1.a = t2.a and t1.b = t2.b + 4;

and the wrong result is:

TiDB(localhost:4000) > select * from t t1 left join t t2 on t1.a = t2.a and t1.b = t2.b + 4;
+------+------+------+------+
| a    | b    | a    | b    |
+------+------+------+------+
|    1 |    1 | NULL | NULL |
|    1 |    1 | NULL | NULL |
|    1 |    1 | NULL | NULL |
|    1 |    1 | NULL | NULL |
|    1 |    2 | NULL | NULL |
|    1 |    2 | NULL | NULL |
|    1 |    2 | NULL | NULL |
|    1 |    2 | NULL | NULL |
|    1 |    3 | NULL | NULL |
|    1 |    3 | NULL | NULL |
|    1 |    3 | NULL | NULL |
|    1 |    3 | NULL | NULL |
|    1 |    4 | NULL | NULL |
|    1 |    4 | NULL | NULL |
|    1 |    4 | NULL | NULL |
|    1 |    4 | NULL | NULL |
|    1 |    5 |    1 |    1 |
|    1 |    5 | NULL | NULL |
|    1 |    5 | NULL | NULL |
|    1 |    5 | NULL | NULL |
|    1 |    6 |    1 |    2 |
|    1 |    6 | NULL | NULL |
|    1 |    6 | NULL | NULL |
|    1 |    6 | NULL | NULL |
+------+------+------+------+
24 rows in set (0.00 sec)

This PR fixes the above issue:

TiDB(localhost:4000) > select * from t t1 left join t t2 on t1.a = t2.a and t1.b = t2.b + 4;
+------+------+------+------+
| a    | b    | a    | b    |
+------+------+------+------+
|    1 |    1 | NULL | NULL |
|    1 |    2 | NULL | NULL |
|    1 |    3 | NULL | NULL |
|    1 |    4 | NULL | NULL |
|    1 |    5 |    1 |    1 |
|    1 |    6 |    1 |    2 |
+------+------+------+------+
6 rows in set (0.00 sec)

@zz-jason zz-jason added type/bugfix This PR fixes a bug. sig/execution SIG execution labels Aug 6, 2018
@zz-jason
Copy link
Member Author

zz-jason commented Aug 6, 2018

/run-all-tests

@zz-jason zz-jason changed the title Dev/fix join generator executor: refactor joinResultGenerator to handle the unmatched outer records Aug 6, 2018
// }
//
// NOTE: This interface is **not** thread-safe.
type recordJoiner interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be joinRecorder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record means data record, it should be recordJoiner or tupleJoiner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record have both verb and noun, and not used very often in executor package. What about rowJoiner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct name can be discussed later, we can firstly focus on the code logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the original name joinResultGenerator and do the name refactor later?
Because there are too many places to change.

@zz-jason
Copy link
Member Author

zz-jason commented Aug 7, 2018

/run-unit-test

@@ -60,7 +60,7 @@ type IndexLookUpJoin struct {
joinResult *chunk.Chunk
innerIter chunk.Iterator

resultGenerator joinResultGenerator
resultGenerator recordJoiner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name needs to be updated accordingly?

@zz-jason
Copy link
Member Author

zz-jason commented Aug 7, 2018

@coocood @lysu @winoros PTAL

@@ -32,21 +32,51 @@ var (
_ joinResultGenerator = &innerJoinResultGenerator{}
)

// joinResultGenerator is used to generate join results according the join type, see every implementor for detailed information.
// joinResultGenerator is used to generate join results according the join type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joinResultGenerator is used to generate join results according to the join type.

// onMissMatch operates on the unmatched outer row according to the join
// type. An outer row can be considered miss matched if:
// 1. it can not pass the filter on the outer table side.
// 2. there does not exist an inner row with the same join key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use "there is no inner row with the same join key." Just for your reference.

// unmatched outer rows according to the current join type:
// 1. 'SemiJoin': ignores the unmatched outer row.
// 2. 'AntiSemiJoin': appends the unmatched outer row to the result buffer.
// 3. 'LeftOuterSemiJoin': concates the unmatched outer row with 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concates -> concats

// 1. 'SemiJoin': ignores the unmatched outer row.
// 2. 'AntiSemiJoin': appends the unmatched outer row to the result buffer.
// 3. 'LeftOuterSemiJoin': concates the unmatched outer row with 0 and
// appended it to the result buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appended -> appends

// 2. 'AntiSemiJoin': appends the unmatched outer row to the result buffer.
// 3. 'LeftOuterSemiJoin': concates the unmatched outer row with 0 and
// appended it to the result buffer.
// 4. 'AntiLeftOuterSemiJoin': concates the unmatched outer row with 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concates -> concats

// appended it to the result buffer.
// 4. 'AntiLeftOuterSemiJoin': concates the unmatched outer row with 0 and
// appended to the result buffer.
// 5. 'LeftOuterJoin': concates the unmatched outer row with a row of NULLs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concates -> concats

// 4. 'AntiLeftOuterSemiJoin': concates the unmatched outer row with 0 and
// appended to the result buffer.
// 5. 'LeftOuterJoin': concates the unmatched outer row with a row of NULLs
// and appended to the result buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appended -> appends it

// appended to the result buffer.
// 5. 'LeftOuterJoin': concates the unmatched outer row with a row of NULLs
// and appended to the result buffer.
// 6. 'RightOuterJoin': concates the unmatched outer row with a row of NULLs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concates -> concats

// 5. 'LeftOuterJoin': concates the unmatched outer row with a row of NULLs
// and appended to the result buffer.
// 6. 'RightOuterJoin': concates the unmatched outer row with a row of NULLs
// and appended to the result buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appended -> appends it

// tryToMatch tries to join an outer row with a batch of inner rows. When
// 'inners.Len != 0' but all the joined rows are filtered, the outer row is
// considered unmatched. Otherwise, the outer row is matched and some joined
// rows is appended to `chk`. The size of `chk` is limited to MaxChunkSize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows are appended to chk. The size of chk is limited to MaxChunkSize.

@zz-jason
Copy link
Member Author

zz-jason commented Aug 8, 2018

@CaitinChen Done, PTAL again, thanks!

@@ -60,21 +60,21 @@ type joinResultGenerator interface {
// onMissMatch operates on the unmatched outer row according to the join
// type. An outer row can be considered miss matched if:
// 1. it can not pass the filter on the outer table side.
// 2. there does not exist an inner row with the same join key.
// 2. there is no inner row with the same join key..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the extra period at the end of this sentence.

if !(err == nil && matched) {
return
}
// here we handle the matched outer.
chk.AppendPartialRow(0, outer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think do the main work in defer function is too tricky.
How about just

defer func() {
    if matched {
        innters.ReachEnd()
    }
}

And AppendPartialRow before return true, nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, innters.ReachEnd() should be called together when we have a matched inner row and do the join work. So I chose the defer style to reduce the code duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use defer to reducing code duplication.
It splits the execution flow, makes the code harder to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extract functions to reducing code duplication.

chk.AppendPartialRow(0, outer)
chk.AppendInt64(outer.Len(), 1)
return nil
inners.ReachEnd()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

chk.AppendPartialRow(0, outer)
chk.AppendInt64(outer.Len(), 0)
return nil
inners.ReachEnd()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zz-jason
Copy link
Member Author

zz-jason commented Aug 8, 2018

@coocood @CaitinChen @winoros PTAL

Copy link
Contributor

@CaitinChen CaitinChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason
Copy link
Member Author

zz-jason commented Aug 8, 2018

/run-all-tests

@coocood
Copy link
Member

coocood commented Aug 8, 2018

LGTM

Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@coocood coocood merged commit b39b5f5 into pingcap:master Aug 8, 2018
zz-jason added a commit to zz-jason/tidb that referenced this pull request Aug 8, 2018
@zz-jason zz-jason deleted the dev/fix-join-generator branch August 8, 2018 08:17
zz-jason added a commit that referenced this pull request Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants