Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: revert test case introduced in #18819 to solve the data race on release-4.0 #19722

Conversation

ichn-hu
Copy link
Contributor

@ichn-hu ichn-hu commented Sep 2, 2020

Reverts #18819 , partially solved #19700

This PR causes a data race that blocks all subsequent PRs, therefore it should consider a revert.

See: https://github.com/pingcap/tidb/pulls?q=is%3Apr+base%3Arelease-4.0+is%3Aclosed+sort%3Aupdated-desc

starting from #18819 , all CIs are broken.

Release Note

  • No release note

@sre-bot
Copy link
Contributor

sre-bot commented Sep 2, 2020

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

/cc @lzmhhh123

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

/cc @zz-jason

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

/run-all-tests

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

/cc @SunRunAway

@github-actions github-actions bot added the sig/execution SIG execution label Sep 2, 2020
@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

/run-all-tests

@zz-jason
Copy link
Member

zz-jason commented Sep 2, 2020

What's the root cause of the CI broken?

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

What's the root cause of the CI broken?

I am currently not sure, I still need more time to investigate about this (hugely because of PointGet under parallel execution), but it can be obvious that this PR causes the broken CI, and in order not to block PRs to release-4.0, I proposed this revert PR.

@zz-jason
Copy link
Member

zz-jason commented Sep 2, 2020

another question is, why the CI in the master branch isn't broken 😂

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 2, 2020

another question is, why the CI in the master branch isn't broken

It might be fixed, but the fix is not ported to 4.0. I will look into it.

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

/run-all-tests

1 similar comment
@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

/run-all-tests

ichn-hu referenced this pull request Sep 3, 2020
- implement Execute using ExecuteStmt
- Execute is only for internal usage and accept only one statement
- clean up the old execute function
@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

another question is, why the CI in the master branch isn't broken

After investigation, the cause of this data race is similar with this one #16817 (comment) , both are due to point get at the outer side of look up merge join, which after execution, enters finishStmt and turned the txn.Transaction into invalid, while the point get executor is still running, checking the transaction variable in Next, see

e.txn, err = e.ctx.Txn(false)
.

It does not cause panic, just because it only checks if txn.Transaction is nil or not, not dereferencing it, while such check concurrently raced with the set to nil.

The reason it does not happen on master is because @tiancaiamao moved this check from Next to Open in 05b2e2b#r42000685 .

I do not have a fix for this right now, I plan to port this PR to 4.0-release to see if it could solve the problem, however this PR changes too much code, it would be splendid if @tiancaiamao could help with this.

@ichn-hu ichn-hu force-pushed the revert-18819-release-4.0-ac581ee01ec6 branch from 01b5098 to 8368bee Compare September 3, 2020 10:03
@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

@zz-jason After discussion with @tiancaiamao , we decided to temporarily comment out the offending test case, once we solved the race condition, we will add it back.

@ichn-hu ichn-hu changed the title Revert "planner: support plan cache for cluster index (#18716)" Revert test case introduced in #18819 to solve the data race on release-4.0 Sep 3, 2020
@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

/run-unit-test

@zz-jason
Copy link
Member

zz-jason commented Sep 3, 2020

OK. Does it only affect the unit tests?

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

/rebuild

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

OK. Does it only affect the unit tests?

As for now, only this unit test will cause the data race, but this is indeed a bug that need to be fixed no matter what, and it is already fixed on master, just not yet ported to release-4.0. The port (or cherry-pick) is complicated, and need more time, therefore we would like to temporarily fix it by revert the test.

Copy link
Contributor

@lzmhhh123 lzmhhh123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use

if israce.RaceEnabled {
	c.Skip("skip race test")
}

instead of comment out the cases.

@zz-jason
Copy link
Member

zz-jason commented Sep 3, 2020

OK. Does it only affect the unit tests?

As for now, only this unit test will cause the data race, but this is indeed a bug that need to be fixed no matter what, and it is already fixed on master, just not yet ported to release-4.0. The port (or cherry-pick) is complicated, and need more time, therefore we would like to temporarily fix it by revert the test.

Seems we only need to fix the test? Just confirm, is there a potential data race issue when users file queries?

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 3, 2020

OK. Does it only affect the unit tests?

As for now, only this unit test will cause the data race, but this is indeed a bug that need to be fixed no matter what, and it is already fixed on master, just not yet ported to release-4.0. The port (or cherry-pick) is complicated, and need more time, therefore we would like to temporarily fix it by revert the test.

Seems we only need to fix the test? Just confirm, is there a potential data race issue when users file queries?

Yes, it is possible, therefore we need to fix it.

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 4, 2020

@zz-jason @SunRunAway let's merge this revert.

@ichn-hu ichn-hu changed the title Revert test case introduced in #18819 to solve the data race on release-4.0 executor: revert test case introduced in #18819 to solve the data race on release-4.0 Sep 4, 2020
Copy link
Contributor

@SunRunAway SunRunAway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 4, 2020
@SunRunAway
Copy link
Contributor

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 4, 2020
@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot
Copy link
Contributor

@ichn-hu merge failed.

@ichn-hu
Copy link
Contributor Author

ichn-hu commented Sep 4, 2020

/run-tics-test

@SunRunAway SunRunAway merged commit 22de543 into pingcap:release-4.0 Sep 4, 2020
@SunRunAway SunRunAway deleted the revert-18819-release-4.0-ac581ee01ec6 branch September 4, 2020 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants