Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv: fix notLeader in region cache #10572

Merged
merged 8 commits into from
May 23, 2019
Merged

tikv: fix notLeader in region cache #10572

merged 8 commits into from
May 23, 2019

Conversation

lysu
Copy link
Contributor

@lysu lysu commented May 22, 2019

What problem does this PR solve?

fixes bug introduced by #10256

not leader: region_id:8 leader:<id:9 store_id:1 > , ctx: region ID: 8, meta: id:8 end_key:\"t\\200\\000\\000\\000\\000\\000\\000\\005\" region_epoch:<conf_ver:5 version:2 > peers:<id:9 store_id:1 > peers:<id:10 store_id:5 > peers:<id:11 store_id:4 > , peer: id:10 store_id:5 , ad
dr: tidb-cluster-tikv-1.tidb-cluster-tikv-peer.dm-refactor-full-exp328-cat0-tidb-cluster.svc:20160 at 2019-05-22T06:42:46.764390748Z\nnot leader: region_id:8 leader:<id:9 store_id:1 > , ctx: region ID
: 8, meta: id:8 end_key:\"t\\200\\000\\000\\000\\000\\000\\000\\005\" region_epoch:<conf_ver:5 version:2 > peers:<id:9 store_id:1 > peers:<id:10 store_id:5 > peers:<id:11 store_id:4 > , peer: id:10 st
ore_id:5 , addr: tidb-cluster-tikv-1.tidb-cluster-tikv-peer.dm-refactor-full-exp328-cat0-tidb-cluster.svc:20160 at 2019-05-22T06:42:46.775362301Z\nnot leader: region_id:8 leader:<id:9 store_id:1 > , c
tx: region ID: 8, meta: id:8 end_key:\"t\\200\\000\\000\\000\\000\\000\\000\\005\" region_epoch:<conf_ver:5 version:2 > peers:<id:9 store_id:1 > peers:<id:10 store_id:5 > peers:<id:11 store_id:4 > , p
eer: id:10 store_id:5 , addr: tidb-cluster-tikv-1.tidb-cluster-tikv-peer.dm-refactor-full-exp328-cat0-tidb-cluster.svc:20160 at 2019-05-22T06:42:46.786329643Z"

the question is NotLeader response told A1 is leader, but A1 maybe meet other error(e.g. network error) and be marked as failed, when next request arrival, TiDB will try round-robin try other nodes, so request will send to A2, and A2 told NotLeader(A1 is leader) again.

What is changed and how it works?

when recieve NotLeader, not only switch currentPeer to A1 but also clear up A1's failed marker.

then next request will try A1, until Ax return a NotLeader with NULL is leader.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

manual running tpcc/sysbench with write leader change test..

schrodinger boxdetail?boxid=15585290796895

Code changes

  • impl change

Side effects

  • N/A

Related changes

  • N/A

This change is Reviewable

@lysu lysu added the type/bugfix This PR fixes a bug. label May 22, 2019
@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #10572 into master will increase coverage by 0.006%.
The diff coverage is 84.9462%.

@@               Coverage Diff               @@
##             master     #10572       +/-   ##
===============================================
+ Coverage   77.3594%   77.3655%   +0.006%     
===============================================
  Files           413        413               
  Lines         87498      87442       -56     
===============================================
- Hits          67688      67650       -38     
+ Misses        14605      14598        -7     
+ Partials       5205       5194       -11

@lysu lysu added component/tikv priority/release-blocker This issue blocks a release. Please solve it ASAP. labels May 22, 2019
@lysu
Copy link
Contributor Author

lysu commented May 22, 2019

/run-all-tests

@lysu lysu changed the title tikv: fix notLeader handle tikv: fix notLeader in region cache May 22, 2019
@lysu lysu requested review from coocood and tiancaiamao May 22, 2019 12:52
Copy link
Contributor

@winkyao winkyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use failpoint to stimulate this case?

store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/run-all-tests

@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/rebuild

@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/run-unit-test

store/tikv/region_cache.go Outdated Show resolved Hide resolved
@coocood
Copy link
Member

coocood commented May 23, 2019

reloadRegionThreshold can be removed

store/tikv/region_cache.go Outdated Show resolved Hide resolved
@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/run-all-tests

@jackysp
Copy link
Member

jackysp commented May 23, 2019

PTAL @coocood

@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/run-all-tests

@lysu
Copy link
Contributor Author

lysu commented May 23, 2019

/run-integration-ddl-test

Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jackysp jackysp added component/DDL-need-LGT3 status/LGT1 Indicates that a PR has LGTM 1. and removed component/DDL-need-LGT3 labels May 23, 2019
@coocood
Copy link
Member

coocood commented May 23, 2019

LGTM

@jackysp jackysp merged commit eb38947 into pingcap:master May 23, 2019
db-storage pushed a commit to db-storage/tidb that referenced this pull request May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv priority/release-blocker This issue blocks a release. Please solve it ASAP. status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants