Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikvclient: refine region-cache #10256

Merged
merged 28 commits into from
May 21, 2019
Merged

tikvclient: refine region-cache #10256

merged 28 commits into from
May 21, 2019

Conversation

lysu
Copy link
Contributor

@lysu lysu commented Apr 24, 2019

What problem does this PR solve?

try to fixes #10037, keep retry region's other stores

  • meet hibernate region feature in tikv
  • keep using old data when tidb can not connect to pd

What is changed and how it works?

What region-cache can hold?

  • region: region maintains a range of data, one region has multiple stores, it will be created/deleted frequently(region split, region merge, leader change, schedule from one store to another)
  • stores: store is kv process, one machine can hold multiple stores, one store can hold multiple regions, it will be infrequent change(copy old kv to new machine, change network interface...)

What fails or data outdated region-cache will meet?

  1. send data failure

normally, it happens when machine down or network partition between tidb and kv or process crash.
for TiDB side, it will see send data failure event to identify the store failure.

but sometimes, this will be caused by someone to replace a new machine or change network interface, but people don't often do that.

  1. send success but got the error response

this means the store is working well, but info is miss matched.
for TiDB side, send data will success, but will get fail response from kv.

normally, it's the region info be changed, seldom it's store info changed.

How current running

on send fail, it will drop region cache and remove store info, it will trigger reload region and store info.

but have 2 problems:

  • re-fetch region info will get the old leader when kv doesn't trigger new election, but hibernate region need trigger other peer to start new election
  • fetch region from pd maybe failure if pd have network partition.

how this PR change

  1. mark store after send
  • mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
  • mark store as success when sending data success.

"blackout store when failure" keep the chance that failure peer can be used again but doesn't make retry flood.

  1. invalidate region cache

cache item never be deleted and only invalidate it to trigger pd re-fetch
it will make it validate again to keep use old data if fetch failure caused by pd down.

review this PR maybe need to take look at #6880 and #2792, also GetRPCContext method which isn't modified but vital to this logic.

  1. main focus
  • fast & lock-free for cache-hit code path?
  • when to try other peers when the current leader unreachable?
  • when to invalidate a region and trigger reload region?
  • when & how long to blackout a store when it continues failed?
  • when to trigger stores info to be reloaded?

Check List

Tests

  • Unit test(WIP add more)
  • Integration test

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change

Side effects

  • Increased code complexity

Related changes

  • N/A

This change is Reviewable

@lysu lysu added component/tikv type/bugfix This PR fixes a bug. labels Apr 24, 2019
@lysu lysu requested a review from coocood April 24, 2019 09:36
@lysu
Copy link
Contributor Author

lysu commented Apr 24, 2019

/run-all-tests

store/tikv/region_cache.go Outdated Show resolved Hide resolved
@lysu lysu added the require-LGT3 Indicates that the PR requires three LGTM. label Apr 25, 2019
@lysu lysu removed the status/WIP label Apr 25, 2019
@lysu lysu marked this pull request as ready for review April 25, 2019 07:29
@lysu lysu requested a review from jackysp April 25, 2019 07:35
@lysu
Copy link
Contributor Author

lysu commented Apr 25, 2019

/run-all-tests

@zhouqiang-cl
Copy link
Contributor

/rebuild

1 similar comment
@zhouqiang-cl
Copy link
Contributor

/rebuild

@lysu
Copy link
Contributor Author

lysu commented Apr 25, 2019

/run-all-tests

@zhouqiang-cl
Copy link
Contributor

/rebuild

1 similar comment
@mahjonp
Copy link
Contributor

mahjonp commented Apr 25, 2019

/rebuild

@lysu
Copy link
Contributor Author

lysu commented Apr 25, 2019

/run-unit-test

@shenli shenli added the priority/release-blocker This issue blocks a release. Please solve it ASAP. label Apr 28, 2019
@codecov
Copy link

codecov bot commented Apr 28, 2019

Codecov Report

Merging #10256 into master will decrease coverage by 0.0033%.
The diff coverage is 72.3404%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.8178%   77.8144%   -0.0034%     
================================================
  Files           410        410                
  Lines         84365      84438        +73     
================================================
+ Hits          65651      65705        +54     
- Misses        13813      13826        +13     
- Partials       4901       4907         +6

@codecov
Copy link

codecov bot commented Apr 28, 2019

Codecov Report

Merging #10256 into master will decrease coverage by 0.0176%.
The diff coverage is 81.0096%.

@@               Coverage Diff                @@
##             master     #10256        +/-   ##
================================================
- Coverage   77.2779%   77.2603%   -0.0177%     
================================================
  Files           413        413                
  Lines         86986      87244       +258     
================================================
+ Hits          67221      67405       +184     
- Misses        14600      14647        +47     
- Partials       5165       5192        +27

@lysu lysu removed the status/WIP label Apr 28, 2019
@coocood
Copy link
Member

coocood commented Apr 28, 2019

If we add an Unreachable pointer to the Store and makes every region reference the pointer, access atomically.
We can avoid holding the lock for a long time on request fail.

Then we need to check the store state and switch to backup stores in GetRPCContext.

store/tikv/region_cache.go Outdated Show resolved Hide resolved
@jackysp jackysp removed the priority/release-blocker This issue blocks a release. Please solve it ASAP. label Apr 28, 2019
id uint64
peer *metapb.Peer
)
for id, peer = range r.backupPeers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is good practice to depend on the fact that access of map is randomized... Maybe simpler to use a slice to store backupPeers instead...


func (r *Region) tryRandPeer() {
r.peer = r.meta.Peers[0]
for i := 1; i < len(r.meta.Peers); i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rand.Shuffle(r.meta.Peers) then you can try peers one by one.

@coocood
Copy link
Member

coocood commented May 20, 2019

LGTM

@tiancaiamao
Copy link
Contributor

LGTM

@tiancaiamao
Copy link
Contributor

/run-all-tests

@tiancaiamao tiancaiamao added status/LGT3 The PR has already had 3 LGTM. and removed status/LGT2 Indicates that a PR has LGTM 2. labels May 21, 2019
@tiancaiamao tiancaiamao merged commit f6346a1 into pingcap:master May 21, 2019
db-storage pushed a commit to db-storage/tidb that referenced this pull request May 29, 2019
1. mark store after send
* mark store be failed when sending data failure, it let tidb blackout this store in following in a period(continue fail count + last fail timestamp).
* mark store as success when sending data success.

2. invalidate region cache
* cache item never be deleted and only invalidate it to trigger pd re-fetch
* make cache item validate again to keep use old data if fetch failure caused by pd down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv priority/release-blocker This issue blocks a release. Please solve it ASAP. require-LGT3 Indicates that the PR requires three LGTM. status/LGT3 The PR has already had 3 LGTM. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ticlient doesn't make request in a round-robin fashion any more
10 participants