store/tikv: avoid holding write lock for long time #6880

coocood · 2018-06-22T06:38:35Z

What have you changed? (mandatory)

When a TiKV store failed, there are many concurrent requests will call OnRequestFail, all of them will holding the write lock for a while trying to drop the regions on the store.
This PR only let one request iterate all regions and drops the regions on the store, others will see that the store is dropped then quit early.

This PR also removed unreachableStores property on the Region struct, because

it makes things much more complex.
we drop all other regions on the store but keep the failed region if not all of the stores in the region are unreachable, it doesn't make sense.

What are the type of the changes (mandatory)?

Improvement (non-breaking change which is an improvement to an existing feature)
Optimize RegionCache performance on send request failure

How has this PR been tested (mandatory)?

Unit test and benchmark test

Benchmark result if necessary (optional)

from
28221 ns/op
to
125 ns/op

Optimize RegionCache performance on send request failure.

coocood · 2018-06-22T06:39:15Z

@disksing @tiancaiamao @zz-jason PTAL

shenli · 2018-06-23T03:50:50Z

Will this PR improve the performance of sysbench?

shenli · 2018-06-23T03:55:08Z

store/mockstore/mocktikv/cluster.go

@@ -464,8 +464,8 @@ func (r *Region) removePeer(peerID uint64) {
 	r.incConfVer()
 }

-func (r *Region) changeLeader(leaderStoreID uint64) {
-	r.leader = leaderStoreID
+func (r *Region) changeLeader(leaderID uint64) {


Is this the ID of a peer or store?

peer, it was mistakenly name leaderStoreID.

shenli · 2018-06-23T03:58:26Z

store/tikv/region_cache.go

+	if !ok {
+		// The failed region is dropped already by another request, we don't need to iterate the regions
+		// and find regions on the failed store to drop.
+		c.mu.Unlock()


Why not use defer to unlock?

So we can do something out of lock at the end of this function.

shenli · 2018-06-23T04:00:30Z

store/tikv/region_cache.go

@@ -581,26 +579,6 @@ func (r *Region) GetContext() *kvrpcpb.Context {
 	}
 }

-// OnRequestFail records unreachable peer and tries to select another valid peer.
-// It returns false if all peers are unreachable.
-func (r *Region) OnRequestFail(storeID uint64) bool {


Why remove this? Is this logic useless or moved to another place?

It's useless.

There are some considerations for using unreachable store list.

Consider a store is down, another peer of the region becomes the leader, but somehow the new leader is not able to send heartbeat to PD in time.

With the unreachable store list, tidb can try the other peers automatically. Otherwise, it will continue to reconnect the down tikv until timeout.

@disksing
I know, but we drop all other regions in the store due to send request failure, keep an unreachable list for only one region doesn't make any difference.

shenli · 2018-06-23T04:04:23Z

@disksing @zhangjinpeng1987 PTAL

coocood · 2018-06-25T02:17:14Z

@shenli
This PR only optimize the case when a tikv-server is down, it can not improve sysbench result.

tiancaiamao · 2018-06-25T12:25:40Z

LGTM

zz-jason · 2018-06-25T12:33:13Z

store/tikv/region_cache.go

 			c.dropRegionFromCache(id)
 		}
 	}
 	c.mu.Unlock()
+	log.Infof("drop regions that on the store %d due to send request fail, err: %v", failedStoreID, err)


can we also print "IP:port" address of that failed store?

coocood · 2018-06-25T12:55:02Z

@zz-jason PTAL

zz-jason

LGTM

store/tikv: avoid holding write lock for long time

59410e7

Optimize RegionCache performance on send request failure.

coocood requested review from disksing and tiancaiamao June 22, 2018 06:38

fix golint

9421005

coocood added type/performance component/tikv labels Jun 22, 2018

shenli reviewed Jun 23, 2018

View reviewed changes

tiancaiamao added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 25, 2018

zz-jason reviewed Jun 25, 2018

View reviewed changes

address comment

a940e1e

zz-jason approved these changes Jun 25, 2018

View reviewed changes

zz-jason added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 25, 2018

Merge branch 'master' into region-cache-lock

22edde4

coocood merged commit 3ac6d3a into pingcap:master Jun 25, 2018

coocood deleted the region-cache-lock branch June 25, 2018 13:31

BusyJay mentioned this pull request Apr 4, 2019

ticlient doesn't make request in a round-robin fashion any more #10037

Closed

lysu mentioned this pull request Apr 24, 2019

tikvclient: refine region-cache #10256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store/tikv: avoid holding write lock for long time #6880

store/tikv: avoid holding write lock for long time #6880

coocood commented Jun 22, 2018

coocood commented Jun 22, 2018

shenli commented Jun 23, 2018

shenli Jun 23, 2018

coocood Jun 25, 2018

shenli Jun 23, 2018

coocood Jun 25, 2018

shenli Jun 23, 2018

coocood Jun 25, 2018

disksing Jun 25, 2018 •

edited

Loading

coocood Jun 25, 2018

shenli commented Jun 23, 2018

coocood commented Jun 25, 2018

tiancaiamao commented Jun 25, 2018

zz-jason Jun 25, 2018

coocood commented Jun 25, 2018

zz-jason left a comment

store/tikv: avoid holding write lock for long time #6880

store/tikv: avoid holding write lock for long time #6880

Conversation

coocood commented Jun 22, 2018

What have you changed? (mandatory)

What are the type of the changes (mandatory)?

How has this PR been tested (mandatory)?

Benchmark result if necessary (optional)

coocood commented Jun 22, 2018

shenli commented Jun 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

disksing Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shenli commented Jun 23, 2018

coocood commented Jun 25, 2018

tiancaiamao commented Jun 25, 2018

Choose a reason for hiding this comment

coocood commented Jun 25, 2018

zz-jason left a comment

Choose a reason for hiding this comment

disksing Jun 25, 2018 •

edited

Loading