stats: refine updating stats using feedback #6796

alivxxx · 2018-06-08T11:47:10Z

What have you changed? (mandatory)

In the before, we first update the bucket's count, then split the bucket, this PR reverse these two steps, because:

In the before, when the feedback only covers a very small fraction of the bucket, we are not able to update the stats because we limit the feedback fraction, this may happen when the distribution is too discrete. In this PR, we are able to do it, because we first split the bucket, then calculate the new bucket's count, and only check if the bucket is too small at last.
We can have more accurate bucket count info after the change, since the more the feedback and bucket overlap, the more accuracy we can have. This brings the benefit that we can have better stats in fewer rounds of updating.

What are the type of the changes (mandatory)?

Improvement (non-breaking change which is an improvement to an existing feature)

How has this PR been tested (mandatory)?

unit test

alivxxx · 2018-06-26T03:14:38Z

PTAL @coocood @zz-jason @winoros

coocood · 2018-06-28T08:08:22Z

statistics/feedback.go

+	}
+	return count
+}
+
 // Get the split count for the histogram.
 func getSplitCount(count, remainBuckets int) int {


How about numFeedbacks instead of count?

coocood · 2018-06-28T09:26:02Z

statistics/feedback.go

-	}
-	return int64(count)
-}
-
 // updateBucket split the bucket according to feedback.
 func (b *BucketFeedback) splitBucket(newBktNum int, totalCount float64, count float64) []bucket {


comment and method name not match.

The count should be renamed to originBucketCount?

s/newBktNum/newNumBkts

coocood · 2018-06-28T11:19:19Z

statistics/feedback.go

+		fbLower, fbUpper = getFraction4PK(minValue, maxValue, fb.lower, fb.upper)
+		bktLower, bktUpper = getFraction4PK(minValue, maxValue, bkt.lower, bkt.upper)
+	}
+	ratio := (bktUpper - bktLower) / (fbUpper - fbLower)


The ratio seems not consistent with the comment, should be (fbUpper - fbLower) /(bktUpper - bktLower)?

And later we calculate the new bucket count with count * ratio.

Yes, the comment is wrong. The count is the feedback count, so this ratio is right.

coocood · 2018-06-28T11:24:37Z

statistics/feedback.go

+	return overlap, ratio
+}
+
+func (b *BucketFeedback) bucketCount(bkt bucket, defaultCount float64) float64 {


How about name it refineBucketCount?
And add some comment about this method.

coocood · 2018-06-28T11:30:11Z

statistics/feedback.go

@@ -425,31 +447,18 @@ func mergeBuckets(bkts []bucket, isNewBuckets []bool, totalCount float64) []buck

 func splitBuckets(h *Histogram, feedback *QueryFeedback) ([]bucket, []bool, int64) {
 	bktID2FB, fbNum := buildBucketFeedback(h, feedback)


s/fbNum/totalNumFBs

coocood · 2018-06-28T11:31:44Z

statistics/feedback.go

@@ -425,31 +447,18 @@ func mergeBuckets(bkts []bucket, isNewBuckets []bool, totalCount float64) []buck

 func splitBuckets(h *Histogram, feedback *QueryFeedback) ([]bucket, []bool, int64) {


add comments about this function.

coocood · 2018-06-28T11:40:41Z

statistics/feedback.go

+	}
+	return count
+}
+
 // Get the split count for the histogram.
 func getSplitCount(count, remainBuckets int) int {
 	remainBuckets = mathutil.Max(remainBuckets, 10)


better use another variable like splitCount rather than modify the input argument.

coocood · 2018-06-28T11:46:05Z

statistics/feedback.go

 			isNewBuckets = append(isNewBuckets, false)
 			continue
 		}
-		bkts := bkt.splitBucket(splitCount*len(bkt.feedback)/fbNum, float64(totCount), float64(counts[i]))
+		bkts := bkt.splitBucket(splitCount*len(bkt.feedback)/fbNum, h.totalRowCount(), float64(h.bucketCount(i)))


Need a dedicated variable for splitCount*len(bkt.feedback)/fbNum and add a comment to explain why.

coocood · 2018-06-28T11:48:06Z

statistics/feedback.go

+	}
+	return count
+}
+
 // Get the split count for the histogram.


Need to explain why choose this algorithm.

coocood · 2018-07-02T06:06:10Z

LGTM

alivxxx · 2018-07-02T11:39:01Z

PTAL @winoros @zz-jason

winoros · 2018-07-02T11:45:52Z

statistics/feedback.go

 			isNewBuckets = append(isNewBuckets, false)
 			continue
 		}
-		bkts := bkt.splitBucket(splitCount*len(bkt.feedback)/fbNum, float64(totCount), float64(counts[i]))
+		// distribute the total split count to bucket based on number of bucket feedback


distribute -> Distribute.
Add . at the end.

winoros · 2018-07-02T12:09:36Z

statistics/feedback.go

+	}
+	var fbLower, fbUpper, bktLower, bktUpper float64
+	minValue, maxValue := &datums[0], &datums[3]
+	if datums[0].Kind() == types.KindBytes {


What about add comment at the declaration of bucket to tell that the datum in bucket can only be Bytes or Int, and Bytes is for index and Int is for int pk.

zz-jason · 2018-07-02T13:03:50Z

statistics/feedback.go

 			isNewBuckets = append(isNewBuckets, false)
 			continue
 		}
-		bkts := bkt.splitBucket(splitCount*len(bkt.feedback)/fbNum, float64(totCount), float64(counts[i]))
+		// distribute the total split count to bucket based on number of bucket feedback
+		newBktNums := splitCount * len(bkt.feedback) / totalNumFBs


how about:

s/totalNumFBs/numTotalFBs/

s/bkt/bktFB/

numFBs := len(bktFB.feedback) numNewBkts := splitCount * numFBs/numTotalFBs

zz-jason · 2018-07-02T13:15:26Z

statistics/feedback.go

+		bkt := bucket{&bounds[i-1], bounds[i].Copy(), 0, 0}
+		// get bucket count
+		_, ratio := getOverlapFraction(feedback{b.lower, b.upper, int64(originBucketCount), 0}, bkt)
+		bucketCount := originBucketCount * ratio


how about:

s/bkts/newBkts/

s/bkt/newBkt/

s/bucketCount/countInNewBkt/

zz-jason · 2018-07-02T13:47:45Z

statistics/feedback.go

-		bkt := bucket{lower: b.lower, upper: b.upper, count: int64(count)}
-		return []bucket{bkt}
-	}
+// splitBucket split the bucket according to feedback.


how about changing this comment to:

// splitBucket firstly splits this "BucketFeedback" to "newNumBkts" new buckets, // calculates the count for each new bucket, merge the new bucket whose count // is smaller than "minBucketFraction*totalCountwith" with the next new bucket // until the last new bucket.

zz-jason

LGTM

zz-jason · 2018-07-03T06:15:27Z

/run-all-tests

stats: refine updating stats using feedback

4d50383

zz-jason added type/enhancement The issue or PR belongs to an enhancement. component/statistics labels Jun 21, 2018

coocood reviewed Jun 28, 2018

View reviewed changes

Haibin Xie added 2 commits June 29, 2018 14:02

Merge branch 'master' of github.com:pingcap/tidb into split

ffb767d

address comment

5aa6d4e

coocood added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 2, 2018

winoros reviewed Jul 2, 2018

View reviewed changes

zz-jason reviewed Jul 2, 2018

View reviewed changes

address comment

db52d40

zz-jason approved these changes Jul 3, 2018

View reviewed changes

Merge branch 'master' into split

2d77a9f

zz-jason added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 3, 2018

Merge branch 'master' into split

3f7cab9

alivxxx added the status/all tests passed label Jul 3, 2018

alivxxx merged commit c9cea72 into pingcap:master Jul 3, 2018

alivxxx deleted the split branch July 3, 2018 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: refine updating stats using feedback #6796

stats: refine updating stats using feedback #6796

alivxxx commented Jun 8, 2018

alivxxx commented Jun 26, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

alivxxx Jun 29, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood Jun 28, 2018

coocood commented Jul 2, 2018

alivxxx commented Jul 2, 2018

winoros Jul 2, 2018

winoros Jul 2, 2018

alivxxx Jul 3, 2018

zz-jason Jul 2, 2018 •

edited

Loading

zz-jason Jul 2, 2018

zz-jason Jul 2, 2018 •

edited

Loading

zz-jason left a comment

zz-jason commented Jul 3, 2018

		@@ -425,31 +447,18 @@ func mergeBuckets(bkts []bucket, isNewBuckets []bool, totalCount float64) []buck

		func splitBuckets(h Histogram, feedback QueryFeedback) ([]bucket, []bool, int64) {
		bktID2FB, fbNum := buildBucketFeedback(h, feedback)

stats: refine updating stats using feedback #6796

stats: refine updating stats using feedback #6796

Conversation

alivxxx commented Jun 8, 2018

What have you changed? (mandatory)

What are the type of the changes (mandatory)?

How has this PR been tested (mandatory)?

alivxxx commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coocood commented Jul 2, 2018

alivxxx commented Jul 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zz-jason Jul 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zz-jason Jul 2, 2018 • edited Loading

Choose a reason for hiding this comment

zz-jason left a comment

Choose a reason for hiding this comment

zz-jason commented Jul 3, 2018

zz-jason Jul 2, 2018 •

edited

Loading

zz-jason Jul 2, 2018 •

edited

Loading