Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

goroutine stack overflow in 2pc commit #19887

Closed
glorv opened this issue Sep 8, 2020 · 6 comments
Closed

goroutine stack overflow in 2pc commit #19887

glorv opened this issue Sep 8, 2020 · 6 comments
Labels
priority/P2 The issue has P2 priority. severity/critical sig/sql-infra SIG: SQL Infra sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug.

Comments

@glorv
Copy link
Contributor

glorv commented Sep 8, 2020

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run br integration test failed by tidb goroutine stack overflow

[2020-09-08T09:55:55.924Z] [2020/09/08 17:55:55.897 +08:00] [INFO] [2pc.go:1114] ["2PC commitTS rejected by TiKV, retry with a newer commitTS"] [txnStartTS=419314771327713288] [info="{StartTs:419314771327713288 AttemptedCommitTs:419314782927585349 Key:6d44423a3830000000fc0000000000000048 MinCommitTs:9223372036854775808}"]
[2020-09-08T09:55:55.924Z] [2020/09/08 17:55:55.897 +08:00] [INFO] [2pc.go:1114] ["2PC commitTS rejected by TiKV, retry with a newer commitTS"] [txnStartTS=419314771327713288] [info="{StartTs:419314771327713288 AttemptedCommitTs:419314782927585350 Key:6d44423a3830000000fc0000000000000048 MinCommitTs:9223372036854775808}"]
[2020-09-08T09:55:55.924Z] runtime: goroutine stack exceeds 1000000000-byte limit
[2020-09-08T09:55:55.924Z] fatal error: stack overflow
[2020-09-08T09:55:55.924Z] 
[2020-09-08T09:55:55.924Z] runtime stack:
[2020-09-08T09:55:55.924Z] runtime.throw(0x2ef0a73, 0xe)
[2020-09-08T09:55:55.924Z] 	/usr/local/go/src/runtime/panic.go:774 +0x72
[2020-09-08T09:55:55.924Z] runtime.newstack()
[2020-09-08T09:55:55.924Z] 	/usr/local/go/src/runtime/stack.go:1046 +0x6e9
[2020-09-08T09:55:55.924Z] runtime.morestack()
[2020-09-08T09:55:55.924Z] 	/usr/local/go/src/runtime/asm_amd64.s:449 +0x8f
[2020-09-08T09:55:55.924Z] 
[2020-09-08T09:55:55.924Z] goroutine 216 [running]:
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x100, 0xc00068e000, 0xc000798500, 0x37f, 0x5, 0xe, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, ...)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:1068 +0x1bd1 fp=0xc0213468b8 sp=0xc0213468b0 pc=0x1ef4341
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.(*actionCommit).handleSingleBatch(0xc002bd9390, 0xc00068e000, 0xc000798500, 0x37f, 0x5, 0xe, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, ...)
[2020-09-08T09:55:55.924Z] 	<autogenerated>:1 +0x8e fp=0xc021346970 sp=0xc0213468b8 pc=0x1f4dd6e
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc00068e000, 0xc000798500, 0x33d7de0, 0xc002bd9390, 0xc002dbbd00, 0x1, 0x1, 0xc000a362d0, 0x2)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:628 +0x110 fp=0xc021346d40 sp=0xc021346970 pc=0x1eee000
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc00068e000, 0xc000798500, 0x33d7de0, 0xc002bd9390, 0xc002dbbc80, 0x1, 0x1, 0x2, 0xc000d2d4d0)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:582 +0x561 fp=0xc021347108 sp=0xc021346d40 pc=0x1eed9f1
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc00068e000, 0xc000798500, 0x33d7de0, 0xc002bd9390, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, 0x2, 0x2, ...)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:478 +0x1c2 fp=0xc0213473a0 sp=0xc021347108 pc=0x1eec482
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).commitMutations(0xc00068e000, 0xc000798500, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, 0x2, 0x2, 0xc000d2d4d0, 0x2, ...)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:1216 +0xea fp=0xc0213474d0 sp=0xc0213473a0 pc=0x1ef51da
[2020-09-08T09:55:55.924Z] github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x100, 0xc00068e000, 0xc000798500, 0x37f, 0x5, 0xe, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, ...)
[2020-09-08T09:55:55.924Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:1130 +0x1680 fp=0xc021347d88 sp=0xc0213474d0 pc=0x1ef3df0
[2020-09-08T09:55:55.925Z] github.com/pingcap/tidb/store/tikv.(*actionCommit).handleSingleBatch(0xc002c473f0, 0xc00068e000, 0xc000798500, 0x37f, 0x5, 0xe, 0xc000a362d0, 0x2, 0x2, 0xc000d2d4a0, ...)
[2020-09-08T09:55:55.925Z] 	<autogenerated>:1 +0x8e fp=0xc021347e40 sp=0xc021347d88 pc=0x1f4dd6e
[2020-09-08T09:55:55.925Z] github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc00068e000, 0xc000798500, 0x33d7de0, 0xc002c473f0, 0xc002e01880, 0x1, 0x1, 0xc000a362d0, 0x2)
[2020-09-08T09:55:55.925Z] 	/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200907094850-1fadd78fc019/store/tikv/2pc.go:628 +0x110 fp=0xc021348210 sp=0xc021347e40 pc=0x1eee000

See: https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/2853/pipeline/

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

release-4.0 branch with reversion 3706da0fa6c7a597f3e8c69520f3ba0748dbd17f

@glorv glorv added the type/bug The issue is confirmed as a bug. label Sep 8, 2020
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Sep 8, 2020

The DDL job related log

[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.773 +08:00] [INFO] [domain.go:624] ["performing DDL change, must reload"]
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.774 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d4442730000000000fa0000000000000068] [nextEndKey=6d4442730000000000fa0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.774 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d4442730000000000fa000000000000006844423a3300000000fb] [nextEndKey=6d4442730000000000fa0000000000000069] [reverse=false] [txnStartTS=9223372036854775807]
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.775 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3634000000fc0000000000000068] [nextEndKey=6d44423a3634000000fc0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.775 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3300000000fb0000000000000068] [nextEndKey=6d44423a3300000000fb0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.775 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3630000000fc0000000000000068] [nextEndKey=6d44423a3630000000fc0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.775 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3632000000fc0000000000000068] [nextEndKey=6d44423a3632000000fc0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.775 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3100000000fb0000000000000068] [nextEndKey=6d44423a3100000000fb0000000000000069] [reverse=false] [txnStartTS=9223372036854775807] 
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.776 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3300000000fb00000000000000685461626c653a3135ff0000000000000000f7] [nextEndKey=6d44423a3300000000fb0000000000000069] [reverse=false] [txnStartTS=9223372036854775807]
[2020-09-08T09:55:05.834Z] [2020/09/08 17:55:05.776 +08:00] [DEBUG] [scan.go:157] ["txn getData"] [nextStartKey=6d44423a3300000000fb00000000000000685461626c653a3339ff0000000000000000f7] [nextEndKey=6d44423a3300000000fb0000000000000069] [reverse=false] [txnStartTS=9223372036854775807]

The scan snapshot version ts is 9223372036854775807 which is an invalid value int64.MAX. This scan resolved the following key and it gets a min_commit_ts with 9223372036854775808, committing this key causes infinite retry because of commit ts expired error.
Need to investigate why the DDL job uses an invalid snapshot version or timestamp to do the scan.

@cfzjywxk
Copy link
Contributor

@you06 We could add some debug log for the scanner or related path, we may also need to add some sanity check for the ts usage for the transaction and the snapshot and report error immediately if there is invalid input.

@you06
Copy link
Contributor

you06 commented Sep 27, 2020

@glorv This is caused by the invalid startTS input from BR, and it has been fixed. Besides, I found there still usage of MaxInt64 in BR's test code, can you help review pingcap/br#526?

@you06
Copy link
Contributor

you06 commented Sep 28, 2020

Should fixed in br.

@sre-bot
Copy link
Contributor

sre-bot commented Oct 9, 2020

Integrity check:
RCA symptom trigger_condition affect_version fix_version fields are empty

Please comment /info to get template

@ti-srebot
Copy link
Contributor

ti-srebot commented Oct 9, 2020

Please edit this comment to complete the following information

Not a bug

  1. Remove the 'type/bug' label
  2. Add notes to indicate why it is not a bug

Duplicate bug

  1. Add the 'type/duplicate' label
  2. Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA)

BR used some functions from TiDB to communite with TiKV, but BR takes snapshot using math.MaxInt64 which will make TiKV set minCommitTS of concurrent write transaction to math.MaxInt64 + 1. The extreme large minCommitTS makes TiDB retry recursively and lead to stack overflow.

2. Symptom

The project used TiDB got stackoverflow.

3. All Trigger Conditions

Either of the 2 conditions may lead to this error.

  • Not upgraded BR to v4.0.6 or later.
  • Some projects import functions from TiDB and use txn functions with incorrect timestamp

4. Workaround (optional)

5. Affected versions

TiDB itself doesn't have this issue.

BR's affected versions: [v4.0.1:v4.0.5]

6. Fixed versions

BR v4.0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P2 The issue has P2 priority. severity/critical sig/sql-infra SIG: SQL Infra sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

6 participants