Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic when point get is executed under index merge join #16817

Closed
zz-jason opened this issue Apr 25, 2020 · 5 comments · Fixed by #17555
Closed

panic when point get is executed under index merge join #16817

zz-jason opened this issue Apr 25, 2020 · 5 comments · Fixed by #17555
Assignees
Labels
priority/P1 The issue has P1 priority. severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.
Milestone

Comments

@zz-jason
Copy link
Member

Bug Report

I wrote a customer sysbench test, the lua script file is named issue.lua, and the content is:

#!/usr/bin/env sysbench

function thread_init()
    drv = sysbench.sql.driver()
    con = drv:connect()
    con:query([[
        prepare stmt from "
            select * from t1, t2
            where t1.a = t2.a
            and t2.a = ?
            and t2.b = 1
        "
    ]])
end

function thread_done()
    con:disconnect()
end

function event()
    con:query("set @a = 1")
    con:query("execute stmt using @a")
end

function prepare()
    local drv = sysbench.sql.driver()
    local con = drv:connect()

    con:query("DROP TABLE IF EXISTS t1")
    con:query([[
        CREATE TABLE t1 (
            a DOUBLE,
            b DOUBLE,
            c DOUBLE DEFAULT NULL,
            PRIMARY KEY (a, b)
        )
    ]])

    con:query([[INSERT INTO t1(a, b) VALUES (1, 1)]])
    con:query([[INSERT INTO t1(a, b) VALUES (1, 2)]])
    con:query([[INSERT INTO t1(a, b) VALUES (2, 1)]])
    con:query([[INSERT INTO t1(a, b) VALUES (2, 2)]])
    con:query([[INSERT INTO t1(a, b) VALUES (3, 1)]])
    con:query([[INSERT INTO t1(a, b) VALUES (3, 2)]])

    con:query("DROP TABLE IF EXISTS t2")
    con:query([[
        CREATE TABLE t2 (
            a DOUBLE,
            b DOUBLE DEFAULT NULL,
            c DOUBLE DEFAULT NULL,
            UNIQUE KEY idx_a (a)
        )
    ]])

    con:query([[INSERT INTO t2(a, b) VALUES(1, 1)]])
    con:query([[INSERT INTO t2(a, b) VALUES(2, 1)]])
    con:query([[INSERT INTO t2(a, b) VALUES(3, 1)]])
end

function cleanup()
    local drv = sysbench.sql.driver()
    local con = drv:connect()
    con:query("DROP TABLE IF EXISTS t1")
    con:query("DROP TABLE IF EXISTS t2")
end

Each connection constantly print a panic log as following when the sysbench runs:

[2020/04/25 16:47:17.266 +08:00] [ERROR] [index_lookup_merge_join.go:300] ["panic in outerMergeWorker.run"] [conn=3] [r="\"invalid memory address or nil pointer dereference\""] ["stack trace"="github.com/pingcap/tidb/executor.(*outerMergeWorker).run.func1\n\t/Users/jianzhang.zj/Code/tidb/executor/                    index_lookup_merge_join.go:302\nruntime.gopanic\n\t/opt/goroot/src/runtime/panic.go:679\nruntime.panicmem\n\t/opt/goroot/src/runtime/panic.go:199\nruntime.sigpanic\n\t/opt/goroot/src/runtime/signal_unix.go:394\ngithub.com/pingcap/tidb/session.(*TxnState).GetSnapshot\n\t<autogenerated>:1\ngithub.com/pingcap/tidb/     executor.(*PointGetExecutor).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/point_get.go:141\ngithub.com/pingcap/tidb/executor.Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*SelectionExec).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:1212\ngithub.com/ pingcap/tidb/executor.Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*outerMergeWorker).buildTask\n\t/Users/jianzhang.zj/Code/tidb/executor/index_lookup_merge_join.go:357\ngithub.com/pingcap/tidb/executor.(*outerMergeWorker).run\n\t/Users/jianzhang.zj/Code/tidb/     executor/index_lookup_merge_join.go:307"]

1. What did you do?

  • step 1: start a tidb-server with mocktikv and plan cache enabled

  • step 2: prepare table and data via the sysbench test:

sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue prepare
  • step 3: run the sysbench test:
sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue run
  • step 4: monitor the tidb log file, you can see the panic log.

2. What did you expect to see?

  • no panic, runs successfully

3. What did you see instead?

panic as described above

4. What version of TiDB are you using? (tidb-server -V or run select tidb_version(); on TiDB)

➜ ./bin/tidb-server -V
Release Version: v4.0.0-beta.2-352-g5ede18f10
Git Commit Hash: 5ede18f10eedfe2e3690d7728dec3ffa4b0af2d5
Git Branch: master
UTC Build Time: 2020-04-25 08:41:26
GoVersion: go1.13.10
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
@zz-jason zz-jason added type/bug The issue is confirmed as a bug. sig/planner SIG: Planner labels Apr 25, 2020
@zz-jason zz-jason added this to the v4.0.0-rc.2 milestone Apr 25, 2020
@eurekaka
Copy link
Contributor

eurekaka commented Apr 27, 2020

@zz-jason I can't reproduce this problem in my environment using your script and commands. I added INL_MERGE_JOIN(t1,t2) to the prepared statement, the panic was still not reproduced.

@eurekaka
Copy link
Contributor

eurekaka commented Apr 29, 2020

I have reproduced the panic in a SMP environment. Change the --threads=4 in the sysbench script to --thread=1 can reproduce the problem as well, but after I turned off the plan cache, the panic disappeared. However, the problem is not caused by plan cache actually.

When plan cache is disabled, the plan for the query is HashJoin instead of IndexMergeJoin, because the constant propagation would simplify the t1.a = t2.a in the query, so there is no column equal condition which can be used as index join lookup key. When plan cache is enabled, the constant propagation for col = const which contains parameters is forbidden, so the INL_MERGE_JOIN hint can take effect. The panic is caused by PointGet in IndexMergeJoin which runs in SMP context.

The root cause of the panic was introduced by #14775 originally, and the code which triggers the panic was introduced by #15400, but the panic is actually exposed by a recent PR #15627.

RCA of the panic: in code snippet of PointGetExecutor::Next:

136     e.txn, err = e.ctx.Txn(false)
137     if err != nil {
138         return err
139     }
140     if e.txn.Valid() && txnCtx.StartTS == txnCtx.GetForUpdateTS() {
141         e.snapshot = e.txn.GetSnapshot()
142     } else {
143         e.snapshot, err = e.ctx.GetStore().GetSnapshot(kv.Version{Ver: snapshotTS})
144         if err != nil {
145             return err
146         }
147     }

the e.txn is of struct TxnState during runtime of the query. The panic happens occasionally, when it doesn't happen, the code path falls into line 143, while when the panic happens, the code path always falls into line 141. The panic is caused by the fact that e.txn.(*TxnState).Transction becomes nil in line 141, however since it passes the check of Valid() in line 140, e.txn.(*TxnState).Transction cannot be nil then. The only explanation for this pretty weird phenomenon is that, e.txn.(*TxnState).Transction is set nil by other threads during the execution time window of line 140 and line 141.

I turned on the -race flag when compiling tidb-server, and found out the data race:

WARNING: DATA RACE
Read at 0x00c000dc0210 by goroutine 109:
  github.com/pingcap/tidb/session.(*TxnState).Valid()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/txn.go:125 +0x3a
  github.com/pingcap/tidb/executor.(*PointGetExecutor).Next()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/point_get.go:140 +0x25c
  github.com/pingcap/tidb/executor.Next()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:249 +0x167
  github.com/pingcap/tidb/executor.(*SelectionExec).Next()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:1212 +0x6e5
  github.com/pingcap/tidb/executor.Next()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:249 +0x167
  github.com/pingcap/tidb/executor.(*outerMergeWorker).buildTask()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/index_lookup_merge_join.go:357 +0xa0c
  github.com/pingcap/tidb/executor.(*outerMergeWorker).run()
      /Users/cauchy/go/src/github.com/pingcap/tidb/executor/index_lookup_merge_join.go:307 +0xdb

Previous write at 0x00c000dc0210 by goroutine 107:
  github.com/pingcap/tidb/session.(*session).doCommit.func1()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/txn.go:201 +0x71
  github.com/pingcap/tidb/session.(*session).doCommit()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:396 +0x990
  github.com/pingcap/tidb/session.(*session).doCommitWithRetry()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:454 +0x19f
  github.com/pingcap/tidb/session.(*session).CommitTxn()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:513 +0x163
  github.com/pingcap/tidb/session.finishStmt()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/tidb.go:211 +0x48b
  github.com/pingcap/tidb/session.runStmt()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/tidb.go:293 +0x4cd
  github.com/pingcap/tidb/session.(*session).ExecuteStmt()
      /Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:1163 +0xa7b
  github.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/driver_tidb.go:248 +0x8e
  github.com/pingcap/tidb/server.(*clientConn).handleStmt()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:1290 +0xc0
  github.com/pingcap/tidb/server.(*clientConn).handleQuery()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:1278 +0x15e
  github.com/pingcap/tidb/server.(*clientConn).dispatch()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:901 +0x620
  github.com/pingcap/tidb/server.(*clientConn).Run()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:715 +0x364
  github.com/pingcap/tidb/server.(*Server).onConn()
      /Users/cauchy/go/src/github.com/pingcap/tidb/server/server.go:415 +0xfe5

The reason of the data race is that: PointGet is outer child of IndexMergeJoin, during IndexMergeJoinExecutor::Open, the outer worker goroutine is started to fetch rows by PointGetExecutor::Next, while the main goroutine returns back to runStmt -> finishStmt .. -> doCommit -> changeToInvalid, which would reset the TxnState.Transaction to nil, thus panic happens in the worker goroutine.

I guess there may exist other problems / panics behind this single symptom described in this issue, because PointGet now may be running in parallel with other goroutines(not only the main goroutine in this case, but other goroutines, e.g, workers of IndexJoin / IndexHashJoin?), while the original implementation of its execution seems not to consider the possibility of concurrent running? We'd better do a full check of these potential concurrent situations to make sure PointGet is safe there.

I checked out the commit before #15400, and the panic still happened, but in another code place, e.g, when calling TxnState.IsReadOnly in point get execution.

I prefer @lzmhhh123 @SunRunAway you guys to fix this problem because you know more about the potential risks of the concurrent execution than me.

BTW, another problem exposed by this issue is that, in https://github.com/pingcap/tidb/blob/master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.

@eurekaka eurekaka changed the title [plan cache] got panic on concurrently running index merge join statements panic when point get is executed under index merge join Apr 29, 2020
@eurekaka eurekaka added sig/execution SIG execution and removed sig/planner SIG: Planner labels Apr 29, 2020
@zz-jason zz-jason added the priority/P1 The issue has P1 priority. label Apr 29, 2020
@bb7133 bb7133 modified the milestones: v4.0.0-rc.1, 4.0.0-rc.2 May 9, 2020
@bb7133 bb7133 modified the milestones: 4.0.0-rc.2, v4.0.0-ga May 19, 2020
@zz-jason zz-jason modified the milestones: v4.0.0-ga, v4.0.1 May 28, 2020
@zz-jason zz-jason assigned tiancaiamao and unassigned eurekaka Jun 3, 2020
@zz-jason zz-jason removed this from the v4.0.1 milestone Jun 5, 2020
@SunRunAway
Copy link
Contributor

BTW, another problem exposed by this issue is that, in https://github.com/pingcap/tidb/blob/master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.

@qw4990 PTAL

@SunRunAway SunRunAway reopened this Jun 30, 2020
@zz-jason zz-jason modified the milestones: v4.0.2, v4.0.3 Jul 10, 2020
@winoros winoros modified the milestones: v4.0.3, v4.0.4 Jul 15, 2020
@zz-jason
Copy link
Member Author

BTW, another problem exposed by this issue is that, in master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.

@qw4990 PTAL

@SunRunAway the issue title is not reflecting the remaining unresolved problem. I think we can create another issue for that and close this one.

@SunRunAway
Copy link
Contributor

BTW, another problem exposed by this issue is that, in master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.

@qw4990 PTAL

@SunRunAway the issue title is not reflecting the remaining unresolved problem. I think we can create another issue for that and close this one.

Thanks for the reminding, #18070 and #18572 are reflecting this problem and both of them are resolved now. I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P1 The issue has P1 priority. severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants