select from information_schema returns "device or resource busy" for non-leader PD #37764

kolbe · 2022-09-12T20:10:49Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

I altered the route tables in AWS so that my primary region (us-east-1, where the PD leader is located) has no route to a tertiary region.

2. What did you expect to see? (Required)

If a single PD member cannot be reached, but quorum remains and there is an active leader, queries to information_schema should still return results.

3. What did you see instead (Required)

select count(*) from information_schema.tikv_region_status;

ERROR 1105 (HY000): Get "http://test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local:2379/pd/api/v1/regions": dial tcp: lookup test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local: device or resource busy

tidb-server log:

[2022/09/12 20:05:32.498 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:06:38.546 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:07:44.594 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:08:05.351 +00:00] [INFO] [conn.go:1149] ["command dispatched failed"] [conn=7172376031750261375] [connInfo="id:7172376031750261375, addr:10.100.111.193:55344 status:10, collation:utf8_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select count(*) from information_schema.tikv_region_status"] [txn_mode=PESSIMISTIC] [timestamp=0] [err="Get \"http://test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local:2379/pd/api/v1/regions\": dial tcp: lookup test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local: device or resource busy
github.com/pingcap/errors.AddStack
	/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174
github.com/pingcap/errors.Trace
	/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/juju_adaptor.go:15
github.com/pingcap/tidb/store/helper.(*Helper).requestPD
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:862
github.com/pingcap/tidb/store/helper.(*Helper).GetRegionsInfo
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:797
github.com/pingcap/tidb/executor.(*memtableRetriever).setDataForTiKVRegionStatus
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:1544
github.com/pingcap/tidb/executor.(*memtableRetriever).retrieve
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:145
github.com/pingcap/tidb/executor.(*MemTableReaderExec).Next
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/memtable_reader.go:118
github.com/pingcap/tidb/executor.Next
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:319
github.com/pingcap/tidb/executor.(*HashAggExec).fetchChildData
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:791
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1571"]

4. What is your TiDB version? (Required)

Release Version: v6.1.1
Edition: Community
Git Commit Hash: 5263a0abda61f102122735049fd0dfadc7b7f8b2
Git Branch: heads/refs/tags/v6.1.1
UTC Build Time: 2022-08-25 10:42:41
GoVersion: go1.18.5
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

The text was updated successfully, but these errors were encountered:

zhangjinpeng87 · 2022-09-12T21:35:06Z

@kolbe where did you place the tidb-server that you run the SQL? Also in the us-east-1 region or other regions? The error message hint that this tidb-server cann't dial to "test-us-east-2-pd-0", this region probably has the new leader?

kolbe · 2022-09-12T21:52:51Z

The tidb-server where I run this SQL is in us-east-1.

MySQL [test]> select @@hostname;
+-----------------------+
| @@hostname            |
+-----------------------+
| test-us-east-1-tidb-0 |
+-----------------------+
1 row in set (0.002 sec)

PD leader is also in us-east-1:

$ kubectl --context "${contexts[0]}" -n test-us-east-1 exec test-us-east-1-pd-0 -- /pd-ctl member | jq .leader
{
  "name": "test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local",
  "member_id": 10924548383988693000,
  "peer_urls": [
    "http://test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local:2380"
  ],
  "client_urls": [
    "http://test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local:2379"
  ],
  "leader_priority": 2,
  "deploy_path": "/",
  "binary_version": "v6.1.1",
  "git_hash": "4ab9c0ef123441a0ef279bf9d2e36d1abe4a14c1"
}

zhangjinpeng87 · 2022-09-12T22:42:12Z

According to https://github.com/pingcap/tidb/blob/master/store/helper/helper.go#L829 and from the error message, I think this tidb-server has contacted to all the pd in all regions, and just printed the last error it met. There maybe network isolation between this tidb server and all pd servers.

kolbe · 2022-09-13T03:44:06Z

I can confirm that the problem is not network isolation between this tidb-server instance and PD servers. I was able to contact the PD servers in us-east-1 and us-west-2 using curl logged into this tidb pod.

nolouch · 2022-09-13T09:16:43Z

I think it's fixed by pr #35750, issue #35708. and v6.1.1 don't pick this commit.

kolbe added the type/bug The issue is confirmed as a bug. label Sep 12, 2022

seiya-annie added component/store component/pd labels Sep 13, 2022

nolouch added the duplicate Issues or pull requests already exists. label Sep 13, 2022

nolouch mentioned this issue Sep 13, 2022

helper: request another PD if one of them is unavailable (#35750) #37781

Merged

12 tasks

jebter closed this as completed Aug 8, 2024

jebter added the type/stale This issue has not been updated for a long time. label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

select from information_schema returns "device or resource busy" for non-leader PD #37764

select from information_schema returns "device or resource busy" for non-leader PD #37764

kolbe commented Sep 12, 2022

zhangjinpeng87 commented Sep 12, 2022 •

edited

Loading

kolbe commented Sep 12, 2022

zhangjinpeng87 commented Sep 12, 2022 •

edited

Loading

kolbe commented Sep 13, 2022

nolouch commented Sep 13, 2022 •

edited

Loading

select from information_schema returns "device or resource busy" for non-leader PD #37764

select from information_schema returns "device or resource busy" for non-leader PD #37764

Comments

kolbe commented Sep 12, 2022

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

zhangjinpeng87 commented Sep 12, 2022 • edited Loading

kolbe commented Sep 12, 2022

zhangjinpeng87 commented Sep 12, 2022 • edited Loading

kolbe commented Sep 13, 2022

nolouch commented Sep 13, 2022 • edited Loading

zhangjinpeng87 commented Sep 12, 2022 •

edited

Loading

zhangjinpeng87 commented Sep 12, 2022 •

edited

Loading

nolouch commented Sep 13, 2022 •

edited

Loading