Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select from information_schema returns "device or resource busy" for non-leader PD #37764

Closed
kolbe opened this issue Sep 12, 2022 · 5 comments
Closed
Labels
component/pd component/store duplicate Issues or pull requests already exists. type/bug The issue is confirmed as a bug. type/stale This issue has not been updated for a long time.

Comments

@kolbe
Copy link
Contributor

kolbe commented Sep 12, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

I altered the route tables in AWS so that my primary region (us-east-1, where the PD leader is located) has no route to a tertiary region.

2. What did you expect to see? (Required)

If a single PD member cannot be reached, but quorum remains and there is an active leader, queries to information_schema should still return results.

3. What did you see instead (Required)

select count(*) from information_schema.tikv_region_status;
ERROR 1105 (HY000): Get "http://test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local:2379/pd/api/v1/regions": dial tcp: lookup test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local: device or resource busy

tidb-server log:

[2022/09/12 20:05:32.498 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:06:38.546 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:07:44.594 +00:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/09/12 20:08:05.351 +00:00] [INFO] [conn.go:1149] ["command dispatched failed"] [conn=7172376031750261375] [connInfo="id:7172376031750261375, addr:10.100.111.193:55344 status:10, collation:utf8_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select count(*) from information_schema.tikv_region_status"] [txn_mode=PESSIMISTIC] [timestamp=0] [err="Get \"http://test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local:2379/pd/api/v1/regions\": dial tcp: lookup test-us-east-2-pd-0.test-us-east-2-pd-peer.test-us-east-2.svc.cluster.local: device or resource busy
github.com/pingcap/errors.AddStack
	/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174
github.com/pingcap/errors.Trace
	/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/juju_adaptor.go:15
github.com/pingcap/tidb/store/helper.(*Helper).requestPD
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:862
github.com/pingcap/tidb/store/helper.(*Helper).GetRegionsInfo
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:797
github.com/pingcap/tidb/executor.(*memtableRetriever).setDataForTiKVRegionStatus
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:1544
github.com/pingcap/tidb/executor.(*memtableRetriever).retrieve
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:145
github.com/pingcap/tidb/executor.(*MemTableReaderExec).Next
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/memtable_reader.go:118
github.com/pingcap/tidb/executor.Next
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:319
github.com/pingcap/tidb/executor.(*HashAggExec).fetchChildData
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:791
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1571"]

4. What is your TiDB version? (Required)

Release Version: v6.1.1
Edition: Community
Git Commit Hash: 5263a0abda61f102122735049fd0dfadc7b7f8b2
Git Branch: heads/refs/tags/v6.1.1
UTC Build Time: 2022-08-25 10:42:41
GoVersion: go1.18.5
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
@kolbe kolbe added the type/bug The issue is confirmed as a bug. label Sep 12, 2022
@zhangjinpeng87
Copy link
Contributor

zhangjinpeng87 commented Sep 12, 2022

@kolbe where did you place the tidb-server that you run the SQL? Also in the us-east-1 region or other regions? The error message hint that this tidb-server cann't dial to "test-us-east-2-pd-0", this region probably has the new leader?

@kolbe
Copy link
Contributor Author

kolbe commented Sep 12, 2022

The tidb-server where I run this SQL is in us-east-1.

MySQL [test]> select @@hostname;
+-----------------------+
| @@hostname            |
+-----------------------+
| test-us-east-1-tidb-0 |
+-----------------------+
1 row in set (0.002 sec)

PD leader is also in us-east-1:

$ kubectl --context "${contexts[0]}" -n test-us-east-1 exec test-us-east-1-pd-0 -- /pd-ctl member | jq .leader
{
  "name": "test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local",
  "member_id": 10924548383988693000,
  "peer_urls": [
    "http://test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local:2380"
  ],
  "client_urls": [
    "http://test-us-east-1-pd-0.test-us-east-1-pd-peer.test-us-east-1.svc.cluster.local:2379"
  ],
  "leader_priority": 2,
  "deploy_path": "/",
  "binary_version": "v6.1.1",
  "git_hash": "4ab9c0ef123441a0ef279bf9d2e36d1abe4a14c1"
}

@zhangjinpeng87
Copy link
Contributor

zhangjinpeng87 commented Sep 12, 2022

According to https://github.com/pingcap/tidb/blob/master/store/helper/helper.go#L829 and from the error message, I think this tidb-server has contacted to all the pd in all regions, and just printed the last error it met. There maybe network isolation between this tidb server and all pd servers.

@kolbe
Copy link
Contributor Author

kolbe commented Sep 13, 2022

I can confirm that the problem is not network isolation between this tidb-server instance and PD servers. I was able to contact the PD servers in us-east-1 and us-west-2 using curl logged into this tidb pod.

@nolouch
Copy link
Member

nolouch commented Sep 13, 2022

I think it's fixed by pr #35750, issue #35708. and v6.1.1 don't pick this commit.

@nolouch nolouch added the duplicate Issues or pull requests already exists. label Sep 13, 2022
@jebter jebter closed this as completed Aug 8, 2024
@jebter jebter added the type/stale This issue has not been updated for a long time. label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/pd component/store duplicate Issues or pull requests already exists. type/bug The issue is confirmed as a bug. type/stale This issue has not been updated for a long time.
Projects
None yet
Development

No branches or pull requests

5 participants