-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689
Comments
I meet this problem a few days ago,i found it is |
can i make a pr to solve the problem? |
/cc @lance6716 |
welcome! before writing codes, can you give some brief introduction of your fixing? we can discuss the effects in advance. |
SetReadTimeout(maxDDLConnectionTimeout) 2. Line 97 in c345857
3.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/dsn.go#L51
Because ddl reorganization like add index,modify column may take a long time, is this proper to set ReadTimeout ulimited? but The MaxDDLConnectionTimeoutMinute is connection timeout argument, go-sql-driver packet should set the "Timeout time.Duration // Dial timeout" Is dm not right set the argument? 5. Line 75 in c345857
there is a handler to ignore the invalid connection error when add index, is it also needed if set ReadTimeout unlimited? so, maybe there is there step:
/cc @lance6716 |
@jiyfhust "1. set ReadTimeout ulimited" itself can't know a downstream is dead. And I'm not sure how to "2. add set connection timeout". There's are some linux kernel feature about TCP keepalive, if this feature is available, enabling it turns the problem into this: we can know the TCP connection is dead or not, but will the downstream MySQL/TiDB/RDS responses to the query in future when the TCP connection is not dead? For example, will MySQL/TiDB/RDS sliently drop the query after receiving it? (I guess no but haven't check it by MySQL Client/Server Protocol) Will MySQL/TiDB/RDS drop some query when reading from the socket and treat it as not received because of something? Will the MySQL/TiDB/RDS fail to send COM_QUERY_Response and not retry? If you can find some proof about above question and correctly set the TCP keepalive feature, I think it's OK to totally remove the ReadTimeout from the application layer. Another solution is after the ReadTimeout, we can use ADMIN SHOW DDL to check if downstream has really received the query, this is more safe IMO but may need more code work. Feel free to discuss! |
"2. add set connection timeout" i mean it is the timeout when dm connecting the downstreams,not like tcp keepalive. there seems no good method to check downstream alive through mysql protocol by a connection Executing sql query. Is the problem "no COM_QUERY_Response for a long time" occurred from some mysql proxy or lvs? if we use ADMIN SHOW DDL or query information_schema.ddl_jobs, by what method to judge the dm ddl sql? May be ddl job_id or the query sql or some way else? |
I think it will take a long time to fix it by myself. Maybe some one who is familiar with dm to fix it is a better choice. If dm syncer a ddl like "modify column", it may trigger a serious TiDB bug which is fixed and mergered to 5.3.0 just three days before. |
In fact I haven't experienced "no COM_QUERY_Response for a long time", I guess it can be caused by any components in the network link, for example the router is down. DM can know the DDL in invalidConnF. Through Don't worry, any kind of contribution is good! |
What did you do?
What did you expect to see?
No error is reported.
What did you see instead?
DM encountered "invalid connection" error and keeps retrying to execute ddl for every 5 minutes.
Versions of the cluster
DM version (run
dmctl -V
ordm-worker -V
ordm-master -V
):v2.0.6
current status of DM cluster (execute
query-status <task-name>
in dmctl)(paste current status of DM cluster here)
The text was updated successfully, but these errors were encountered: