Retry to sync out of sync device lists #7453

babolivier · 2020-05-07T15:58:16Z

When a call to user_device_resync fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.

#6776 introduced some code infrastructure to mark device lists as stale/out of sync.

This PR uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.

It also adds a looping call to retry all failed resync every 30s. I'm not entirely sure about that retry logic right now so would like some opinions on it. It shouldn't cause too much spam in the logs as I removed the "Failed to handle device list update for..." warning logs when catching NotRetryingDestination.

Fixes #7418

~~TODO: Write some tests~~

…_list_resync

@user

Otherwise we're going to be logging `Failed to handle device list update for @user:example.com` every 30s for every remote we're not retrying because of backoff.

Gosh do I really not know the alphabet?

babolivier · 2020-05-08T13:48:50Z

Just tried this PR on my HS, and at least half of it seems to work well, in that inserting rows manually with device_lists_remote_resync using:

INSERT INTO device_lists_remote_resync VALUES ('@user:example.com', (EXTRACT(epoch FROM NOW()) * 1000)::BIGINT);

seems to result in the users' device lists being resynced automatically, and having their shield turned back to green in Riot.

Another point is that I'll need to see if this PR doesn't clash with #6786, it shouldn't be the case but maybe there's be something that needs to be done to smooth them together and make the whole thing more maintainable.

babolivier · 2020-05-08T14:40:48Z

Alright, the other half of this (the bit that does the insertion if resync fails) seems to also be working, given I'm now seeing Successfully resynced the device list for... lines in my logs for users I haven't manually inserted \o/

babolivier · 2020-05-12T09:58:53Z

So I think this PR should be ready for review. I don't believe there's anything that needs to be done wrt #6786.

Also, I'm not sure if it's good enough that, if a remote is unreachable, retry it every 30s and fail silently on a NotRetryingDestination error or if the signature of user_device_resync should be changed so that it also returns how long to wait until the next attempt, and _maybe_retry_device_resync would then keep track of which resync should be retried when (but that may be too much faff).

Another question/issue I have is that it would be neat to cache get_user_ids_requiring_device_list_resync since we might be calling the function every 30s, however the way that function is used outside of this PR makes invalidating that cache quite tricky, as we can't know all of the lists of user IDs that function has been called with.

babolivier · 2020-05-13T13:22:56Z

Known shortcomings of this PR:

It doesn't do the right thing wrt logging contexts, c.f.:

2020-05-13 11:23:04,595 - synapse.storage.database - 525 - WARNING - - Starting db txn 'update_remote_device_list_cache' from sentinel context
2020-05-13 11:23:04,595 - synapse.storage.database - 564 - WARNING - - Starting db connection from sentinel context: metrics will be lost

~~It doesn't update device_lists_outbound_last_success (and maybe other tables) upon successful /devices/... response.~~ see Retry to sync out of sync device lists #7453 (comment)

richvdh · 2020-05-14T09:22:08Z

@babolivier please could you fix #7498 while you're here

babolivier · 2020-05-14T10:48:15Z

Sure

babolivier · 2020-05-14T17:20:41Z

So I'm not so sure about

It doesn't update device_lists_outbound_last_success (and maybe other tables) upon successful /devices/... response.

being a shortcoming of this PR anymore.

This is only needed if the remote server is completely unreachable, since Synapse will always 200 device list updates, even if it failed to process them.

AFAICT, when responding to a request, we can't really be 100% sure response will be reached and fully processed by the remote server (especially if that server or its connection is wobbly). In that case, I'd rather have the sending server wrongly think it has failed to send the update and resend it with the next transaction (so the remote server can simply ignore it if it's already received it) than have it wrongly think it's succeeded and never send it again.

tests/test_federation.py

synapse/handlers/device.py

clokep

Looks good if CI passes!

When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it. matrix-org#6776 introduced some code infrastructure to mark device lists as stale/out of sync. This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now. It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`. Fixes matrix-org#7418

babolivier added 6 commits May 7, 2020 16:38

Mark a remote user's device list as stale if we failed to sync it

c06f493

Make the iterable parameter optional in get_user_ids_requiring_device…

7c6625d

…_list_resync

Add a looping call to retry resyncs

c62c77d

Don't log a warning on NotRetryingDestination

da73470

Otherwise we're going to be logging `Failed to handle device list update for @user:example.com` every 30s for every remote we're not retrying because of backoff.

Changelog

56480b2

Lint

419d116

Gosh do I really not know the alphabet?

babolivier mentioned this pull request May 7, 2020

Cross-signing signatures not being always federated correctly #7418

Closed

Apparently looping calls must use deferreds

d25b036

babolivier added 4 commits May 12, 2020 11:49

Add test case

6c6b260

Better doc for test

ff6e16f

Lint

d6d2037

Lint again

3da3b86

babolivier marked this pull request as ready for review May 12, 2020 09:59

babolivier requested a review from a team May 12, 2020 09:59

Merge branch 'develop' into babolivier/device_list_retry

4cbd582

babolivier linked an issue May 12, 2020 that may be closed by this pull request

Cross-signing signatures not being always federated correctly #7418

Closed

babolivier removed the request for review from a team May 13, 2020 13:23

Add failure information to failure log

14e77a5

babolivier mentioned this pull request May 14, 2020

New homeserver doesn't know about cross-signing keys created before it was set up #7504

Closed

Merge branch 'develop' into babolivier/device_list_retry

185df4a

Fix logging context mess

cdd6fdd

babolivier requested a review from a team May 15, 2020 11:01

Fix bogus call to run_as_background_process

d3f609f

Merge branch 'develop' into babolivier/device_list_retry

3a94f4a

clokep reviewed May 18, 2020

View reviewed changes

tests/test_federation.py Outdated Show resolved Hide resolved

clokep reviewed May 18, 2020

View reviewed changes

synapse/handlers/device.py Show resolved Hide resolved

clokep reviewed May 18, 2020

View reviewed changes

synapse/handlers/device.py Outdated Show resolved Hide resolved

babolivier added 3 commits May 21, 2020 16:57

Apply suggestion from review

70d17c5

Improve docs

d906ca6

Apply suggestions from code review

1a692d0

babolivier requested a review from clokep May 21, 2020 15:03

clokep approved these changes May 21, 2020

View reviewed changes

babolivier merged commit d1ae101 into develop May 21, 2020

babolivier deleted the babolivier/device_list_retry branch May 21, 2020 15:41

erikjohnston mentioned this pull request Jun 15, 2020

Fix "There was no active span when trying to log." error #7698

Merged

babolivier mentioned this pull request Jul 10, 2020

Fix resync remote devices on receive PDU in worker mode. #7815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry to sync out of sync device lists #7453

Retry to sync out of sync device lists #7453

babolivier commented May 7, 2020 •

edited

Loading

babolivier commented May 8, 2020 •

edited

Loading

babolivier commented May 8, 2020

babolivier commented May 12, 2020 •

edited

Loading

babolivier commented May 13, 2020 •

edited

Loading

richvdh commented May 14, 2020

babolivier commented May 14, 2020

babolivier commented May 14, 2020

clokep left a comment

Retry to sync out of sync device lists #7453

Retry to sync out of sync device lists #7453

Conversation

babolivier commented May 7, 2020 • edited Loading

babolivier commented May 8, 2020 • edited Loading

babolivier commented May 8, 2020

babolivier commented May 12, 2020 • edited Loading

babolivier commented May 13, 2020 • edited Loading

richvdh commented May 14, 2020

babolivier commented May 14, 2020

babolivier commented May 14, 2020

clokep left a comment

Choose a reason for hiding this comment

babolivier commented May 7, 2020 •

edited

Loading

babolivier commented May 8, 2020 •

edited

Loading

babolivier commented May 12, 2020 •

edited

Loading

babolivier commented May 13, 2020 •

edited

Loading