Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Retry to sync out of sync device lists #7453

Merged
merged 20 commits into from
May 21, 2020
Merged

Conversation

babolivier
Copy link
Contributor

@babolivier babolivier commented May 7, 2020

When a call to user_device_resync fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.

#6776 introduced some code infrastructure to mark device lists as stale/out of sync.

This PR uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.

It also adds a looping call to retry all failed resync every 30s. I'm not entirely sure about that retry logic right now so would like some opinions on it. It shouldn't cause too much spam in the logs as I removed the "Failed to handle device list update for..." warning logs when catching NotRetryingDestination.

Fixes #7418

TODO: Write some tests

Otherwise we're going to be logging `Failed to handle device list update
for @user:example.com` every 30s for every remote we're not retrying
because of backoff.
Gosh do I really not know the alphabet?
@babolivier
Copy link
Contributor Author

babolivier commented May 8, 2020

Just tried this PR on my HS, and at least half of it seems to work well, in that inserting rows manually with device_lists_remote_resync using:

INSERT INTO device_lists_remote_resync VALUES ('@user:example.com', (EXTRACT(epoch FROM NOW()) * 1000)::BIGINT);

seems to result in the users' device lists being resynced automatically, and having their shield turned back to green in Riot.

Another point is that I'll need to see if this PR doesn't clash with #6786, it shouldn't be the case but maybe there's be something that needs to be done to smooth them together and make the whole thing more maintainable.

@babolivier
Copy link
Contributor Author

Alright, the other half of this (the bit that does the insertion if resync fails) seems to also be working, given I'm now seeing Successfully resynced the device list for... lines in my logs for users I haven't manually inserted \o/

@babolivier
Copy link
Contributor Author

babolivier commented May 12, 2020

So I think this PR should be ready for review. I don't believe there's anything that needs to be done wrt #6786.

Also, I'm not sure if it's good enough that, if a remote is unreachable, retry it every 30s and fail silently on a NotRetryingDestination error or if the signature of user_device_resync should be changed so that it also returns how long to wait until the next attempt, and _maybe_retry_device_resync would then keep track of which resync should be retried when (but that may be too much faff).

Another question/issue I have is that it would be neat to cache get_user_ids_requiring_device_list_resync since we might be calling the function every 30s, however the way that function is used outside of this PR makes invalidating that cache quite tricky, as we can't know all of the lists of user IDs that function has been called with.

@babolivier babolivier marked this pull request as ready for review May 12, 2020 09:59
@babolivier babolivier requested a review from a team May 12, 2020 09:59
@babolivier babolivier linked an issue May 12, 2020 that may be closed by this pull request
@babolivier
Copy link
Contributor Author

babolivier commented May 13, 2020

Known shortcomings of this PR:

  • It doesn't do the right thing wrt logging contexts, c.f.:
2020-05-13 11:23:04,595 - synapse.storage.database - 525 - WARNING - - Starting db txn 'update_remote_device_list_cache' from sentinel context
2020-05-13 11:23:04,595 - synapse.storage.database - 564 - WARNING - - Starting db connection from sentinel context: metrics will be lost

@babolivier babolivier removed the request for review from a team May 13, 2020 13:23
@richvdh
Copy link
Member

richvdh commented May 14, 2020

@babolivier please could you fix #7498 while you're here

@babolivier
Copy link
Contributor Author

Sure

@babolivier
Copy link
Contributor Author

So I'm not so sure about

It doesn't update device_lists_outbound_last_success (and maybe other tables) upon successful /devices/... response.

being a shortcoming of this PR anymore.

This is only needed if the remote server is completely unreachable, since Synapse will always 200 device list updates, even if it failed to process them.

AFAICT, when responding to a request, we can't really be 100% sure response will be reached and fully processed by the remote server (especially if that server or its connection is wobbly). In that case, I'd rather have the sending server wrongly think it has failed to send the update and resend it with the next transaction (so the remote server can simply ignore it if it's already received it) than have it wrongly think it's succeeded and never send it again.

@babolivier babolivier requested a review from a team May 15, 2020 11:01
tests/test_federation.py Outdated Show resolved Hide resolved
synapse/handlers/device.py Outdated Show resolved Hide resolved
@babolivier babolivier requested a review from clokep May 21, 2020 15:03
Copy link
Member

@clokep clokep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good if CI passes!

@babolivier babolivier merged commit d1ae101 into develop May 21, 2020
@babolivier babolivier deleted the babolivier/device_list_retry branch May 21, 2020 15:41
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.

matrix-org#6776 introduced some code infrastructure to mark device lists as stale/out of sync.

This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.

It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`.

Fixes matrix-org#7418
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-signing signatures not being always federated correctly
3 participants