Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

Open
Damian-Nordic opened this issue Jul 22, 2022 · 10 comments
Open

Comments

@Damian-Nordic
Copy link
Contributor

Problem

Consider the following scenario:

  1. Commissioning of a Thread device begins.
  2. In the network commissioning phase, the device joins a Thread network and registers its DNS services using SRP.
  3. The commissioner fails to discover the operational service, or communicate with the device due to a network failure.
  4. Failsafe on the device expires and the device removes its Thread operational dataset, hence it detaches from the Thread network.
  5. The registered services will still be discoverable for the period of lease time which is configured to 2 hours by default.

It was raised by @bluebin14 in the following slack discussion: https://csamembers.slack.com/archives/G014G30SVV0/p1657552482903829?thread_ts=1657202586.638369&cid=G014G30SVV0

Proposed Solution

  • Actively unregister services when detaching from the network
    • It complicates design of the failsafe expiry. Currently, the API is synchronous and we would need to introduce kind of expiration in progress state. What should we do if the next commissioning attempt starts when the network configuration has not been fully reverted yet?
    • If the commissioning could not be completed due to a Thread network failure, it's likely that unregistering will fail, too.
  • Use very short lease time for SRP services registered during the commissioning. After the commissioning re-register the services using long lease time
    • That could work, but SRP servers are free to reject services with too short lease time. No specific minimum is defined, but e.g. OTBR currently only accepts lease time >= 30 minutes.
  • Maybe it's not a big issue after all
    • If the commissioner assigns a unique node ID to the device in each commissioning attempt, stale operational services appearing in rare failure situations, for the period of approximately 2 hours, shouldn't impact the user experience.
    • The commissionable node service is potentially a bigger problem, but when the commissioning starts, CM key of the service is switched to 0, so the commissioner shouldn't use the stale service to initiate the commissioning. But we could also delay registering the commissionable node service if there's scenario in which it could improve the user experience.
@Damian-Nordic
Copy link
Contributor Author

@bluebin14 @tcarmelveilleux @jwhui @abtink @LuDuda @jmartinez-silabs I'm not sure what's the best way forward, so your comments are welcome :)

@bluebin14
Copy link
Contributor

Once the Thread network is successfully joined, for consistency failsafe expiry should not cause deleting Thread network credentials, just like removing the last peer doesn't. Then you don't need an "expiry in progress state" or other workarounds that may introduce new bugs. There is a removeNetwork command, the controller can use it if needed.

@Damian-Nordic
Copy link
Contributor Author

That would be against the spec which mandates that the failsafe expiry should have the following effect (among others):

Reset the configuration of all Network Commissioning Networks attribute to their state prior to the Fail-Safe being armed.

@bluebin14
Copy link
Contributor

Leaving a dangling SRP advertisement is much worse from usability perspective than adhering to a spec inconsistency as compared to removeFabric behavior for last peer. The spec requirement even implies resetting network configuration when there are other accessing peers, so it should be removed from the spec.

@Damian-Nordic
Copy link
Contributor Author

Could elaborate more on how exactly it impacts the usability? Or what are your assumptions? Do you assume that the commissioner keeps using the same node ID on each commissioning attempt?

@bluebin14
Copy link
Contributor

It impacts SRP server operation when the situation repeats, eventually being refused (as seen in uart log). Physically resetting a border router such as HomePod to clear the entries is a major disruptive operation. Better to have a software solution that tears down things cleanly.

@bzbarsky-apple
Copy link
Contributor

Currently, the API is synchronous and we would need to introduce kind of expiration in progress state

For what it's worth, failsafe cleanup is in fact async from the failsafe's point of view. It has a "not armed, but not fully disarmed; disarming in progress" state. What's missing is a way to keep it in that state for a bit while async disarm work happens. We could add such a way....

@jmartinez-silabs
Copy link
Member

In the case, there is a failsafe and the device thread attachment was complete, I feel like we should create a mechanism to unregister the services. Once the removeSrpService is sent then you can delete the thread dataset.

How would the new commissioning attempts start before the thread dataset is cleared? I believe you wouldn't be advertising already.

Unsure about the complexity it adds to the failsafe expiry. I am not familiar with that part.

@stale
Copy link

stale bot commented Jan 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label Jan 20, 2023
@Damian-Nordic Damian-Nordic removed the stale Stale issue or PR label Jan 20, 2023
@stale
Copy link

stale bot commented Jul 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants