[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

Damian-Nordic · 2022-07-22T08:46:49Z

Problem

Consider the following scenario:

Commissioning of a Thread device begins.
In the network commissioning phase, the device joins a Thread network and registers its DNS services using SRP.
The commissioner fails to discover the operational service, or communicate with the device due to a network failure.
Failsafe on the device expires and the device removes its Thread operational dataset, hence it detaches from the Thread network.
The registered services will still be discoverable for the period of lease time which is configured to 2 hours by default.

It was raised by @bluebin14 in the following slack discussion: https://csamembers.slack.com/archives/G014G30SVV0/p1657552482903829?thread_ts=1657202586.638369&cid=G014G30SVV0

Proposed Solution

Actively unregister services when detaching from the network
- It complicates design of the failsafe expiry. Currently, the API is synchronous and we would need to introduce kind of expiration in progress state. What should we do if the next commissioning attempt starts when the network configuration has not been fully reverted yet?
- If the commissioning could not be completed due to a Thread network failure, it's likely that unregistering will fail, too.
Use very short lease time for SRP services registered during the commissioning. After the commissioning re-register the services using long lease time
- That could work, but SRP servers are free to reject services with too short lease time. No specific minimum is defined, but e.g. OTBR currently only accepts lease time >= 30 minutes.
Maybe it's not a big issue after all
- If the commissioner assigns a unique node ID to the device in each commissioning attempt, stale operational services appearing in rare failure situations, for the period of approximately 2 hours, shouldn't impact the user experience.
- The commissionable node service is potentially a bigger problem, but when the commissioning starts, CM key of the service is switched to 0, so the commissioner shouldn't use the stale service to initiate the commissioning. But we could also delay registering the commissionable node service if there's scenario in which it could improve the user experience.

The text was updated successfully, but these errors were encountered:

Damian-Nordic · 2022-07-22T08:49:03Z

@bluebin14 @tcarmelveilleux @jwhui @abtink @LuDuda @jmartinez-silabs I'm not sure what's the best way forward, so your comments are welcome :)

bluebin14 · 2022-07-22T12:27:19Z

Once the Thread network is successfully joined, for consistency failsafe expiry should not cause deleting Thread network credentials, just like removing the last peer doesn't. Then you don't need an "expiry in progress state" or other workarounds that may introduce new bugs. There is a removeNetwork command, the controller can use it if needed.

Damian-Nordic · 2022-07-22T12:35:30Z

That would be against the spec which mandates that the failsafe expiry should have the following effect (among others):

Reset the configuration of all Network Commissioning Networks attribute to their state prior to the Fail-Safe being armed.

bluebin14 · 2022-07-22T14:15:37Z

Leaving a dangling SRP advertisement is much worse from usability perspective than adhering to a spec inconsistency as compared to removeFabric behavior for last peer. The spec requirement even implies resetting network configuration when there are other accessing peers, so it should be removed from the spec.

Damian-Nordic · 2022-07-22T14:19:40Z

Could elaborate more on how exactly it impacts the usability? Or what are your assumptions? Do you assume that the commissioner keeps using the same node ID on each commissioning attempt?

bluebin14 · 2022-07-22T14:33:10Z

It impacts SRP server operation when the situation repeats, eventually being refused (as seen in uart log). Physically resetting a border router such as HomePod to clear the entries is a major disruptive operation. Better to have a software solution that tears down things cleanly.

bzbarsky-apple · 2022-07-22T14:56:39Z

Currently, the API is synchronous and we would need to introduce kind of expiration in progress state

For what it's worth, failsafe cleanup is in fact async from the failsafe's point of view. It has a "not armed, but not fully disarmed; disarming in progress" state. What's missing is a way to keep it in that state for a bit while async disarm work happens. We could add such a way....

jmartinez-silabs · 2022-07-22T18:22:19Z

In the case, there is a failsafe and the device thread attachment was complete, I feel like we should create a mechanism to unregister the services. Once the removeSrpService is sent then you can delete the thread dataset.

How would the new commissioning attempts start before the thread dataset is cleared? I believe you wouldn't be advertising already.

Unsure about the complexity it adds to the failsafe expiry. I am not familiar with that part.

stale · 2023-01-20T07:31:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2023-07-22T04:28:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

bzbarsky-apple mentioned this issue Aug 9, 2022

[SQA] [Thread]TC-CNET-4.12 Getting Run Command Failure after connect-network with THREAD_2ND_OPERATIONALDATASET #20803

Open

stale bot added the stale Stale issue or PR label Jan 20, 2023

Damian-Nordic removed the stale Stale issue or PR label Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

Damian-Nordic commented Jul 22, 2022

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

bzbarsky-apple commented Jul 22, 2022

jmartinez-silabs commented Jul 22, 2022

stale bot commented Jan 20, 2023

stale bot commented Jul 22, 2023

[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

[Thread] Changing/reverting operational dataset leaves stale SRP services #21101

Comments

Damian-Nordic commented Jul 22, 2022

Problem

Proposed Solution

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

Damian-Nordic commented Jul 22, 2022

bluebin14 commented Jul 22, 2022

bzbarsky-apple commented Jul 22, 2022

jmartinez-silabs commented Jul 22, 2022

stale bot commented Jan 20, 2023

stale bot commented Jul 22, 2023