chip-tool is still racy on Darwin #7557

mrjerryjohns · 2021-06-11T18:19:15Z

Problem

#7478 attempts to fix data races detected by tsan in chip-tool - however, the bulk of the work there was validated on POSIX based platforms.

On Darwin which uses dispatch queues, the logic changes in chip-tool don't make any meaningful dent on the races that still exist there.

Proposed Solution

Not sure, need to figure something out!

bzbarsky-apple · 2021-06-11T18:40:22Z

@vivien-apple @bzbarsky-apple

mrjerryjohns · 2021-06-13T21:14:41Z

Here is my analysis of the state of threading and dispatch queues on Darwin, especially in light of the changes that were made recently in #7405.

On the whole, the Darwin implementation deviates from the API contracts stipulated in PlatformManager.h, exposing a number of race conditions in applications like chip-tool that have a separate application/main dispatch from the dispatch that runs parts of CHIP.

Namely:

_InitChipStack() is expected to initialize the chip-stack without actually starting any threading contexts or tasks to service CHIP events. This is how most applications are structured - they call _InitChipStack() to initialize the stack, do a bunch more stack interactions, then, they either start the chip task, OR block on the main runloop by calling RunEventLoop(). However, in the Darwin implementation, the dispatch queue is created and made runnable immediately, resulting in races between code that is being executed on the dispatch queue and APIs being invoked by the main thread into the stack.
_StartEventLoopTask() should be actually spinning up the task/dispatch. In Darwin, this is a no-op.
_RunEventLoop() is called by applications that do not want to start up a 'separate threading context', but rather, just block on running the event loop till it's stopped. This is not implemented in Darwin. This is the model that servers use, and the only reason why our server example apps are not failing is because the shell has a blocking loop that keeps the app from terminating early.
LockChipStack() and UnlockChipStack() are not implemented. In the 'external synchronization' threading model (which Fix all detectable thread races in chip-tool #7478 adopts, and uses locks to fix races in one example application), this is problematic.
PostEvent() doesn't grab the chip lock while dispatching, resulting in races with the main thread invoking the SDK.
Shutdown() doesn't actually stop the dispatch queue synchronously, causing segfaults like that here
The dispatch code in SystemLayer, SystemTimer, {Raw|TCP|UDP}Endpoint do not grab locks to serialize access to the SDK critical section.

A lot of the API contracts are not well documented in PlatformMgr.h, which I think causes problems like these. I'll put up a PR to provide clarifications there to ensure we're all on the same page there.

mrjerryjohns · 2021-06-13T21:18:15Z

Most of the above were confirmed with tsan, which reported a number of race conditions in those areas.

I've since made a bunch of fixes to the Darwin Platform implementation that keeps the spirit of what's being achieved there, while making it compliant with the updated API contracts I've defined in #7478.

I was able to repro the segfault in #7574 quite reliably with the test runner, and with the above fixes, I was able to run up to a 1000 iterations crash free, as well as tsan error free. While this doesn't mean it's completely race free, it's still a positive leg up!

Will put it up soon...

mrjerryjohns · 2021-06-13T21:18:25Z

FYI @msandstedt

vivien-apple · 2021-06-14T10:36:57Z

Here is my analysis of the state of threading and dispatch queues on Darwin, especially in light of the changes that were made recently in #7405.

On the whole, the Darwin implementation deviates from the API contracts stipulated in PlatformManager.h, exposing a number of race conditions in applications like chip-tool that have a separate application/main dispatch from the dispatch that runs parts of CHIP.

Namely:

_InitChipStack() is expected to initialize the chip-stack without actually starting any threading contexts or tasks to service CHIP events. This is how most applications are structured - they call _InitChipStack() to initialize the stack, do a bunch more stack interactions, then, they either start the chip task, OR block on the main runloop by calling RunEventLoop(). However, in the Darwin implementation, the dispatch queue is created and made runnable immediately, resulting in races between code that is being executed on the dispatch queue and APIs being invoked by the main thread into the stack.

_StartEventLoopTask() should be actually spinning up the task/dispatch. In Darwin, this is a no-op.

If this is the API contract (that was pretty unclear ;) ), this can be replicated by calling dispatch_resume onto the dispatch queues that have been created previously. By default the dispatch queues created with dispatch_source_create are into a suspended states. The current code calls dispatch_resume right after they have been created. It was effectively decoupling those parts of the code.

_RunEventLoop() is called by applications that do not want to start up a 'separate threading context', but rather, just block on running the event loop till it's stopped. This is not implemented in Darwin. This is the model that servers use, and the only reason why our server example apps are not failing is because the shell has a blocking loop that keeps the app from terminating early.

At least it has let us find this issue ;)

That said, we could call something like dispatch_main(); or set a NSRunLoop into _RunEventLoop.

LockChipStack() and UnlockChipStack() are not implemented. In the 'external synchronization' threading model (which Fix all detectable thread races in chip-tool #7478 adopts, and uses locks to fix races in one example application), this is problematic.

PostEvent() doesn't grab the chip lock while dispatching, resulting in races with the main thread invoking the SDK.

The dispatch code in SystemLayer, SystemTimer, {Raw|TCP|UDP}Endpoint do not grab locks to serialize access to the SDK critical section.

One of the goal of dispatch queues is to avoid locking by serialising access to the stack instead of accessing it concurrently. If we do need locking with dispatch queues, something is wrong somewhere.

One thing that is effectively missing in chip-tool and that has been done into https://github.com/project-chip/connectedhomeip/blob/master/src/darwin/Framework/CHIP/CHIPDeviceController.mm is to serialise the access to the CHIP stack for the Device Commissioner API.

This is the part where I would expect a nicer API that does not force us to wraps the call to the CHIP stack into dispatch_sync or LockChipStack()/UnlockChipStack().

Shutdown() doesn't actually stop the dispatch queue synchronously, causing segfaults like that here

There are 2 issues that I can think of with the way Shutdown is called at the moment.

Shutdown is called on the main thread because a message response has led chip-tool to call cvWaitingForResponse.notify_all and the stack is cleaned up while some there may still exists some work to do afterwards, such as sending a ack onto the main chip dispatch queue.
Because Shutdown is currently stopping the dispatch queues, something may happens in the meantime (e.g a new message onto the sockets) and it is dispatched asynchronously onto the main work queue, so it will effectively be processed after Shutdown and that's not good.

The root of the issue at the moment is that Shutdown is stopping the dispatch queues as well as cleaning up some states.

Following your API I would expect a call to StopEventLoopTask to dispatch_suspend the various dispatch queues, then dispatch_async(chipWorkQueue, ^{ mCommissioner.Shutdown(); }); should be safe.

A lot of the API contracts are not well documented in PlatformMgr.h, which I think causes problems like these. I'll put up a PR to provide clarifications there to ensure we're all on the same page there.

Pretty sure we can honoured the API contract without resorting to LockChipStack and UnlockChipStack on Darwin.

tcarmelveilleux · 2021-06-14T14:01:19Z

External synchronization doesn't require locks. External syncrhonization can be achieved by serializing all stack API calls and callbacks to a single context, therefore ensuring no locks are needed. LockChipStack/UnlockChipStack are helpers which can do no-op if you can guarantee that your code never calls CHIP stack code from multiple thread contexts.

However, the question is: is there any global state across dispatch contexts being mutated?

If the answer is yes, then the dispatch queues are used in a racy way. The usual way for this method to work is to ensure that each dispatch queue only touches context that depends on the incoming work to it, and there must be a specific design done to design messaging between dispatch queues if communication/state is to be exchanged/shared between dispatch queue contexts. If global state is mutated by work within several dispatch queues, then you basically have the implementation details of "where is this dispatch queue running" being close to "each dispatch queue is effectively a thread". In those cases, dispatch queues don't magically make the code safe, and the shared state has to be protected.

woody-apple · 2022-01-26T05:07:43Z

The way things are used here for v1 is safe, moving this out.

stale · 2022-09-08T16:43:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2023-03-11T11:00:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

andy31415 · 2023-08-30T20:31:51Z

I believe that the virtual device build is probably something separate not related to chip-tool. it looks like it tries to access LevelControl::minLevel

mrjerryjohns mentioned this issue Jun 11, 2021

Fix all detectable thread races in chip-tool #7478

Merged

todo bot mentioned this issue Jun 11, 2021

Restyle Fix all detectable thread races in chip-tool #7559

Closed

woody-apple added the V1.X label Jan 26, 2022

stale bot added the stale Stale issue or PR label Sep 8, 2022

bzbarsky-apple removed the stale Stale issue or PR label Sep 9, 2022

stale bot added the stale Stale issue or PR label Mar 11, 2023

bzbarsky-apple removed the stale Stale issue or PR label Mar 13, 2023

petetnt mentioned this issue Aug 30, 2023

Examples/chef not installing properly #28253

Closed

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chip-tool is still racy on Darwin #7557

chip-tool is still racy on Darwin #7557

mrjerryjohns commented Jun 11, 2021

bzbarsky-apple commented Jun 11, 2021

mrjerryjohns commented Jun 13, 2021 •

edited

Loading

mrjerryjohns commented Jun 13, 2021 •

edited

Loading

mrjerryjohns commented Jun 13, 2021

vivien-apple commented Jun 14, 2021

tcarmelveilleux commented Jun 14, 2021 •

edited

Loading

woody-apple commented Jan 26, 2022

stale bot commented Sep 8, 2022

stale bot commented Mar 11, 2023

This comment was marked as off-topic.

andy31415 commented Aug 30, 2023

chip-tool is still racy on Darwin #7557

chip-tool is still racy on Darwin #7557

Comments

mrjerryjohns commented Jun 11, 2021

Problem

Proposed Solution

bzbarsky-apple commented Jun 11, 2021

mrjerryjohns commented Jun 13, 2021 • edited Loading

mrjerryjohns commented Jun 13, 2021 • edited Loading

mrjerryjohns commented Jun 13, 2021

vivien-apple commented Jun 14, 2021

tcarmelveilleux commented Jun 14, 2021 • edited Loading

woody-apple commented Jan 26, 2022

stale bot commented Sep 8, 2022

stale bot commented Mar 11, 2023

This comment was marked as off-topic.

andy31415 commented Aug 30, 2023

mrjerryjohns commented Jun 13, 2021 •

edited

Loading

mrjerryjohns commented Jun 13, 2021 •

edited

Loading

tcarmelveilleux commented Jun 14, 2021 •

edited

Loading