Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chip-tool is still racy on Darwin #7557

Open
mrjerryjohns opened this issue Jun 11, 2021 · 11 comments
Open

chip-tool is still racy on Darwin #7557

mrjerryjohns opened this issue Jun 11, 2021 · 11 comments
Labels

Comments

@mrjerryjohns
Copy link
Contributor

Problem

#7478 attempts to fix data races detected by tsan in chip-tool - however, the bulk of the work there was validated on POSIX based platforms.

On Darwin which uses dispatch queues, the logic changes in chip-tool don't make any meaningful dent on the races that still exist there.

Proposed Solution

Not sure, need to figure something out!

@bzbarsky-apple
Copy link
Contributor

@mrjerryjohns
Copy link
Contributor Author

mrjerryjohns commented Jun 13, 2021

Here is my analysis of the state of threading and dispatch queues on Darwin, especially in light of the changes that were made recently in #7405.

On the whole, the Darwin implementation deviates from the API contracts stipulated in PlatformManager.h, exposing a number of race conditions in applications like chip-tool that have a separate application/main dispatch from the dispatch that runs parts of CHIP.

Namely:

  1. _InitChipStack() is expected to initialize the chip-stack without actually starting any threading contexts or tasks to service CHIP events. This is how most applications are structured - they call _InitChipStack() to initialize the stack, do a bunch more stack interactions, then, they either start the chip task, OR block on the main runloop by calling RunEventLoop(). However, in the Darwin implementation, the dispatch queue is created and made runnable immediately, resulting in races between code that is being executed on the dispatch queue and APIs being invoked by the main thread into the stack.
  2. _StartEventLoopTask() should be actually spinning up the task/dispatch. In Darwin, this is a no-op.
  3. _RunEventLoop() is called by applications that do not want to start up a 'separate threading context', but rather, just block on running the event loop till it's stopped. This is not implemented in Darwin. This is the model that servers use, and the only reason why our server example apps are not failing is because the shell has a blocking loop that keeps the app from terminating early.
  4. LockChipStack() and UnlockChipStack() are not implemented. In the 'external synchronization' threading model (which Fix all detectable thread races in chip-tool #7478 adopts, and uses locks to fix races in one example application), this is problematic.
  5. PostEvent() doesn't grab the chip lock while dispatching, resulting in races with the main thread invoking the SDK.
  6. Shutdown() doesn't actually stop the dispatch queue synchronously, causing segfaults like that here
  7. The dispatch code in SystemLayer, SystemTimer, {Raw|TCP|UDP}Endpoint do not grab locks to serialize access to the SDK critical section.

A lot of the API contracts are not well documented in PlatformMgr.h, which I think causes problems like these. I'll put up a PR to provide clarifications there to ensure we're all on the same page there.

@mrjerryjohns
Copy link
Contributor Author

mrjerryjohns commented Jun 13, 2021

Most of the above were confirmed with tsan, which reported a number of race conditions in those areas.

I've since made a bunch of fixes to the Darwin Platform implementation that keeps the spirit of what's being achieved there, while making it compliant with the updated API contracts I've defined in #7478.

I was able to repro the segfault in #7574 quite reliably with the test runner, and with the above fixes, I was able to run up to a 1000 iterations crash free, as well as tsan error free. While this doesn't mean it's completely race free, it's still a positive leg up!

Will put it up soon...

@mrjerryjohns
Copy link
Contributor Author

FYI @msandstedt

@vivien-apple
Copy link
Contributor

Here is my analysis of the state of threading and dispatch queues on Darwin, especially in light of the changes that were made recently in #7405.

On the whole, the Darwin implementation deviates from the API contracts stipulated in PlatformManager.h, exposing a number of race conditions in applications like chip-tool that have a separate application/main dispatch from the dispatch that runs parts of CHIP.

Namely:

  1. _InitChipStack() is expected to initialize the chip-stack without actually starting any threading contexts or tasks to service CHIP events. This is how most applications are structured - they call _InitChipStack() to initialize the stack, do a bunch more stack interactions, then, they either start the chip task, OR block on the main runloop by calling RunEventLoop(). However, in the Darwin implementation, the dispatch queue is created and made runnable immediately, resulting in races between code that is being executed on the dispatch queue and APIs being invoked by the main thread into the stack.
  2. _StartEventLoopTask() should be actually spinning up the task/dispatch. In Darwin, this is a no-op.

If this is the API contract (that was pretty unclear ;) ), this can be replicated by calling dispatch_resume onto the dispatch queues that have been created previously. By default the dispatch queues created with dispatch_source_create are into a suspended states. The current code calls dispatch_resume right after they have been created. It was effectively decoupling those parts of the code.

  1. _RunEventLoop() is called by applications that do not want to start up a 'separate threading context', but rather, just block on running the event loop till it's stopped. This is not implemented in Darwin. This is the model that servers use, and the only reason why our server example apps are not failing is because the shell has a blocking loop that keeps the app from terminating early.

At least it has let us find this issue ;)

That said, we could call something like dispatch_main(); or set a NSRunLoop into _RunEventLoop.

  1. LockChipStack() and UnlockChipStack() are not implemented. In the 'external synchronization' threading model (which Fix all detectable thread races in chip-tool #7478 adopts, and uses locks to fix races in one example application), this is problematic.
  2. PostEvent() doesn't grab the chip lock while dispatching, resulting in races with the main thread invoking the SDK.
  3. The dispatch code in SystemLayer, SystemTimer, {Raw|TCP|UDP}Endpoint do not grab locks to serialize access to the SDK critical section.

One of the goal of dispatch queues is to avoid locking by serialising access to the stack instead of accessing it concurrently. If we do need locking with dispatch queues, something is wrong somewhere.

One thing that is effectively missing in chip-tool and that has been done into https://github.com/project-chip/connectedhomeip/blob/master/src/darwin/Framework/CHIP/CHIPDeviceController.mm is to serialise the access to the CHIP stack for the Device Commissioner API.

This is the part where I would expect a nicer API that does not force us to wraps the call to the CHIP stack into dispatch_sync or LockChipStack()/UnlockChipStack().

  1. Shutdown() doesn't actually stop the dispatch queue synchronously, causing segfaults like that here

There are 2 issues that I can think of with the way Shutdown is called at the moment.

  1. Shutdown is called on the main thread because a message response has led chip-tool to call cvWaitingForResponse.notify_all and the stack is cleaned up while some there may still exists some work to do afterwards, such as sending a ack onto the main chip dispatch queue.
  2. Because Shutdown is currently stopping the dispatch queues, something may happens in the meantime (e.g a new message onto the sockets) and it is dispatched asynchronously onto the main work queue, so it will effectively be processed after Shutdown and that's not good.

The root of the issue at the moment is that Shutdown is stopping the dispatch queues as well as cleaning up some states.

Following your API I would expect a call to StopEventLoopTask to dispatch_suspend the various dispatch queues, then dispatch_async(chipWorkQueue, ^{ mCommissioner.Shutdown(); }); should be safe.

A lot of the API contracts are not well documented in PlatformMgr.h, which I think causes problems like these. I'll put up a PR to provide clarifications there to ensure we're all on the same page there.

Pretty sure we can honoured the API contract without resorting to LockChipStack and UnlockChipStack on Darwin.

@tcarmelveilleux
Copy link
Contributor

tcarmelveilleux commented Jun 14, 2021

External synchronization doesn't require locks. External syncrhonization can be achieved by serializing all stack API calls and callbacks to a single context, therefore ensuring no locks are needed. LockChipStack/UnlockChipStack are helpers which can do no-op if you can guarantee that your code never calls CHIP stack code from multiple thread contexts.

However, the question is: is there any global state across dispatch contexts being mutated?

If the answer is yes, then the dispatch queues are used in a racy way. The usual way for this method to work is to ensure that each dispatch queue only touches context that depends on the incoming work to it, and there must be a specific design done to design messaging between dispatch queues if communication/state is to be exchanged/shared between dispatch queue contexts. If global state is mutated by work within several dispatch queues, then you basically have the implementation details of "where is this dispatch queue running" being close to "each dispatch queue is effectively a thread". In those cases, dispatch queues don't magically make the code safe, and the shared state has to be protected.

@woody-apple
Copy link
Contributor

The way things are used here for v1 is safe, moving this out.

@stale
Copy link

stale bot commented Sep 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label Sep 8, 2022
@bzbarsky-apple bzbarsky-apple removed the stale Stale issue or PR label Sep 9, 2022
@stale
Copy link

stale bot commented Mar 11, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label Mar 11, 2023
@bzbarsky-apple bzbarsky-apple removed the stale Stale issue or PR label Mar 13, 2023
@petetnt

This comment was marked as off-topic.

@andy31415
Copy link
Contributor

I believe that the virtual device build is probably something separate not related to chip-tool. it looks like it tries to access LevelControl::minLevel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants