Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception and Reset Log possible race conditions #411

Closed
skliper opened this issue Nov 18, 2019 · 11 comments · Fixed by #653 or #692
Closed

Exception and Reset Log possible race conditions #411

skliper opened this issue Nov 18, 2019 · 11 comments · Fixed by #653 or #692
Assignees
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Nov 18, 2019

Describe the bug
CFE_ES_ClearERLogCmd and CFE_ES_WriteToERLog both modify shared CFE_ES_ResetDataPtr values. CFE_ES_ProcessCoreException and CFE_ES_ResetCFE both use CFE_ES_WriteToERLog (both are API's, so could be out of ES context).

To Reproduce
Looks to me like if CFE_ES_ClearERLogCmd gets interrupted by the processing of an app core exception, the log could get corrupted.

Expected behavior
No race.

Code snips
See functions above.

Versions

  • latest cFS bundle dev branch

Additional context
Not observed, via code review.

Reporter Info
Jacob Hageman - NASA/GSFC

@jphickey
Copy link
Contributor

The potential race conditions in exception handling was also noted in a previous trac ticket, now in issue #76

@skliper skliper added the bug label Nov 18, 2019
@skliper
Copy link
Contributor Author

skliper commented Feb 10, 2020

@jphickey could you take this and #76? Deprecate CFE_ES_ProcessCoreException() helps. Then re-architecture the recording/reporting.

@jphickey
Copy link
Contributor

Sure, I'm happy to take it on, but this is one area that reality of varying system architectures is somewhat mismatched to the original API/requirements (which had a bit of a simplistic VxWorks/flat memory/non-MMU view of the world).

A reason this has been sitting in the "todo" bucket for so long is that it first needs some community agreement on what the requirements for exception handling should be. It likely means moving at least some aspects of the exception handling to the PSP so will require coordinated change to the requirement, PSP, and CFE at the same time.

Yes, will likely have to deprecate CFE_ES_ProcessCoreException() and replace it with something more abstract to get this fixed.

@skliper
Copy link
Contributor Author

skliper commented Feb 11, 2020

Sounds good. It would help if you could initiate this conversation and work it with the architects and/or coordinate with @astrogeco to bring it up at the CCB.

Note #509 removes the exception handling requirements from cFE (so it can be implemented in the PSP layer). The requirements related to the ER Log remain since we still need a way to command a clear of the log and command writing the log to the file (cFE capabilities), but utilizing the PSP under the hood.

@skliper skliper added this to the 6.8.0 milestone Feb 26, 2020
@jphickey
Copy link
Contributor

Suggestions/Proposal for fixing this issue

The root of the issue is that the exception handler within CFE (CFE_ES_ProcessCoreException) is invoked directly by the PSP during the exception handling, which is an async event - that is, an ISR or signal handler context. However, the code inside this function is not async-signal safe. Furthermore, it is not possible to do any locking - particularly in POSIX where it is not safe to use any pthread primitives from a signal handler.

This is another area where the "better" approach likely depends on the mode of operation and goals, in particular whether it is a "debug" build or an actual deployment build.

On a "debug" build it is preferable to do an abort() or equivalent immediately when an exception occurs, such that the kernel can generate a core file which can be examined in a debugger for post-mortem analysis.

On a flight-like or "release" build it is preferable to capture and log as much as possible internally. Not all platforms can support this, however, and the solution is platform-dependent. Suggesting a sequence that the MCP750 could do as an example (even though it is generally still a debug platform).

Proposed sequence for VxWorks MCP750 (for a flight-like example)
Exception Hook (ISR equivalent):

  • Confirm global PSP "exception occurred" flag is not set. If set, return and do nothing (or maybe go direct to a watchdog reset?).
  • Exception Hook Stores the relevant data (task ID, vector number, and stack frame) which are passed by the kernel to the hook.
  • Exception Hook also reads the timebase used by the performance log to get precise timestamp of when exception occurred.
  • Set PSP global flag indicating that an exception has occurred, and context data was stored.
  • Exception Hook Wakes up another regular task to perform logging duties (see below).
  • Exit/return from Exception Hook
  • The task which triggered the exception will remain suspended by the kernel (it is not resumed)

Background Task:

  • A background task detects that the exception flag is set, and invokes CFE_ES_ProcessCoreException with the relevant exception data
  • OK if this uses OSAL primitives (it is a normal task - but see risks)
  • May collect data from the performance log, to capture the context of what was happening when the exception occurred (correlated by timestamp).
  • Write all data to an exception file, containing board-specific data (registers/stack) and performance log data.
  • Trigger a final action, such as CFE_PSP_Restart() to restart the processor, or trigger a hardware watchdog.

Concerns/considerations with this sequence:

  • Exception could occur while a task is holding a lock. Because task is suspended after exception, lock will never be released. This is a biggest concern if the background task needs to take the same lock (deadlock would ensue)
  • Exception could occur while handling an exception. There is a window between checking the flag and setting the flag, where if a second exception occurs it will corrupt the data. This is likely not possible on MCP750 though, as it is single-core and the exception hook is an ISR. But If that happens, the system is probably too far gone to do anything useful - a watchdog is probably needed.

If the hardware supports it, it might be useful to trigger the watchdog directly in the exception hook but in a "delay" fashion, so if the background task never completes, the system still gets a reset (again, requires HW that can actually do this, not all can).

Proposed Sequence for DEBUG builds
Do nothing! This would be the only mode used on pc-linux.

The default behavior for handling this type of exception (i.e. if the PSP/OSAL did not register any handler for this event, the kernel would would handle it equivalently to an "abort" or debugger trap). This will either generate a core dump or otherwise let the developer debug the situation with full context info.

Please review and provide comments/thoughts, particularly @skliper, @acdumore and @jwilmot, I'd like to hear your thoughts on this approach. I'll then move forward with implementing it.

@skliper
Copy link
Contributor Author

skliper commented Apr 8, 2020

I like it.

@ghost
Copy link

ghost commented Apr 9, 2020

The proposed solution sounds good to me too. I think the original implementation worked for VxWorks with the exception hook if I remember correctly, but I had to hack the implementation for RTEMS and FreeRTOS because of the problems with calling CFE_ES_ProcessCoreException from an ISR/Exception handler.
I like the idea of waking up the background task to log the information call the reset. The watchdog (if available) should provide a backup. Could you implement the logger as another job in the new ES background task?

I suppose another potential implementation could be to let the PSP handle everything. Make sure the PSP can access the Performance Log, Exception and Reset Log, syslog, etc, so the PSP could just log the exception data from ES and reset directly. I guess it would have to stop the scheduler to do this. (is that even possible on Linux?) But is that just moving the problems?

@skliper
Copy link
Contributor Author

skliper commented Apr 10, 2020

Could you make the different behavior selectable independent of debug mode? There may be users who would like to control this behavior explicitly or enforce it be consistent between build types.

@skliper
Copy link
Contributor Author

skliper commented Apr 10, 2020

I'm fine if it defaults to behaving this way, just asking that it not be required to behave this way (allow folks to configure it to behave more flight-like while in debug mode).

@jphickey
Copy link
Contributor

It is based on your PSP and its capabilities, not the BUILDTYPE of the build.

In the pc-linux / POSIX world it doesn't offer much of a user space exception self-recovery option at all. Exceptions are handled by the kernel. This in turn will generate a signal to the thread, but AFAIK there is no way to suspend the thread without killing it, which will actually kill the whole process. You can configure the floating point modes which determines whether or not certain arithmetic things are exceptions in the first place, but once an exception occurs it is going to abort.

Maybe there is some signal handler technique I'm not aware of for the arithmetic/floating point exception stuff, but for the other stuff like SIGSEGV/SIGBUS/SIGILL, it really is futile to try to continue running because the task is in an indeterminate state.

So for pc-linux, I think the only choice should be to just use the default signal behavior for these exceptions, which will abort the CFE and generate a coredump. Any recovery could be done in a parent/watchdog process that detects this abnormal exit and does a "processor restart" of the CFE to get things running again. In theory, this could detect the presence of the core file and make an ER log entry after it restarts, or it could be logged by the external process.

My recommendation is to save that for another day, or let someone who actually wants that to implement it.

The other, better way to make pc-linux more robust WRT exceptions is to run each app/module in a separate memory-protected process. This would be FAR more reliable. I think its doable but it would be a good challenge. But then the exception handling would "just work".

When using the mcp750/vxworks, this is the only platform that can really implement exception handling as it is currently designed because of the fact that VxWorks suspends the individual task and has the exception hook feature. So this will be employed when using this PSP.

I've been thinking about @acudmore 's suggestion of letting the PSP do the log entirely and I think that makes a lot of sense. most everything related to the exception is platform-specific so it may make more sense to just keep it entirely "owned" by the PSP and only offer a simplified API that ES can call if it needs to add some entry into it.

@skliper
Copy link
Contributor Author

skliper commented Apr 10, 2020

Proposed Sequence for DEBUG builds

This "DEBUG builds" is what threw me (this sounded like a reference to BUILDTYPE), completely on-board with PSP handled. It was the initial direction I thought this was going to begin with per
#411 (comment).

What about retaining the clear/write to file commands?

jphickey added a commit to jphickey/cFE that referenced this issue Apr 29, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue Apr 29, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue Apr 30, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue May 6, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue May 6, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue May 6, 2020
Move exception handling to a PSP function.  In this approach
the CFE only logs data after the event as a background job.

Replaces the CFE_ES_ProcessCoreException with a simple notification
that causes the ES background job to run, which in turn polls the
PSP for logged exceptions and writes entries to the ES ER log.

Both the PSP execption scan and the ER log file dump are converted
to background jobs.
jphickey added a commit to jphickey/cFE that referenced this issue May 6, 2020
Correct items flagged as warnings in documenation build, and
remove now-unused definition in sample platform config
astrogeco added a commit that referenced this issue May 8, 2020
Fix #411, rework exception handling in CFE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants