-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469
Comments
Concept here isn't to duplicate what's provided in events/syslog, but at a higher level report "real" health issues. Basically things that mean the system really isn't behaving or configured correctly. Allows for continued ops (if monitor doesn't trigger reset) if required to perform recovery options, while also providing situational awareness. |
Another syslog only reporting of an unhealthy system: cFE/modules/tbl/fsw/src/cfe_tbl_internal.c Lines 750 to 759 in 5e41330
|
There's also numerous cases of conditions to test for things that should never happen, with inconsistent responses. One idea from recent discussions is to add an API with a "soft exception" sort of concept where we capture context and have a configurable response (and persistent reporting mechanism if applicable). Something like "system tainted" or similar. Could reduce event clutter and provide for more consistent reporting. One example is the ID match failure case here where a cleanup action will happen and message will get reported, but it's not really obvious that something might be seriously broken (similar cases where mutex actions fail): cFE/modules/es/fsw/src/cfe_es_apps.c Line 1150 in e5d4ed9
|
I've got a health reporting library prototyped that manages status bits and a counter that works nicely for HK reporting. If anyone ever wants to help advance this or wants to collaborate or just status let me know. Still maturing API's, but using it in a real world use case and the concept seems helpful/useful especially for automation/autonomy. |
Is your feature request related to a problem? Please describe.
Historically syslog or events are used to report issues, and telemetry status reporting is likely scattered and/or inconsistent. Not easy to really be sure everything is "healthy" at a glance. Example issue is with system startup synchronization, there isn't an easy way to tell (especially if there's spotty com) that startup synchronization was successful. There's also other cases where operation continues "best effort" in failure conditions, since there isn't anything that can really be done from within the system.
Describe the solution you'd like
Add an app/service health summary parameter to HK, 0 is healthy and nonzero bits could indicate specific issues have been encountered. Latch on condition, but clear with the a reset command. Proper synchronization is an easy first condition to add, but scrub for others to include in the summary. With this addition, reduces the dependency on syslog/events for a monitoring system (like HS or an "external" monitor) or the ground to take appropriate action.
Additionally many of the CDS "errors" are simply written to the system log (or not) and initialization continues. When these things fail there is something wrong or something got corrupted, needs to be more obvious (examples):
cFE/modules/tbl/fsw/src/cfe_tbl_internal.c
Lines 155 to 163 in 84ba9a9
cFE/modules/tbl/fsw/src/cfe_tbl_internal.c
Lines 167 to 173 in 84ba9a9
cFE/modules/tbl/fsw/src/cfe_tbl_internal.c
Lines 177 to 188 in 84ba9a9
Describe alternatives you've considered
None
Additional context
#1466 would allow apps to add the sync status, note also #1467 would provide the syslog. Spawned from issues discussed at code review.
Requester Info
Jacob Hageman - NASA/GSFC
The text was updated successfully, but these errors were encountered: