Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFE intermittently showing invalid message ID errors #1073

Closed
jphickey opened this issue Jan 6, 2021 · 8 comments · Fixed by #1092 or #1109
Closed

CFE intermittently showing invalid message ID errors #1073

jphickey opened this issue Jan 6, 2021 · 8 comments · Fixed by #1092 or #1109
Assignees
Labels
Milestone

Comments

@jphickey
Copy link
Contributor

jphickey commented Jan 6, 2021

Describe the bug
When leaving CFE running for long periods of time, I am seeing occasional errors related to bad message IDs/Commands appear.

For instance:
EVS Port1 66/1/CFE_EVS 5: Invalid command packet, Message ID = 0x00001810

I was actually able to catch one of these in a core dump, and I can confirm that SBBufPtr->Msg looks like a perfectly well-formed message with MsgID of 0x1810. This particular example was caught at

"Invalid command packet, Message ID = 0x%08X",
.

(gdb) print /x *((CFE_MSG_CommandHeader_t*)SBBufPtr)
$2 = {Msg = {CCSDS = {Pri = {StreamId = {0x18, 0x10}, Sequence = {0xc0, 0x0}, Length = {0x0, 0x1}}}, Byte = {0x18, 0x10, 0xc0, 0x0, 0x0, 0x1}}, Sec = {FunctionCode = 0x0, Checksum = 0x0}}

What is particularly mysterious is that 0x1810 is CFE_TIME_TONE_CMD_MID ... These MIDs should be delivered only to CFE_TIME - EVS does not subscribe to these - yet for some reason it appears to be delivered to EVS....

To Reproduce
Build CFE in default debug config (SIMULATION=native, etc). Start up CFE core and just let it run. It may take several hours for the first error to appear. CFE appears to continue running normally afterwards though.

Expected behavior
EVS (or other apps!) should not randomly see messages it didn't subscribe to.

System observed on:
Ubuntu 20.04 (native)

Additional context
Not sure when this started because everything seems fine when debugging CFE for shorter time periods. But over the last couple weeks I've noticed these messages randomly appearing in terminal windows where I've started CFE and forgotten about it (overnight or longer).

Although I cite EVS here (because its the one I actually caught in a core file) this phenomenon occurs in other apps too (SB, TIME, etc) all randomly getting a delivered message ID they didn't subscribe to.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey self-assigned this Jan 6, 2021
@jphickey jphickey added the bug label Jan 6, 2021
@jphickey
Copy link
Contributor Author

jphickey commented Jan 6, 2021

I am starting to dig deeper into this - don't have any more info just yet - but was wondering if perhaps other users have noticed this too? Given the intermittent nature and that it takes hours to reproduce this, it might be a while before I can track it down sufficiently.

Seems like it could be a race condition where the buffer is perhaps reused before it was freed, or the routing in SB delivered it to the wrong place somehow.

@astrogeco
Copy link
Contributor

Might be worth sending a message to the mailing list. I'll post on the internal MS Teams channel.

@jphickey
Copy link
Contributor Author

jphickey commented Jan 6, 2021

I am pretty sure it is something fairly recently added. As I said I cannot pinpoint exactly when (yet) but I do frequently leave CFE running in one of the many terminal windows buried on my desktop - and this didn't happen before. So only users of bleeding edge dev builds might see this - Not official release users.

And for users that do have a build of the latest code and have the ability to let it run for an extended period - please try it and comment if you see this too. I am wondering if it might be affected by multi-core systems or something like that.

@skliper
Copy link
Contributor

skliper commented Jan 7, 2021

Ran overnight (SIMULATION=native on Ubuntu 18.04 VM) w/ no occurrences yet. Just the flywheel toggle.

@skliper skliper added this to the 7.0.0 milestone Jan 7, 2021
@skliper
Copy link
Contributor

skliper commented Jan 7, 2021

I'll do a debug build with the CFE_TIME_TONE_CMD_MID sent at a high rate, maybe some extra reporting to see if I can increase the frequency of the bug.

@jphickey
Copy link
Contributor Author

jphickey commented Jan 7, 2021

I also ran two instances overnight (in VM) - one was a vanilla/unmodified "main" build and the other had debug instrumentation added. Since last night the unmodified build has accumulated 3 errors:

1980-012-14:11:19.65671 ES Startup: CFE_ES_Main entering OPERATIONAL state
EVS Port1 66/1/CFE_TIME 21: Stop FLYWHEEL
EVS Port1 66/1/CI_LAB_APP 4: CI: invalid command packet,MID = 0x1810
EVS Port1 66/1/CFE_SB 43: Invalid Cmd, Unexpected Msg Id: 0x1810
EVS Port1 66/1/SAMPLE_APP 5: SAMPLE: invalid command packet,MID = 0x1810

Of course the build with debug instrumentation added has accumulated none yet :-/....

@skliper
Copy link
Contributor

skliper commented Jan 7, 2021

Typical. I updated HK messages to "1 Hz" in sch_lab and then I made "1 Hz" 100 Hz from the time service timer config, no errors yet but hopefully I can trigger one... all we send is HK messages and the time at tone message (that I can think of) and it happens in a burst. I haven't reviewed priorities in a while...

@jphickey
Copy link
Contributor Author

jphickey commented Jan 8, 2021

OK - I was able to catch several of these incidents and get core files of the system state when it happened, along with one where I had also started the perf log. With another day of runtime, both of my builds eventually accumulated some errors.

This core files confirm that SCH_LAB was actively running and generating the SEND_HK requests, while CFE_TIME "tone task" was also actively running to generate the CFE_TIME_TONE_CMD_MID (0x1810) message. Ultimately it looks like the same buffer was issued to both messages at the same time.

After doing some inspection on CFE SB, it appears to be missing locks here. In particular getting the buffer from the pool used to be done while holding the SB global lock prior to the recent SB refactor, but in the current CFE_SB_TransmitMsg it is not locked. Incidentally the inverse operation (CFE_SB_DecrBufUseCnt) is still done under lock.

I also see other areas of SB that are accessing globals and therefore should be locked, too... will need to a scrub of this code and make sure it has all the locks it needs.

jphickey added a commit to jphickey/cFE that referenced this issue Jan 13, 2021
Significant refactor of many SB API calls to address inconsistencies
with respect to locking and unlocking of global data structures.

First this updates the definition of CFE_SB_PipeId_t to use the
CFE_ES_ResourceID_t base type, and a new ID range.  Notably this
prevents direct access to the CFE_SB.PipeTbl global, forcing
code to go through the proper lookup routine, which should only
be done while locked.

All API implementations follow the same general pattern:

- Initial checks/queries while unlocked
- Lock SB global
- Lookups and/or modifications to the pipe table/routing info
- Unlock SB global
- Invoke other subsystems (e.g. OSAL)
- Re-lock SB global (if needed) do final update, and unlock again
- Send all events

All error counters should be updated at the end, while still locked.
All event processing is deferred to the end of each function, after
all other processing is done.
jphickey added a commit to jphickey/cFE that referenced this issue Jan 13, 2021
Significant refactor of many SB API calls to address inconsistencies
with respect to locking and unlocking of global data structures.

First this updates the definition of CFE_SB_PipeId_t to use the
CFE_ES_ResourceID_t base type, and a new ID range.  Notably this
prevents direct access to the CFE_SB.PipeTbl global, forcing
code to go through the proper lookup routine, which should only
be done while locked.

All API implementations follow the same general pattern:

- Initial checks/queries while unlocked
- Lock SB global
- Lookups and/or modifications to the pipe table/routing info
- Unlock SB global
- Invoke other subsystems (e.g. OSAL)
- Re-lock SB global (if needed) do final update, and unlock again
- Send all events

All error counters should be updated at the end, while still locked.
All event processing is deferred to the end of each function, after
all other processing is done.
jphickey added a commit to jphickey/cFE that referenced this issue Jan 13, 2021
Significant refactor of many SB API calls to address inconsistencies
with respect to locking and unlocking of global data structures.

First this updates the definition of CFE_SB_PipeId_t to use the
CFE_ES_ResourceID_t base type, and a new ID range.  Notably this
prevents direct access to the CFE_SB.PipeTbl global, forcing
code to go through the proper lookup routine, which should only
be done while locked.

All API implementations follow the same general pattern:

- Initial checks/queries while unlocked
- Lock SB global
- Lookups and/or modifications to the pipe table/routing info
- Unlock SB global
- Invoke other subsystems (e.g. OSAL)
- Re-lock SB global (if needed) do final update, and unlock again
- Send all events

All error counters should be updated at the end, while still locked.
All event processing is deferred to the end of each function, after
all other processing is done.
jphickey added a commit to jphickey/cFE that referenced this issue Jan 15, 2021
Significant refactor of many SB API calls to address inconsistencies
with respect to locking and unlocking of global data structures.

First this updates the definition of CFE_SB_PipeId_t to use the
CFE_ES_ResourceID_t base type, and a new ID range.  Notably this
prevents direct access to the CFE_SB.PipeTbl global, forcing
code to go through the proper lookup routine, which should only
be done while locked.

All API implementations follow the same general pattern:

- Initial checks/queries while unlocked
- Lock SB global
- Lookups and/or modifications to the pipe table/routing info
- Unlock SB global
- Invoke other subsystems (e.g. OSAL)
- Re-lock SB global (if needed) do final update, and unlock again
- Send all events

All error counters should be updated at the end, while still locked.
All event processing is deferred to the end of each function, after
all other processing is done.
astrogeco added a commit that referenced this issue Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants