Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory Leak on Secondary Core Power Cycle #9005

Closed
tmleman opened this issue Apr 5, 2024 · 0 comments · Fixed by #9006
Closed

[BUG] Memory Leak on Secondary Core Power Cycle #9005

tmleman opened this issue Apr 5, 2024 · 0 comments · Fixed by #9006
Labels
bug Something isn't working as expected LNL Applies to Lunar Lake platform MTL Applies to Meteor Lake platform regression identified Identified the commit or PR that introduced a regression Zephyr Issues only observed with Zephyr integrated

Comments

@tmleman
Copy link
Contributor

tmleman commented Apr 5, 2024

Describe the bug
A regression has been detected in the power flow code, introduced by commit 5f1e690, causing a memory leak on multicore platforms using Zephyr as the RTOS. The issue manifests as repeated memory allocations for secondary cores that are not freed upon powering down. This results in a gradual depletion of available memory, eventually leading to a firmware exception notification from the DSP. The problem was first identified on the LunarLake platform and is likely to affect other multicore platforms due to the shared power flow code.

To Reproduce
Steps to reproduce the behavior:

  1. Power up all secondary cores of the DSP on a multicore platform (e.g., LunarLake).
  2. Power down all the secondary cores.
  3. Repeat steps 1 and 2 until a firmware exception notification is received from the DSP.

Reproduction Rate
The issue is reproducible 10/10 times when following the above manual sequence on the LunarLake platform. The reproduction rate is expected to be consistent across other multicore platforms using Zephyr as the RTOS.

Expected behavior
Upon powering down the secondary cores, the system should either release the resources allocated during the power-up phase or ensure that they are reused during the next power-up.

Impact
This memory leak is a critical issue that leads to resource exhaustion and potential DSP panic after an extended number of power cycles. It is a showstopper for the reliability and stability of the DSP on all affected multicore platforms.

Environment
* SOF: main
* Platform: LunarLake (and potentially all multicore platforms using Zephyr RTOS)

@tmleman tmleman added bug Something isn't working as expected Zephyr Issues only observed with Zephyr integrated MTL Applies to Meteor Lake platform LNL Applies to Lunar Lake platform regression identified Identified the commit or PR that introduced a regression labels Apr 5, 2024
tmleman added a commit to tmleman/sof that referenced this issue Apr 5, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
tmleman added a commit to tmleman/sof that referenced this issue Apr 8, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
tmleman added a commit to tmleman/sof that referenced this issue Apr 8, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
tmleman added a commit to tmleman/sof that referenced this issue Apr 9, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
lgirdwood pushed a commit that referenced this issue Apr 9, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix #9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
eddy1021 pushed a commit to eddy1021/sof that referenced this issue Jul 15, 2024
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected LNL Applies to Lunar Lake platform MTL Applies to Meteor Lake platform regression identified Identified the commit or PR that introduced a regression Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant