Infinite recursion on list_windows.go #23984
Labels
hcc/jira
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
theme/client
theme/platform-windows
type/bug
Nomad version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e
Operating system and Environment details
OS Name: Microsoft Windows Server 2022 Datacenter
OS Version: 10.0.20348 N/A Build 20348
Issue
It seems that we are hitting an infinite recursion issue running Nomad on Windows (as client). We have a large cluster of just long-running services usinf raw_exec. When the services goes up, all seems to be good and they can be stable for quite a few hours.
However, sometimes after a few hours, we start receiving a lot of these errors in the respective allocations:
I pulled the logs from a particular client in which one of these allocations failed and I found this go stack trace:
Our system is set to try restarting up to three times before reallocating. If it fails three times in a row, it triggers a new allocation. As you can see in the screenshot, even those new allocations can fail at first, but eventually, one of them sticks and things stabilize. The catch is, after a while, the service gets unstable again for the same reason, and the whole cycle starts over.
Reproduction steps
We don't have a synthetic project in which we can reliably reproduce this, as this is happening exclusively on our production workload.
Here are some guesses and some more information about what this process is actually doing.
The main process being executed in this Nomad allocation is a Node application that spawns a Github Actions runner, that is configured to only execute a particular kind of a job: a Unity build. Unity is a game engine that, when building, can take a lot of resources and put the machine under a significant amount of stress. Although we have beefy machines that are way beyond what we have observed this process to take, this could explain part of this behaviour. Moreover, the actual Unity process run on a Docker for Windows container.
Nomad runs as a Windows Service under the
NT AUTHORITY\SYSTEM
user in the Windows machine. This is an example of the process tree that our nomad client spawns for each allocation:Moreover, it looks like when this happen, only the Nomad executor dies but the underlying process is left alive, which becomes problematic as these live outside our Nomad cluster and take resources in the machines.
Finally, we started to experience this issue after upgrading Nomad from version 1.7.7 to 1.8.3. We are considering downgrading because of this issue, but would be nice to understand if we can do anything to mitigate at all.
If it helps, these are normal EC2 machines in the AWS Cloud.
Expected Result
The Nomad allocations don't crash with a go panic and they run normally
Actual Result
The Nomad allocations crash with a go panic, and the underlying spawned process is left alive taking resources in the machine.
The text was updated successfully, but these errors were encountered: