Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error generating reports #115

Open
chzgustavo opened this issue Nov 2, 2022 · 6 comments
Open

Error generating reports #115

chzgustavo opened this issue Nov 2, 2022 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@chzgustavo
Copy link

Hello, I am using this tool, congratulations it is very good, but I have noticed that when a segment fault is generated, it sometimes generates all the files with another namespace name.

I attach evidence.

  • cluster: EKS v1.21
  • core dump version:
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                        APP VERSION
core-dump-handler       observe         1               2022-07-01 04:33:50.377219926 +0000 UTC deployed        core-dump-handler-v8.6.0     v8.6.0    
  • pod-info file: it contains the namespace env-1f1de3e2bda8, when in fact this pod is in the namespace: env-e4e2facbcb22
    image

It occurs to me to update core-dump to the newest version, I don't know if this will solve this problem.

@chzgustavo
Copy link
Author

chzgustavo commented Nov 2, 2022

Do you have any idea how I could debug this error?

Regards,
Gustavo.

@No9
Copy link
Collaborator

No9 commented Nov 2, 2022

Hi @chzgustavo
Thanks for the feedback really appreciate it.

Do you have pods with the same name running in different namespaces?

Background

The information from crio is currently queried using the hostname of the crashing container which is assumed to be unique.

This container hostname is then used to match to the pod.
https://github.com/IBM/core-dump-handler/blob/main/core-dump-composer/src/main.rs#L75

It isn't ideal but using the hostname is the only way to try and catch the crashing container information that I am aware of.

This isn't an issue in most deployment scenarios as people tend to use replicasets/deployments that generates a unique id for each pod.

However if you are creating pods directly in each namespace then you may have the potential to hit a name clash issue.

Possible Solution

If that sounds like the problem I would suggest giving each pod a unique name when provisioning.

@chzgustavo
Copy link
Author

chzgustavo commented Nov 3, 2022

Yes, indeed, I have many pods with the same name running in different namespaces.
The pods that generate segment fault belong to statefulset resources.

@chzgustavo
Copy link
Author

They all have the same hostname (but they are in different namespaces), is there any other possible solution for this case?
Thanks for your help!

@No9
Copy link
Collaborator

No9 commented Nov 3, 2022

Sorry I'm not aware of another possible solution.

Statefulsets intentionally label the pods with ordinal numbers
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity.

If you're using helm you can add the namespace to the statefulset name which would resolve this.
I know it's clunky but it should resolve it handily enough.

The underlying issue here is that the container kernel.core_pattern is per host and not per container so it's not possible to feed dynamic info from the pod to the kernel at runtime.

As systemd becomes more pod aware there may be a possibility to do something there but the last time I looked it just seemed to pass through to the system code.

[Edit]
I will add this to the FAQ as it seems like it would be a fairly common scenario that will trip others up.

[Edit2]
I'll double check the statuses in the responses from CRIO it may be possible to detect if the pod is crashing and if it isn't then move on to the next pod. I seem to remember looking at this when I wrote it and it wasn't possible but I'll double check.
I won't get to that for a bit though as I have to look at #114 first.

@No9 No9 added the documentation Improvements or additions to documentation label Nov 3, 2022
No9 referenced this issue in Ninja-Kiwi/k8s-core-dump-handler Nov 17, 2022
…ariable from a core dump and use that as the podname

If the environment variable can't be found then the composer will default back to hostname

Signed-off-by: Tom Haygarth <tom@ninjakiwi.com>
@jesuslinares
Copy link

Hi @No9,

Thanks for the information. We continue with this bug in production since we didn't apply the "clunky workaround".

Did you make some progress to fix it?

This project is very useful for us, thanks for the good work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants