Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-process-uid-gid.js (& euid-egid) failing in Node.js 10.x on Alpine 3.9 with Ubuntu 18.04 hosts #29977

Closed
rvagg opened this issue Oct 15, 2019 · 10 comments
Labels
build Issues and PRs related to build files or the CI. process Issues and PRs related to the process subsystem.

Comments

@rvagg
Copy link
Member

rvagg commented Oct 15, 2019

@BethGriggs picked this up @ nodejs/build#1945 (comment)

We've switched our Alpine 3.9 containers to Ubuntu 18.04 hosts, from 16.04. So that's a ~4.4.0 kernel to a ~4.15.0 and these two tests are now reliably failing with segfaults: test-process-uid-gid.js & test-process-euid-egid.js.

With some effort I've managed to generate a core dump but I don't think it's very helpful:

#0  0x00007f9b32eb1568 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007f9b32eaed56 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007f9b304e4b54 in ?? ()
#3  0x0000000000000000 in ?? ()

The latest Alpine is 3.10 and it's working fine for 10.x (and above), so this is specifically for the last gen Alpine with last gen Node. How much does this matter, and do we have someone with expertise and time to dive into this? I've taken an Alpine 3.9 container out of CI for messing around with this and am happy to give someone access & instructions if you want to toy with it.

If I manage to make a debug build core dump then I'll paste that in, maybe it'll be more interesting?

For my own reference, to help clean up:

  • https://ci.nodejs.org/computer/test-digitalocean-alpine39_container-x64-2/ is offline for this
  • have forced host core dumps to /var/crash and mounted that in the container
  • started container with --ulimit core=99999999999:99999999999
  • installed gdb in the container to inspect
  • saved the offending (release build) node as ~iojs/node-10-segfault, uid-gid.js core dump is /var/crash/core.node.78 and euid-egid.js core dump is /var/crash/core.node.86.
@rvagg
Copy link
Member Author

rvagg commented Oct 15, 2019

welp, V8 doesn't support DEBUG against musl apparently, so the above is the best we're going to get.

../deps/v8/src/external-reference-table.cc:14:10: fatal error: execinfo.h: No such file or directory
 #include <execinfo.h>
          ^~~~~~~~~~~~
compilation terminated.

@nodejs/releasers I'd say for now that these two failures on 10.x on alpine-last-latest-x64 are acceptable and should now hold up a release.

@bnoordhuis
Copy link
Member

I can take a look if you want. Do I have access? What's the address to ssh to?

@jbergstroem
Copy link
Member

#include <execinfo.h>

This lives in libexecinfo-dev in alpine.

@rvagg
Copy link
Member Author

rvagg commented Oct 16, 2019

libexecinfo-dev was 👍 but it fails later

/home/iojs/build/workspace/node-test-commit-linux/nodes/alpine-last-latest-x64/out/../deps/v8/src/external-reference-table.cc:45: undefined reference to `backtrace_symbols'

installing libunwind doesn't help either.

@bnoordhuis you should have access, it's test-digitalocean-ubuntu1804_docker-x64-2 (root@159.89.183.200 - but if you haven't run ansible-playbook playbooks/write-ssh-config.yml in build/ansible then maybe you should so you get the ssh config all set up).

I don't know how much help you need with Docker & Alpine but here's a quick rundown to get you in and working:

  • docker ps will show you the running containers, the one I've taken offline for debugging this is node-ci:test-digitalocean-alpine39_container-x64-2 which is tagged as node-ci-test-digitalocean-alpine39_container-x64-2
  • docker exec -ti node-ci-test-digitalocean-alpine39_container-x64-2 bash will get you in to the container as user iojs. You can do docker exec -ti -u root to get in as root and su iojs from within there as you need.
  • /home/iojs/build/workspace/node-test-commit-linux/nodes/alpine-last-latest-x64 inside the container is where there's a v10.x checked out, you can mess with that. /home/iojs/ is mounted from /home/iojs/test-digitalocean-alpine39_container-x64-2/ on the host if that helps.
  • If you need to install packages, apk add (I've already added libexecinfo-dev as @jbergstroem suggested). I don't mind what you do to this container, I'll reset it all when you're done. Add editors or whatever else you like.
  • /lib/systemd/system/jenkins-test-digitalocean-alpine39_container-x64-2.service on the host has the startup for this container, you can see my custom ulimit in there as well as the mount of /var/crash. You're welcome to modify this and restart as you need, I'll assume you're comfortable with systemd. I don't mind what you do here, I'll reset it when you're done.

@sam-github
Copy link
Contributor

Maybe obvious, or too difficult to arrange, but I suspect a -lexecinfo is needed.

@rvagg
Copy link
Member Author

rvagg commented Oct 21, 2019

I've had to rerun ansible on this machine to update the sharedlibs containers, so the systemd config is reset and Dockerfile was reset along with its image too. Feel free to edit them again as needed, that alpine39 container is still marked as offline.

@rvagg
Copy link
Member Author

rvagg commented Oct 23, 2019

Same failure on both Alpine 3.9 and Alpine 3.10 for 10.x here: https://ci.nodejs.org/job/node-test-commit-linux/30365/

Perhaps there's something particular about the base machines that's causing it to be flaky. It might explain why this didn't show up when I ran the initial tests after updating the container hosts.

@rvagg
Copy link
Member Author

rvagg commented Oct 23, 2019

@targos
Copy link
Member

targos commented Dec 26, 2020

@rvagg should this stay open?

@targos targos added build Issues and PRs related to build files or the CI. v10.x process Issues and PRs related to the process subsystem. labels Dec 26, 2020
@rvagg
Copy link
Member Author

rvagg commented Dec 31, 2020

🤷

@rvagg rvagg closed this as completed Dec 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. process Issues and PRs related to the process subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants