Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Crashes and hangs on ppc64 trunk/5.2 #380

Closed
jmid opened this issue Aug 14, 2023 · 9 comments
Closed

[ocaml5-issue] Crashes and hangs on ppc64 trunk/5.2 #380

jmid opened this issue Aug 14, 2023 · 9 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Aug 14, 2023

We're seeing both crashes and hangs on the native ppc64 backend running sequential STM tests(!) through multicoretests-ci.

This run https://ocaml-multicoretests.ci.dev:8100/job/2023-08-14/091418-ci-ocluster-build-25f6ae hung to the point of timing out on a sequential STM array test:

random seed: 123267297
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)2023-08-14 14:14.18: Cancelling: Timeout (300.0 minutes)
Job cancelled

On another run https://ocaml-multicoretests.ci.dev:8100/job/2023-08-14/113100-ci-ocluster-build-a0a174 we triggered two segfault crashes:

random seed: 440184295
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 301756713
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

In both cases the tested version was reported as 5.2.0+dev0-2023-04-11

@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Aug 14, 2023
@jmid
Copy link
Collaborator Author

jmid commented Aug 15, 2023

I also just spotted this in the CI run for #379 where it crashed on sequential STM Bytes and Float Array tests:
https://ocaml-multicoretests.ci.dev:8100/job/2023-07-28/174223-ci-ocluster-build-1474f8

random seed: 289298402
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 44278272
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential (generating)File "src/floatarray/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/floatarray && ./stm_tests.exe --verbose)
Command got signal SEGV.

@jmid
Copy link
Collaborator Author

jmid commented Aug 16, 2023

2 more crashes of sequential STM Array and Bytes tests after merging #375 to main:
https://ocaml-multicoretests.ci.dev:8100/job/2023-08-15/064105-ci-ocluster-build-4b9f79

random seed: 374624147
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 457483824
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

@jmid
Copy link
Collaborator Author

jmid commented Aug 18, 2023

Observed a triple crash in sequential STM tests of Array, Bytes, and Float.Array after merging #381 to main
https://ocaml-multicoretests.ci.dev:8100/job/2023-08-17/073031-ci-ocluster-build-744e2c

random seed: 282178331
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 211356154
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 449604618
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential (generating)File "src/floatarray/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/floatarray && ./stm_tests.exe --verbose)
Command got signal SEGV.

@jmid
Copy link
Collaborator Author

jmid commented Sep 7, 2023

Triggered again on merge of #390 to main on ppc64le trunk/5.2:
https://ocaml-multicoretests.ci.dev:8100/job/2023-09-06/062220-ci-ocluster-build-5e2e68

random seed: 416437602
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 400694331
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 278261962
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential (generating)File "src/floatarray/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/floatarray && ./stm_tests.exe --verbose)
Command got signal SEGV.

@dustanddreams
Copy link

Should hopefully be fixed with this OCaml PR.

@jmid
Copy link
Collaborator Author

jmid commented Oct 12, 2023

We saw segfaults triggered again with trunk/5.2 on the merge of the 0.3 branch to main - these are due to running tests on a 3-weeks old Docker image IFAIU
https://ocaml-multicoretests.ci.dev:8100/job/2023-10-11/132452-ci-ocluster-build-ed923b

random seed: 522156132
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 138990997
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Bytes test sequential (generating)File "src/bytes/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/bytes && ./stm_tests.exe --verbose)
Command got signal SEGV.

[...]

random seed: 417927763
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential (generating)File "src/floatarray/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/floatarray && ./stm_tests.exe --verbose)
Command got signal SEGV.

@dustanddreams
Copy link

We saw segfaults triggered again with trunk/5.2 on the merge of the 0.3 branch to main - these are due to running tests on a 3-weeks old Docker image IFAIU https://ocaml-multicoretests.ci.dev:8100/job/2023-10-11/132452-ci-ocluster-build-ed923b

Unfortunately there doesn't seem to be an easy way to know what git revision had been used in this build.

The power backend fix got merged on september 25th, which is only two weeks and a day ago, so it is likely that it is missing from that image.

@jmid
Copy link
Collaborator Author

jmid commented Oct 12, 2023

That makes sense then 👍

There's a copy of the opam-repo included in the Docker image AFAIU.
For the above image, that gives us a rough indicator of when the image was created and hence how old the 5.2-trunk compiler built from https://github.com/ocaml/ocaml/archive/trunk.tar.gz is:

/src: (run (cache (opam-archives (target /home/opam/.opam/download-cache)))
           (network host)
           (shell "cd ~/opam-repository && (git cat-file -e 52faa0368bad47b21b52c2289d1d97c6e2bf429b || git fetch origin master) && git reset -q --hard 52faa0368bad47b21b52c2289d1d97c6e2bf429b && git log --no-decorate -n1 --oneline && opam update -u"))
52faa0368b Merge pull request #24427 from mtelvers/freebsd-emacs

This PR was merge 3 weeks ago: ocaml/opam-repository#24427 😃

I've notified the CI folks. There's an effort to set up a callback from https://github.com/ocaml/opam-repository to multicoretests-ci to influence its caching decision AFAIU.

@jmid
Copy link
Collaborator Author

jmid commented Nov 1, 2023

Closing this as 3 more weeks have gone by - and the latest run no longer triggered it, so I suspect the CI caches have been refreshed.

@jmid jmid closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants