Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation error in flake regression test suite #11141

Closed
edolstra opened this issue Jul 19, 2024 · 7 comments · Fixed by #11150 or #11152
Closed

Evaluation error in flake regression test suite #11141

edolstra opened this issue Jul 19, 2024 · 7 comments · Fixed by #11150 or #11152
Labels
bug regression Something doesn't work anymore release-blocker

Comments

@edolstra
Copy link
Member

Describe the bug

The following test has recently started failing:

$ ./flake-regressions/eval-flake.sh  flake-regressions/tests/Anomalocaridid/toolbox/0.1.269+rev-c41c2725b8099d3a37b0beab6afdc99470dfaefa
Evaluating https://api.flakehub.com/f/pinned/Anomalocaridid/toolbox/0.1.269%2Brev-c41c2725b8099d3a37b0beab6afdc99470dfaefa/018d8efd-c3f8-7d1d-adb3-b0101178042b/source.tar.gz...
Flake https://api.flakehub.com/f/pinned/Anomalocaridid/toolbox/0.1.269%2Brev-c41c2725b8099d3a37b0beab6afdc99470dfaefa/018d8efd-c3f8-7d1d-adb3-b0101178042b/source.tar.gz failed to evaluate:
warning: creating lock file '/home/eelco/Dev/nix/flake-regressions/tests/Anomalocaridid/toolbox/0.1.269+rev-c41c2725b8099d3a37b0beab6afdc99470dfaefa/tmp-flake/flake.lock': 
• Added input 'flake':
    'https://api.flakehub.com/f/pinned/Anomalocaridid/toolbox/0.1.269%2Brev-c41c2725b8099d3a37b0beab6afdc99470dfaefa/018d8efd-c3f8-7d1d-adb3-b0101178042b/source.tar.gz?narHash=sha256-915Vs2YX3p3SgUzC7EOeyvpYoiUFniNv01Uj0zJnDos%3D' (2024-02-09)
• Added input 'flake/nixpkgs':
    'github:NixOS/nixpkgs/7c9cc5a6e5d38010801741ac830a3f8fd667a7a0?narHash=sha256-SaTWPkI8a5xSHX/rrKzUe%2B/uVNy6zCGMXgoeMb7T9rg%3D' (2023-10-19)
error:
       … while evaluating attribute 'inventory'
         at /nix/store/badnsj7h4f7n1b2f8fnillpxqsf96ac9-source/flake.nix:113:25:
          112|                 inherit docs;
          113|                 inherit inventory;
             |                         ^
          114|               };

       … while evaluating attribute 'checks'

       … while evaluating attribute 'children'
         at /nix/store/badnsj7h4f7n1b2f8fnillpxqsf96ac9-source/flake.nix:20:48:
           19|
           20|               mkChildren = children: { inherit children; };
             |                                                ^
           21|

       … while evaluating attribute 'x86_64-linux'

       … while evaluating attribute 'children'
         at /nix/store/badnsj7h4f7n1b2f8fnillpxqsf96ac9-source/flake.nix:20:48:
           19|
           20|               mkChildren = children: { inherit children; };
             |                                                ^
           21|

       … while evaluating attribute 'whiskers'

       … while evaluating attribute 'derivation'
         at /nix/store/badnsj7h4f7n1b2f8fnillpxqsf96ac9-source/flake.nix:77:35:
           76|                                 {
           77|                                   derivation =
             |                                   ^
           78|                                     if attrs ? derivation

       … while calling the 'unsafeDiscardStringContext' builtin
         at /nix/store/badnsj7h4f7n1b2f8fnillpxqsf96ac9-source/flake.nix:79:42:
           78|                                     if attrs ? derivation
           79|                                     then builtins.unsafeDiscardStringContext attrs.derivation.drvPath
             |                                          ^
           80|                                     else null;

       … while evaluating the attribute 'derivation.drvPath'
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/customisation.nix:223:7:
          222|     in commonAttrs // {
          223|       drvPath = assert condition; drv.drvPath;
             |       ^
          224|       outPath = assert condition; drv.outPath;

       … while calling the 'derivationStrict' builtin
         at <nix/derivation-internal.nix>:9:12:
            8|
            9|   strict = derivationStrict drvAttrs;
             |            ^
           10|

       … while evaluating the derivation attribute 'name'
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:300:7:
          299|     // (lib.optionalAttrs (attrs ? name || (attrs ? pname && attrs ? version)) {
          300|       name =
             |       ^
          301|         let

       … while evaluating the `name` attribute passed to builtins.derivationStrict

       … from call site
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:318:9:
          317|         in
          318|         lib.strings.sanitizeDerivationName (
             |         ^
          319|           if attrs ? name

       … while calling anonymous lambda
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1157:3:
         1156|   in
         1157|   string:
             |   ^
         1158|   # First detect the common case of already valid strings, to speed those up

       … while evaluating a branch condition
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1159:3:
         1158|   # First detect the common case of already valid strings, to speed those up
         1159|   if stringLength string <= 207 && okRegex string != null
             |   ^
         1160|   then unsafeDiscardStringContext string

       … in the left operand of the AND (&&) operator
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1159:33:
         1158|   # First detect the common case of already valid strings, to speed those up
         1159|   if stringLength string <= 207 && okRegex string != null
             |                                 ^
         1160|   then unsafeDiscardStringContext string

       … in the argument of the not operator
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1159:26:
         1158|   # First detect the common case of already valid strings, to speed those up
         1159|   if stringLength string <= 207 && okRegex string != null
             |                          ^
         1160|   then unsafeDiscardStringContext string

       … while calling the 'lessThan' builtin
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1159:26:
         1158|   # First detect the common case of already valid strings, to speed those up
         1159|   if stringLength string <= 207 && okRegex string != null
             |                          ^
         1160|   then unsafeDiscardStringContext string

       … while calling the 'stringLength' builtin
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/strings.nix:1159:6:
         1158|   # First detect the common case of already valid strings, to speed those up
         1159|   if stringLength string <= 207 && okRegex string != null
             |      ^
         1160|   then unsafeDiscardStringContext string

       … in the condition of the assert statement
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:323:13:
          322|             # we cannot coerce null to a string below
          323|             assert lib.assertMsg (attrs ? version && attrs.version != null) "The ‘version’ attribute cannot be null.";
             |             ^
          324|             "${attrs.pname}${staticMarker}${hostSuffix}-${attrs.version}"

       … from call site
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:323:20:
          322|             # we cannot coerce null to a string below
          323|             assert lib.assertMsg (attrs ? version && attrs.version != null) "The ‘version’ attribute cannot be null.";
             |                    ^
          324|             "${attrs.pname}${staticMarker}${hostSuffix}-${attrs.version}"

       … while calling 'assertMsg'
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/asserts.nix:23:5:
           22|     # Message to throw in case `pred` fails
           23|     msg:
             |     ^
           24|     pred || builtins.throw msg;

       … in the left operand of the OR (||) operator
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/asserts.nix:24:10:
           23|     msg:
           24|     pred || builtins.throw msg;
             |          ^
           25|

       … in the right operand of the AND (&&) operator
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:323:51:
          322|             # we cannot coerce null to a string below
          323|             assert lib.assertMsg (attrs ? version && attrs.version != null) "The ‘version’ attribute cannot be null.";
             |                                                   ^
          324|             "${attrs.pname}${staticMarker}${hostSuffix}-${attrs.version}"

       … from call site
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/pkgs/stdenv/generic/make-derivation.nix:323:54:
          322|             # we cannot coerce null to a string below
          323|             assert lib.assertMsg (attrs ? version && attrs.version != null) "The ‘version’ attribute cannot be null.";
             |                                                      ^
          324|             "${attrs.pname}${staticMarker}${hostSuffix}-${attrs.version}"

       … while calling anonymous lambda
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/attrsets.nix:846:24:
          845|     let f = attrPath:
          846|       zipAttrsWith (n: values:
             |                        ^
          847|         let here = attrPath ++ [n]; in

       … while calling the 'head' builtin
         at /nix/store/4jfc6vrkmq7z5pb651jh5b4kra5f1kwp-source/lib/attrsets.nix:850:11:
          849|         || pred here (elemAt values 1) (head values) then
          850|           head values
             |           ^
          851|         else

       error: attribute 'package' missing
       at /nix/store/lnrdwxk9d0ffv78w87swjy497js2i13s-source/nix/default.nix:11:18:
           10|         inherit pname;
           11|         inherit (memberCargoToml.package) version;
             |                  ^
           12|         src = pkgs.nix-gitignore.gitignoreSource [] ../.;

git bisect pointed to b230c01 (#11014), however since that didn't change anything in the evaluator, and mysterious "attribute missing" errors are often GC-related, I ran the test script with GC_INITIAL_HEAP_SIZE=10G and that made it succeed. So we have a GC bug somewhere.

Steps To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

nix-env --version output

Additional context

Add any other context about the problem here.

Priorities

Add 👍 to issues you find important.

@edolstra edolstra added bug release-blocker regression Something doesn't work anymore labels Jul 19, 2024
@tomberek
Copy link
Contributor

One of these commits causes the GC to do a use-after-free.
2edcdf8
2477e4e
8df206b
b311f51

from the merge commit: https://github.com/NixOS/nix/pull/10835/commits

Ran into a segfault on a particular dirty NixOS evaluation (ask @djacu for access). It had random errors that seemed to indicate a bad GC, or use-after-free. Either segfaults or errors related to values being of the wrong type. This makes it seem that GC was freeing something it should not, then re-using the memory, causing values to be corrupted. Running it with GC_DONT_GC would result in proper evals. Tested and bisected with @djacu and @JeremiahSecrist

@roberth : this commit (master...djacu:nix:djacu/fix-gc-issue) fixes the error, but we're not quite sure that is the intended outcome of removing the patch.

image

@roberth
Copy link
Member

roberth commented Jul 22, 2024

The sp_corrector callback is supposed to replace the old stack correcting patch, so that the GC would work properly when the stack pointer points into a coroutine stack instead of an OS-provided thread stack.
I presume this only happens on Linux? On Darwin, we currently still disable all this logic, and instead disable GC collection whenever/while a coroutine exists (which is bad for memory use when you should be collecting during source filtering).

Possible workarounds:

  • disable GC collection during coroutines like on mac
  • revert to bwdgc 8.2

@roberth
Copy link
Member

roberth commented Jul 22, 2024

I've tried to reproduce by creating and checking garbage while filtering sources, and while my test does trigger garbage collections during source filtering, it may not depend on the main stack getting marked.

Reproducer work in progress

# GC_INITIAL_HEAP_SIZE=$[1024 * 1024] NIX_SHOW_STATS=1 nix eval -f gc-coroutine-test.nix -vvvv

gc-coroutine-test.nix

let
  inherit (builtins)
    foldl'
    isList
    ;

  # Generate a tree of numbers, n deep, such that the numbers add up to (1 + salt) * 10^n.
  # The salting makes the numbers all different, increasing the likelihood of catching
  # any memory corruptions that might be caused by the GC or otherwise.
  garbage = salt: n:
    if n == 0
    then [(1 + salt)]
    else [
      (garbage (10 * salt + 1) (n - 1))
      (garbage (10 * salt - 1) (n - 1))
      (garbage (10 * salt + 2) (n - 1))
      (garbage (10 * salt - 2) (n - 1))
      (garbage (10 * salt + 3) (n - 1))
      (garbage (10 * salt - 3) (n - 1))
      (garbage (10 * salt + 4) (n - 1))
      (garbage (10 * salt - 4) (n - 1))
      (garbage (10 * salt + 5) (n - 1))
      (garbage (10 * salt - 5) (n - 1))
    ];

  pow = base: n:
    if n == 0
    then 1
    else base * (pow base (n - 1));

  sumNestedLists = l:
    if isList l
    then foldl' (a: b: a + sumNestedLists b) 0 l
    else l;

in
  assert sumNestedLists (garbage 0 3) == pow 10 3;
  assert sumNestedLists (garbage 0 6) == pow 10 6;
  builtins.path {
    path = ./src;
    filter = path: type:
      # We're not doing common subexpression elimination, so this reallocates
      # the fairly big tree over and over, producing a lot of garbage during
      # source filtering, whose filter runs in a coroutine.
      assert sumNestedLists (garbage 0 4) == pow 10 4;
      true;
  }

@roberth
Copy link
Member

roberth commented Jul 22, 2024

Reproducer:

# Run:
# GC_INITIAL_HEAP_SIZE=$[1024 * 1024] NIX_SHOW_STATS=1 nix eval -f gc-coroutine-test.nix  -vvvv

let
  inherit (builtins)
    foldl'
    isList
    ;

  # Generate a tree of numbers, n deep, such that the numbers add up to (1 + salt) * 10^n.
  # The salting makes the numbers all different, increasing the likelihood of catching
  # any memory corruptions that might be caused by the GC or otherwise.
  garbage = salt: n:
    if n == 0
    then [(1 + salt)]
    else [
      (garbage (10 * salt + 1) (n - 1))
      (garbage (10 * salt - 1) (n - 1))
      (garbage (10 * salt + 2) (n - 1))
      (garbage (10 * salt - 2) (n - 1))
      (garbage (10 * salt + 3) (n - 1))
      (garbage (10 * salt - 3) (n - 1))
      (garbage (10 * salt + 4) (n - 1))
      (garbage (10 * salt - 4) (n - 1))
      (garbage (10 * salt + 5) (n - 1))
      (garbage (10 * salt - 5) (n - 1))
    ];

  pow = base: n:
    if n == 0
    then 1
    else base * (pow base (n - 1));

  sumNestedLists = l:
    if isList l
    then foldl' (a: b: a + sumNestedLists b) 0 l
    else l;

in
  assert sumNestedLists (garbage 0 3) == pow 10 3;
  assert sumNestedLists (garbage 0 6) == pow 10 6;
  builtins.foldl'
    (a: b:
      assert
        "${
          builtins.path {
            path = ./src;
            filter = path: type:
              # We're not doing common subexpression elimination, so this reallocates
              # the fairly big tree over and over, producing a lot of garbage during
              # source filtering, whose filter runs in a coroutine.
              assert sumNestedLists (garbage 0 3) == pow 10 3;
              true;
          }
        }"
          == "${./src}";

      # These asserts don't seem necessary, as the lambda value get corrupted first
      assert a.okay;
      assert b.okay;
      { okay = true; }
    )
    { okay = true; }
    [ { okay = true; } { okay = true; } { okay = true; } ]

roberth added a commit to hercules-ci/nix that referenced this issue Jul 22, 2024
edolstra added a commit to DeterminateSystems/nix-src that referenced this issue Jul 22, 2024
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.
@djacu
Copy link
Member

djacu commented Jul 22, 2024

The sp_corrector callback is supposed to replace the old stack correcting patch, so that the GC would work properly when the stack pointer points into a coroutine stack instead of an OS-provided thread stack.
I presume this only happens on Linux? On Darwin, we currently still disable all this logic, and instead disable GC collection whenever/while a coroutine exists (which is bad for memory use when you should be collecting during source filtering).

Possible workarounds:

  • disable GC collection during coroutines like on mac
  • revert to bwdgc 8.2

Not certain about Darwin. I only tested this on my Linux mint machine. @tomberek and Jeremiah also tested it on their machines but I'm not sure what OS they were using at the time

@JeremiahSecrist
Copy link

Not certain about Darwin. I only tested this on my Linux mint machine. @tomberek and Jeremiah also tested it on their machines but I'm not sure what OS they were using at the time

I was using nixos at the time.

edolstra added a commit that referenced this issue Jul 22, 2024
…ector

Fix issue #11141 broken stack pointer corrector
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-07-22-nix-team-meeting-minutes-163/49544/1

edolstra added a commit to DeterminateSystems/nix-src that referenced this issue Jul 24, 2024
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.
tomberek pushed a commit to tomberek/nix that referenced this issue Aug 17, 2024
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.
domenkozar pushed a commit to domenkozar/nix that referenced this issue Sep 10, 2024
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.
domenkozar pushed a commit to domenkozar/nix that referenced this issue Sep 10, 2024
This hopefully avoids the need for all our Boehm GC coroutine
workarounds, since the GC roots will be on the main stack of the
thread.

Fixes NixOS#11141.

(cherry picked from commit 048271e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug regression Something doesn't work anymore release-blocker
Projects
None yet
6 participants