ci: investigate linux-race failures in GitHub actions #2159

nicktrav · 2022-12-01T21:53:14Z

Currently, the linux-race job is consistently failing in GitHub actions (with exit code 143 - SIGTERM).

For example, a CI run with verbose logging enabled fails with the following:

2022-12-01T17:39:21.3663768Z make: *** [Makefile:22: test] Terminated
2022-12-01T17:39:21.5137635Z ##[error]Process completed with exit code 143.
2022-12-01T17:39:21.5192954Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-01T17:39:21.7086950Z Cleaning up orphan processes

Initial indications point at 7d9a5b2, though the same failure mode has been observed on branches without this commit (see here).

Pure speculation: we're crossing some kind of GH Action resource limit when running race, which results in the platform sending a SIGTERM to the make process, which fails the test.

See this internal thread for more context.

The text was updated successfully, but these errors were encountered:

The `linux-race` job is currently consistently failing in CI (both pre- and post-merge). This is being tracked in cockroachdb#2159. Temporarily skip the job while the cause is investigated. Touches cockroachdb#2159.

The `linux-race` job is currently consistently failing in CI (both pre- and post-merge). This is being tracked in #2159. Temporarily skip the job while the cause is investigated. Touches #2159.

nicktrav · 2022-12-01T22:13:04Z

I did some searching, and it looks like there are some other projects that run into the same issues with jobs that eat up a lot of resources.

I'm going to open a support case with GH and see if they can poke around and see if the VMs running our jobs are being killed for the same reasons.

nicktrav · 2022-12-02T15:24:10Z

The memory usage for the sstable package is markedly increased after 7d9a5b2.

Running:

$ go test -tags 'invariants' -race -timeout 20m -count 1 -run . ./sstable

Current:

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 1; done
total kB         1561548   48580   47556
total kB         12747176   57172   56084
total kB         34971432 5513232 5512144
total kB         46083048 10562376 10561288
total kB         46092112 7531108 7530020
total kB         46104464 7953620 7942876
total kB         1899826908 7649968 7639160
total kB         1899826404 7836268 7825396
total kB         1899828180 8066388 8055516
...

Before commit:

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 1; done
total kB         1637896   78916   77808
total kB         2090584  218160  217052
total kB         2536000  236340  235232
total kB         1881287484  154604  153432
total kB         1881302720  221436  210664
total kB         1881287740  196204  185432
total kB         1881287740  157572  146800
...

nicktrav · 2022-12-02T15:47:47Z

Narrowing in on the issue, it looks like TestReader provides a cleaner reproducer.

On master:

$ go test -tags '' -race -timeout 20m -count 1 -run TestReader$ -v ./sstable | grep -E '^--- PASS'
--- PASS: TestReader (4.62s)

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 0.5; done
total kB         1705820   47684   46448
total kB         1853284   48480   37588
total kB         13039160 2255992 2245100
total kB         35262392 3500896 3490004
total kB         35262392 8288664 8277772
total kB         35262392 6127036 6116144
total kB         57485624 3711220 3700264
total kB         57485624 9722268 9711312

Running the same test without the new format version, the memory usage is much better, and the test completes in a fraction of the time:

diff --git a/sstable/reader_test.go b/sstable/reader_test.go
index c94e1fdf..a768cce1 100644
--- a/sstable/reader_test.go
+++ b/sstable/reader_test.go
@@ -204,7 +204,7 @@ func TestReader(t *testing.T) {
                "prefixFilter": "testdata/prefixreader",
        }

-       for _, format := range []TableFormat{TableFormatPebblev2, TableFormatPebblev3} {
+       for _, format := range []TableFormat{TableFormatPebblev2} {
                for dName, blockSize := range blockSizes {
                        for iName, indexBlockSize := range blockSizes {
                                for lName, tableOpt := range writerOpts {

$ go test -tags '' -race -timeout 20m -count 1 -run TestReader$ -v ./sstable | grep -E '^--- PASS'
--- PASS: TestReader (0.94s)

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 0.5; done
total kB         1704404   44844   43672
total kB         1778144   46276   45104

Note that I'm also running with invariants disabled, which rules out the effects of scrambling some byte slices in the value blocks.

@sumeerbhola - mind if I send this your way to take a look?

nicktrav · 2023-01-10T19:39:16Z

I poked this some more with pprof. As expected, this points at the new value block code:

github.com/cockroachdb/pebble/sstable.(*valueBlockWriter).addValue
/home/nickt/Development/pebble/sstable/value_block.go

  Total:         3GB        3GB (flat, cum) 99.85%
    463            .          .           	if cap(w.buf.b) < blockLen { 
    464            .          .           		size := w.blockSize + w.blockSize/2 
    465            .          .           		if size < blockLen { 
    466            .          .           			size = blockLen + blockLen/2 
    467            .          .           		} 
    468          3GB        3GB           		buf := make([]byte, blockLen, size) 
    469            .          .           		_ = copy(buf, w.buf.b) 
    470            .          .           		w.buf.b = buf 
    471            .          .           	} else { 
    472            .          .           		w.buf.b = w.buf.b[:blockLen] 
    473            .          .           	}

@sumeerbhola - is there anything for us to dig into here around pooling some of these byte slices or making use of some of the sugar we have in bytealloc? I suspect we're not seeing the effects of this yet as these codepaths are yet to be enabled, though I assume we'll run into an issue with memory usage eventually.

sumeerbhola · 2023-01-11T21:00:50Z

Thanks for investigating! I will send out a fix soon.

Tests can set a high maximum block size e.g. TestReader sets the block size to 2GB. This resulted in a 3GB byte slice being allocated. The new logic is similar to the behavior in blockWriter. Fixes cockroachdb#2159

Tests can set a high maximum block size e.g. TestReader sets the block size to 2GB. This resulted in a 3GB byte slice being allocated. The new logic is similar to the behavior in blockWriter. Fixes #2159

blathers-crl bot added A-storage T-storage labels Dec 1, 2022

nicktrav mentioned this issue Dec 1, 2022

ci: temporarily skip the linux-race job #2160

Merged

nicktrav mentioned this issue Dec 2, 2022

ubuntu-latest: jobs fail with error code 143 actions/runner-images#6680

Closed

11 tasks

jbowens assigned sumeerbhola Dec 5, 2022

jbowens mentioned this issue Jan 3, 2023

windows: test flakes / failures due to out of memory #1897

Closed

sumeerbhola mentioned this issue Jan 11, 2023

sstable: avoid allocating up to maximum block size in valueBlockWriter #2236

Merged

nicktrav closed this as completed in #2236 Jan 11, 2023

nicktrav mentioned this issue Jul 17, 2023

go-linux-race is flaky #2741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: investigate linux-race failures in GitHub actions #2159

ci: investigate linux-race failures in GitHub actions #2159

nicktrav commented Dec 1, 2022 •

edited

Loading

nicktrav commented Dec 1, 2022

nicktrav commented Dec 2, 2022

nicktrav commented Dec 2, 2022 •

edited

Loading

nicktrav commented Jan 10, 2023

sumeerbhola commented Jan 11, 2023

ci: investigate linux-race failures in GitHub actions #2159

ci: investigate linux-race failures in GitHub actions #2159

Comments

nicktrav commented Dec 1, 2022 • edited Loading

nicktrav commented Dec 1, 2022

nicktrav commented Dec 2, 2022

nicktrav commented Dec 2, 2022 • edited Loading

nicktrav commented Jan 10, 2023

sumeerbhola commented Jan 11, 2023

nicktrav commented Dec 1, 2022 •

edited

Loading

nicktrav commented Dec 2, 2022 •

edited

Loading