Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: investigate linux-race failures in GitHub actions #2159

Closed
nicktrav opened this issue Dec 1, 2022 · 5 comments · Fixed by #2236
Closed

ci: investigate linux-race failures in GitHub actions #2159

nicktrav opened this issue Dec 1, 2022 · 5 comments · Fixed by #2236

Comments

@nicktrav
Copy link
Contributor

nicktrav commented Dec 1, 2022

Currently, the linux-race job is consistently failing in GitHub actions (with exit code 143 - SIGTERM).

For example, a CI run with verbose logging enabled fails with the following:

2022-12-01T17:39:21.3663768Z make: *** [Makefile:22: test] Terminated
2022-12-01T17:39:21.5137635Z ##[error]Process completed with exit code 143.
2022-12-01T17:39:21.5192954Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-01T17:39:21.7086950Z Cleaning up orphan processes

Initial indications point at 7d9a5b2, though the same failure mode has been observed on branches without this commit (see here).

Pure speculation: we're crossing some kind of GH Action resource limit when running race, which results in the platform sending a SIGTERM to the make process, which fails the test.

See this internal thread for more context.

nicktrav added a commit to nicktrav/pebble that referenced this issue Dec 1, 2022
The `linux-race` job is currently consistently failing in CI (both pre-
and post-merge). This is being tracked in cockroachdb#2159.

Temporarily skip the job while the cause is investigated.

Touches cockroachdb#2159.
nicktrav added a commit that referenced this issue Dec 1, 2022
The `linux-race` job is currently consistently failing in CI (both pre-
and post-merge). This is being tracked in #2159.

Temporarily skip the job while the cause is investigated.

Touches #2159.
@nicktrav
Copy link
Contributor Author

nicktrav commented Dec 1, 2022

I did some searching, and it looks like there are some other projects that run into the same issues with jobs that eat up a lot of resources.

I'm going to open a support case with GH and see if they can poke around and see if the VMs running our jobs are being killed for the same reasons.

@nicktrav
Copy link
Contributor Author

nicktrav commented Dec 2, 2022

The memory usage for the sstable package is markedly increased after 7d9a5b2.

Running:

$ go test -tags 'invariants' -race -timeout 20m -count 1 -run . ./sstable

Current:

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 1; done
total kB         1561548   48580   47556
total kB         12747176   57172   56084
total kB         34971432 5513232 5512144
total kB         46083048 10562376 10561288
total kB         46092112 7531108 7530020
total kB         46104464 7953620 7942876
total kB         1899826908 7649968 7639160
total kB         1899826404 7836268 7825396
total kB         1899828180 8066388 8055516
...

Before commit:

$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 1; done
total kB         1637896   78916   77808
total kB         2090584  218160  217052
total kB         2536000  236340  235232
total kB         1881287484  154604  153432
total kB         1881302720  221436  210664
total kB         1881287740  196204  185432
total kB         1881287740  157572  146800
...

@nicktrav
Copy link
Contributor Author

nicktrav commented Dec 2, 2022

Narrowing in on the issue, it looks like TestReader provides a cleaner reproducer.

On master:

$ go test -tags '' -race -timeout 20m -count 1 -run TestReader$ -v ./sstable | grep -E '^--- PASS'
--- PASS: TestReader (4.62s)
$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 0.5; done
total kB         1705820   47684   46448
total kB         1853284   48480   37588
total kB         13039160 2255992 2245100
total kB         35262392 3500896 3490004
total kB         35262392 8288664 8277772
total kB         35262392 6127036 6116144
total kB         57485624 3711220 3700264
total kB         57485624 9722268 9711312

Running the same test without the new format version, the memory usage is much better, and the test completes in a fraction of the time:

diff --git a/sstable/reader_test.go b/sstable/reader_test.go
index c94e1fdf..a768cce1 100644
--- a/sstable/reader_test.go
+++ b/sstable/reader_test.go
@@ -204,7 +204,7 @@ func TestReader(t *testing.T) {
                "prefixFilter": "testdata/prefixreader",
        }

-       for _, format := range []TableFormat{TableFormatPebblev2, TableFormatPebblev3} {
+       for _, format := range []TableFormat{TableFormatPebblev2} {
                for dName, blockSize := range blockSizes {
                        for iName, indexBlockSize := range blockSizes {
                                for lName, tableOpt := range writerOpts {
$ go test -tags '' -race -timeout 20m -count 1 -run TestReader$ -v ./sstable | grep -E '^--- PASS'
--- PASS: TestReader (0.94s)
$ while $(pgrep sstable > /dev/null); do pmap -x $(pgrep sstable) | grep total; sleep 0.5; done
total kB         1704404   44844   43672
total kB         1778144   46276   45104

Note that I'm also running with invariants disabled, which rules out the effects of scrambling some byte slices in the value blocks.

@sumeerbhola - mind if I send this your way to take a look?

@nicktrav
Copy link
Contributor Author

I poked this some more with pprof. As expected, this points at the new value block code:

github.com/cockroachdb/pebble/sstable.(*valueBlockWriter).addValue
/home/nickt/Development/pebble/sstable/value_block.go

  Total:         3GB        3GB (flat, cum) 99.85%
    463            .          .           	if cap(w.buf.b) < blockLen { 
    464            .          .           		size := w.blockSize + w.blockSize/2 
    465            .          .           		if size < blockLen { 
    466            .          .           			size = blockLen + blockLen/2 
    467            .          .           		} 
    468          3GB        3GB           		buf := make([]byte, blockLen, size) 
    469            .          .           		_ = copy(buf, w.buf.b) 
    470            .          .           		w.buf.b = buf 
    471            .          .           	} else { 
    472            .          .           		w.buf.b = w.buf.b[:blockLen] 
    473            .          .           	} 

@sumeerbhola - is there anything for us to dig into here around pooling some of these byte slices or making use of some of the sugar we have in bytealloc? I suspect we're not seeing the effects of this yet as these codepaths are yet to be enabled, though I assume we'll run into an issue with memory usage eventually.

@sumeerbhola
Copy link
Collaborator

Thanks for investigating! I will send out a fix soon.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 11, 2023
Tests can set a high maximum block size e.g. TestReader sets the block size
to 2GB. This resulted in a 3GB byte slice being allocated. The new logic
is similar to the behavior in blockWriter.

Fixes cockroachdb#2159
nicktrav pushed a commit that referenced this issue Jan 11, 2023
Tests can set a high maximum block size e.g. TestReader sets the block size
to 2GB. This resulted in a 3GB byte slice being allocated. The new logic
is similar to the behavior in blockWriter.

Fixes #2159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants