Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel CI Flaky Test: //src/test/shell/bazel:starlark_repository_test (test_download_failure_message) #21238

Closed
meteorcloudy opened this issue Feb 7, 2024 · 11 comments
Assignees
Labels
breakage flaky test P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@meteorcloudy
Copy link
Member

meteorcloudy commented Feb 7, 2024

Description of the bug:

This test often timeout in Bazel postsubmit: https://buildkite.com/bazel/bazel-bazel/builds/26661#018d837a-6b44-4938-be56-d6bf3c695381

** test_download_failure_message ***********************************************
-- Test timed out at 2024-02-07 12:47:51 UTC --
Terminated
-- Test log: -----------------------------------------------------------
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a' and max_idle_secs default is '15'.
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //:it (1 packages loaded, 0 targets configured)
Analyzing: target //:it (1 packages loaded, 0 targets configured)
[0 / 1] [Prepa] BazelWorkspaceStatusAction stable-status.txt
WARNING: Download from http://does.not.exist.example.com/some/file.tar failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException Unknown host: does.not.exist.example.com
INFO: Repository this_is_the_root_cause instantiated at:
  /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/workspace/WORKSPACE:65:11: in <toplevel>
  /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/workspace/root.bzl:4:15: in root_cause
Repository rule http_archive defined at:
  /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/root/86e00b674c10623bdea7d66ec785c5ae/external/bazel_tools/tools/build_defs/repo/http.bzl:375:31: in <toplevel>
ERROR: An error occurred during the fetch of repository 'this_is_the_root_cause':
   Traceback (most recent call last):
	File "/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/root/86e00b674c10623bdea7d66ec785c5ae/external/bazel_tools/tools/build_defs/repo/http.bzl", line 139, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [http://does.not.exist.example.com/some/file.tar] to /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/root/86e00b674c10623bdea7d66ec785c5ae/external/this_is_the_root_cause/temp18096450050676409068/file.tar: Unknown host: does.not.exist.example.com
ERROR: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/workspace/WORKSPACE:65:11: fetching http_archive rule //external:this_is_the_root_cause: Traceback (most recent call last):
	File "/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/root/86e00b674c10623bdea7d66ec785c5ae/external/bazel_tools/tools/build_defs/repo/http.bzl", line 139, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [http://does.not.exist.example.com/some/file.tar] to /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a/root/86e00b674c10623bdea7d66ec785c5ae/external/this_is_the_root_cause/temp18096450050676409068/file.tar: Unknown host: does.not.exist.example.com
Analyzing: target //:it (5 packages loaded, 6 targets configured)
[1 / 1] checking cached actions

Bazel caught terminate signal; cancelling pending invocation.

------------------------------------------------------------------------
test_download_failure_message FAILED: terminated by signal TERM.
/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/starlark_repository_test.runfiles/_main/src/test/shell/bazel/starlark_repository_test:2707: in call to main
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/8170/execroot/_main/_tmp/2cb27242228a13b00c5d2dbd63a30e9a' and max_idle_secs default is '15'.
Another command (pid=826) is running. Waiting for it to complete on the server (server_pid=432)...

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This can be easily reproduced within docker image gcr.io/bazel-public/centos7-java11-devtoolset10 by

bazel test //src/test/shell/bazel:starlark_repository_test --test_filter=test_download_failure_message --runs_per_test=20

Increasing the number of --runs_per_test will increase the chance of reproducing this issue.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

7.0.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@meteorcloudy
Copy link
Member Author

@meteorcloudy
Copy link
Member Author

This doesn't seem to happen after I revert 57c1801
/cc @Wyverald

@meteorcloudy
Copy link
Member Author

This is not reproducible if I add --local_test_jobs=1, so it's likely some dead lock while running multiple Bazel instances at the same time?

@fmeum
Copy link
Collaborator

fmeum commented Feb 7, 2024

Is the setup "multiple Bazel instances" or "multiple Bazel builds running against the same server"? If the latter, this could be caused by https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/downloader/HttpDownloader.java;l=53;bpv=1;bpt=1?q=Httpdownloader&ss=bazel%2Fbazel.

@meteorcloudy
Copy link
Member Author

So if we don't set --local_test_jobs=1, running the test with --runs_per_test=20 means we are running the same Bazel binary under different workspaces in parallel, therefore I believe they should be different Bazel servers?

@meteorcloudy meteorcloudy added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed untriaged labels Feb 8, 2024
@Wyverald
Copy link
Member

Wyverald commented Feb 8, 2024

@justinhorvitz I may need some help here with the Skyframe internals -- this is a deadlock that shows up with the usage of the Loom+repo fetching stuff, for context.

This line in the jstack output looks particularly suspect: https://gist.github.com/meteorcloudy/a7b812947d3e328daa5c8015a2d4ac2e#file-jstack-bazel-log-L153

Looks like ParallelEvaluator.bubbleErrorUp is calling SkyFunction.compute in a somewhat special way. I tried digging around but got a bit lost. Justin, if this rings a bell to you, I'd appreciate some tips; otherwise I'll keep digging in a bit.

@justinhorvitz
Copy link
Contributor

@fmeum
Copy link
Collaborator

fmeum commented Feb 12, 2024

@Wyverald Should we mark this as a blocker for 7.1.0? It looks like this could cause hangs in production.

@meteorcloudy
Copy link
Member Author

I think so, at least the test flak also exists on releaes-7.1.0

@meteorcloudy
Copy link
Member Author

@bazel-io fork 7.1.0

@meteorcloudy
Copy link
Member Author

There is a fix being submitted

Wyverald added a commit that referenced this issue Feb 12, 2024
…ror bubbling

For some reason, using worker threads for repo fetching during Skyframe error bubbling frequently causes deadlocks on Linux. I wasn't able to find out why the deadlock happens, but this CL is the immediate solution to the problem, and shouldn't be a performance concern since no Skyframe restarts should happen during error bubbling anyway.

Tested on Linux; with this CL, `bazel test //src/test/shell/bazel:starlark_repository_test --test_filter=test_download_failure_message --runs_per_test=20` finishes just fine. (On an M1 macbook, I can't trigger the deadlock even without this CL.)

Fixes #21238

PiperOrigin-RevId: 606305306
Change-Id: I6f47a144b29030011f6c10c2b37f6874190fed0e
github-merge-queue bot pushed a commit that referenced this issue Feb 12, 2024
#21305)

…ror bubbling

For some reason, using worker threads for repo fetching during Skyframe
error bubbling frequently causes deadlocks on Linux. I wasn't able to
find out why the deadlock happens, but this CL is the immediate solution
to the problem, and shouldn't be a performance concern since no Skyframe
restarts should happen during error bubbling anyway.

Tested on Linux; with this CL, `bazel test
//src/test/shell/bazel:starlark_repository_test
--test_filter=test_download_failure_message --runs_per_test=20` finishes
just fine. (On an M1 macbook, I can't trigger the deadlock even without
this CL.)

Fixes #21238

PiperOrigin-RevId: 606305306
Change-Id: I6f47a144b29030011f6c10c2b37f6874190fed0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage flaky test P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests

7 participants