Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel at HEAD gets DEADLINE_EXCEEDED with remote strategy #3248

Closed
philwo opened this issue Jun 22, 2017 · 7 comments
Closed

Bazel at HEAD gets DEADLINE_EXCEEDED with remote strategy #3248

philwo opened this issue Jun 22, 2017 · 7 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: bug
Milestone

Comments

@philwo
Copy link
Member

philwo commented Jun 22, 2017

I think this is a release blocker.

Reproduction steps:

# Start remote_worker:
$ rm -rf /usr/local/google/tmp/worker
$ mkdir /usr/local/google/tmp/worker
$ bazel build //src/tools/remote_worker
$ bazel-bin/src/tools/remote_worker/remote_worker --work_path=/usr/local/google/tmp

# Start build using remote strategy ("bazel" is Bazel built from HEAD)
$ bazel clean
$ bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 --blazerc=/dev/null build --spawn_strategy=remote --strategy=Javac=remote --remote_cache=localhost:8080 --remote_executor=localhost:8080 //src:bazel
[...]
WARNING: CppLink remote work failed (io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED).
ERROR: /usr/local/google/home/philwo/src/bazel/third_party/protobuf/3.2.0/BUILD:529:1: Linking of rule '//third_party/protobuf/3.2.0:protoc' failed: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED.
WARNING: Genrule remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: Javac remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: CppCompile remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: Javac remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: CppCompile remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: CppCompile remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
WARNING: CppLink remote work failed (io.grpc.StatusRuntimeException: CANCELLED).
Target //src:bazel failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 66.525s, Critical Path: 10.37s
@philwo philwo added P1 I'll work on this now. (Assignee required) Release blocker type: bug labels Jun 22, 2017
@philwo
Copy link
Member Author

philwo commented Jun 22, 2017

FYI @ola-rozenfeld

@ola-rozenfeld
Copy link
Contributor

The default minute of gRPC deadline should probably be increased, although it's the first time I see it not suffice for src:bazel. I don't think it's a release blocker, actually. But if you do, the best solution is increase the default timeout, imo.

@ulfjack ulfjack added this to the 0.6 milestone Jun 22, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jun 26, 2017

Something is preventing the remote worker from responding in a timely manner, and we don't think that we should increase the default timeout without understanding what. Our hypothesis is that we're blocking the gRPC threads with action execution. If that's correct, we may need to change it to run actions in a separate thread pool.

@philwo
Copy link
Member Author

philwo commented Jun 26, 2017

If this is a server-side problem, I don't think it's a release blocker and we should remove the tag.
We don't really bundle or ship the remote worker with specific Bazel versions anyway.

If it's a client-side problem with our RemoteSpawnStrategy, it probably is a blocker for 0.5.2.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 26, 2017

I haven't been able to reproduce so far.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 26, 2017

I wonder if 716b527 may have fixed this accidentally?

I saw non-zero exit twice:

ERROR: /.../Projects/os-bazel/src/main/java/com/google/devtools/build/lib/BUILD:1205:1: Building Java resource jar failed: singlejar failed: error executing command external/bazel_tools/tools/jdk/singlejar/singlejar --normalize --dont_change_compression --exclude_build_data --output ... (remaining 19 argument(s) skipped): Exit -1

@ola-rozenfeld
Copy link
Contributor

The non-zero exit is #3251, I believe. I still see it a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

4 participants