Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Worker failures on bazel build #3251

Closed
ola-rozenfeld opened this issue Jun 22, 2017 · 52 comments
Closed

Remote Worker failures on bazel build #3251

ola-rozenfeld opened this issue Jun 22, 2017 · 52 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: bug

Comments

@ola-rozenfeld
Copy link
Contributor

When trying to build Bazel using the RemoteWorker, I get intermittent errors such as:

ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: Extracting interface //third_party:android_common_25_0_0 failed: ijar failed: error executing command
(cd /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/execroot/io_bazel &&
exec env -
PATH=/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/local/google/home/olaola/work/bin:/usr/local/google/home/olaola/depot_tools:/usr/local/google/home/olaola/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin
external/bazel_tools/tools/jdk/ijar/ijar third_party/android_common/com.android_annotations_25.0.0.jar bazel-out/local-fastbuild/genfiles/third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android_annotations_25.0.0-ijar.jar): Exit -1.
ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: output 'third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android.tools.lint_lint-api_25.0.0-ijar.jar' was not created.

The command actually failed on the RemoteWorker, this is not a gRPC issue.

@ola-rozenfeld
Copy link
Contributor Author

Another manifestation of the same error (I think):

ERROR: /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/external/com_google_protobuf_java/BUILD.bazel:1:1: Extracting interface @com_google_protobuf_java//:protobuf failed: ijar failed: error executing command
(cd /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/execroot/io_bazel &&
exec env -
PATH=/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/local/google/home/olaola/work/bin:/usr/local/google/home/olaola/depot_tools:/usr/local/google/home/olaola/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin
external/bazel_tools/tools/jdk/ijar/ijar external/com_google_protobuf_java/libprotobuf_java.jar bazel-out/host/genfiles/external/com_google_protobuf_java/_ijar/protobuf/external/com_google_protobuf_java/libprotobuf_java-ijar.jar): Exit -1.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

No idea. There are some cases where we still swallow exceptions from the underlying libraries, and I suspect it's one of those. I'll try to see if I can reproduce this with a patched binary that produces more debugging output.

@ulfjack ulfjack assigned ulfjack and unassigned philwo and iirina Jun 28, 2017
@ulfjack ulfjack added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Jun 28, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

I got a stack trace.

java.io.IOException: Cannot run program "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf/external/bazel_tools/tools/jdk/ijar/ijar" (in directory "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf"): error=26, Text file busy
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at com.google.devtools.build.lib.shell.JavaSubprocessFactory.create(JavaSubprocessFactory.java:116)
	at com.google.devtools.build.lib.shell.SubprocessBuilder.start(SubprocessBuilder.java:194)
	at com.google.devtools.build.lib.shell.Command.startProcess(Command.java:738)
	at com.google.devtools.build.lib.shell.Command.doExecute(Command.java:705)
	at com.google.devtools.build.lib.shell.Command.execute(Command.java:418)
	at com.google.devtools.build.remote.WatcherServer.execute(WatcherServer.java:250)
	at com.google.devtools.build.remote.WatcherServer.watch(WatcherServer.java:318)
	at com.google.watcher.v1.WatcherGrpc$MethodHandlers.invoke(WatcherGrpc.java:214)
	at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:148)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:262)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.runInContext(ServerImpl.java:572)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=26, Text file busy
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 16 more

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

Well, it's pretty obvious what's going wrong now. Every time we're running an action we download all the relevant inputs and put them all into the same directory, silently overwriting any existing files. If we run two actions in parallel, we can be writing to a file just as we want to execute it, and Linux complains about that.

@philwo
Copy link
Member

philwo commented Jun 28, 2017

ehehehe... nice find, @ulfjack.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

Err, no, that's not right. It is putting them in different directories.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

The RemoteWorker is running the action for every watch request it gets. My new hypothesis is that we get the same watch request twice....

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

WatcherServer has a check-then-act race, in which case it can throw a NPE. If we can get multiple watch requests for the same action, then it can also return Code.NOT_FOUND.

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 28, 2017 via email

@ulfjack
Copy link
Contributor

ulfjack commented Jun 28, 2017

I don't know yet. I added some more debug output, and am running builds trying to reproduce, but no success so far.

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 28, 2017 via email

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

I haven't been able to reproduce the error in the last two days. :-( I only have logging changes in the code compared to the stack trace I posted - how can that be?

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

I can reproduce reliably by increasing the number of threads. I think it might be this bug:
https://bugs.openjdk.java.net/browse/JDK-8068370

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

As a workaround, I can make it so the remote worker retires an action if i get the "error=26, Text file busy" error. That might be good enough for now. Unfortunately, if upstream doesn't want to fix this, we'll have to write our own.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

I get a new error now:

ERROR: /usr/local/google/home/ulfjack/.cache/bazel/_bazel_ulfjack/d25975f09249f56db00255081e2885e2/external/com_google_protobuf/BUILD.bazel:423:1: C++ compilation of rule '@com_google_protobuf//:protoc_lib' failed: io.grpc.StatusRuntimeException: UNKNOWN

This is without the retry logic.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

More fun exceptions:

    12:30:13.594  ... Start 7c34e712-3c3f-495e-b899-f36714f231f7
    12:30:13.594  ... Working dir /tmp/remote.2q8eAE27O/build-7c34e712-3c3f-495e-b899-f36714f231f7
    12:30:13.594  ... 2 input files & 1 output files
    12:30:13.604  ... external/bazel_tools/tools/jdk/ijar/ijar third_party/asm/asm-commons-5.1.jar bazel-out/host/genfiles/third_party/_ijar/asm-commons/third_par
                      ty/asm/asm-commons-5.1-ijar.jar
    12:30:13.604  ... path=/tmp/remote.2q8eAE27O/build-7c34e712-3c3f-495e-b899-f36714f231f7

java.lang.NullPointerException
	at com.google.devtools.build.remote.WatcherServer.execute(WatcherServer.java:271)
	at com.google.devtools.build.remote.WatcherServer.watch(WatcherServer.java:341)
	at com.google.watcher.v1.WatcherGrpc$MethodHandlers.invoke(WatcherGrpc.java:214)
	at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:148)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:262)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.runInContext(ServerImpl.java:572)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

Ignore that last one, that's a bug in my change.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

Well, testing is difficult, since it's using up all memory in my machine, and making it impossible for me to use the machine until I physically switch it off and on again.

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 30, 2017 via email

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

This is happening with as few as a dozen simultaneous requests, with a single client on a single machine. I basically haven't been able to build bazel on my machine with the remote worker if I change jobs to a bit more than I have cores in my machine. That's a really bad experience, because I want to have a higher number of jobs when I get cache hits, as that significantly improves performance when a lot of the requests are cache hits. But I can't know ahead of time whether the actions will be cache hits or not, and manually tweaking jobs on the client isn't a good UI either.

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 30, 2017 via email

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 30, 2017 via email

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

It definitely does not take 6 minutes to the DEADLINE_EXCEEDED. It takes a bit more than 60 seconds, so maybe retries don't work properly?

I don't think we'd need to increase the timeout - we'd queue the action for execution, but report back regularly via the Watcher API. I thought that's why we're using a long running operation?

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 30, 2017 via email

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jun 30, 2017 via email

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

I get this fun error in Bazel:

Server terminated abruptly (error code: 14, error message: '', log file: '/usr/local/google/home/ulfjack/.cache/bazel/_bazel_ulfjack/5c706281d73fa9baac5a257912930c73/server/jvm.out')

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

Here's the --verbose_failures output from a DEADLINE_EXCEEDED, which I just managed to reproduce again...

WARNING: Turbine remote work failed (after 6 attempts: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED)
after 6 attempts: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED
	at com.google.devtools.build.lib.remote.Retrier.onFailures(Retrier.java:333)
	at com.google.devtools.build.lib.remote.Retrier.onFailure(Retrier.java:312)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:291)
	at com.google.devtools.build.lib.remote.GrpcActionCache.download(GrpcActionCache.java:219)
	at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:292)
	at com.google.devtools.build.lib.rules.java.JavaHeaderCompileAction.internalExecute(JavaHeaderCompileAction.java:136)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:270)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:803)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:746)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:103)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:648)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:605)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:395)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:440)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:201)
	at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:401)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:352)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED
	at io.grpc.Status.asRuntimeException(Status.java:531)
	... 21 more

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

INFO: Elapsed time: 71.334s, Critical Path: 61.36s

@ulfjack
Copy link
Contributor

ulfjack commented Jun 30, 2017

Ok, I have a set of changes that make the remote worker not completely lock up my machine. But I still get DEADLINE_EXCEEDED when building from scratch, and also the odd Bazel server crash with no error messages and no output (except as noted above).

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

And apparently, netty is allocating huge amounts of direct memory. Maybe we're holding it wrong?

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 8120172847, max: 8135442432)

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

Ok, I can at least confirm that the "Server terminated abruptly" and the "OutOfDirectMemoryError" are connected. We're actually redirecting stdout / stderr within Java to go over gRPC, so we can loose error messages if gRPC crashes. I've removed the redirection (instead, it goes to an output file), and now I can see the exception stack traces.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

Actually, there are more cases where we get "Server terminated abruptly". Apparently the remote execution use of gRPC is starving out the client/server connection and the client is loosing the connection, even though the server is still running.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

Maybe it's getting 'deadline exceeded'?

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

Status 14 is UNAVAILABLE according to StatusCode.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 8120172847, max: 8135442432)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:624)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:578)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:707)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:239)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:209)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:287)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:170)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:131)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:565)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:479)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

We think the main problem is that there's no flow control at the application level - we're queueing all the chunks at once which means reading all the files for upload into memory at the same time.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

The DEADLINE_EXCEEDED problem is because the stubs are memoized. gRPC computes the deadline when the stubs are created not when they're used.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

So the deadline isn't applied per call, but per build, which isn't right.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

New error:
Caused by: java.lang.IllegalArgumentException: errors must be non-empty
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at com.google.devtools.build.lib.util.Preconditions.checkArgument(Preconditions.java:38)
at com.google.devtools.build.lib.remote.Retrier.onFailures(Retrier.java:326)
at com.google.devtools.build.lib.remote.GrpcActionCache.uploadChunks(GrpcActionCache.java:475)
at com.google.devtools.build.lib.remote.GrpcActionCache.ensureInputsPresent(GrpcActionCache.java:188)
at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:278)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.internalExecute(SpawnAction.java:263)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:270)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:803)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:746)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:103)

@ola-rozenfeld
Copy link
Contributor Author

ola-rozenfeld commented Jul 3, 2017 via email

@ola-rozenfeld
Copy link
Contributor Author

re: errors: I don't see yet how that could happen, but I do know how to fix it (I think). Will add it to the the cl in review.

@ola-rozenfeld
Copy link
Contributor Author

Wait, I just realized I have always had the assumption that the uploaded digests are different. That is incorrect, and in fact I should filter the same ones out before they get into the Chunker. I'm still not sure how that could cause errors to have less values than failedDigests (we modify them at the same time, one is a list, the other is a set), but I am now seeing more evidence that the observer threads do not terminate when I thought they would.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 3, 2017

It might be a bug in my own code. I only rewrote the chunker, the uploader, the downloader, the byte stream server, and the watch server, and the execution server. I might have a few CLs to send out this week...

@ola-rozenfeld
Copy link
Contributor Author

Might be, but now I think it is a bug in my code. Here is what I think happens: the Chunker chunks 5 inputs, out of which there are only 4 different digests (2 inputs happen to have same content). Suppose the doubled file actually fails on upload. Then, the number of failedDigests is actually 1, but the number of uploads I will need to execute again (in the current code) is 2! That screws up my whole multithreading assumption, because I am instantiating the finishLatch to await on one item, when in reality I spawn two threads.
So yeah, fixing, and you might have a few CLs to review this week as well :-)

@ulfjack
Copy link
Contributor

ulfjack commented Jul 4, 2017

I'm afraid we need to rewrite this code entirely to add flow control, for which I have a partial CL, so I wouldn't spend much time on this right now if I was you.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 11, 2017

The OOMs got fixed in dc24004.

The deadline problem got fixed in aff0c67.

The "Text file busy" is still unfixed, but since I introduced a thread pool for the remote executor in bc16cc2, I haven't seen any failures. We can open a new bug for that since we know fairly well what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

5 participants