Remote Worker failures on bazel build #3251

ola-rozenfeld · 2017-06-22T23:44:22Z

When trying to build Bazel using the RemoteWorker, I get intermittent errors such as:

ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: Extracting interface //third_party:android_common_25_0_0 failed: ijar failed: error executing command
(cd /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/execroot/io_bazel &&
exec env -
PATH=/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/local/google/home/olaola/work/bin:/usr/local/google/home/olaola/depot_tools:/usr/local/google/home/olaola/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin
external/bazel_tools/tools/jdk/ijar/ijar third_party/android_common/com.android_annotations_25.0.0.jar bazel-out/local-fastbuild/genfiles/third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android_annotations_25.0.0-ijar.jar): Exit -1.
ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: output 'third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android.tools.lint_lint-api_25.0.0-ijar.jar' was not created.

The command actually failed on the RemoteWorker, this is not a gRPC issue.

ola-rozenfeld · 2017-06-23T00:50:14Z

Another manifestation of the same error (I think):

ERROR: /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/external/com_google_protobuf_java/BUILD.bazel:1:1: Extracting interface @com_google_protobuf_java//:protobuf failed: ijar failed: error executing command
(cd /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/execroot/io_bazel &&
exec env -
PATH=/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/local/google/home/olaola/work/bin:/usr/local/google/home/olaola/depot_tools:/usr/local/google/home/olaola/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin
external/bazel_tools/tools/jdk/ijar/ijar external/com_google_protobuf_java/libprotobuf_java.jar bazel-out/host/genfiles/external/com_google_protobuf_java/_ijar/protobuf/external/com_google_protobuf_java/libprotobuf_java-ijar.jar): Exit -1.

ulfjack · 2017-06-28T09:31:41Z

No idea. There are some cases where we still swallow exceptions from the underlying libraries, and I suspect it's one of those. I'll try to see if I can reproduce this with a patched binary that produces more debugging output.

ulfjack · 2017-06-28T09:51:18Z

I got a stack trace.

java.io.IOException: Cannot run program "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf/external/bazel_tools/tools/jdk/ijar/ijar" (in directory "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf"): error=26, Text file busy
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at com.google.devtools.build.lib.shell.JavaSubprocessFactory.create(JavaSubprocessFactory.java:116)
	at com.google.devtools.build.lib.shell.SubprocessBuilder.start(SubprocessBuilder.java:194)
	at com.google.devtools.build.lib.shell.Command.startProcess(Command.java:738)
	at com.google.devtools.build.lib.shell.Command.doExecute(Command.java:705)
	at com.google.devtools.build.lib.shell.Command.execute(Command.java:418)
	at com.google.devtools.build.remote.WatcherServer.execute(WatcherServer.java:250)
	at com.google.devtools.build.remote.WatcherServer.watch(WatcherServer.java:318)
	at com.google.watcher.v1.WatcherGrpc$MethodHandlers.invoke(WatcherGrpc.java:214)
	at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:148)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:262)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.runInContext(ServerImpl.java:572)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=26, Text file busy
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 16 more

ulfjack · 2017-06-28T11:13:50Z

Well, it's pretty obvious what's going wrong now. Every time we're running an action we download all the relevant inputs and put them all into the same directory, silently overwriting any existing files. If we run two actions in parallel, we can be writing to a file just as we want to execute it, and Linux complains about that.

philwo · 2017-06-28T11:18:47Z

ehehehe... nice find, @ulfjack.

ulfjack · 2017-06-28T11:23:18Z

Err, no, that's not right. It is putting them in different directories.

ulfjack · 2017-06-28T11:27:51Z

The RemoteWorker is running the action for every watch request it gets. My new hypothesis is that we get the same watch request twice....

ulfjack · 2017-06-28T11:42:35Z

WatcherServer has a check-then-act race, in which case it can throw a NPE. If we can get multiple watch requests for the same action, then it can also return Code.NOT_FOUND.

ola-rozenfeld · 2017-06-28T13:56:23Z

Yeah, my WatcherServer is very simplistic, it strongly counts on the client only calling one request. But how is this assumption violated? I indeed only call watch once per action, GrpcRemoteExecutor.java line 96

…

On Wed, Jun 28, 2017 at 7:42 AM, Ulf Adams ***@***.***> wrote: WatcherServer has a check-then-act race, in which case it can throw a NPE. If we can get multiple watch requests for the same action, then it can also return Code.NOT_FOUND. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuCZ7ZVfpPoM3JKWS1-iLReK7bDbgks5sIjw2gaJpZM4OC-X_> .

ulfjack · 2017-06-28T14:09:41Z

I don't know yet. I added some more debug output, and am running builds trying to reproduce, but no success so far.

ola-rozenfeld · 2017-06-28T14:18:52Z

In the mean time, I will improve the Watch server to properly synchronize the operations map, and see if it fixes this. Of course, understanding what actually caused the error would be much better, but I'll settle for second best :-)

…

On Wed, Jun 28, 2017 at 10:10 AM, Ulf Adams ***@***.***> wrote: I don't know yet. I added some more debug output, and am running builds trying to reproduce, but no success so far. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuBYgStExnrFNGyWlKTqM4eo0ZlZtks5sIl7AgaJpZM4OC-X_> .

ulfjack · 2017-06-30T07:54:44Z

I haven't been able to reproduce the error in the last two days. :-( I only have logging changes in the code compared to the stack trace I posted - how can that be?

ulfjack · 2017-06-30T08:32:34Z

I can reproduce reliably by increasing the number of threads. I think it might be this bug:
https://bugs.openjdk.java.net/browse/JDK-8068370

ulfjack · 2017-06-30T08:51:12Z

As a workaround, I can make it so the remote worker retires an action if i get the "error=26, Text file busy" error. That might be good enough for now. Unfortunately, if upstream doesn't want to fix this, we'll have to write our own.

ulfjack · 2017-06-30T09:59:32Z

I get a new error now:

ERROR: /usr/local/google/home/ulfjack/.cache/bazel/_bazel_ulfjack/d25975f09249f56db00255081e2885e2/external/com_google_protobuf/BUILD.bazel:423:1: C++ compilation of rule '@com_google_protobuf//:protoc_lib' failed: io.grpc.StatusRuntimeException: UNKNOWN

This is without the retry logic.

ulfjack · 2017-06-30T10:31:08Z

More fun exceptions:

    12:30:13.594  ... Start 7c34e712-3c3f-495e-b899-f36714f231f7
    12:30:13.594  ... Working dir /tmp/remote.2q8eAE27O/build-7c34e712-3c3f-495e-b899-f36714f231f7
    12:30:13.594  ... 2 input files & 1 output files
    12:30:13.604  ... external/bazel_tools/tools/jdk/ijar/ijar third_party/asm/asm-commons-5.1.jar bazel-out/host/genfiles/third_party/_ijar/asm-commons/third_par
                      ty/asm/asm-commons-5.1-ijar.jar
    12:30:13.604  ... path=/tmp/remote.2q8eAE27O/build-7c34e712-3c3f-495e-b899-f36714f231f7

java.lang.NullPointerException
	at com.google.devtools.build.remote.WatcherServer.execute(WatcherServer.java:271)
	at com.google.devtools.build.remote.WatcherServer.watch(WatcherServer.java:341)
	at com.google.watcher.v1.WatcherGrpc$MethodHandlers.invoke(WatcherGrpc.java:214)
	at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:148)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:262)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.runInContext(ServerImpl.java:572)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

ulfjack · 2017-06-30T11:04:36Z

Ignore that last one, that's a bug in my change.

ulfjack · 2017-06-30T13:05:43Z

Well, testing is difficult, since it's using up all memory in my machine, and making it impossible for me to use the machine until I physically switch it off and on again.

ola-rozenfeld · 2017-06-30T13:15:11Z

That's a great find! Doesn't look like they are fixing this. If retrying this error works, great. Now, if we only had a generic retrier... :-)

…

On Fri, Jun 30, 2017 at 9:06 AM, Ulf Adams ***@***.***> wrote: Well, testing is difficult, since it's using up all memory in my machine, and making it impossible for me to use the machine until I physically switch it off and on again. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuPxBHuGOsycTbd1CT8Lcw70v6PMvks5sJPLAgaJpZM4OC-X_> .

ulfjack · 2017-06-30T14:37:19Z

This is happening with as few as a dozen simultaneous requests, with a single client on a single machine. I basically haven't been able to build bazel on my machine with the remote worker if I change jobs to a bit more than I have cores in my machine. That's a really bad experience, because I want to have a higher number of jobs when I get cache hits, as that significantly improves performance when a lot of the requests are cache hits. But I can't know ahead of time whether the actions will be cache hits or not, and manually tweaking jobs on the client isn't a good UI either.

ola-rozenfeld · 2017-06-30T14:46:49Z

I wonder if this behavior is also related to the switching back to a single gRPC channel change. We have experienced a much higher volume of 502 due to this on our end, too (I think you're cc-ed on the internal bug). I'm considering implementing a gRPC channel pool (want to get some advice from gRPC folks before doing that). Actually, let me check if that makes a difference.

…

On Fri, Jun 30, 2017 at 10:37 AM, Ulf Adams ***@***.***> wrote: This is happening with as few as a dozen simultaneous requests, with a single client on a single machine. I basically haven't been able to build bazel on my machine with the remote worker if I change jobs to a bit more than I have cores in my machine. That's a really bad experience, because I want to have a higher number of jobs when I get cache hits, as that significantly improves performance when a lot of the requests are cache hits. But I can't know ahead of time whether the actions will be cache hits or not, and manually tweaking jobs on the client isn't a good UI either. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuMlWrbpuLOXvsluf-nQF55Sx21dxks5sJQgpgaJpZM4OC-X_> .

ola-rozenfeld · 2017-06-30T14:53:59Z

Also: re: queueing -- if we had explicit queueing in the RemoteWorker, we'd still have to increase the remote_timeout in Bazel to account for the queueing delays. I am wondering, however, if a thread actually gets starved somewhere, because 6 attempts on a 60 seconds timeout should equal lots of time. Does that really take 6 minutes before failing? Can you please paste the --verbose_failures output?

…

On Fri, Jun 30, 2017 at 10:46 AM, Ola Rozenfeld ***@***.***> wrote: I wonder if this behavior is also related to the switching back to a single gRPC channel change. We have experienced a much higher volume of 502 due to this on our end, too (I think you're cc-ed on the internal bug). I'm considering implementing a gRPC channel pool (want to get some advice from gRPC folks before doing that). Actually, let me check if that makes a difference. On Fri, Jun 30, 2017 at 10:37 AM, Ulf Adams ***@***.***> wrote: > This is happening with as few as a dozen simultaneous requests, with a > single client on a single machine. I basically haven't been able to build > bazel on my machine with the remote worker if I change jobs to a bit more > than I have cores in my machine. That's a really bad experience, because I > want to have a higher number of jobs when I get cache hits, as that > significantly improves performance when a lot of the requests are cache > hits. But I can't know ahead of time whether the actions will be cache hits > or not, and manually tweaking jobs on the client isn't a good UI either. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3251 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AYoKuMlWrbpuLOXvsluf-nQF55Sx21dxks5sJQgpgaJpZM4OC-X_> > . >

ulfjack · 2017-06-30T15:04:09Z

It definitely does not take 6 minutes to the DEADLINE_EXCEEDED. It takes a bit more than 60 seconds, so maybe retries don't work properly?

I don't think we'd need to increase the timeout - we'd queue the action for execution, but report back regularly via the Watcher API. I thought that's why we're using a long running operation?

ola-rozenfeld · 2017-06-30T15:14:09Z

Yeah, there's definitely something wrong there. I am looking into it as well. Sorry, you're right, that's why I didn't set a timeout on the Watcher API at all. That's why I wanted to see the stack trace -- I suspect it's not the Watcher call that hits this, or that I have a bug in dealing with command execution timeout (which throws DEADLINE_EXCEEDED, but then is supposed to put it onto the Operation error field instead of propagating via onError).

…

On Fri, Jun 30, 2017 at 11:04 AM, Ulf Adams ***@***.***> wrote: It definitely does *not* take 6 minutes to the DEADLINE_EXCEEDED. It takes a bit more than 60 seconds, so maybe retries don't work properly? I don't think we'd need to increase the timeout - we'd queue the action for execution, but report back regularly via the Watcher API. I thought that's why we're using a long running operation? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuMJzLcQZjwMrjT-Hy4_8TUQG1dUxks5sJQ5zgaJpZM4OC-X_> .

ola-rozenfeld · 2017-06-30T15:16:44Z

Btw, another small issue I need to fix -- when the server fails on an operation, e.g. IOException, it gets raised as INTERNAL_ERROR, but then a simple retry obviously hits a NOT_FOUND -- which hides the problem. The bug is that I need to retry both execute and watch on a NOT_FOUND.

…

On Fri, Jun 30, 2017 at 11:13 AM, Ola Rozenfeld ***@***.***> wrote: Yeah, there's definitely something wrong there. I am looking into it as well. Sorry, you're right, that's why I didn't set a timeout on the Watcher API at all. That's why I wanted to see the stack trace -- I suspect it's not the Watcher call that hits this, or that I have a bug in dealing with command execution timeout (which throws DEADLINE_EXCEEDED, but then is supposed to put it onto the Operation error field instead of propagating via onError). On Fri, Jun 30, 2017 at 11:04 AM, Ulf Adams ***@***.***> wrote: > It definitely does *not* take 6 minutes to the DEADLINE_EXCEEDED. It > takes a bit more than 60 seconds, so maybe retries don't work properly? > > I don't think we'd need to increase the timeout - we'd queue the action > for execution, but report back regularly via the Watcher API. I thought > that's why we're using a long running operation? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3251 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AYoKuMJzLcQZjwMrjT-Hy4_8TUQG1dUxks5sJQ5zgaJpZM4OC-X_> > . >

ulfjack · 2017-06-30T15:49:05Z

I get this fun error in Bazel:

Server terminated abruptly (error code: 14, error message: '', log file: '/usr/local/google/home/ulfjack/.cache/bazel/_bazel_ulfjack/5c706281d73fa9baac5a257912930c73/server/jvm.out')

ulfjack · 2017-06-30T15:52:48Z

Here's the --verbose_failures output from a DEADLINE_EXCEEDED, which I just managed to reproduce again...

WARNING: Turbine remote work failed (after 6 attempts: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED)
after 6 attempts: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED
	at com.google.devtools.build.lib.remote.Retrier.onFailures(Retrier.java:333)
	at com.google.devtools.build.lib.remote.Retrier.onFailure(Retrier.java:312)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:291)
	at com.google.devtools.build.lib.remote.GrpcActionCache.download(GrpcActionCache.java:219)
	at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:292)
	at com.google.devtools.build.lib.rules.java.JavaHeaderCompileAction.internalExecute(JavaHeaderCompileAction.java:136)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:270)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:803)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:746)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:103)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:648)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:605)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:395)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:440)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:201)
	at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:401)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:352)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED
	at io.grpc.Status.asRuntimeException(Status.java:531)
	... 21 more

ulfjack · 2017-06-30T15:53:22Z

INFO: Elapsed time: 71.334s, Critical Path: 61.36s

ulfjack · 2017-06-30T15:57:58Z

Ok, I have a set of changes that make the remote worker not completely lock up my machine. But I still get DEADLINE_EXCEEDED when building from scratch, and also the odd Bazel server crash with no error messages and no output (except as noted above).

ulfjack · 2017-07-03T08:08:25Z

And apparently, netty is allocating huge amounts of direct memory. Maybe we're holding it wrong?

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 8120172847, max: 8135442432)

ulfjack · 2017-07-03T08:51:13Z

Ok, I can at least confirm that the "Server terminated abruptly" and the "OutOfDirectMemoryError" are connected. We're actually redirecting stdout / stderr within Java to go over gRPC, so we can loose error messages if gRPC crashes. I've removed the redirection (instead, it goes to an output file), and now I can see the exception stack traces.

ulfjack · 2017-07-03T10:09:39Z

Actually, there are more cases where we get "Server terminated abruptly". Apparently the remote execution use of gRPC is starving out the client/server connection and the client is loosing the connection, even though the server is still running.

ulfjack · 2017-07-03T11:16:28Z

Maybe it's getting 'deadline exceeded'?

ulfjack · 2017-07-03T11:21:09Z

Status 14 is UNAVAILABLE according to StatusCode.

ulfjack · 2017-07-03T12:49:07Z

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 8120172847, max: 8135442432)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:624)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:578)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:707)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:239)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:209)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:287)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:170)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:131)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:565)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:479)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)

ulfjack · 2017-07-03T17:13:15Z

We think the main problem is that there's no flow control at the application level - we're queueing all the chunks at once which means reading all the files for upload into memory at the same time.

ulfjack · 2017-07-03T18:47:54Z

The DEADLINE_EXCEEDED problem is because the stubs are memoized. gRPC computes the deadline when the stubs are created not when they're used.

ulfjack · 2017-07-03T18:49:06Z

So the deadline isn't applied per call, but per build, which isn't right.

ulfjack · 2017-07-03T18:57:51Z

New error:
Caused by: java.lang.IllegalArgumentException: errors must be non-empty
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at com.google.devtools.build.lib.util.Preconditions.checkArgument(Preconditions.java:38)
at com.google.devtools.build.lib.remote.Retrier.onFailures(Retrier.java:326)
at com.google.devtools.build.lib.remote.GrpcActionCache.uploadChunks(GrpcActionCache.java:475)
at com.google.devtools.build.lib.remote.GrpcActionCache.ensureInputsPresent(GrpcActionCache.java:188)
at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:278)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.internalExecute(SpawnAction.java:263)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:270)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:803)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:746)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:103)

ola-rozenfeld · 2017-07-03T19:03:20Z

Ouch, sorry about that! Let me send you a CL right away.

…

On Mon, Jul 3, 2017 at 2:49 PM, Ulf Adams ***@***.***> wrote: So the deadline isn't applied per call, but per build, which isn't right. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYoKuGPQBlnzSObrbHy7iHJ19V12HwYdks5sKTesgaJpZM4OC-X_> .

ola-rozenfeld · 2017-07-03T19:06:45Z

re: errors: I don't see yet how that could happen, but I do know how to fix it (I think). Will add it to the the cl in review.

ola-rozenfeld · 2017-07-03T20:01:11Z

Wait, I just realized I have always had the assumption that the uploaded digests are different. That is incorrect, and in fact I should filter the same ones out before they get into the Chunker. I'm still not sure how that could cause errors to have less values than failedDigests (we modify them at the same time, one is a list, the other is a set), but I am now seeing more evidence that the observer threads do not terminate when I thought they would.

ulfjack · 2017-07-03T20:09:35Z

It might be a bug in my own code. I only rewrote the chunker, the uploader, the downloader, the byte stream server, and the watch server, and the execution server. I might have a few CLs to send out this week...

ola-rozenfeld · 2017-07-03T20:16:09Z

Might be, but now I think it is a bug in my code. Here is what I think happens: the Chunker chunks 5 inputs, out of which there are only 4 different digests (2 inputs happen to have same content). Suppose the doubled file actually fails on upload. Then, the number of failedDigests is actually 1, but the number of uploads I will need to execute again (in the current code) is 2! That screws up my whole multithreading assumption, because I am instantiating the finishLatch to await on one item, when in reality I spawn two threads.
So yeah, fixing, and you might have a few CLs to review this week as well :-)

ulfjack · 2017-07-04T07:57:48Z

I'm afraid we need to rewrite this code entirely to add flow control, for which I have a partial CL, so I wouldn't spend much time on this right now if I was you.

ulfjack · 2017-07-11T11:20:14Z

The OOMs got fixed in dc24004.

The deadline problem got fixed in aff0c67.

The "Text file busy" is still unfixed, but since I introduced a thread pool for the remote executor in bc16cc2, I haven't seen any failures. We can open a new bug for that since we know fairly well what's going on.

hlopko assigned philwo Jun 26, 2017

hlopko added category: rules > java P2 We'll consider working on this in future. (Assignee optional) type: bug labels Jun 26, 2017

hlopko assigned iirina Jun 26, 2017

ola-rozenfeld mentioned this issue Jun 26, 2017

Bazel at HEAD gets DEADLINE_EXCEEDED with remote strategy #3248

Closed

ulfjack assigned ulfjack and unassigned philwo and iirina Jun 28, 2017

ulfjack added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Jun 28, 2017

ulfjack closed this as completed Jul 11, 2017

ulfjack mentioned this issue Jul 11, 2017

Remote worker execution sometimes fails with exit -1 #3356

Closed

nelhage mentioned this issue Oct 10, 2017

Default local resources should respect cgroup limits on Linux #3886

Closed

Remote Worker failures on bazel build #3251

Remote Worker failures on bazel build #3251

Comments

ola-rozenfeld commented Jun 22, 2017

ola-rozenfeld commented Jun 23, 2017

ulfjack commented Jun 28, 2017

ulfjack commented Jun 28, 2017

ulfjack commented Jun 28, 2017

philwo commented Jun 28, 2017

ulfjack commented Jun 28, 2017

ulfjack commented Jun 28, 2017

ulfjack commented Jun 28, 2017

ola-rozenfeld commented Jun 28, 2017 via email

ulfjack commented Jun 28, 2017

ola-rozenfeld commented Jun 28, 2017 via email

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ola-rozenfeld commented Jun 30, 2017 via email

ulfjack commented Jun 30, 2017

ola-rozenfeld commented Jun 30, 2017 via email

ola-rozenfeld commented Jun 30, 2017 via email

ulfjack commented Jun 30, 2017

ola-rozenfeld commented Jun 30, 2017 via email

ola-rozenfeld commented Jun 30, 2017 via email

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jun 30, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ola-rozenfeld commented Jul 3, 2017 via email

ola-rozenfeld commented Jul 3, 2017

ola-rozenfeld commented Jul 3, 2017

ulfjack commented Jul 3, 2017

ola-rozenfeld commented Jul 3, 2017

ulfjack commented Jul 4, 2017

ulfjack commented Jul 11, 2017