Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: WorkerTaskManagerTest and stuck CompactionTaskRunTest #14406

Closed
abhishekrb19 opened this issue Jun 12, 2023 · 2 comments
Closed

Flaky Test: WorkerTaskManagerTest and stuck CompactionTaskRunTest #14406

abhishekrb19 opened this issue Jun 12, 2023 · 2 comments
Labels

Comments

@abhishekrb19
Copy link
Contributor

Link to failed GHA runner: https://github.com/apache/druid/actions/runs/5196957717/jobs/9379949692

At first glance, the "indexing module test" seems greyed out. It seems like the "Operation was canceled" after running for close to 6 hours -- I think the timeout is just a symptom (see more details below).

It's useful to look at the raw logs as it includes timestamps to understand the timeline of events. From the logs, the first indexing unit test that failed WorkerTaskManagerTest:

2023-06-07T13:51:10.3414768Z [ERROR] org.apache.druid.indexing.worker.WorkerTaskManagerTest.testTaskRun[restoreTasksOnRestart = false, useMultipleBaseTaskDirPaths = {1}]  Time elapsed: 0.243 s  <<< FAILURE!
2023-06-07T13:51:10.3415800Z java.lang.AssertionError
2023-06-07T13:51:10.3416587Z 	at org.junit.Assert.fail(Assert.java:87)
2023-06-07T13:51:10.3417001Z 	at org.junit.Assert.assertTrue(Assert.java:42)
2023-06-07T13:51:10.3417590Z 	at org.junit.Assert.assertTrue(Assert.java:53)
2023-06-07T13:51:10.3418398Z 	at org.apache.druid.indexing.worker.WorkerTaskManagerTest.testTaskRun(WorkerTaskManagerTest.java:220)
2023-06-07T13:51:10.3419078Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2023-06-07T13:51:10.3420106Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2023-06-07T13:51:10.3421402Z 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2023-06-07T13:51:10.3422175Z 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2023-06-07T13:51:10.3422901Z 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
2023-06-07T13:51:10.3423510Z 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2023-06-07T13:51:10.3424150Z 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
2023-06-07T13:51:10.3425192Z 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2023-06-07T13:51:10.3425886Z 	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
2023-06-07T13:51:10.3426583Z 	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
2023-06-07T13:51:10.3427179Z 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2023-06-07T13:51:10.3427633Z 	at java.base/java.lang.Thread.run(Thread.java:829)

Shortly following the above failure, we can observe a series of indexing unit tests fail - RangePartitionAdjustingCorePartitionSizeTest , MultiPhaseParallelIndexingRowStatsTest, etc.

Finally, the indexing test module gets stuck in CompactionTaskRunTest -- for almost 5h30m before what seems like the runner comes along and kills the job around the 6h mark:

2023-06-07T14:18:54.8086117Z [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 s - in org.apache.druid.indexing.common.task.SpecificSegmentsSpecTest
2023-06-07T14:18:54.8090339Z [INFO] Running org.apache.druid.indexing.common.task.CompactionTaskRunTest
2023-06-07T19:45:01.7866391Z ##[error]The operation was canceled.

Potential causes

  1. Given the location of the first test failure, WorkerTaskManagerTest, it seems like completed task directory creation failed -- this makes me wonder if the GHA runner somehow was out of storage space.
  2. The assertion failure is also happening in code that recently changed (about 2 months ago), so there could be a bug that got recently introduced: https://github.com/apache/druid/blob/master/indexing-service/src/test/java/org/apache/druid/indexing/worker/WorkerTaskManagerTest.java#L220

It might be worthwhile looking into why CompactionTaskRunTest was stuck and didn't seem to make any progress for 5h 30m. I wonder if it was blocked on some IO related to 1. The test framework should've timed out for such long-running tests.

Copy link

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant