Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator cannot read task logs from Peon #16518

Open
ksharma-qc opened this issue May 30, 2024 · 0 comments
Open

Coordinator cannot read task logs from Peon #16518

ksharma-qc opened this issue May 30, 2024 · 0 comments

Comments

@ksharma-qc
Copy link

I have a Druid cluster running on Kubernetes. I have allowed port 8100 through the data-service pod and service.

When running an ingestion task I see this error.

However the task runs and ingests the data.

image

image

Affected Version

29.0.1

Description

In the coordinator logs I see this error. Which implies that it cannot reach data service on port 8100.

[0] coordinator-overlord.log: [[1717080225.505805896, {}], {"log"=>"2024-05-30T14:43:45,505 WARN [qtp1421004802-150] org.apache.druid.indexing.overlord.http.OverlordResource - Failed to stream task reports for task query-9e607dc2-55e8-4b6
1-9a6a-1953fb8f914e"}]
...
[87] coordinator-overlord.log: [[1717080225.505858796, {}], {"log"=>"   at java.lang.Thread.run(Thread.java:842) ~[?:?]"}]
[88] coordinator-overlord.log: [[1717080225.505859347, {}], {"log"=>"Caused by: java.net.ConnectException: Connection refused: druid-data-0.data-pods/10.244.3.10:8100"}]

The port is indeed open and the Peon service is listening on it during task execution which I've verifed by running this:

while true; do
  nc -vz druid-data-0.data-pods 8100
  sleep 1
done

Which prints:

# Before running ingestion task
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused

# During the task
Connection to druid-data-0.data-pods (10.244.3.14) 8100 port [tcp/*] succeeded!
Connection to druid-data-0.data-pods (10.244.3.14) 8100 port [tcp/*] succeeded!
Connection to druid-data-0.data-pods (10.244.3.14) 8100 port [tcp/*] succeeded!

# After the task
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused
nc: connect to druid-data-0.data-pods (10.244.3.14) port 8100 (tcp) failed: Connection refused

To be honest I'm confused as to why Coordinator is even trying to reach the Peon for task logs. Given that Peons are ephemeral and exit once the task is done, it would make much more sense to get those logs from the Middle Manager whose job is to manage the Peons.

What am I missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant