Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High kafka partition count causing lag metrics to be dropped #15655

Open
karthikgurram87 opened this issue Jan 10, 2024 · 0 comments
Open

High kafka partition count causing lag metrics to be dropped #15655

karthikgurram87 opened this issue Jan 10, 2024 · 0 comments

Comments

@karthikgurram87
Copy link

karthikgurram87 commented Jan 10, 2024

Setup
Source: a single Kafka topic
Partitions: 400
Metrics Emitter: StatsDEmitter
Issue: Overlord dropping metrics

Description

We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the ingest/kafka/partitionLag metric. Since then we have noticed that

  • ingest/kafka/partitionLag
  • ingest/kafka/maxLag
  • ingest/notices/queueSize

are getting dropped frequently at overlord. We haven't seen the issue with any other metrics.

We are using StatsDEmitter to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying on dogstatsd.client.packets_dropped to see if the metrics are getting dropped. The telemetry metrics available in the StatsDProcessor do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshot
Screenshot 2024-01-10 at 3 01 06 PM

ingest/kafka/maxLag is crucial to us as we rely on it extensively for alerting. We use ingest/kafka/partitionLag to identify the partitions that lag the most.

The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.

Proposal

  1. Pass on the tags available in StatsDEmitter to telemetry metrics in StatsDSender.
  2. FEmit partitionLag in a new ScheduledExecutorService with a configurable emissionPeriod in SeekableStreamSupervisor. Put a random delay between each emit so that the total delay is less than emissionPeriod.

We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant