You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Setup
Source: a single Kafka topic
Partitions: 400
Metrics Emitter: StatsDEmitter
Issue: Overlord dropping metrics
Description
We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the ingest/kafka/partitionLag metric. Since then we have noticed that
ingest/kafka/partitionLag
ingest/kafka/maxLag
ingest/notices/queueSize
are getting dropped frequently at overlord. We haven't seen the issue with any other metrics.
We are using StatsDEmitter to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying on dogstatsd.client.packets_dropped to see if the metrics are getting dropped. The telemetry metrics available in the StatsDProcessor do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshot
ingest/kafka/maxLag is crucial to us as we rely on it extensively for alerting. We use ingest/kafka/partitionLag to identify the partitions that lag the most.
The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.
Proposal
Pass on the tags available in StatsDEmitter to telemetry metrics in StatsDSender.
FEmit partitionLag in a new ScheduledExecutorService with a configurable emissionPeriod in SeekableStreamSupervisor. Put a random delay between each emit so that the total delay is less than emissionPeriod.
We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.
The text was updated successfully, but these errors were encountered:
Setup
Source: a single Kafka topic
Partitions: 400
Metrics Emitter: StatsDEmitter
Issue: Overlord dropping metrics
Description
We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the
ingest/kafka/partitionLag
metric. Since then we have noticed thatingest/kafka/partitionLag
ingest/kafka/maxLag
ingest/notices/queueSize
are getting dropped frequently at overlord. We haven't seen the issue with any other metrics.
We are using
StatsDEmitter
to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying ondogstatsd.client.packets_dropped
to see if the metrics are getting dropped. The telemetry metrics available in theStatsDProcessor
do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshotingest/kafka/maxLag
is crucial to us as we rely on it extensively for alerting. We useingest/kafka/partitionLag
to identify the partitions that lag the most.The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.
Proposal
StatsDEmitter
to telemetry metrics in StatsDSender.ScheduledExecutorService
with a configurable emissionPeriod inSeekableStreamSupervisor
. Put a random delay between each emit so that the total delay is less than emissionPeriod.We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.
The text was updated successfully, but these errors were encountered: