Segments becoming frequently unavailable when replica = 1 for large datasource #14548

uditsharma · 2023-07-07T11:20:29Z

Segments becoming frequently unavailable when replica = 1 for large datasource

Affected Version

26.0.0

Description

We have noticed that one of the data source which has 3 TB of data having 30K segments is having frequently unavailable segments. From the finding it looks to me, it is a coordinator balancing issue, where coordinator load a segment to new historical and after loading on new one it ends up dropping from both place.

Cluster size:
- 12 historical
- 3 broker
- 6 MM
Configurations in use

coordinator config

    druid.service=druid/coordinator
    druid.plaintextPort=8081
    druid.indexer.logs.kill.enabled=true
    druid.indexer.logs.kill.durationToRetain=259200000
    druid.indexer.logs.kill.delay=21600000
    
    druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage", "druid-kafka-indexing-service", "druid-datasketches", "kafka-emitter","druid-multi-stage-query"]
    druid.coordinator.loadqueuepeon.type=curator
    druid.serverview.type=batch
    

    druid.coordinator.startDelay=PT10S
    druid.coordinator.period=PT200S
    druid.coordinator.period.indexingPeriod=PT180S

Steps to reproduce the problem

Not sure if i have any steps to reproduce it. As this happens when coordinator does the re-balancing.

Finding
This is what we have observed in the logs for a specific segment. Let me know if complete logs needed i will try to get it.

coordinator asks a new historical to load the segment.
next it ask the same new historical to drop the segment which it just loaded because it sees that replica =2.
next it ask the older historical to drop the data, as i am assuming some callback went in saying that new node has loaded the segment so it should also drop.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-13T00:14:16Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-05-11T00:16:39Z

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

uditsharma added the Uncategorized problem report label Jul 7, 2023

github-actions bot added the stale label Apr 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segments becoming frequently unavailable when replica = 1 for large datasource #14548

Segments becoming frequently unavailable when replica = 1 for large datasource #14548

uditsharma commented Jul 7, 2023

github-actions bot commented Apr 13, 2024

github-actions bot commented May 11, 2024

Segments becoming frequently unavailable when replica = 1 for large datasource #14548

Segments becoming frequently unavailable when replica = 1 for large datasource #14548

Comments

uditsharma commented Jul 7, 2023

Affected Version

Description

github-actions bot commented Apr 13, 2024

github-actions bot commented May 11, 2024