Consumer.get_watermark_offsets Ignores Timeout Parameter in Some Situations #413

bpowers39 · 2018-06-28T20:05:05Z

Description

The get_watermark_offsets ignores the timeout and blocks forever when the kafka broker for a selected partition is down. Once the broker is back up, the function returns. I suspect this is a bug in rdkafka, but I'm posting it here first in case it's an issue with the python bindings.

How to reproduce

Create a topic with one partition. Run the below example. Once the example has consumed a few messages, kill the broker hosting the partition. The call to get_watermark_offsets will block until the broker comes back up.

import time
import sys
import confluent_kafka
from confluent_kafka import Consumer, KafkaError
from uuid import uuid4

if __name__ == '__main__':
    debug_thread = threading.Thread(target=debug_thread_func)
    debug_thread.start()

    client = Consumer({'bootstrap.servers': 'gateway:9092', 'group.id': str(uuid4()),
              'default.topic.config': {'auto.offset.reset': 'smallest'}})

    def assigned(consumer, partitions):
        print("Assigned:", partitions)
        

    client.subscribe(['ibbot'], on_assign=assigned)

    while True:
        msg = client.poll(timeout=1)
        for partition in client.assignment():
            print(client.get_watermark_offsets(partition, timeout=1))

        if msg is not None:
            if msg.error():
                print("Error: ", msg.error())
            else:
                print("Data")

    client.close()

Note that this example is probably dependent on a bug in rdkafka, so it may not be reproducible 100% of the time. You can also reproduce this with a multi-partition and broker setup. In this case, the function only blocks for as long as it takes for a new leader to be elected.

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): kafka-python: ('0.11.4', 721920) rdkafka: ('0.11.4', 722175)
Apache Kafka broker version: 0.11.0.2
Client configuration: {'bootstrap.servers': 'gateway:9092', 'group.id': str(uuid4()), 'default.topic.config': {'auto.offset.reset': 'smallest'}}
Operating system: RHEL 7
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

rnpridgeon · 2018-06-29T12:57:01Z

Thanks for reporting this @bpowers39. At a glance it would seem that once the broker transitions to the down state it not only stops servicing the broker ops queue but also stops scanning for timeouts. As a result your request hangs out on the queue until a connection can be reestablished.

case RD_KAFKA_BROKER_STATE_DOWN:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L3480-L3507

rd_kafka_broker_connect:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L1457-L1490

case RD_KAFKA_BROKER_STATE_UP; rd_kafka_broker_bufq_timeout_scan:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L3547-L3573

I would agree this can cause some unexpected results. Perhaps adding a timeout scan at some configurable interval during the down state could help here. I'll open a librdkafka issue to report this behavior(or if you would like to that's okay too) to see what the best approach is to take here.

bpowers39 · 2018-06-29T13:19:34Z

Thanks for the quick reply! Feel free to open the issue, you understand it better than I do. This is particularly problematic in combination with #412. Since the whole process stops when it happens.

chinmaychandak · 2020-03-13T22:53:49Z

I'm curious, is this being worked on currently? This issue can really be problematic in some cases.

rnpridgeon added the librdkafka label Jun 29, 2018

rnpridgeon mentioned this issue Jun 29, 2018

Operation timeouts do not appear to be enforced while broker is down. confluentinc/librdkafka#1869

Closed

7 tasks

jsmaupin mentioned this issue Mar 13, 2020

Union of Kafka streams should not freeze when either of the sources freezes/fails. python-streamz/streamz#312

Closed

jsmaupin mentioned this issue Mar 14, 2020

Multi-kafka for high availability python-streamz/streamz#308

Open

This was referenced Mar 16, 2020

[BUG] Kafka external datasource to follow CK convention of creating consumer rapidsai/cudf#4444

Closed

[BUG] Kafka external data source — get_watermark_offsets(timeout=.., cached=False) throws errors rapidsai/cudf#4581

Closed

jsmaupin mentioned this issue Oct 1, 2020

get_watermark offsets timeout not firing confluentinc/librdkafka#3089

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer.get_watermark_offsets Ignores Timeout Parameter in Some Situations #413

Consumer.get_watermark_offsets Ignores Timeout Parameter in Some Situations #413

bpowers39 commented Jun 28, 2018

rnpridgeon commented Jun 29, 2018

bpowers39 commented Jun 29, 2018

chinmaychandak commented Mar 13, 2020

Consumer.get_watermark_offsets Ignores Timeout Parameter in Some Situations #413

Consumer.get_watermark_offsets Ignores Timeout Parameter in Some Situations #413

Comments

bpowers39 commented Jun 28, 2018

Description

How to reproduce

Checklist

rnpridgeon commented Jun 29, 2018

bpowers39 commented Jun 29, 2018

chinmaychandak commented Mar 13, 2020