Tasks are in the queued status after restarting the redis container #28941

Romsik788 · 2023-01-14T14:20:33Z

Apache Airflow version

2.5.0

What happened

I am using this docker-compose file to deploy Airflow locally on my machine.
When I kill and restart the redis container, the worker Airflow reconnects (according to the logs), but after starting the tasks, they all become queued and run indefinitely.
Last task:

I saw a solution that was adopted in Airflow 2.3.1, but it doesn't seem to work or I'm probably doing something wrong - #23690

What you think should happen instead

After restarting redis, the tasks should run as before and not hang in the queue indefinitely.

How to reproduce

Deploy Airflow using the official docker-compose file.
Stop and start the redis container.
Run a task of some kind

Operating System

Ubuntu 22.04 LTS

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

Part of logs, after I restarted the redis container. You can see that the worker lost the connection but then restores:

airflow-worker_1     | [2023-01-14 13:51:04,008: ERROR/MainProcess] consumer: Cannot connect to redis://redis:6379/0: Error -3 connecting to redis:6379. Temporary failure in name resolution..
airflow-worker_1     | Trying again in 8.00 seconds... (4/100)
airflow-worker_1     | 
redis_1              | 1:M 14 Jan 2023 13:51:08.059 * monotonic clock: POSIX clock_gettime
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Running mode=standalone, port=6379.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 # Server initialized
redis_1              | 1:M 14 Jan 2023 13:51:08.060 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Loading RDB produced by version 7.0.7
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * RDB age 95 seconds
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * RDB memory usage when created 1.42 Mb
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Done loading RDB, keys loaded: 3, keys expired: 0.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * DB loaded from disk: 0.000 seconds
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Ready to accept connections
airflow-worker_1     | [2023-01-14 13:51:12,021: INFO/MainProcess] Connected to redis://redis:6379/0
airflow-worker_1     | [2023-01-14 13:51:12,028: INFO/MainProcess] mingle: searching for neighbors
airflow-worker_1     | [2023-01-14 13:51:13,037: INFO/MainProcess] mingle: all alone

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-01-14T14:20:34Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2023-01-15T21:30:51Z

Airflow's docker compose is not fool-proof. Redis is an in-memory data store and killing it and restarting will cause those problems. The Docker-compose of ours is not "official" way of running a production-ready deployment. Quite contrary - we are very explicit that this is quick-start only and you should make it much more robust if you want to make it production ready (and what you are describing is part of production-ready deployment):

In the docker-compose:

 # WARNING: This configuration is for local development. Do not use it in a production deployment.

and in the docs:

This procedure can be useful for learning and exploration. However, adapting it for use in real-world situations can be complicated. Making changes to this procedure will require specialized expertise in Docker & Docker Compose, and the Airflow community may not be able to help you.

For that reason, we recommend using Kubernetes with the Official Airflow Community Helm Chart when you are ready to run Airflow in production.

The docker-compose of ours uses redis in it's very-basic form, which is without persistence - all data is kept in memory. If you need persistence in order ot survive restarts, you need to configure and modify the docker-compose:

Redis has https://redis.io/docs/management/persistence/

But it's up to you to make it robust and production-ready and resilient to any kind of failures, I am afraid.

Romsik788 · 2023-01-16T17:32:10Z

But it's up to you to make it robust and production-ready and resilient to any kind of failures, I am afraid.

Ok, I have deployed Apache Airflow on kubernetes 1.22 using the official helm chart with the following settings:

elasticsearch:
  enabled: true
workers:
  persistence:
    storageClassName: oci-bv
  podAnnotations:
    log_format: json
redis:
  persistence:
    storageClassName: oci-bv
images:
  useDefaultImageForMigration: true
createUserJob:
  useHelmHooks: false
migrateDatabaseJob:
  useHelmHooks: false

But still I get the same result. I also additionally check if there are any problems with writing data to disk - I did not find any problems.

potiuk · 2023-01-16T19:58:17Z

Then it nees to be looked up by somoene who knows celery and redis more than I do. I do not know redis that much - I guess there are delays in processing and saving the stored data. If you kill redis abruptly by kill -9 for example , then (obviously like any other software) it might loose some data that it keeps in memory and absolutely nothing can be done about it. There will be hangning tasks in this case which you will have to clear. That's the usual recovery mechanism from catastrophic failures. No system in the world can be made resilient to it really unless you do a lot of operational overhead and redundancy (and if you would like to do that, then it is more of a deployment issue).

I think you should make sure that you are stopping redis in the "gentle" way that gives it a chance to flush everything to the disk and make sure that is actually restoring it from there

Please then open a new issue with Helm chart and ideally showing all the logs (incuding debug logs showing redis storing and restoring the data - to make sure that it actually happens). If you can reproduce that knowng tha tredis is storing/restoring the queue, then I think that's something that somone who is a celery expert should take a look at so it's worth opening an issue.

Romsik788 added area:core kind:bug This is a clearly a bug labels Jan 14, 2023

Romsik788 changed the title ~~Tasks are in the queued status after restarting the redis container.~~ Tasks are in the queued status after restarting the redis container Jan 14, 2023

potiuk closed this as completed Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks are in the queued status after restarting the redis container #28941

Tasks are in the queued status after restarting the redis container #28941

Romsik788 commented Jan 14, 2023 •

edited

Loading

boring-cyborg bot commented Jan 14, 2023

potiuk commented Jan 15, 2023 •

edited

Loading

Romsik788 commented Jan 16, 2023 •

edited

Loading

potiuk commented Jan 16, 2023

Tasks are in the queued status after restarting the redis container #28941

Tasks are in the queued status after restarting the redis container #28941

Comments

Romsik788 commented Jan 14, 2023 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Jan 14, 2023

potiuk commented Jan 15, 2023 • edited Loading

Romsik788 commented Jan 16, 2023 • edited Loading

potiuk commented Jan 16, 2023

Romsik788 commented Jan 14, 2023 •

edited

Loading

potiuk commented Jan 15, 2023 •

edited

Loading

Romsik788 commented Jan 16, 2023 •

edited

Loading