Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks are in the queued status after restarting the redis container #28941

Closed
1 of 2 tasks
Romsik788 opened this issue Jan 14, 2023 · 4 comments
Closed
1 of 2 tasks

Tasks are in the queued status after restarting the redis container #28941

Romsik788 opened this issue Jan 14, 2023 · 4 comments
Labels
area:core kind:bug This is a clearly a bug

Comments

@Romsik788
Copy link

Romsik788 commented Jan 14, 2023

Apache Airflow version

2.5.0

What happened

I am using this docker-compose file to deploy Airflow locally on my machine.
When I kill and restart the redis container, the worker Airflow reconnects (according to the logs), but after starting the tasks, they all become queued and run indefinitely.
Last task:
image
I saw a solution that was adopted in Airflow 2.3.1, but it doesn't seem to work or I'm probably doing something wrong - #23690

What you think should happen instead

After restarting redis, the tasks should run as before and not hang in the queue indefinitely.

How to reproduce

  1. Deploy Airflow using the official docker-compose file.
  2. Stop and start the redis container.
  3. Run a task of some kind

Operating System

Ubuntu 22.04 LTS

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

Part of logs, after I restarted the redis container. You can see that the worker lost the connection but then restores:

airflow-worker_1     | [2023-01-14 13:51:04,008: ERROR/MainProcess] consumer: Cannot connect to redis://redis:6379/0: Error -3 connecting to redis:6379. Temporary failure in name resolution..
airflow-worker_1     | Trying again in 8.00 seconds... (4/100)
airflow-worker_1     | 
redis_1              | 1:M 14 Jan 2023 13:51:08.059 * monotonic clock: POSIX clock_gettime
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Running mode=standalone, port=6379.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 # Server initialized
redis_1              | 1:M 14 Jan 2023 13:51:08.060 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Loading RDB produced by version 7.0.7
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * RDB age 95 seconds
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * RDB memory usage when created 1.42 Mb
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Done loading RDB, keys loaded: 3, keys expired: 0.
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * DB loaded from disk: 0.000 seconds
redis_1              | 1:M 14 Jan 2023 13:51:08.060 * Ready to accept connections
airflow-worker_1     | [2023-01-14 13:51:12,021: INFO/MainProcess] Connected to redis://redis:6379/0
airflow-worker_1     | [2023-01-14 13:51:12,028: INFO/MainProcess] mingle: searching for neighbors
airflow-worker_1     | [2023-01-14 13:51:13,037: INFO/MainProcess] mingle: all alone

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Romsik788 Romsik788 added area:core kind:bug This is a clearly a bug labels Jan 14, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 14, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@Romsik788 Romsik788 changed the title Tasks are in the queued status after restarting the redis container. Tasks are in the queued status after restarting the redis container Jan 14, 2023
@potiuk
Copy link
Member

potiuk commented Jan 15, 2023

Airflow's docker compose is not fool-proof. Redis is an in-memory data store and killing it and restarting will cause those problems. The Docker-compose of ours is not "official" way of running a production-ready deployment. Quite contrary - we are very explicit that this is quick-start only and you should make it much more robust if you want to make it production ready (and what you are describing is part of production-ready deployment):

In the docker-compose:

 # WARNING: This configuration is for local development. Do not use it in a production deployment.

and in the docs:

This procedure can be useful for learning and exploration. However, adapting it for use in real-world situations can be complicated. Making changes to this procedure will require specialized expertise in Docker & Docker Compose, and the Airflow community may not be able to help you.

For that reason, we recommend using Kubernetes with the Official Airflow Community Helm Chart when you are ready to run Airflow in production.

The docker-compose of ours uses redis in it's very-basic form, which is without persistence - all data is kept in memory. If you need persistence in order ot survive restarts, you need to configure and modify the docker-compose:

Redis has https://redis.io/docs/management/persistence/

But it's up to you to make it robust and production-ready and resilient to any kind of failures, I am afraid.

@potiuk potiuk closed this as completed Jan 15, 2023
@Romsik788
Copy link
Author

Romsik788 commented Jan 16, 2023

But it's up to you to make it robust and production-ready and resilient to any kind of failures, I am afraid.

Ok, I have deployed Apache Airflow on kubernetes 1.22 using the official helm chart with the following settings:

elasticsearch:
  enabled: true
workers:
  persistence:
    storageClassName: oci-bv
  podAnnotations:
    log_format: json
redis:
  persistence:
    storageClassName: oci-bv
images:
  useDefaultImageForMigration: true
createUserJob:
  useHelmHooks: false
migrateDatabaseJob:
  useHelmHooks: false

But still I get the same result. I also additionally check if there are any problems with writing data to disk - I did not find any problems.

@potiuk
Copy link
Member

potiuk commented Jan 16, 2023

Then it nees to be looked up by somoene who knows celery and redis more than I do. I do not know redis that much - I guess there are delays in processing and saving the stored data. If you kill redis abruptly by kill -9 for example , then (obviously like any other software) it might loose some data that it keeps in memory and absolutely nothing can be done about it. There will be hangning tasks in this case which you will have to clear. That's the usual recovery mechanism from catastrophic failures. No system in the world can be made resilient to it really unless you do a lot of operational overhead and redundancy (and if you would like to do that, then it is more of a deployment issue).

I think you should make sure that you are stopping redis in the "gentle" way that gives it a chance to flush everything to the disk and make sure that is actually restoring it from there

Please then open a new issue with Helm chart and ideally showing all the logs (incuding debug logs showing redis storing and restoring the data - to make sure that it actually happens). If you can reproduce that knowng tha tredis is storing/restoring the queue, then I think that's something that somone who is a celery expert should take a look at so it's worth opening an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants