-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks are in the queued status after restarting the redis container #28941
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Airflow's docker compose is not fool-proof. Redis is an in-memory data store and killing it and restarting will cause those problems. The Docker-compose of ours is not "official" way of running a production-ready deployment. Quite contrary - we are very explicit that this is quick-start only and you should make it much more robust if you want to make it production ready (and what you are describing is part of production-ready deployment): In the docker-compose:
and in the docs:
The docker-compose of ours uses redis in it's very-basic form, which is without persistence - all data is kept in memory. If you need persistence in order ot survive restarts, you need to configure and modify the docker-compose: Redis has https://redis.io/docs/management/persistence/ But it's up to you to make it robust and production-ready and resilient to any kind of failures, I am afraid. |
Ok, I have deployed Apache Airflow on kubernetes 1.22 using the official helm chart with the following settings:
But still I get the same result. I also additionally check if there are any problems with writing data to disk - I did not find any problems. |
Then it nees to be looked up by somoene who knows celery and redis more than I do. I do not know redis that much - I guess there are delays in processing and saving the stored data. If you kill redis abruptly by kill -9 for example , then (obviously like any other software) it might loose some data that it keeps in memory and absolutely nothing can be done about it. There will be hangning tasks in this case which you will have to clear. That's the usual recovery mechanism from catastrophic failures. No system in the world can be made resilient to it really unless you do a lot of operational overhead and redundancy (and if you would like to do that, then it is more of a deployment issue). I think you should make sure that you are stopping redis in the "gentle" way that gives it a chance to flush everything to the disk and make sure that is actually restoring it from there Please then open a new issue with Helm chart and ideally showing all the logs (incuding debug logs showing redis storing and restoring the data - to make sure that it actually happens). If you can reproduce that knowng tha tredis is storing/restoring the queue, then I think that's something that somone who is a celery expert should take a look at so it's worth opening an issue. |
Apache Airflow version
2.5.0
What happened
I am using this docker-compose file to deploy Airflow locally on my machine.
When I kill and restart the redis container, the worker Airflow reconnects (according to the logs), but after starting the tasks, they all become queued and run indefinitely.
Last task:
I saw a solution that was adopted in Airflow 2.3.1, but it doesn't seem to work or I'm probably doing something wrong - #23690
What you think should happen instead
After restarting redis, the tasks should run as before and not hang in the queue indefinitely.
How to reproduce
Operating System
Ubuntu 22.04 LTS
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
No response
Anything else
Part of logs, after I restarted the redis container. You can see that the worker lost the connection but then restores:
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: