Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Compose CeleryExecutor example #8621

Closed
wants to merge 4 commits into from
Closed

Docker Compose CeleryExecutor example #8621

wants to merge 4 commits into from

Conversation

resdevd
Copy link

@resdevd resdevd commented Apr 29, 2020

This is a docker-compose template ideal for local development using official Airflow Docker images with easy deploy instructions. Addressing issue #8605


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg
Copy link

boring-cyborg bot commented Apr 29, 2020

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://apache-airflow-slack.herokuapp.com/

@resdevd
Copy link
Author

resdevd commented Apr 29, 2020

@potiuk FYI

image: apache/airflow:1.10.10
container_name: airflow_flower_cont
restart: always
volumes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@resdevd resdevd Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did contemplate/consider using extension-fields, however IMO we shouldn't incorporate extension-fields for this usecase because of following reasons.

  • Most production environments like ECS/Fargate and Docker swarm don't support them unlike docker env variables.
  • Users might want to use different images and volumes for different airflow services. (Airflow workers might need java, specific python packages etc.,)
  • Much of code duplication is eliminated in form of .env master file already.
  • Readability and Adoption of extension-fields is still lagging behind with majority of docker/docker-compose users.
  • Container names should be unique.
  • Advanced users can always configure the config according to their needs.

So I believe especially for this usecase, extension-fields come under "premature optimization".

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” - Donald Knuth

@BasPH
Copy link
Contributor

BasPH commented Apr 30, 2020

I think we should decide if the docker-compose file will serve as a "hello world" example, or should it be something more complex, but closer to a production deployment?

My ideas:

  • Users will likely want to use it as an example and change it for their specific deployment
  • I think these should be as simple as possible to run, i.e. a one-liner in the readme
  • I think the folder name "templates" is incorrect here, no templating is being done? Something like "examples/docker", or even "docker-compose" in the root of the project makes more sense IMO
  • There are already plenty of example DAGs, let's not add yet another example

To address your points about the extension fields:

Most production environments like ECS/Fargate and Docker swarm don't support them unlike docker env variables.

I think the goal should be a simple example docker-compose file. People will then extract bits and pieces from the example into their deployment. There's a ton of different ways to do deployment, so I would not aim for creating a docker-compose file with the goal of being close to one single deployment method. Instead, aim for something easy to comprehend.

Users might want to use different images and volumes for different airflow services. (Airflow workers might need java, specific python packages etc.,)

Okay, why does this change the argument for using the extension fields?

Much of code duplication is eliminated in form of .env master file already.

If we want the docker-compose file to serve as an example, I think it should be as self-explanatory as possible. Storing env vars in a separate file does not help that cause, I think using extension fields and having everything in the same file will be clearer.

Readability and Adoptation of extension-fields is still lagging behind with majority of docker/docker-compose users.

How so?

Container names should be unique.

You still have unique container names.

Advanced users can always configure the config according to their needs.

Yes.

See the following docker-compose file using extension fields, I think it's pretty readable:

version: '3'

# ========================== AIRFLOW ENVIRONMENT VARIABLES ===========================
x-environment: &airflow_environment
  - AIRFLOW__CORE__EXECUTOR=LocalExecutor
  - AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
  - AIRFLOW__CORE__LOAD_EXAMPLES=False
  - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql://airflow:airflow@postgres:5432/airflow
  - AIRFLOW__CORE__STORE_DAG_CODE=True
  - AIRFLOW__CORE__STORE_SERIALIZED_DAGS=True
  - AIRFLOW__WEBSERVER__RBAC=True
# ========================== /AIRFLOW ENVIRONMENT VARIABLES ==========================

services:
  postgres:
    image: postgres:12-alpine
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    ports:
      - "5432:5432"

  webserver:
    image: apache/airflow:1.10.10-python3.7
    ports:
      - "8080:8080"
    environment: *airflow_environment
    command: webserver

  scheduler:
    image: apache/airflow:1.10.10-python3.7
    environment: *airflow_environment
    command: scheduler

For the initialization I've used this thing in the past, worked quite okay. We could make it a bit more sophisticated, e.g. by checking some condition to be True once a second, instead of sleep 5:

  initdb_adduser:
    image: apache/airflow:1.10.10-python3.7
    depends_on:
      - postgres
    environment: *airflow_environment
    entrypoint: /bin/bash
    # The webserver initializes permissions, so sleep for that to (approximately) be finished
    # No disaster if the webserver isn't finished by then, but create_user will start spitting out errors until the permissions exist
    command: -c 'airflow initdb && sleep 5 && airflow create_user --role Admin --username airflow --password airflow -e airflow@airflow.com -f airflow -l airflow'

@resdevd
Copy link
Author

resdevd commented Apr 30, 2020

I think we should decide if the docker-compose file will serve as a "hello world" example, or should it be something more complex, but closer to a production deployment?

IMO the simplest way to run airflow is still pip install with airflow commands. A "hello world" example should not be requiring users to install/learn docker and docker-compose.

I think the folder name "templates" is incorrect here, no templating is being done? Something like "examples/docker", or even "docker-compose" in the root of the project makes more sense IMO

Sure, like @turbaszek mentioned we could use a name like "deploy" etc., where more docker-compose config with LocalExecutor and potentially k8s/helm config can be added in future. IMO "examples" would be confusing as it might refer to example dags and operators.

There are already plenty of example DAGs, let's not add yet another example

It wasn't about adding an example, the goal was to show usage of docker volumes for dags and plugins.

I think the goal should be a simple example docker-compose file. People will then extract bits and pieces from the example into their deployment. There's a ton of different ways to do deployment, so I would not aim for creating a docker-compose file with the goal of being close to one single deployment method. Instead, aim for something easy to comprehend.

This PR is addressing #8605 "Add Production-ready docker compose for the production image".
IMO CeleryExecutor is probably most advanced config in terms of docker-compose, It's easy to change from CeleryExecutor to LocalExecutor than vice-versa.

Okay, why does this change the argument for using the extension fields?

I was referring to the code context that was being referred to regarding extension fields in the original review.

If we want the docker-compose file to serve as an example, I think it should be as self-explanatory as possible. Storing env vars in a separate file does not help that cause, I think using extension fields and having everything in the same file will be clearer.

Like mentioned above, this PR is addressing an issue where the goal is to have docker-compose close to production. Having fernet keys, sensitive info that live in the docker-compose is not a best practice either (in terms of .gitignore config and migration to ECS/Fargate, Docker Swarm etc., ).

Readability and Adoptation of extension-fields is still lagging behind with majority of docker/docker-compose users.

How so?

Glad you asked. Prior to Airflow's official Docker images, the two most popular docker-compose airflow code include https://github.com/puckel/docker-airflow and https://github.com/bitnami/bitnami-docker-airflow . I don't think extension fields are used there.
Not only that, I don't think you'd find them being used even in official Docker's repo https://github.com/docker/awesome-compose extensively. You might notice same behavior with some of the most popular docker-compose code like https://github.com/deviantony/docker-elk and https://github.com/vegasbrianc/prometheus etc.,

For the context, Extension fields were introduced in docker-compose 3.4 in 2017.

You still have unique container names.

I was referring to the code context that was being referred to regarding extension fields in the original review.

For the initialization I've used this thing in the past, worked quite okay. We could make it a bit more sophisticated, e.g. by checking some condition to be True once a second, instead of sleep 5:

This is a separate issue #8606 , in that case if env variables would be used for this purpose then we can git rid of the init script entirely. Besides containers share cpu, memory, network and io, I'm not sure adding another container for a one time action is the best way to go.

@BasPH BasPH mentioned this pull request May 3, 2020
6 tasks
@gpongracz
Copy link

Looks like the entrypoint.sh is not doing an initdb or upgradedb prior to the command like is found in puckel
https://github.com/puckel/docker-airflow/blob/master/script/entrypoint.sh#L110
I think that the puckel entrypoint script is very elegant and may help inform design of this one...
In the mean time I have written a docker-compose for a local executor definition that overcomes this by setting restart to on-failure and starting a container which runs an initdb which then stops after successfully completing - a bit of a cludge but it gets the current image working...
This is certainly not a production script but it works with the current image
I've commented out a couple of optional mounts for dags and in my case aws keys
If a feature similar to the puckel entrypoint above is developed then the webserver and scheduler definitions in the following docker-compose could be collapsed into a single container

docker-compose.yml (working example mitigating for entrypoint.sh lacking initdb)
version: '3.7'
services:
postgres:
image: postgres:10
restart: on-failure
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
ports:
- "5432:5432"
initdb:
image: apache/airflow:latest
restart: on-failure
depends_on:
- postgres
environment:
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
logging:
options:
max-size: 10m
max-file: "3"
command: initdb
webserver:
image: apache/airflow:latest
restart: on-failure
depends_on:
- postgres
environment:
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
- AIRFLOW__CORE__LOAD_EXAMPLES=True
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
logging:
options:
max-size: 10m
max-file: "3"

volumes:

- ./dags:/opt/airflow/dags

- ~/.aws:/home/airflow/.aws

    ports:
        - "8080:8080"
    command: webserver
scheduler:
    image: apache/airflow:latest
    restart: on-failure
    depends_on:
        - postgres
    environment:
        - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
        - AIRFLOW__CORE__LOAD_EXAMPLES=True
        - AIRFLOW__CORE__EXECUTOR=LocalExecutor
        - AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
    logging:
        options:
            max-size: 10m
            max-file: "3"

volumes:

- ./dags:/opt/airflow/dags

- ~/.aws:/home/airflow/.aws

    command: scheduler

@resdevd
Copy link
Author

resdevd commented May 4, 2020

entrypoint.sh is maintained in dockerfile itself. This PR is docker-compose setup using Airflow official Docker Image.
If you explore the code, you can see one time setup of initdb and inituser via docker-exec into main airflow container via a shell script.
The idea is users who need initdb and intiuser setup can run that script one time after deploying docker compose, or users can just use docker-compose by replacing env variables with an existing db connection string.
In this PR, the containers are in restart mode, so users can run that script while docker-compose is active. IMO we shouldn't waste resources by having a container run just for one time action. In this PR, users won't have to rebuild images by changing entrypoint.

@stale
Copy link

stale bot commented Jun 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 20, 2020
@stale stale bot closed this Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants