Skip to content

Example of Airflow usage and deployment with Docker-compose to schedule and monitore data pipelines.

Notifications You must be signed in to change notification settings

francamacdowell/orchestrator-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow as Pipeline Orchestrator

Example of usage and deployment of Airflow scheduling and monitoring workflows with docker-compose.

Airflow Webserver

Table of Contents

About the project

Project showing an example of how to build a production Airflow with Docker, managed by Docker-compose and using ELT pipelines to ingest and move data through Data Lake stages.

Built With

I already have mentioned some technologies I used to build this project but not all of them. Therefore, here is a summary of them:

  • Python: Programming language which Airflow is built and also used to connect to GCP and manipulate data;
  • Pip: Is the package installer for Python
  • Airflow: Is a platform created by the community to programmatically author, schedule and monitor workflows. In our case, data pipelines.
  • Docker and Docker-compose: With Docker we create images to run as containers and packages up code and all its dependencies. Therefore the application runs quickly and reliably from one computing environment to another.
  • Cloud Storage: Is an object storage from Google Cloud Platform (GCP) and usually used as Data Lake because of your main characteristics as unlimited storage with no minimum object size, easily transfer and others likely any Hadoop data repository
  • YAML file: Is a serialization language often used as a format for configuration files, but its object serialization abilities make it a viable replacement for languages like JSON. In this project we use like a configuration file to build our Airflow DAGs

Architecture Diagram

This is a fingerprint of project's architecture.

Project Architecture

We are extracting data from GitHub, storing it on Storage stage ,called Raw Data, exactly as data is found. Extracting from Raw Data stage, doing some data manipulations (turning into tabular data) and storing on a second stage called Refined Data. All with Python code orchestrated by Airflow running inside of a Docker Container.

Missing Parts

With this porfolio project, also would like to show how to deploy the project using VM (Compute Engine from GCP or EC2 from AWS).

Getting Started

What we need and what to do to execute this project, is basically:

  • Start a container
  • Enter Airflow server
  • Control and see your DAGs and tasks

Prerequisites

Will need:

  • Install Docker: I'm using version 19.03.12
  • Install Docker-compose: version 1.26.2
  • Configure google_credentials.json: To access GCP you have to create a project and generate an service account keys by JSON file. Rename by google_credentials.json and put on root folder.

Executing the project

After complete the prerequisites, we are ready to execute the project with a simple command inside project directory:

sudo docker-compose up --build -d

Now access on your browser:

localhost:8080

Docker-compose commands and parameters meaning

  • up: Builds, (re)creates, starts, and attaches to containers for a service.
  • --build: Make Docker build images before starting containers.
  • -d: Detached mode: Run containers in the background, print new container names.

And you're going to have Airflow server running on your localhost on port 8080

Contributing

  1. Fork it (https://github.com/francamacdowell/orchestrator-airflow)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Acknowledgements

About

Example of Airflow usage and deployment with Docker-compose to schedule and monitore data pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published