Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move "prepare-provider-documentation" to Breeze #35586

Merged

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Nov 12, 2023

This PR moves the functionality of preparing provider documentation from a python script inside Breeze CI image to breeze Python package.

This is the first of the series of moves that will simplify the way how provider packages are being built and prepared with the aim of improving security of the supply chain and make it easier to debug and modify the release process.

Historically, release process has been run inside of Breeze for several reasons: isolation of running package preparation from Host environment, the need to keep separate virtualenv and because we run verification of provider packages during release process - which requires the CI environment with all its dependencies.

So far the process looked like this:

  • bash breeze parsed the arguments
  • bash breeze started the docker bash script with packages as parameters
  • the bash script in CI image looped over the packages and run python prepare_provider_packages.py (twice) to generate docs and update changelog (this is interactive process where release manager makes decision on bumping versions). Those python script performed verification on provider.yaml files
  • the bash script summarized the packages and displayed status of preparation

However after moving to Python based breeze, we can simplify it all and run all those steps in Python breeze internal code - no need to go to docker and use bash scripts. We also do not have to do verification of provider.yaml files is done already extensively in pre-commit.

This PR moves all this logic to inside Breeze.

There is stil remainig duplicated code in the original in-container prepare_provider_packages.py remaining, this duplication will be removed by subsequent PRs where other release management commands for provider packages will also be moved to Breeze as follow-up of this PR.

This PR has the following changes:

  • move the provider decumentation code from dev/provider_packages to dev/breeze/ (and from in-container to in-breeze-venv execution)
  • completely removed the intermediate bash script and calling Python scripts from it - moving the logic to Breeze entirely
  • added better diagnostics of what happens when packages are classified with particular types of changes (added special style to show it)
  • cleaned and clarified prepare-provider-documentation commmand line flags
  • introduce explicit "non-interactive" mode that is used to run and test the command in CI and to test it locally
  • replace str with Path were files were used in the moved code
  • add unit tests covering unit-testable parts of the moved code
  • refactore the moved code to use utils available in Breeze
  • split the code into packages and versions (reusable utils) and specific code for preparing package documentation
  • cached provider.yaml information retrieved from providers
  • move provider documentation templates to Breeze
  • better error handling - they are now regular exceptions in Python process rather than interpreting the exit_codes passed from python sub-scripts returnig the codes to Bash
  • when release manager classifies package, only relevant sections are generated (Features/Breaking changes) based on decision and changes are automatically "guessed" only if release manager chose the section where they would fall in

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member Author

potiuk commented Nov 12, 2023

I know there is a lot of code moved here, but I think the easiest way to review (except the regular - code quality) is to see result of this one run.

Anyone should be able review the output of CI command in the "Prepare providers" job. The way how it is run in CI is that it "pretends" you are a relese manager and perform release documentation preparattion, randomly making decisions on the type of change (Breaking/Feature/Bugfix/Doc-only) and in the output you will see the generated documentation and the output of such process run.

Also you can do this manually to check it:

  • breeze release-management prepare-provider-decumentation -> the usual command that Elad runs
  • breeze release-management prepare-provider-decumentation --non-interactive - this one will do the same what CI does, i.e. it will make random decisions and produce output as if the relase manager woudl see it, for the complete set of providers.

@eladkal -> You might appreciate that I fixed the few annoying things in this PR:

  • when you make a decision about type of change, only the relevant "sections" will be generated in Changelog (I.e. when you choose "Bugfix" - you will not see empty "Breaking changes" and "Features" sections any more (so you will not have to delete is any more)

  • also when you make a decision, it will spell out what is going to happen (i.e. to which version number the version is bumped or whether "documentation-only" hash has been updated.

@ashb -> I am finally cleaning it up, this is somethging you - rightfully - complained about in the past that we had this historically weird host -> python -> docker -> bash -> loop -> multiple python commands in a loop -> summary in bash complex sequence here (it's also needed to strengthen security of our release process, which I will explaine soon in follow-up PRs).

After this (and few follow-up PR for package preparation - much smaller than this one) , all the logic for provider release will be fiinally in single place (breeze) and it will be easier to debug and maintain.

@potiuk potiuk force-pushed the move-prepare-providers-documentation-to-breeze branch 5 times, most recently from 01c0e1b to d9d2dda Compare November 12, 2023 18:42
@potiuk
Copy link
Member Author

potiuk commented Nov 12, 2023

All right .

I think that one is already going to be green... One example for easier review (but there are more in the CI job output):

https://github.com/apache/airflow/actions/runs/6842478060/job/18604229396?pr=35586

Airbyte

Bumping patchlevel:

Screenshot 2023-11-12 at 20 33 36

Commits:

Screenshot 2023-11-12 at 20 34 29

Changelog:

Screenshot 2023-11-12 at 20 36 07

@potiuk potiuk force-pushed the move-prepare-providers-documentation-to-breeze branch from d9d2dda to 547cd12 Compare November 12, 2023 19:42
@potiuk
Copy link
Member Author

potiuk commented Nov 12, 2023

Looks green now :).

@potiuk potiuk changed the title Move prepare-provider-packages to breeze Move prepare provider documentation to Breeze Nov 13, 2023
@potiuk potiuk force-pushed the move-prepare-providers-documentation-to-breeze branch from 547cd12 to fddf030 Compare November 13, 2023 08:29
@potiuk potiuk changed the title Move prepare provider documentation to Breeze Move "prepare-provider-documentation" to Breeze Nov 13, 2023
potiuk added a commit to potiuk/airflow that referenced this pull request Nov 13, 2023
This is a follow-up after apache#35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
@potiuk
Copy link
Member Author

potiuk commented Nov 13, 2023

Side effect-> preparing provider documentation goes down from 14 minutes to 3

@potiuk potiuk force-pushed the move-prepare-providers-documentation-to-breeze branch from 844b297 to c261c23 Compare November 15, 2023 00:49
This PR moves the functionality of preparing provider documentation from
a python script inside Breeze CI image to breeze Python package.

This is the first of the series of moves that will simplify the way
how provider packages are being built and prepared with the aim of
improving security of the supply chain and make it easier to debug
and modify the release process.

Historically, release process has been run inside of Breeze for
several reasons: isolation of running package preparation from
Host environment, the need to keep separate virtualenv and because
we run verification of provider packages during release process - which
requires the CI environment with all its dependencies.

So far the process looked like this:

* bash breeze parsed the arguments
* bash breeze started the docker bash script with packages as
  parameters
* the bash script in CI image looped over the packages and run python
  prepare_provider_packages.py (twice) to generate docs and
  update changelog (this is interactive process where release manager
  makes decision on bumping versions). Those python script
  performed verification on provider.yaml files
* the bash script summarized the packages and displayed status of
  preparation

However after moving to Python based breeze, we can simplify it all
and run all those steps in Python breeze internal code - no need to
go to docker and use bash scripts. We also do not have to do
verification of provider.yaml files is done already extensively in
pre-commit.

This PR moves all this logic to inside Breeze.

There is stil remainig duplicated code in the original in-container
`prepare_provider_packages.py` remaining, this duplication will be
removed by subsequent PRs where other release management commands
for provider packages will also be moved to Breeze as follow-up
of this PR.

This PR has the following changes:

* move the provider decumentation code from `dev/provider_packages` to
  `dev/breeze/` (and from in-container to in-breeze-venv execution)
* completely removed the intermediate bash script and calling
  Python scripts from it - moving the logic to Breeze entirely
* added better diagnostics of what happens when packages are classified
  with particular types of changes (added `special` style to show it)
* cleaned and clarified `prepare-provider-documentation` commmand
  line flags
* introduce explicit "non-interactive" mode that is used to
  run and test the command in CI and to test it locally
* replace str with Path were files were used in the moved code
* add unit tests covering unit-testable parts of the moved code
* refactore the moved code to use utils available in Breeze
* split the code into packages and versions (reusable utils) and
  specific code for preparing package documentation
* cached provider.yaml information retrieved from providers
* move provider documentation templates to Breeze
* better error handling - they are now regular exceptions in Python
  process rather than interpreting the exit_codes passed from python
  sub-scripts returnig the codes to Bash
* when release manager classifies package, only relevant sections
  are generated (Features/Breaking changes) based on decision and
  changes are automatically "guessed" only if release manager
  chose the section where they would fall in
@potiuk potiuk force-pushed the move-prepare-providers-documentation-to-breeze branch from c261c23 to 9416085 Compare November 15, 2023 00:55
@potiuk
Copy link
Member Author

potiuk commented Nov 15, 2023

I left a few suggestions for the documentation piece.

Used the opportunity of jsonschema fix rebase to apply those 👍

@potiuk potiuk merged commit e755b79 into apache:main Nov 15, 2023
71 checks passed
@potiuk potiuk deleted the move-prepare-providers-documentation-to-breeze branch November 15, 2023 02:35
potiuk added a commit that referenced this pull request Nov 15, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit that referenced this pull request Nov 15, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit that referenced this pull request Nov 15, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit to potiuk/airflow that referenced this pull request Nov 16, 2023
This PR fixes some of the new templates added in apache#35586 and reapplies
them accross the board for all providers. There were a few small
issues with those templates:

* wrong location of the original template in the comment
* extra comment and lines in the INDEX template
* changes in __init__.py with shorter warning and avoiding
  "too long line" have been reapplied to all providers
potiuk added a commit that referenced this pull request Nov 16, 2023
This PR fixes some of the new templates added in #35586 and reapplies
them accross the board for all providers. There were a few small
issues with those templates:

* wrong location of the original template in the comment
* extra comment and lines in the INDEX template
* changes in __init__.py with shorter warning and avoiding
  "too long line" have been reapplied to all providers
potiuk added a commit that referenced this pull request Nov 16, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit that referenced this pull request Nov 16, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit that referenced this pull request Nov 18, 2023
This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
potiuk added a commit that referenced this pull request Nov 18, 2023
…35617)

This is a follow-up after #35586 and it depends on this one. It
moves the whole functionality of preparing provider packages to
breeze, removing the need of doing it in the Breeze CI image.

Since we have Python breeze with its own environment managed via
`pipx` we can now make sure that all the necessary packages are
installed in this environment and run package building in the
same environment Breeze uses.

Previously we have been running all the package building inside the
CI image for two reasons:

* we could rely on the same version of build tools (wheel/setuptools)
  being installed in the CI image
* security of the provider package preparation that used setuptools
  pre PEP-517 way of building packages that executed setup.py code

In order to isolate execution of potentially arbitrary code
in setup.py from the HOST environment in CI - where the host
environment might have access to secrets and tokens that would allow
it to break out of the sandbox for PRs coming from forks. The setup.py
file has been prepared by breeze using JINJA templates but it was
potentially possible to manipulate provider package directory structure
and get "Python" injection into generated setup.py, so it was safer
to run it in the isolated Breeze CI environment.

This PR makes it secure to run it in the Host environment,
because instead of generating setup.cfg and setup.py we generate
pyproject.toml with all the necessary information and we are using
PEP-517 compliant way of building provider packages - no arbitrary
code executed via setup.py is possible this way on the host,
so we can safely build provider packages in the host. We are
generating declarative pyproject.toml for that rather than imperative
setup.py, so we are safe to run the build process in the host without
being afraid of executing arbitrary code.

We are using flit as build tool - this is one of the popular build
tools - created by Python Packaging team. It is simple and not
too opinionated, it supports PEP-517 as well as PEP-621, so most of
the project mnetadata in pyproject toml can be added to PEP-621
compliant "project" section of pyproject.toml.

Together with the change we improves the process of generation of the
extracted sources for the providers. Originally we copied the whole
sources of Airflow to a single directory (provider_packages) and run
sequentially provider packages building from that single directory,
however it made it impossible to parallelise such builds - all
providers had to be built sequentially.

We change the approach now - instead of copying all airflow
sources once to the single directory, we build providers in separate
subdirectories of files/provider_packages/PROVIDER_ID and we only
copy there relevant sources (i.e. only provider's subfolder from
the "airflow/providers". This is quite a bit faster (each provider
only gets built using only its own sources so just scanning the
directory is faster) but it also allows to run package preparation
in parallel because each provider is fully isolated from others.

This PR also excludes not-needed `prepare_providers_package.py`
and unneded `provider_packages` folder used to prepare providers
before as well as bash script to build the providers and some
unused bash functions.
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Nov 20, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants