-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow DAG to ping pull request reviewers #1540
Airflow DAG to ping pull request reviewers #1540
Comments
Zack and I discussed me taking this on in private. It's a high value DevEx improvement that should help move along PRs and ease some of the communication burden and lack of direction and isolation folks have been feeling. |
@AetherUnbound and @stacimc could y'all advise what the structure and approach I should take for this is like, specifically with the DAGs to make it as easy as possible to add them to the DAG pull in Airflow? In the meantime I will work on the generic script to comb through open PRs and ping reviewers. |
It sounds like this could be run on a
I'm not sure I understand this part of the question or if that was helpful, sorry! If you're asking about how to make sure Airflow picks up the new DAG, you should be able to just add a new file under the maintenance directory (or wherever you think it should go) and as long as you define the dag at the top level (and not inside a function for example), Airflow will pick it up. *(Fun fact although I imagine you won't run into it, the words 'Dag' and 'Airflow' have to appear at least once in the file or Airflow won't find it 😅) |
Where is the maintenance directory? In the catalog repo? I think we wanted to put these DAGs into this repository (the openverse) one to avoid mixing application code with non-application code (some folks had qualms about this). If you and Madison think it should just go in the catalog repo though that's fine by me, we can move this issue over there 🙂 |
Ohh now I understand your question! I didn't even register what repo I was in 😄 Yeah I see the utility in keeping this in this repository, particularly since I imagine it's not likely to be the only DAG we ever want that doesn't quite fit within the Catalog. Astronomer doesn't have much to say about managing DAGs across repos, other than to not recommend it:
But it's definitely possible 🤷♀️ I remember reading about a solution that uses an S3 bucket sitting in between the DAG source repos and Airflow, and pulls from each of the repos. I believe Madison had a specific approach she was thinking of for this use case (sorry Madison if misremembering 😆)? Edit: found the article I was thinking of |
I think if this is a high-value item we'd like to get out quickly, putting it in If we move to a monorepo setup, this could get easier to separate in some ways (and harder in others!). |
@sarayourfriend do you remember where folks voiced concerns about keeping the DAGs in the catalog? I would personally be fine with implementing this in the Catalog and perhaps beefing up the sync/moving it back here if & when we start building up more DAGs of that type.
Out of curiosity, in which ways would it be harder? |
I don't remember what issue the discussion happened in, sorry! @dhruvkb was the one who I was chatting about it with, so maybe he remembers? IIRC the discussion was more or less about this kind of DAG specifically, though I could be misremembering. |
Also, it sounds like for expediency it'll be best to implement this in the catalog repository for now 👍 Moving this issue over there. |
I'm trying to add the PyGithub package to the catalog dependencies so it can be used in the DAG I'm writing for this issue, however there's a nasty version conflict 😕
Airflow has this line in their dependency configuration:
This is pretty unfortunate! Rather than using the popular PyGithub package it seems like it'll be easier to just build the requests directly 😢 |
You could pin PyGithub to 1.54.1 but that's probably not the best idea. |
That version is pretty old at this point 😢 Luckily we really only need to make a few targeted requests:
I'm hopeful that this won't be too terribly cumbersome. |
@AetherUnbound and @stacimc could y'all advise me on something I'm not clear on: given this is basically just a script we'd otherwise run as a cron job, is there a value in trying to make it "DAG-like" and split the steps into separate operators? For simplicity's sake it'd be nice to just have a single If I can do things just as a single |
RE PyGitHub, Airflow dependencies can be a complicated mess, so we use a constraints file to lock in dependencies. It does look like PyGithub is currently pinned to 1.54.1 as Zack mentioned. That's probably fine, but an
Edit: sorry, misread the question! More response in the comment below 👇🏼 |
Oh to answer your original question - the |
Okay. I've written it as a single function to pass to I'm not seeing the benefit of the TaskFlow API in this case. It's a single function, there isn't really a "data" pipeline 🤔 We're not saving any of the data as I think it's wiser to just treat GitHub itself as the source of truth. |
@AetherUnbound something I'm unclear about with our "deployless Airflow" setup that we have now with the DAG pull job and all... how do we get new variables into the production Airflow? Do we define them through the Airflow admin instead of the .env file? Or is there a way to pull new environment variables in and have Airflow pick them up? |
@AetherUnbound and @rwidom (I remember y'all working on this) I need some help with the URI encoded connection strings. For the GitHub connection I need to be able to pass headers as a query param to the final URL encoded string, but it seems that's not possible with the current "automatic url encoding" script 🤔 Or in any case, I'm having a heck of a time trying to get it to work with the automatic encoding. Looking at the example at the bottom of this page, for example, I'm unable to get it to work: https://airflow.apache.org/docs/apache-airflow-providers-http/stable/connections/http.html Instead of using the headers param it treats the connection string as the base url and I end up making requests that look like this:
As you can see, that's not a properly formed URI! Here's the template I'm using for the connection string:
Obviously with |
For now, we're adding them through the UI. In order to pull them from environment variables, we'd either need to 1) redeploy airflow entirely or 2) leverage the airflow CLI during the DAG sync process to update the connections/variables automatically. More discussion on this here: WordPress/openverse-catalog#209. Down the road we can definitely do the latter, but we're using the UI to avoid the former currently 🙂 |
On the URI encoding, wow, uncovering a lot here! This was originally the solution we used, since Airflow apparently didn't support It turns out though that you can also refrain from URL encoding the variable by putting the
Note above the trailing To summarize:
Does that sound okay @sarayourfriend? |
Another approach could be that the Actually, I think this approach would be best because then we can store the token along with the connection rather than in a separate variable! |
Originally I was passing the token as an Airflow variable to the DAG and then the HttpHook implementation just passes the Authorization header explicitly. The I'd rather not get bogged down in changing the way connection strings are managed... And I still do not really understand the advantage of the connection string syntax over using explicitly configured secrets in the hook implementation class just through regular variables? |
Honestly for HTTP connections I could see it going either way. I think for Basic Auth it works great but anything outside that (Slack & this being the perfect example) it gets messy. The advantage of having that all managed in the connection string is that you only need to know the connection ID to get everything working, Airflow will pull in the rest from the connection information. If we're going to go the Variable route though, maybe it's best to just use |
Okay, sounds good. Luckily it'll be an easy refactor 🙂 |
It looks like the request to leave the comments is malformed. These logs appear when attempting to run a non-dry run of the DAG in production Airflow:
|
This is a test comment!!! |
Problem
We have a lot of pull requests that go stale; it would be nice to not have to manually ping reviewers for pull requests via Slack DM or
@
in GitHub.Description
Write a script + DAG to schedule the script that can automatically ping PR reviewers.
Alternatives
Additional context
This work will require some new folder structure in this repo for DAGs.
Implementation
The text was updated successfully, but these errors were encountered: