Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow DAG to ping pull request reviewers #1540

Closed
1 task
zackkrida opened this issue Jun 8, 2022 · 26 comments · Fixed by WordPress/openverse-catalog#553 or WordPress/openverse-catalog#570
Closed
1 task
Assignees
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents
Projects

Comments

@zackkrida
Copy link
Member

Problem

We have a lot of pull requests that go stale; it would be nice to not have to manually ping reviewers for pull requests via Slack DM or @ in GitHub.

Description

Write a script + DAG to schedule the script that can automatically ping PR reviewers.

Alternatives

Additional context

This work will require some new folder structure in this repo for DAGs.

Implementation

  • 🙋 I would be interested in implementing this feature.
@zackkrida zackkrida added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature labels Jun 8, 2022
@sarayourfriend sarayourfriend self-assigned this Jun 8, 2022
@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature labels Jun 8, 2022
@sarayourfriend
Copy link
Contributor

Zack and I discussed me taking this on in private. It's a high value DevEx improvement that should help move along PRs and ease some of the communication burden and lack of direction and isolation folks have been feeling.

@sarayourfriend
Copy link
Contributor

@AetherUnbound and @stacimc could y'all advise what the structure and approach I should take for this is like, specifically with the DAGs to make it as easy as possible to add them to the DAG pull in Airflow?

In the meantime I will work on the generic script to comb through open PRs and ping reviewers.

@stacimc
Copy link
Contributor

stacimc commented Jun 8, 2022

It sounds like this could be run on a @daily schedule. The maintenance directory might be a good place to put something like this. Depending on what you think makes sense for notifications, the Slack api allows for pinging users.

specifically with the DAGs to make it as easy as possible to add them to the DAG pull in Airflow?

I'm not sure I understand this part of the question or if that was helpful, sorry!

If you're asking about how to make sure Airflow picks up the new DAG, you should be able to just add a new file under the maintenance directory (or wherever you think it should go) and as long as you define the dag at the top level (and not inside a function for example), Airflow will pick it up.

*(Fun fact although I imagine you won't run into it, the words 'Dag' and 'Airflow' have to appear at least once in the file or Airflow won't find it 😅)

@sarayourfriend
Copy link
Contributor

Where is the maintenance directory? In the catalog repo? I think we wanted to put these DAGs into this repository (the openverse) one to avoid mixing application code with non-application code (some folks had qualms about this).

If you and Madison think it should just go in the catalog repo though that's fine by me, we can move this issue over there 🙂

@stacimc
Copy link
Contributor

stacimc commented Jun 8, 2022

Ohh now I understand your question! I didn't even register what repo I was in 😄 Yeah I see the utility in keeping this in this repository, particularly since I imagine it's not likely to be the only DAG we ever want that doesn't quite fit within the Catalog.

Astronomer doesn't have much to say about managing DAGs across repos, other than to not recommend it:

This is a less common pattern and is not recommended for project organization unless it is specifically required. In this case, deploying the files from different repositories together into one Airflow deployment should be managed by your CI/CD tool.

But it's definitely possible 🤷‍♀️ I remember reading about a solution that uses an S3 bucket sitting in between the DAG source repos and Airflow, and pulls from each of the repos. I believe Madison had a specific approach she was thinking of for this use case (sorry Madison if misremembering 😆)?

Edit: found the article I was thinking of

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jun 8, 2022

I think if this is a high-value item we'd like to get out quickly, putting it in maintenance (or even maintainers or a different top level directory to encapsulate DevEx DAGs for now) in the catalog repo would be good. The DAG sync works by doing a git pull, so we'd also need it to do a separate git pull & copy files into the DAG directory, and then report that update, which might be difficult. There are definitely more complex ways we could use to sync the DAGs but I'm not sure we want to tackle that before this issue 😅 Also many of the "git sync" recommendations we'll find for Airflow involve using the Kubernetes Executor, which we are not set up to do yet either haha.

If we move to a monorepo setup, this could get easier to separate in some ways (and harder in others!).

@stacimc
Copy link
Contributor

stacimc commented Jun 8, 2022

Also many of the "git sync" recommendations we'll find for Airflow involve using the Kubernetes Executor, which we are not set up to do yet either haha.
+1 😄

@sarayourfriend do you remember where folks voiced concerns about keeping the DAGs in the catalog? I would personally be fine with implementing this in the Catalog and perhaps beefing up the sync/moving it back here if & when we start building up more DAGs of that type.

If we move to a monorepo setup, this could get easier to separate in some ways (and harder in others!)

Out of curiosity, in which ways would it be harder?

@sarayourfriend
Copy link
Contributor

I don't remember what issue the discussion happened in, sorry! @dhruvkb was the one who I was chatting about it with, so maybe he remembers? IIRC the discussion was more or less about this kind of DAG specifically, though I could be misremembering.

@sarayourfriend
Copy link
Contributor

Also, it sounds like for expediency it'll be best to implement this in the catalog repository for now 👍 Moving this issue over there.

@sarayourfriend sarayourfriend transferred this issue from WordPress/openverse Jun 9, 2022
@sarayourfriend
Copy link
Contributor

I'm trying to add the PyGithub package to the catalog dependencies so it can be used in the DAG I'm writing for this issue, however there's a nasty version conflict 😕

ERROR: Cannot install -r requirements_prod.txt (line 8) and apache-airflow because these package versions have conflicting dependencies.

The conflict is caused by:
    pygithub 1.55 depends on pyjwt>=2.0
    flask-appbuilder 3.4.5 depends on PyJWT<2.0.0 and >=1.7.1

Airflow has this line in their dependency configuration:

    # We are tightly coupled with FAB version because we vendored in part of FAB code related to security manager
    # This is done as part of preparation to removing FAB as dependency, but we are not ready for it yet
    # Every time we update FAB version here, please make sure that you review the classes and models in
    # `airflow/www/fab_security` with their upstream counterparts. In particular, make sure any breaking changes,
    # for example any new methods, are accounted for.
    flask-appbuilder==3.4.5

This is pretty unfortunate!

Rather than using the popular PyGithub package it seems like it'll be easier to just build the requests directly 😢

@zackkrida
Copy link
Member Author

You could pin PyGithub to 1.54.1 but that's probably not the best idea.

@sarayourfriend
Copy link
Contributor

That version is pretty old at this point 😢 Luckily we really only need to make a few targeted requests:

  1. Get list of open PRs
  2. Get comments on PRs
  3. Post comments on PRs

I'm hopeful that this won't be too terribly cumbersome.

@sarayourfriend
Copy link
Contributor

@AetherUnbound and @stacimc could y'all advise me on something I'm not clear on: given this is basically just a script we'd otherwise run as a cron job, is there a value in trying to make it "DAG-like" and split the steps into separate operators? For simplicity's sake it'd be nice to just have a single PythonOperator that runs through the routine in a single callable, but maybe that's not very resilient or something? I also don't want to overcomplicate things though.

If I can do things just as a single PythonOperator all in one function, what is the benefit of using the HttpHook over requests directly? I've read the documentation describing why to use hooks, but it's still not super clear to me that it offers a benefit in this case (a single function that does everything all at once without needing to DAG-ify the process through multiple discrete tasks). In fact, it seems like it actually significantly complicates things 🙂 Maybe the HttpHook approach also does some Airflow magic to make the process more debuggable or something?

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jun 9, 2022

RE PyGitHub, Airflow dependencies can be a complicated mess, so we use a constraints file to lock in dependencies. It does look like PyGithub is currently pinned to 1.54.1 as Zack mentioned. That's probably fine, but an HttpHook would also be totally reasonable.

As for the "dagification", go with what is easiest! Using a PythonOperator with the HttpHook is an idiomatic way of writing a DAG 😄 You could split that up into separate functions, passing parameters between them using XComs if you'd like. Or it could all be just one operator. Since this is a pretty straightforward DAG, you could also try using the TaskFlow API! I haven't used it yet, but I've been wanting to for some time, might be the perfect opportunity.

Edit: sorry, misread the question! More response in the comment below 👇🏼

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jun 9, 2022

Oh to answer your original question - the HttpHook is set up with an airflow Connection. If there is an API key we need to use or anything that needs to access secrets, we can store that in an http-type Connection. The hook can then use that connection by name, which prevents us from having to manage the secrets another way. If you're accessing all public URLs, requests is fine! But for something like Slack, using an HttpHook makes it easier to keep the webhook URL secure.

@sarayourfriend
Copy link
Contributor

Okay. I've written it as a single function to pass to PythonCallable. It uses HttpHook, but the secret is pulled in the DAG definition file using Variable and passed to the callable with op_args. Is that okay or does it need to happen another way?

I'm not seeing the benefit of the TaskFlow API in this case. It's a single function, there isn't really a "data" pipeline 🤔 We're not saving any of the data as I think it's wiser to just treat GitHub itself as the source of truth.

@sarayourfriend
Copy link
Contributor

@AetherUnbound something I'm unclear about with our "deployless Airflow" setup that we have now with the DAG pull job and all... how do we get new variables into the production Airflow? Do we define them through the Airflow admin instead of the .env file? Or is there a way to pull new environment variables in and have Airflow pick them up?

@sarayourfriend
Copy link
Contributor

@AetherUnbound and @rwidom (I remember y'all working on this) I need some help with the URI encoded connection strings. For the GitHub connection I need to be able to pass headers as a query param to the final URL encoded string, but it seems that's not possible with the current "automatic url encoding" script 🤔 Or in any case, I'm having a heck of a time trying to get it to work with the automatic encoding. Looking at the example at the bottom of this page, for example, I'm unable to get it to work: https://airflow.apache.org/docs/apache-airflow-providers-http/stable/connections/http.html

Instead of using the headers param it treats the connection string as the base url and I end up making requests that look like this:

[2022-06-10, 10:46:03 UTC] {http.py:129} INFO - Sending 'GET' to url: https://api.github.com?headers=%7B'Authorization':'token%20<redacted>'%7D/repos/WordPress/openverse/pulls

As you can see, that's not a properly formed URI!

Here's the template I'm using for the connection string:

AIRFLOW_CONN_GITHUB="https://api.github.com?headers=%7B'Authorization':'token%20%3Cnot_set%3E'%7D"

Obviously with not_set replaced by an actual PAT in my .env file.

@AetherUnbound
Copy link
Contributor

@AetherUnbound something I'm unclear about with our "deployless Airflow" setup that we have now with the DAG pull job and all... how do we get new variables into the production Airflow? Do we define them through the Airflow admin instead of the .env file? Or is there a way to pull new environment variables in and have Airflow pick them up?

For now, we're adding them through the UI. In order to pull them from environment variables, we'd either need to 1) redeploy airflow entirely or 2) leverage the airflow CLI during the DAG sync process to update the connections/variables automatically. More discussion on this here: WordPress/openverse-catalog#209. Down the road we can definitely do the latter, but we're using the UI to avoid the former currently 🙂

@AetherUnbound
Copy link
Contributor

On the URI encoding, wow, uncovering a lot here! This was originally the solution we used, since Airflow apparently didn't support https as the initial scheme for HTTP connections: https://stackoverflow.com/a/69387598/3277713

It turns out though that you can also refrain from URL encoding the variable by putting the schema after the URL, which is referenced in the original post. An example of that is also shown in the docs you linked:

For example:

export AIRFLOW_CONN_HTTP_DEFAULT='http://username:password@servvice.com:80/https?headers=header'

Note above the trailing /https, which I believe directs the HttpHook to use https as the scheme instead of http. Now this makes things difficult! Our Slack URLs are the form of https://hooks.slack.com/services/<secret1>/<secret2>/<secret3>. I'd love to throw out the connection string encoding bits we have in the Docker entrypoint (not because they aren't excellent @rwidom 😉 rather, deleted code is debugged code!), but I worry doing so would lead to Airflow interpreting /services as the scheme to use for our Slack connections. Those connections aren't used locally, though, and I believe we define them through the UI anyway in production! And unfortunately we can't just specify hooks.slack.com as the HTTP connection in the environment variable since we still need to store the secrets somewhere.

To summarize:

  • I believe this issue is probably blocking this work, so we should jump on it ASAP
  • We should remove the URL-encoding steps of the entrypoint and revert to unencoded connections by default, with encoded connections where we need to specify a full path (e.g. Slack)
  • Ensure that the header extras will function appropriately after all this

Does that sound okay @sarayourfriend?

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jun 10, 2022

Another approach could be that the HttpHook you construct explicitly overrides the auth_type parameter, and we store the API token in the password field of the connection environment variable. requests doesn't seem to have its own class to handle this unfortunately: https://requests.readthedocs.io/en/latest/api/#authentication. That class just gets called with the login/password, it seems like we could potentially make a shim for it ourselves to handle Authorization: token use cases.

Actually, I think this approach would be best because then we can store the token along with the connection rather than in a separate variable!

@sarayourfriend
Copy link
Contributor

Originally I was passing the token as an Airflow variable to the DAG and then the HttpHook implementation just passes the Authorization header explicitly. The run method accepts headers 🤷 It's one faster solution.

I'd rather not get bogged down in changing the way connection strings are managed... And I still do not really understand the advantage of the connection string syntax over using explicitly configured secrets in the hook implementation class just through regular variables?

@AetherUnbound
Copy link
Contributor

I'd rather not get bogged down in changing the way connection strings are managed... And I still do not really understand the advantage of the connection string syntax over using explicitly configured secrets in the hook implementation class just through regular variables?

Honestly for HTTP connections I could see it going either way. I think for Basic Auth it works great but anything outside that (Slack & this being the perfect example) it gets messy. The advantage of having that all managed in the connection string is that you only need to know the connection ID to get everything working, Airflow will pull in the rest from the connection information. If we're going to go the Variable route though, maybe it's best to just use requests directly. It seems a bit unnecessary to have an entire HTTP connection that points to github.com 😅

@sarayourfriend
Copy link
Contributor

Okay, sounds good. Luckily it'll be an easy refactor 🙂

@sarayourfriend
Copy link
Contributor

It looks like the request to leave the comments is malformed. These logs appear when attempting to run a non-dry run of the DAG in production Airflow:

[2022-06-27, 00:33:17 UTC] {pr_review_reminders.py:165} INFO - Pinging @dhruvkb, @stacimc to review Add basic API load testing script
[2022-06-27, 00:33:17 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/maintenance/pr_review_reminders/pr_review_reminders.py", line 167, in post_reminders
    gh.post_issue_comment(base_repo_name(pr), pr["number"], comment_body)
  File "/usr/local/airflow/openverse_catalog/dags/common/github.py", line 40, in post_issue_comment
    return self._make_request(
  File "/usr/local/airflow/openverse_catalog/dags/common/github.py", line 16, in _make_request
    response.raise_for_status()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.github.com/repos/WordPress/openverse/issues/257/comments

@openverse-bot
Copy link
Collaborator

This is a test comment!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents
Projects
Archived in project
Openverse
  
Backlog
5 participants