-
Notifications
You must be signed in to change notification settings - Fork 51
Merge popularity calculations and data refresh into a single DAG #496
Merge popularity calculations and data refresh into a single DAG #496
Conversation
@@ -0,0 +1,174 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: This file just pulls out all the old data_refresh steps into a new subfactory.
If `force_refresh_metrics` is explictly configured to False (rather than omitted), then do not refresh the popularity metrics even if this is the first successful run of the month. This option could be helpful if, for example, the first dagrun of the month succeeds during the popularity steps but fails during the data refresh. When we manually re-run the DAG, we can save time by skipping this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more small things!
openverse_catalog/dags/data_refresh/refresh_popularity_metrics_task_factory.py
Outdated
Show resolved
Hide resolved
python_callable=sql.update_db_view, | ||
op_args=[POSTGRES_CONN_ID, media_type], | ||
trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS, | ||
doc_md=create_refresh_view_data_task.__doc__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💖
I cannot test this. On the first step, when I trigger
The ingestion server seems to finish the task, at least that's what I understand from all the logs, and their last lines:
I really love being able to run both the catalog and the API ingestion server locally! I hope I can actually finish the refresh, though :) |
Thanks for testing @obulat! I wonder, can you check for
But with the changes in #480 we can now just configure:
If you haven't updated your connection variables to the improved format, you'll see those connection errors. |
|
||
You can find more background information on this process in the following | ||
issues and related PRs: | ||
|
||
- [[Feature] Data refresh orchestration DAG]( | ||
https://github.com/WordPress/openverse-catalog/issues/353) | ||
- [[Feature] Merge popularity calculations and data refresh into a single DAG]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have these PRs linked here
Thank you, @stacimc! This fixed it, and I managed to run all of the steps you describe in the Testing instructions. It all works really well, and the workflow is much simpler now. One thing I don't understand is this: I have not ingested any data, I was running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for such detailed instructions for testing! It all works great and simplifies a lot in the process ❤️
@obulat The process syncs the catalog with the API DB, but as far as I understand it isn't smart enough to detect whether there have actually been any changes. Actually, when you're testing locally the data refresh server doesn't even really connect to your local catalog by default. It connects to a mock
|
I just realized that we needed to update the @AetherUnbound, @obulat -- just wanted to give you an opportunity to object/add feedback to the last commit if you'd like before I merge. Sorry about that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the timeouts! I love being able to pass around the data classes themselves 😄
Have you tried running both the catalog and the API locally? I have wanted to try it for a year now :) I managed to get it to work, but maybe there's is a simpler way? Simply setting the upstream_db values didn't work. I looked at the local networks (
Based on https://docs.docker.com/compose/networking/#use-a-pre-existing-network And it worked, and now the local catalog and the API are talking to each other. And none of this is related to this PR, of course, 😂 |
Fixes
Fixes WordPress/openverse#1603 by @AetherUnbound
Description
For each media type, this PR merges the following DAGs into a single DAG:
refresh_all_*_popularity_data
: A DAG that currently runs@monthly
. It updates the underlying popularity DB tablesrefresh_*_view_data
: A DAG that currently runs@daily
, except on the first of the month (when the previous DAG runs). It refreshes the materialized view that calculates popularity for records.*_data_refresh
: A DAG that runs@weekly
. It syncs data (including popularity data) in the Catalog up to the API db and elasticsearch cluster, so that it is accessible via the frontend.The new, single DAG combines all the steps and looks like this:
![Screen Shot 2022-05-02 at 11 11 41 AM](https://user-images.githubusercontent.com/63313398/166301808-92987258-0987-442e-a1a5-82a0de5216a1.png)
Key benefits:
How the DAG works
Here's the full DAG, with the TaskGroups expanded so you can see all the steps:
![Screen Shot 2022-05-05 at 3 21 37 PM](https://user-images.githubusercontent.com/63313398/167034839-4f34edbe-7abf-4bed-91ef-ff14b4822ce6.png)
The steps are:
month_check
: a branching operator that checks whether this is the first time this DAG has run this month. If so, it proceeds to the full popularity recalculation; if not, we go directly to the matview refresh.refresh_popularity_metrics_and_constants
: this TaskGroup is skipped if this isn't the first run of the month. In this step, we update the underlying popularity DB tables:update_media_popularity_metrics_table
: this adds any new popularity metrics and updates the configured percentile.update_media_popularity_constants_view
: this completely recalculates the popularity constant for each provider.update_materialized_popularity_view
: this always runs. It refreshes the materialized popularity view, which updates popularity data for existing records and adds data for records that were ingested since the last refreshdata_refresh
: Triggers the actual data refresh on the ingestion server, and awaits its completion. Read more in Add data refresh to Airflow #397.Only the
wait_for_data_refresh
step (first step of the remote data refresh) is restricted to thedata_refresh
pool with one worker slot. This works to ensure that the remote data refreshes cannot occur concurrently, but the popularity refresh steps preceding it can.There is also an option to pass a
force_refresh_metrics
option in the DagRun config, which can be set totrue
to make the metrics refresh tasks run even if it's not the first of the month. We might use this if a new metric is added/constants need to be recalculated mid-month. We can also set the option tofalse
if for some reason we want to skip this time-consuming step.Testing Instructions
Prereqs for setup
just up
from the api repo)openverse-catalog/.env
:AIRFLOW_CONN_DATA_REFRESH="http://172.17.0.1:8001"
data_refresh
pool through the Airflow UI (Admin
->Pools
). Name itdata_refresh
and give it1
SlotTests
First run of the month
If you have any runs for
audio_data_refresh
from this month, delete them. Then runaudio_data_refresh
(encouraged because it is shorter). Ensure that all tasks, including therefresh_all_popularity_data
tasks, runSubsequent runs
Once it has finished, re-run
audio_data_refresh
and make sure therefresh_all_popularity_data
tasks skip, but all others run successfullyForce refresh_all_popularity_data tasks to run with config
In Airflow, click on the run button and select "Trigger DAG with config". You'll be prompted to provide config data as json.
{"force_refresh_metrics": true}
, and verify that therefresh_all_popularity_data
tasks do not skip even though it is not the first run of the month{"force_refresh_metrics": false}
and verify they skip againMultiple DAGs
The
data_refresh
steps should not run concurrently, but all previous steps can. To test this we can take advantage of how slow the image refresh is compared to audio.wait_for_data_refresh
step. Immediately start the audio refresh. It should complete the popularity calculation steps and get towait_for_data_refresh
before the image refresh is complete; thus itswait_for_data_refresh
should enterup_for_reschedule
, and only succeed when image finishes.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin