-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split GISAID profile to "six-month" and "all-time" builds #910
Commits on Apr 30, 2022
-
Split GISAID profile to "six-month" and "all-time" builds
This commit splits the existing regional builds "global", "africa", etc... in the "nextstrain-gisaid" profile into "six-month" builds that focus subsampling on the previous six months and "all-time" builds that subsample evenly across time. This uses the new relative dates functionality in "augur filter" and "augur frequencies" to make these subsampling strategies easier to implement and more obvious. The general subsampling logic is cleaned up in a few ways: 1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc... 2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified. 3. There is a consistent 4:1 emphasis on recent vs early samples for the "six-month" builds and a consistent 4:1 emphasis on focal vs context for the regional builds. Frequencies timespans are set to match subsampling ranges. The description.md footer text is updated to describe this split and to provide a table links of region x time period combinations.
Configuration menu - View commit details
-
Copy full SHA for 638470f - Browse repository at this point
Copy the full SHA 638470fView commit details -
Split Nextstrain open builds to generate "6m" and "all-time" targets
Follow the same logic from Nextstrain GISAID and split Nextstrain open to produce "6m" targets that focus subsampling on the previous 6 months as well as "all-time" targets that subsample evenly since pandemic start. Remove subsampling_ranges.smk as it's no longer referenced.
Configuration menu - View commit details
-
Copy full SHA for 4a35ef5 - Browse repository at this point
Copy the full SHA 4a35ef5View commit details -
Increase compute resources for Nextstrain builds
To compensate for doubling build targets from 7 regional builds to 14 regional builds, this commit doubles computational resources from 36 CPUs to 72 CPUs. These specific CPU numbers are keyed to AWS EC2 instance sizes. A c5.9xlarge is 36 CPUs, a c5.12xlarge is 48 CPUs and a c5.18xlarge is 72 CPUs. We should be picking one of these and not a number in between. Finally, this reduces `--set-threads tree` from 16 to 8. There are often close to 7 trees that wanted to simultaneously be run. With 36 CPUs, we'd get situations where 2 trees were taking up 32 CPUs leaving 4 open. With this commit, we'll have 72 CPUs and want to simultaneously run 14 trees. If trees are each 8 CPUs this should better fit into resources.
Configuration menu - View commit details
-
Copy full SHA for 206a97e - Browse repository at this point
Copy the full SHA 206a97eView commit details -
Fix parsing of build names with complex prefixes
Tells Snakemake what the `prefix` wildcard's literal value is, preventing Snakemake from interpreting part of the build name as the prefix. When Snakemake misinterprets the build name, this causes key errors downstream that are difficult to debug.
Configuration menu - View commit details
-
Copy full SHA for e9ffad1 - Browse repository at this point
Copy the full SHA e9ffad1View commit details -
Escape regexp metachars when using auspice_json_prefix as a wildcard …
…constraint Avoids accidentally treating the fixed string as a regex which could lead to very weird Snakemake DAG issues when matching the "prefix" wildcard.
Configuration menu - View commit details
-
Copy full SHA for bd2d766 - Browse repository at this point
Copy the full SHA bd2d766View commit details -
Upload files for 6m builds to the same URLs as the previous builds
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
Configuration menu - View commit details
-
Copy full SHA for b172c2a - Browse repository at this point
Copy the full SHA b172c2aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7bbc46e - Browse repository at this point
Copy the full SHA 7bbc46eView commit details