Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split GISAID profile to "six-month" and "all-time" builds #910

Merged
merged 7 commits into from
Apr 30, 2022

Commits on Apr 30, 2022

  1. Split GISAID profile to "six-month" and "all-time" builds

    This commit splits the existing regional builds "global", "africa", etc... in the "nextstrain-gisaid" profile into "six-month" builds that focus subsampling on the previous six months and "all-time" builds that subsample evenly across time. This uses the new relative dates functionality in "augur filter" and "augur frequencies" to make these subsampling strategies easier to implement and more obvious.
    
    The general subsampling logic is cleaned up in a few ways:
    1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc...
    2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified.
    3. There is a consistent 4:1 emphasis on recent vs early samples for the "six-month" builds and a consistent 4:1 emphasis on focal vs context for the regional builds.
    
    Frequencies timespans are set to match subsampling ranges.
    
    The description.md footer text is updated to describe this split and to provide a table links of region x time period combinations.
    trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    638470f View commit details
    Browse the repository at this point in the history
  2. Split Nextstrain open builds to generate "6m" and "all-time" targets

    Follow the same logic from Nextstrain GISAID and split Nextstrain open to produce "6m" targets that focus subsampling on the previous 6 months as well as "all-time" targets that subsample evenly since pandemic start.
    
    Remove subsampling_ranges.smk as it's no longer referenced.
    trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    4a35ef5 View commit details
    Browse the repository at this point in the history
  3. Increase compute resources for Nextstrain builds

    To compensate for doubling build targets from 7 regional builds to 14 regional builds, this commit doubles computational resources from 36 CPUs to 72 CPUs.
    
    These specific CPU numbers are keyed to AWS EC2 instance sizes. A c5.9xlarge is 36 CPUs, a c5.12xlarge is 48 CPUs and a c5.18xlarge is 72 CPUs. We should be picking one of these and not a number in between.
    
    Finally, this reduces `--set-threads tree` from 16 to 8. There are often close to 7 trees that wanted to simultaneously be run. With 36 CPUs, we'd get situations where 2 trees were taking up 32 CPUs leaving 4 open.
    
    With this commit, we'll have 72 CPUs and want to simultaneously run 14 trees. If trees are each 8 CPUs this should better fit into resources.
    trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    206a97e View commit details
    Browse the repository at this point in the history
  4. Fix parsing of build names with complex prefixes

    Tells Snakemake what the `prefix` wildcard's literal value is,
    preventing Snakemake from interpreting part of the build name as the
    prefix. When Snakemake misinterprets the build name, this causes key
    errors downstream that are difficult to debug.
    huddlej authored and trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    e9ffad1 View commit details
    Browse the repository at this point in the history
  5. Escape regexp metachars when using auspice_json_prefix as a wildcard …

    …constraint
    
    Avoids accidentally treating the fixed string as a regex which could
    lead to very weird Snakemake DAG issues when matching the "prefix"
    wildcard.
    tsibley authored and trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    bd2d766 View commit details
    Browse the repository at this point in the history
  6. Upload files for 6m builds to the same URLs as the previous builds

    Avoids (for now) changes that would break downstream usage like external
    builds or other analyses based on the Open data and the fetches GISAID
    makes to re-serve the files themselves.
    
    6m builds are, per @trvrb, "effectively the same files as we're
    currently providing (with subsampling targeting recent viruses)."¹  In
    the future, once we work out naming more generally and other APIs, we'll
    provide files for the all-time builds too.
    
    The build description template is updated to handle the new build names.
    The templating method changed to make it easier to support dynamic
    template vars.  Since the `build_description` rule runs for _all_ builds,
    not just our own, it's important that we maintain backwards compat.
    This will mostly maintain it except in an slight edge case where
    `$BUILD` will now be substituted in addition to `${BUILD}`.
    
    The `upload` rule is expected to get less usage outside of our own
    builds, but I believe it does get some so it will maintain backwards
    compat behaviour (as long as someone's current build names don't already
    match our new ones).
    
    ¹ #910 (comment)
    tsibley authored and trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    b172c2a View commit details
    Browse the repository at this point in the history
  7. Update change log

    trvrb committed Apr 30, 2022
    Configuration menu
    Copy the full SHA
    7bbc46e View commit details
    Browse the repository at this point in the history