-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow numbers in build names #524
Conversation
@huddlej -- I still get this error here:
|
Is there a reason this hasn't been merged yet? Or did it just fall through the cracks? I have had no issues using the code |
I think this works fine for most people's builds, however, in the 'official' Nextstrain builds we also generate dated files - I think this is where the error comes in. (Unless something changed since Jan) |
7cbf6ca
to
3cbddd0
Compare
Fixes multiple input function exceptions that occur when running the Nextstrain profiles with older versions of Snakemake (pre-5.20.0). Since Nextstrain builds produce dated JSONs (e.g., with "YYYY-MM-DD.json" suffixes), the change to allow numbers in build names makes these dated JSONs ambiguous to create from Snakemake's perspective. Despite the earlier inclusion of rule orders to prioritize the "dated_json" rule over the now-ambiguous "finalize" rule, Snakemake still needs to walk through the graph of possible ways to create the dated JSON files before it accounts for the rule order. If any exceptions occur during this graph walk, pre-5.20.0 Snakemake versions would exit with an "InputFunctionException". As of 5.20.0, Snakemake catches these exceptions internally and keeps searching the graph for valid paths [1]. Because Snakemake initially interprets the value of the "build_name" wildcard to be something like "europe_2021-11-24" instead of "europe", all dictionary lookups in the workflow that expect "europe" use the dated name instead and raise key errors. The solution in this commit is to replace hardcoded key lookups with `get` method calls and a default empty dictionary. Although we should really update the Nextstrain Docker base image to use the latest Snakemake (6.10.0 from Oct 2021) instead of 5.10.0 (from Jan 2020), this commit fixes the input function exceptions that were breaking the CI build in the Docker image and likely the cause of the error that @rneher mentioned in the original PR comments [2]. [1] snakemake/snakemake#479 [2] #524 (comment)
I was just testing out a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ce0a763 is some good detective work @huddlej!
I've tested locally using snakemake 7.3.7 and can build a DAG without issue for the core nextstrain builds, however when changing the example profile to use a numeric build name I get an error:
## my_profiles/example/builds.yaml changed to include
builds:
north-america_usa_washington_king-county_123: # name of the build; this can be anything
# error
Failed validating 'pattern' in schema['properties']['builds']['propertyNames']:
OrderedDict([('pattern', '^(?:[_a-zA-Z-](?!(tip-frequencies)))+$')])
Updating workflow/schemas/config.schema.yaml
lines 39-46 to be the following allowed things to work as expected:
builds:
type: object
minProperties: 1
propertyNames:
# Allow build names to contain alpha characters, underscores, and hyphens
# but not special strings used for Nextstrain builds. Also used in the
# workflow's wildcard_constraints.
pattern: "^(?:[a-zA-Z0-9-_](?!(tip-frequencies)))+$"
Reduces the constraints on build names for most builds and adds specific constraints to Nextstrain-dated builds. These constraints prevent dates from being in Nextstrain build names so the standard Nextstrain builds can be saved as dated copies. To prevent ambiguity between the dated JSONs rule and the main workflow's finalize rule, this commit also specifies a preferred rule order for the Nextstrain builds.
Fixes multiple input function exceptions that occur when running the Nextstrain profiles with older versions of Snakemake (pre-5.20.0). Since Nextstrain builds produce dated JSONs (e.g., with "YYYY-MM-DD.json" suffixes), the change to allow numbers in build names makes these dated JSONs ambiguous to create from Snakemake's perspective. Despite the earlier inclusion of rule orders to prioritize the "dated_json" rule over the now-ambiguous "finalize" rule, Snakemake still needs to walk through the graph of possible ways to create the dated JSON files before it accounts for the rule order. If any exceptions occur during this graph walk, pre-5.20.0 Snakemake versions would exit with an "InputFunctionException". As of 5.20.0, Snakemake catches these exceptions internally and keeps searching the graph for valid paths [1]. Because Snakemake initially interprets the value of the "build_name" wildcard to be something like "europe_2021-11-24" instead of "europe", all dictionary lookups in the workflow that expect "europe" use the dated name instead and raise key errors. The solution in this commit is to replace hardcoded key lookups with `get` method calls and a default empty dictionary. Although we should really update the Nextstrain Docker base image to use the latest Snakemake (6.10.0 from Oct 2021) instead of 5.10.0 (from Jan 2020), this commit fixes the input function exceptions that were breaking the CI build in the Docker image and likely the cause of the error that @rneher mentioned in the original PR comments [2]. [1] snakemake/snakemake#479 [2] #524 (comment)
Updates the config schema to reflect the regular expressions used in wildcard constraints.
Always place the hyphen at the end of the character set in our regular expressions for build names and origins, to ensure that hyphens always get treated as literals and not as part of a range.
ce0a763
to
6721486
Compare
Thank you for testing this out, @jameshadfield! I've rebased onto master (to fix conflicts in the main workflow) and pushed commits to fix the schema validation and update the docs. I confirmed I could build DAGs for Nextstrain GISAID and open builds and have queued up a trial run for open builds. I also modified the |
Use leading hyphen in regex character sets to follow the convention of literal hyphens in either first or last position of a set and to support easier appending of additional characters to the end of the set without accidentally implementing a range with a trailing hyphen. Also, use a compressed syntax for dates.
Also, fix the ordering of the recent features such that the latest feature is always at the top.
This is a second attempt to add support for numbers in build names after #522 unexpectedly broke the Nextstrain team's builds. This PR reduces the constraints on build names for most builds and adds rule-specific wildcard constraints to Nextstrain-dated builds. These constraints prevent dates from being in Nextstrain build names so the standard Nextstrain builds can be saved as dated copies. To prevent ambiguity between the dated JSONs rule and the main workflow's finalize rule, this commit also specifies a preferred rule order for the Nextstrain builds.
(Updated on November 24, 2021)
I tested this new code with the following build config (
builds.yaml
):And then the following command:
I also confirmed that the Nextstrain profile's dryrun worked as expected:
These changes should allow people to name their builds nearly anything they like including dates (e.g.,
global-2020-12-22
) without breaking the existing Nextstrain naming scheme for dated builds.Resolves #482