Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow numbers in build names #524

Merged
merged 7 commits into from
Apr 12, 2022
Merged

Allow numbers in build names #524

merged 7 commits into from
Apr 12, 2022

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Dec 22, 2020

This is a second attempt to add support for numbers in build names after #522 unexpectedly broke the Nextstrain team's builds. This PR reduces the constraints on build names for most builds and adds rule-specific wildcard constraints to Nextstrain-dated builds. These constraints prevent dates from being in Nextstrain build names so the standard Nextstrain builds can be saved as dated copies. To prevent ambiguity between the dated JSONs rule and the main workflow's finalize rule, this commit also specifies a preferred rule order for the Nextstrain builds.

(Updated on November 24, 2021)
I tested this new code with the following build config (builds.yaml):

inputs:
  - name: example-data
    metadata: data/example_metadata.tsv
    sequences: data/example_sequences.fasta.gz

default_build_name: my-build-2021-11-24

And then the following command:

snakemake --configfile builds.yaml -p -j 1

I also confirmed that the Nextstrain profile's dryrun worked as expected:

snakemake --profile nextstrain_profiles/nextstrain-gisaid all_regions -np

These changes should allow people to name their builds nearly anything they like including dates (e.g., global-2020-12-22) without breaking the existing Nextstrain naming scheme for dated builds.

Resolves #482

@rneher
Copy link
Member

rneher commented Jan 10, 2021

@huddlej -- I still get this error here:

\:> snakemake --profile nextstrain_profiles/nextstrain all_regions -n
Building DAG of jobs...
InputFunctionException in line 963 of /home/richard/Projects/nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk:
KeyError: 'global_2021-01-10'
Wildcards:
build_name=global_2021-01-10

@huddlej huddlej self-assigned this Jan 14, 2021
@jacaravas
Copy link

Is there a reason this hasn't been merged yet? Or did it just fall through the cracks? I have had no issues using the code
build_name = r'(?:[_a-zA-Z0-9-](?!(tip-frequencies)))+',
in the Snakefile "wildcard_constraints" for the last several months.

@emmahodcroft
Copy link
Member

I think this works fine for most people's builds, however, in the 'official' Nextstrain builds we also generate dated files - I think this is where the error comes in. (Unless something changed since Jan)

huddlej added a commit that referenced this pull request Nov 25, 2021
Fixes multiple input function exceptions that occur when running the
Nextstrain profiles with older versions of Snakemake (pre-5.20.0). Since
Nextstrain builds produce dated JSONs (e.g., with "YYYY-MM-DD.json"
suffixes), the change to allow numbers in build names makes these dated
JSONs ambiguous to create from Snakemake's perspective.

Despite the earlier inclusion of rule orders to prioritize the
"dated_json" rule over the now-ambiguous "finalize" rule, Snakemake
still needs to walk through the graph of possible ways to create the
dated JSON files before it accounts for the rule order. If any
exceptions occur during this graph walk, pre-5.20.0 Snakemake versions
would exit with an "InputFunctionException". As of 5.20.0, Snakemake
catches these exceptions internally and keeps searching the graph for
valid paths [1]. Because Snakemake initially interprets the value of the
"build_name" wildcard to be something like "europe_2021-11-24" instead
of "europe", all dictionary lookups in the workflow that expect "europe"
use the dated name instead and raise key errors. The solution in this
commit is to replace hardcoded key lookups with `get` method calls and a
default empty dictionary.

Although we should really update the Nextstrain Docker base image to use
the latest Snakemake (6.10.0 from Oct 2021) instead of 5.10.0 (from Jan
2020), this commit fixes the input function exceptions that were
breaking the CI build in the Docker image and likely the cause of the
error that @rneher mentioned in the original PR comments [2].

[1] snakemake/snakemake#479
[2] #524 (comment)
@huddlej huddlej requested review from rneher and removed request for trvrb November 25, 2021 06:48
@victorlin victorlin added the source: office hours Issue mentioned during office hours label Mar 10, 2022
@trvrb
Copy link
Member

trvrb commented Apr 9, 2022

I was just testing out a ncov/gisaid/global/6m build and ran into this issue. I'd suggest bumping priority of addressing this.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ce0a763 is some good detective work @huddlej!

I've tested locally using snakemake 7.3.7 and can build a DAG without issue for the core nextstrain builds, however when changing the example profile to use a numeric build name I get an error:

## my_profiles/example/builds.yaml changed to include
builds:
    north-america_usa_washington_king-county_123: # name of the build; this can be anything
    
# error
Failed validating 'pattern' in schema['properties']['builds']['propertyNames']:
    OrderedDict([('pattern', '^(?:[_a-zA-Z-](?!(tip-frequencies)))+$')])

Updating workflow/schemas/config.schema.yaml lines 39-46 to be the following allowed things to work as expected:

  builds:
    type: object
    minProperties: 1
    propertyNames:
      # Allow build names to contain alpha characters, underscores, and hyphens
      # but not special strings used for Nextstrain builds.  Also used in the
      # workflow's wildcard_constraints.
      pattern: "^(?:[a-zA-Z0-9-_](?!(tip-frequencies)))+$"

Reduces the constraints on build names for most builds and adds specific
constraints to Nextstrain-dated builds. These constraints prevent dates
from being in Nextstrain build names so the standard Nextstrain builds
can be saved as dated copies. To prevent ambiguity between the dated
JSONs rule and the main workflow's finalize rule, this commit also
specifies a preferred rule order for the Nextstrain builds.
Fixes multiple input function exceptions that occur when running the
Nextstrain profiles with older versions of Snakemake (pre-5.20.0). Since
Nextstrain builds produce dated JSONs (e.g., with "YYYY-MM-DD.json"
suffixes), the change to allow numbers in build names makes these dated
JSONs ambiguous to create from Snakemake's perspective.

Despite the earlier inclusion of rule orders to prioritize the
"dated_json" rule over the now-ambiguous "finalize" rule, Snakemake
still needs to walk through the graph of possible ways to create the
dated JSON files before it accounts for the rule order. If any
exceptions occur during this graph walk, pre-5.20.0 Snakemake versions
would exit with an "InputFunctionException". As of 5.20.0, Snakemake
catches these exceptions internally and keeps searching the graph for
valid paths [1]. Because Snakemake initially interprets the value of the
"build_name" wildcard to be something like "europe_2021-11-24" instead
of "europe", all dictionary lookups in the workflow that expect "europe"
use the dated name instead and raise key errors. The solution in this
commit is to replace hardcoded key lookups with `get` method calls and a
default empty dictionary.

Although we should really update the Nextstrain Docker base image to use
the latest Snakemake (6.10.0 from Oct 2021) instead of 5.10.0 (from Jan
2020), this commit fixes the input function exceptions that were
breaking the CI build in the Docker image and likely the cause of the
error that @rneher mentioned in the original PR comments [2].

[1] snakemake/snakemake#479
[2] #524 (comment)
Updates the config schema to reflect the regular expressions used in
wildcard constraints.
Always place the hyphen at the end of the character set in our regular
expressions for build names and origins, to ensure that hyphens always
get treated as literals and not as part of a range.
@huddlej
Copy link
Contributor Author

huddlej commented Apr 12, 2022

Thank you for testing this out, @jameshadfield! I've rebased onto master (to fix conflicts in the main workflow) and pushed commits to fix the schema validation and update the docs. I confirmed I could build DAGs for Nextstrain GISAID and open builds and have queued up a trial run for open builds.

I also modified the getting_started profile's builds.yaml to name the build global_6m and confirmed that this built as expected with the both CLI/Docker and Snakemake/Conda. If the open builds work as expected, I'll plan to merge this.

workflow/snakemake_rules/export_for_nextstrain.smk Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Use leading hyphen in regex character sets to follow the convention of
literal hyphens in either first or last position of a set and to support
easier appending of additional characters to the end of the set without
accidentally implementing a range with a trailing hyphen. Also, use a
compressed syntax for dates.
Also, fix the ordering of the recent features such that the latest
feature is always at the top.
@huddlej huddlej merged commit f317e90 into master Apr 12, 2022
@huddlej huddlej deleted the allow-numeric-build-names branch April 12, 2022 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: office hours Issue mentioned during office hours
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Build name wildcard does not support numbers
8 participants