Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a config file to define hardcoded parameters #9

Merged
merged 3 commits into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# CHANGELOG
* 11 January 2024: Use a config file to define hardcoded parameters and file paths, add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9)
38 changes: 15 additions & 23 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,17 +1,9 @@
configfile: "config/config.yaml"

rule all:
input:
auspice_json = "auspice/measles.json",

rule files:
params:
input_fasta = "data/measles.fasta",
dropped_strains = "config/dropped_strains.txt",
reference = "config/measles_reference.gb",
colors = "config/colors.tsv",
auspice_config = "config/auspice_config.json"

files = rules.files.params

rule download:
"""Downloading sequences and metadata from data.nextstrain.org"""
output:
Expand Down Expand Up @@ -51,14 +43,14 @@ rule filter:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv",
exclude = files.dropped_strains
exclude = config["files"]["exclude"]
output:
sequences = "results/filtered.fasta"
params:
group_by = "country year month",
sequences_per_group = 20,
min_date = 1950,
min_length = 5000
group_by = config["filter"]["group_by"],
sequences_per_group = config["filter"]["sequences_per_group"],
min_date = config["filter"]["min_date"],
min_length = config["filter"]["min_length"]
shell:
"""
augur filter \
Expand All @@ -79,7 +71,7 @@ rule align:
"""
input:
sequences = "results/filtered.fasta",
reference = files.reference
reference = config["files"]["reference"]
output:
alignment = "results/aligned.fasta"
shell:
Expand Down Expand Up @@ -121,9 +113,9 @@ rule refine:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
params:
coalescent = "opt",
date_inference = "marginal",
clock_filter_iqd = 4
coalescent = config["refine"]["coalescent"],
date_inference = config["refine"]["date_inference"],
clock_filter_iqd = config["refine"]["clock_filter_iqd"]
shell:
"""
augur refine \
Expand All @@ -147,7 +139,7 @@ rule ancestral:
output:
node_data = "results/nt_muts.json"
params:
inference = "joint"
inference = config["ancestral"]["inference"]
shell:
"""
augur ancestral \
Expand All @@ -162,7 +154,7 @@ rule translate:
input:
tree = "results/tree.nwk",
node_data = "results/nt_muts.json",
reference = files.reference
reference = config["files"]["reference"]
output:
node_data = "results/aa_muts.json"
shell:
Expand All @@ -182,8 +174,8 @@ rule export:
branch_lengths = "results/branch_lengths.json",
nt_muts = "results/nt_muts.json",
aa_muts = "results/aa_muts.json",
colors = files.colors,
auspice_config = files.auspice_config
colors = config["files"]["colors"],
auspice_config = config["files"]["auspice_config"]
output:
auspice_json = rules.all.input.auspice_json
shell:
Expand Down
16 changes: 16 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
files:
exclude: "config/dropped_strains.txt"
reference: "config/measles_reference.gb"
colors: "config/colors.tsv"
auspice_config: "config/auspice_config.json"
Comment on lines +1 to +5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

I would follow the pattern we use for the ncov workflow config files where we define a top-level "files" key in the YAML such that most of the entries in the files rule map like this:

@huddlej, can you share some details on the reason ncov was originally set up like this?
Specifically, why group under the top level files key instead of adding the files to specific param groups like so?

filter:
    exclude: "config/dropped_strains.txt"
    ...
align:
    reference: "config/measles_reference.gb"
export:
    colors: "config/colors.tsv"
    auspice_config: "config/auspice_config.json"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joverlee521 That's a good question! The ncov pattern came from the same impulse we felt in this PR to replace a files rule in the Snakefile with a YAML config. I think I wanted to emphasize the placement of the configuration details instead of suggesting their reorganization. (My memories from this time period are hazy though!)

The original reason for grouping all the config file paths into one section of the Snakefile must date back farther, but I suspect the main reason for that structure is that the same files can be needed by multiple rules. For example, we might need the reference file for align, ancestral, and translate. In the flu workflow, we need to reference the list of strains from include in both augur filter rules and a later flag outliers rule. Even files that seem less ambiguously connected to a specific rule like the Auspice config JSON is to export, we also use the color scales from those JSONs in seasonal flu across multiple rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the context @huddlej!

Grouping files together makes sense if expected to be used multiple times. I also see the other side of it being confusing to need to edit configs for the filter rule under both files and filter. It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).

Currently, I see configs using a mix of top level files key, top level file name keys, and rule specific file name keys. We should come to some consensus on this schema as a group...I'll write up an issue on this in pathogen-repo-template.

Copy link
Member

@jameshadfield jameshadfield Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).

I have spent plenty of time following the strings to work out which rules use which files (mainly in ncov), and often finding a bunch of files no longer used by any rule. Docs are great, but from experience they inevitably fall out of sync. How about something like the following (or a variant thereof)?

files:
  exclude: &files_exclude 'config/exclude_accessions.txt'
filter:
  exclude: *files_exclude

P.S. Thanks @joverlee521 for the links to examples in your comment, it makes a big difference ✨

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue discussion in nextstrain/pathogen-repo-guide#26

filter:
group_by: "country year month"
sequences_per_group: 20
min_date: 1950
min_length: 5000
refine:
coalescent: "opt"
date_inference: "marginal"
clock_filter_iqd: 4
ancestral:
inference: "joint"