Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize Ingest #6

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
95b1718
Ingest: Copy ingest from monkeypox repo
Nov 17, 2022
eefe5c1
Ingest: Ignore ingest cache directories
j23414 Mar 28, 2023
aa8b868
Ingest: Remove Nextclade
Nov 17, 2022
a71cc55
Remove bin/scripts duplication
Nov 17, 2022
05c181b
fix: Use curl for downloading files
j23414 Mar 28, 2023
49b0764
Ingest: Replace monkeypox text and parameters with dengue
Nov 17, 2022
29368b6
Dengue-specific-ingest: Add dengue serotype wildcards
Nov 17, 2022
c2ad70f
Dengue-specific-ingest: Add download filters
Nov 17, 2022
159b928
Add post processing script
Nov 17, 2022
9935631
Replace post processing R script with python
Nov 30, 2022
03c0819
[ingest] Simplify finding strain name
j23414 Apr 24, 2023
931f5ae
zstd compress output files
Nov 17, 2022
205165a
fix: makes the compress rule more generic
j23414 Mar 28, 2023
5d7aa6d
Build: Index by genbank accession instead of duplicate strain names
Feb 23, 2023
7c2da29
fix: remove entries where accession is not found
j23414 Mar 28, 2023
269837e
Ingest: Compromise by duplicating scripts
Mar 25, 2023
cc8731c
Ingest: Replace monkeypox text and parameters with dengue in scripts
j23414 Jun 6, 2023
0a9a2a5
Ingest: Compromise by allowing redundant data pull by serotype
Mar 25, 2023
641c020
[wip] attempt at limiting concurrent deploys
j23414 Mar 29, 2023
0ec11a9
Build: parameterize threads in align rule
j23414 Mar 28, 2023
dec2ec3
docs: Add documentation on running ingest
j23414 Apr 18, 2023
e80947d
fix: wildcards paired with optional.yaml
j23414 Jun 6, 2023
30fbeef
refactor: move post_process_metadata to rule transform
j23414 Jun 9, 2023
d25a3e7
cleanup some unused metadata columns
j23414 Jun 9, 2023
a309e5f
mark temp intermediate files
j23414 Jun 9, 2023
5c6baf4
Switch to augur curate format-dates
j23414 Aug 3, 2023
f82298f
Switch to augur curate titlecase
j23414 Aug 3, 2023
ac34243
Use "accession" column as ID column directly
j23414 Aug 14, 2023
9dbccc4
fixup! Use "accession" column as ID column directly
j23414 Aug 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ environment*

# Snakemake state dir
/.snakemake
ingest/.snakemake
ingest/logs
Comment on lines 12 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consolidate the two .snakemake directories here and we probably want to exclude all logs/ directories as well:

Suggested change
/.snakemake
ingest/.snakemake
ingest/logs
.snakemake/
logs/


# Local config overrides
/config_local.yaml
Expand Down
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,14 @@ This build starts by pulling preprocessed sequence and metadata files from:
* https://data.nextstrain.org/files/dengue/sequences_denv4.fasta.zst
* https://data.nextstrain.org/files/dengue/metadata_denv4.tsv.zst

The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals.
The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals from the ingest folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is a little unclear to me. It seems to imply that one should run the code here (nextstrain build ingest) to pull the preprocessed sequence and metadata files. I'd instead replace the code block with a sentence along the lines of "See ./ingest/README.md for how these files are generated."


```
nextstrain build ingest

# Upload final dataset and trigger slack notifications
nextstrain build ingest --configfiles config/config.yaml config/optional.yaml
```

### Using example data

Expand Down
61 changes: 49 additions & 12 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
configfile: "config/config_dengue.yaml"

serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']

rule all:
Expand Down Expand Up @@ -73,14 +75,15 @@ rule decompress:
sequences = "data/sequences_{serotype}.fasta.zst",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, the data files produced by a local ingest must go up to S3 then be downloaded here to run the actual build. This isn't ideal -- we should be able to run a full build locally without touching S3.

(There's the larger question of whether we should maintain multiple Snakefiles, which would largely solve this, but I'll discuss this elsewhere.)

metadata = "data/metadata_{serotype}.tsv.zst"
output:
sequences = "results/sequences_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv"
sequences = "data/sequences_{serotype}.fasta",
metadata = "data/metadata_{serotype}.tsv"
shell:
"""
zstd -d -c {input.sequences} > {output.sequences}
zstd -d -c {input.metadata} > {output.metadata}
"""


rule filter:
"""
Filtering to
Expand All @@ -90,20 +93,22 @@ rule filter:
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "results/sequences_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv",
sequences = "data/sequences_{serotype}.fasta",
metadata = "data/metadata_{serotype}.tsv",
exclude = files.dropped_strains
output:
sequences = "results/filtered_{serotype}.fasta"
params:
group_by = "year region",
sequences_per_group = filter_sequences_per_group,
min_length = 5000
min_length = 5000,
strain_id = config.get("strain_id_field", "strain"),
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id {params.strain_id} \
--exclude {input.exclude} \
--output {output.sequences} \
--group-by {params.group_by} \
Expand All @@ -122,6 +127,8 @@ rule align:
reference = files.reference
output:
alignment = "results/aligned_{serotype}.fasta"
params:
threads = 1
shell:
"""
augur align \
Expand All @@ -130,7 +137,7 @@ rule align:
--output {output.alignment} \
--fill-gaps \
--remove-reference \
--nthreads 1
--nthreads {params.threads}
"""

rule tree:
Expand Down Expand Up @@ -158,20 +165,22 @@ rule refine:
input:
tree = "results/tree-raw_{serotype}.nwk",
alignment = "results/aligned_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv"
metadata = "data/metadata_{serotype}.tsv"
output:
tree = "results/tree_{serotype}.nwk",
node_data = "results/branch-lengths_{serotype}.json"
params:
coalescent = "const",
date_inference = "marginal",
clock_filter_iqd = 4
clock_filter_iqd = 4,
strain_id = config.get("strain_id_field", "strain"),
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--metadata-id {params.strain_id} \
--output-tree {output.tree} \
--output-node-data {output.node_data} \
--timetree \
Expand Down Expand Up @@ -223,17 +232,19 @@ rule traits:
"""
input:
tree = "results/tree_{serotype}.nwk",
metadata = "results/metadata_{serotype}.tsv"
metadata = "data/metadata_{serotype}.tsv"
output:
node_data = "results/traits_{serotype}.json",
params:
columns = traits_columns,
sampling_bias_correction = 3
sampling_bias_correction = 3,
strain_id = config.get("strain_id_field", "strain"),
shell:
"""
augur traits \
--tree {input.tree} \
--metadata {input.metadata} \
--metadata-id {params.strain_id} \
--output {output.node_data} \
--columns {params.columns} \
--confidence \
Expand Down Expand Up @@ -262,26 +273,52 @@ rule export:
"""Exporting data files for for auspice"""
input:
tree = "results/tree_{serotype}.nwk",
metadata = "results/metadata_{serotype}.tsv",
metadata = "data/metadata_{serotype}.tsv",
branch_lengths = "results/branch-lengths_{serotype}.json",
traits = "results/traits_{serotype}.json",
clades = "results/clades_{serotype}.json",
nt_muts = "results/nt-muts_{serotype}.json",
aa_muts = "results/aa-muts_{serotype}.json",
auspice_config = files.auspice_config
output:
auspice_json = "auspice/dengue_{serotype}.json"
auspice_json = "results/raw_dengue_{serotype}.json",
root_sequence = "results/raw_dengue_{serotype}_root-sequence.json",
params:
strain_id = config.get("strain_id_field", "strain"),
shell:
"""
augur export v2 \
--tree {input.tree} \
--metadata {input.metadata} \
--metadata-id {params.strain_id} \
--node-data {input.branch_lengths} {input.traits} {input.clades} {input.nt_muts} {input.aa_muts} \
--auspice-config {input.auspice_config} \
--include-root-sequence \
--output {output.auspice_json}
"""

rule final_strain_name:
input:
auspice_json="results/raw_dengue_{serotype}.json",
metadata="data/metadata_{serotype}.tsv",
root_sequence="results/raw_dengue_{serotype}_root-sequence.json",
output:
auspice_json="auspice/dengue_{serotype}.json",
root_sequence="auspice/dengue_{serotype}_root-sequence.json",
params:
display_strain_field=config.get("display_strain_field", "strain"),
strain_id=config.get("strain_id_field", "strain"),
shell:
"""
python3 bin/set_final_strain_name.py \
--metadata {input.metadata} \
--metadata-id-columns {params.strain_id} \
--input-auspice-json {input.auspice_json} \
--display-strain-name {params.display_strain_field} \
--output {output.auspice_json}
cp {input.root_sequence} {output.root_sequence}
"""

rule clean:
"""Removing directories: {params}"""
params:
Expand Down
38 changes: 38 additions & 0 deletions bin/set_final_strain_name.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import pandas as pd
import json, argparse
from augur.io import read_metadata

def replace_name_recursive(node, lookup):
if node["name"] in lookup:
node["name"] = lookup[node["name"]]

if "children" in node:
for child in node["children"]:
replace_name_recursive(child, lookup)

if __name__=="__main__":
parser = argparse.ArgumentParser(
description="Swaps out the strain names in the Auspice JSON with the final strain name",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)

parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json")
parser.add_argument('--metadata', type=str, required=True, help="input data")
parser.add_argument('--metadata-id-columns', nargs="+", help="names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.")
parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice")
parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON")
args = parser.parse_args()

metadata = read_metadata(args.metadata, id_columns=args.metadata_id_columns)
name_lookup = {}
for ri, row in metadata.iterrows():
strain_id = row.name
name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]
j23414 marked this conversation as resolved.
Show resolved Hide resolved

with open(args.input_auspice_json, 'r') as fh:
data = json.load(fh)

replace_name_recursive(data['tree'], name_lookup)

with open(args.output, 'w') as fh:
json.dump(data, fh)
j23414 marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 2 additions & 0 deletions config/config_dengue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
strain_id_field: "accession"
display_strain_field: "strain"
110 changes: 41 additions & 69 deletions config/dropped_strains.txt
Original file line number Diff line number Diff line change
@@ -1,69 +1,41 @@
DENV/SPAIN/EEB17/2009
DENV1/FRANCE/00475/2008
DENV1/MALAYSIA/P1244/1972
DENV1/VIETNAM/BIDV3990/2008
DENV1/VIETNAM/BIDV992/2006
DENV2/AUSTRALIA/QML22/2015
DENV2/BURKINA_FASO/DAKAR2039/1980
DENV2/BURKINA_FASO/DAKARA2022/1980
DENV2/COTE_D_IVOIRE/DAKAR510/1980
DENV2/COTE_D_IVOIRE/DAKAR578/1980
DENV2/COTE_D_IVOIRE/DAKARA1247/1980
DENV2/GUINEA/PM33974/1981
DENV2/HAITI/DENGUEVIRUS2HOMOSAPIENS1/2016
DENV2/MALAYSIA/DKD811/2008
DENV2/MALAYSIA/P81407/1970
DENV2/MALAYSIA/SAB/2015
DENV2/NIGERIA/IBH11208/1966
DENV2/NIGERIA/IBH11234/1966
DENV2/NIGERIA/IBH11664/1966
DENV2/SENEGAL/0674/1970
DENV2/SENEGAL/DAKAR0761/1974
DENV2/SENEGAL/DAKAR141069/1999
DENV2/SENEGAL/DAKAR141070/1999
DENV2/SENEGAL/DAKARD75505/1999
DENV2/TRINIDAD_AND_TOBAGO/NA/1953
DENV4/MALAYSIA/P215/1975
DENV4/MALAYSIA/P514/1975
DENV4/MALAYSIA/P731120/1973
D2Sab2015 # miscategorized
QML22 # miscategorized
DAK_Ar_A1247 # sylvatic
Dak_Ar_2039 # sylvatic
Dak_Ar_578 # sylvatic
DAK_Ar_510 # sylvatic
PM33974 # sylvatic
Dak_Ar_A2022 # sylvatic
Dak_Ar_141069 # sylvatic
Dak_Ar_141070 # sylvatic
Dak_Ar_D75505 # sylvatic
Dak_HD_10674 # sylvatic
Dak_Ar_D20761 # sylvatic
IBH11664 # sylvatic
IBH11208 # sylvatic
IBH11234 # sylvatic
P8_1407 # sylvatic
P75_514 # sylvatic
P73_1120 # sylvatic
P75_215 # sylvatic
DKD811 # sylvatic
ZS01/01 # metadata issue
Vero # cell line
MS13002673 # too divergent
MS11011405 # too divergent
V43257 # too divergent
KDC0574A2_06/02/2011 # too divergent
00178/03 # too divergent
00759/12 # too divergent
00988/11 # too divergent
01113/10 # too divergent
01224/04 # too divergent
01231/10 # too divergent
01488/09 # too divergent
01542/04 # too divergent
dev1 # too divergent
DKE_121 # too divergent
SENDAK_HD_10674 # sylvatic
DENV2_1_DAK_HD_76395 # sylvatic
DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent
PR_6 # too divergent
KY923048 # D2Sab2015 # miscategorized
KX274130 # QML22 # miscategorized
EF105383 # DAK_Ar_A1247 # sylvatic
EF105382 # Dak_Ar_2039 # sylvatic
EF105380 # Dak_Ar_578 # sylvatic
EF105381 # DAK_Ar_510 # sylvatic
EF105378 # PM33974 # sylvatic
EF105386 # Dak_Ar_A2022 # sylvatic
EF105389 # Dak_Ar_141069 # sylvatic
EF105390 # Dak_Ar_141070 # sylvatic
EF457904 # Dak_Ar_D75505 # sylvatic
EF105384 # Dak_HD_10674 # sylvatic
EF105385 # Dak_Ar_D20761 # sylvatic
EF105388 # IBH11664 # sylvatic
EF105387 # IBH11208 # sylvatic
EU003591 # IBH11234 # sylvatic
EF105379 # P8_1407 # sylvatic
JF262779 # P75_514 # sylvatic
JF262780 # P73_1120 # sylvatic
EF457906 # P75_215 # sylvatic
FJ467493 # DKD811 # sylvatic
EF051521 # ZS01/01 # metadata issue
MT929160 # Vero # cell line
MH048676 # MS13002673 # too divergent
MH048674 # MS11011405 # too divergent
MT597439 # V43257 # too divergent
MN448607 # KDC0574A2_06/02/2011 # too divergent
ON046268 # 00178/03 # too divergent
ON046278 # 00759/12 # too divergent
ON046276 # 00988/11 # too divergent
ON046273 # 01113/10 # too divergent
ON046270 # 01224/04 # too divergent
ON046274 # 01231/10 # too divergent
ON046272 # 01488/09 # too divergent
ON046271 # 01542/04 # too divergent
MZ284953 # dev1 # too divergent
MZ215848 # DKE_121 # too divergent
MW946564 # SENDAK_HD_10674 # sylvatic
OK605757 # DENV2_1_DAK_HD_76395 # sylvatic
MW945427 # DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent
OM258630 # PR_6 # too divergent
Binary file modified example_data/sequences_all.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv1.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv2.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv3.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv4.fasta.zst
Binary file not shown.
Loading