-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize Ingest #6
Changes from all commits
95b1718
eefe5c1
aa8b868
a71cc55
05c181b
49b0764
29368b6
c2ad70f
159b928
9935631
03c0819
931f5ae
205165a
5d7aa6d
7c2da29
269837e
cc8731c
0a9a2a5
641c020
0ec11a9
dec2ec3
e80947d
30fbeef
d25a3e7
a309e5f
5c6baf4
f82298f
ac34243
9dbccc4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,7 +54,14 @@ This build starts by pulling preprocessed sequence and metadata files from: | |
* https://data.nextstrain.org/files/dengue/sequences_denv4.fasta.zst | ||
* https://data.nextstrain.org/files/dengue/metadata_denv4.tsv.zst | ||
|
||
The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals. | ||
The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals from the ingest folder. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is a little unclear to me. It seems to imply that one should run the code here ( |
||
|
||
``` | ||
nextstrain build ingest | ||
|
||
# Upload final dataset and trigger slack notifications | ||
nextstrain build ingest --configfiles config/config.yaml config/optional.yaml | ||
``` | ||
|
||
### Using example data | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
configfile: "config/config_dengue.yaml" | ||
|
||
serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4'] | ||
|
||
rule all: | ||
|
@@ -73,14 +75,15 @@ rule decompress: | |
sequences = "data/sequences_{serotype}.fasta.zst", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As it stands, the data files produced by a local ingest must go up to S3 then be downloaded here to run the actual build. This isn't ideal -- we should be able to run a full build locally without touching S3. (There's the larger question of whether we should maintain multiple Snakefiles, which would largely solve this, but I'll discuss this elsewhere.) |
||
metadata = "data/metadata_{serotype}.tsv.zst" | ||
output: | ||
sequences = "results/sequences_{serotype}.fasta", | ||
metadata = "results/metadata_{serotype}.tsv" | ||
sequences = "data/sequences_{serotype}.fasta", | ||
metadata = "data/metadata_{serotype}.tsv" | ||
shell: | ||
""" | ||
zstd -d -c {input.sequences} > {output.sequences} | ||
zstd -d -c {input.metadata} > {output.metadata} | ||
""" | ||
|
||
|
||
rule filter: | ||
""" | ||
Filtering to | ||
|
@@ -90,20 +93,22 @@ rule filter: | |
- excluding strains with missing region, country or date metadata | ||
""" | ||
input: | ||
sequences = "results/sequences_{serotype}.fasta", | ||
metadata = "results/metadata_{serotype}.tsv", | ||
sequences = "data/sequences_{serotype}.fasta", | ||
metadata = "data/metadata_{serotype}.tsv", | ||
exclude = files.dropped_strains | ||
output: | ||
sequences = "results/filtered_{serotype}.fasta" | ||
params: | ||
group_by = "year region", | ||
sequences_per_group = filter_sequences_per_group, | ||
min_length = 5000 | ||
min_length = 5000, | ||
strain_id = config.get("strain_id_field", "strain"), | ||
shell: | ||
""" | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id {params.strain_id} \ | ||
--exclude {input.exclude} \ | ||
--output {output.sequences} \ | ||
--group-by {params.group_by} \ | ||
|
@@ -122,6 +127,8 @@ rule align: | |
reference = files.reference | ||
output: | ||
alignment = "results/aligned_{serotype}.fasta" | ||
params: | ||
threads = 1 | ||
shell: | ||
""" | ||
augur align \ | ||
|
@@ -130,7 +137,7 @@ rule align: | |
--output {output.alignment} \ | ||
--fill-gaps \ | ||
--remove-reference \ | ||
--nthreads 1 | ||
--nthreads {params.threads} | ||
""" | ||
|
||
rule tree: | ||
|
@@ -158,20 +165,22 @@ rule refine: | |
input: | ||
tree = "results/tree-raw_{serotype}.nwk", | ||
alignment = "results/aligned_{serotype}.fasta", | ||
metadata = "results/metadata_{serotype}.tsv" | ||
metadata = "data/metadata_{serotype}.tsv" | ||
output: | ||
tree = "results/tree_{serotype}.nwk", | ||
node_data = "results/branch-lengths_{serotype}.json" | ||
params: | ||
coalescent = "const", | ||
date_inference = "marginal", | ||
clock_filter_iqd = 4 | ||
clock_filter_iqd = 4, | ||
strain_id = config.get("strain_id_field", "strain"), | ||
shell: | ||
""" | ||
augur refine \ | ||
--tree {input.tree} \ | ||
--alignment {input.alignment} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id {params.strain_id} \ | ||
--output-tree {output.tree} \ | ||
--output-node-data {output.node_data} \ | ||
--timetree \ | ||
|
@@ -223,17 +232,19 @@ rule traits: | |
""" | ||
input: | ||
tree = "results/tree_{serotype}.nwk", | ||
metadata = "results/metadata_{serotype}.tsv" | ||
metadata = "data/metadata_{serotype}.tsv" | ||
output: | ||
node_data = "results/traits_{serotype}.json", | ||
params: | ||
columns = traits_columns, | ||
sampling_bias_correction = 3 | ||
sampling_bias_correction = 3, | ||
strain_id = config.get("strain_id_field", "strain"), | ||
shell: | ||
""" | ||
augur traits \ | ||
--tree {input.tree} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id {params.strain_id} \ | ||
--output {output.node_data} \ | ||
--columns {params.columns} \ | ||
--confidence \ | ||
|
@@ -262,26 +273,52 @@ rule export: | |
"""Exporting data files for for auspice""" | ||
input: | ||
tree = "results/tree_{serotype}.nwk", | ||
metadata = "results/metadata_{serotype}.tsv", | ||
metadata = "data/metadata_{serotype}.tsv", | ||
branch_lengths = "results/branch-lengths_{serotype}.json", | ||
traits = "results/traits_{serotype}.json", | ||
clades = "results/clades_{serotype}.json", | ||
nt_muts = "results/nt-muts_{serotype}.json", | ||
aa_muts = "results/aa-muts_{serotype}.json", | ||
auspice_config = files.auspice_config | ||
output: | ||
auspice_json = "auspice/dengue_{serotype}.json" | ||
auspice_json = "results/raw_dengue_{serotype}.json", | ||
root_sequence = "results/raw_dengue_{serotype}_root-sequence.json", | ||
params: | ||
strain_id = config.get("strain_id_field", "strain"), | ||
shell: | ||
""" | ||
augur export v2 \ | ||
--tree {input.tree} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id {params.strain_id} \ | ||
--node-data {input.branch_lengths} {input.traits} {input.clades} {input.nt_muts} {input.aa_muts} \ | ||
--auspice-config {input.auspice_config} \ | ||
--include-root-sequence \ | ||
--output {output.auspice_json} | ||
""" | ||
|
||
rule final_strain_name: | ||
input: | ||
auspice_json="results/raw_dengue_{serotype}.json", | ||
metadata="data/metadata_{serotype}.tsv", | ||
root_sequence="results/raw_dengue_{serotype}_root-sequence.json", | ||
output: | ||
auspice_json="auspice/dengue_{serotype}.json", | ||
root_sequence="auspice/dengue_{serotype}_root-sequence.json", | ||
params: | ||
display_strain_field=config.get("display_strain_field", "strain"), | ||
strain_id=config.get("strain_id_field", "strain"), | ||
shell: | ||
""" | ||
python3 bin/set_final_strain_name.py \ | ||
--metadata {input.metadata} \ | ||
--metadata-id-columns {params.strain_id} \ | ||
--input-auspice-json {input.auspice_json} \ | ||
--display-strain-name {params.display_strain_field} \ | ||
--output {output.auspice_json} | ||
cp {input.root_sequence} {output.root_sequence} | ||
""" | ||
|
||
rule clean: | ||
"""Removing directories: {params}""" | ||
params: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import pandas as pd | ||
import json, argparse | ||
from augur.io import read_metadata | ||
|
||
def replace_name_recursive(node, lookup): | ||
if node["name"] in lookup: | ||
node["name"] = lookup[node["name"]] | ||
|
||
if "children" in node: | ||
for child in node["children"]: | ||
replace_name_recursive(child, lookup) | ||
|
||
if __name__=="__main__": | ||
parser = argparse.ArgumentParser( | ||
description="Swaps out the strain names in the Auspice JSON with the final strain name", | ||
formatter_class=argparse.ArgumentDefaultsHelpFormatter | ||
) | ||
|
||
parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json") | ||
parser.add_argument('--metadata', type=str, required=True, help="input data") | ||
parser.add_argument('--metadata-id-columns', nargs="+", help="names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.") | ||
parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice") | ||
parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON") | ||
args = parser.parse_args() | ||
|
||
metadata = read_metadata(args.metadata, id_columns=args.metadata_id_columns) | ||
name_lookup = {} | ||
for ri, row in metadata.iterrows(): | ||
strain_id = row.name | ||
name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name] | ||
j23414 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
with open(args.input_auspice_json, 'r') as fh: | ||
data = json.load(fh) | ||
|
||
replace_name_recursive(data['tree'], name_lookup) | ||
|
||
with open(args.output, 'w') as fh: | ||
json.dump(data, fh) | ||
j23414 marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
strain_id_field: "accession" | ||
display_strain_field: "strain" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,41 @@ | ||
DENV/SPAIN/EEB17/2009 | ||
DENV1/FRANCE/00475/2008 | ||
DENV1/MALAYSIA/P1244/1972 | ||
DENV1/VIETNAM/BIDV3990/2008 | ||
DENV1/VIETNAM/BIDV992/2006 | ||
DENV2/AUSTRALIA/QML22/2015 | ||
DENV2/BURKINA_FASO/DAKAR2039/1980 | ||
DENV2/BURKINA_FASO/DAKARA2022/1980 | ||
DENV2/COTE_D_IVOIRE/DAKAR510/1980 | ||
DENV2/COTE_D_IVOIRE/DAKAR578/1980 | ||
DENV2/COTE_D_IVOIRE/DAKARA1247/1980 | ||
DENV2/GUINEA/PM33974/1981 | ||
DENV2/HAITI/DENGUEVIRUS2HOMOSAPIENS1/2016 | ||
DENV2/MALAYSIA/DKD811/2008 | ||
DENV2/MALAYSIA/P81407/1970 | ||
DENV2/MALAYSIA/SAB/2015 | ||
DENV2/NIGERIA/IBH11208/1966 | ||
DENV2/NIGERIA/IBH11234/1966 | ||
DENV2/NIGERIA/IBH11664/1966 | ||
DENV2/SENEGAL/0674/1970 | ||
DENV2/SENEGAL/DAKAR0761/1974 | ||
DENV2/SENEGAL/DAKAR141069/1999 | ||
DENV2/SENEGAL/DAKAR141070/1999 | ||
DENV2/SENEGAL/DAKARD75505/1999 | ||
DENV2/TRINIDAD_AND_TOBAGO/NA/1953 | ||
DENV4/MALAYSIA/P215/1975 | ||
DENV4/MALAYSIA/P514/1975 | ||
DENV4/MALAYSIA/P731120/1973 | ||
D2Sab2015 # miscategorized | ||
QML22 # miscategorized | ||
DAK_Ar_A1247 # sylvatic | ||
Dak_Ar_2039 # sylvatic | ||
Dak_Ar_578 # sylvatic | ||
DAK_Ar_510 # sylvatic | ||
PM33974 # sylvatic | ||
Dak_Ar_A2022 # sylvatic | ||
Dak_Ar_141069 # sylvatic | ||
Dak_Ar_141070 # sylvatic | ||
Dak_Ar_D75505 # sylvatic | ||
Dak_HD_10674 # sylvatic | ||
Dak_Ar_D20761 # sylvatic | ||
IBH11664 # sylvatic | ||
IBH11208 # sylvatic | ||
IBH11234 # sylvatic | ||
P8_1407 # sylvatic | ||
P75_514 # sylvatic | ||
P73_1120 # sylvatic | ||
P75_215 # sylvatic | ||
DKD811 # sylvatic | ||
ZS01/01 # metadata issue | ||
Vero # cell line | ||
MS13002673 # too divergent | ||
MS11011405 # too divergent | ||
V43257 # too divergent | ||
KDC0574A2_06/02/2011 # too divergent | ||
00178/03 # too divergent | ||
00759/12 # too divergent | ||
00988/11 # too divergent | ||
01113/10 # too divergent | ||
01224/04 # too divergent | ||
01231/10 # too divergent | ||
01488/09 # too divergent | ||
01542/04 # too divergent | ||
dev1 # too divergent | ||
DKE_121 # too divergent | ||
SENDAK_HD_10674 # sylvatic | ||
DENV2_1_DAK_HD_76395 # sylvatic | ||
DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent | ||
PR_6 # too divergent | ||
KY923048 # D2Sab2015 # miscategorized | ||
KX274130 # QML22 # miscategorized | ||
EF105383 # DAK_Ar_A1247 # sylvatic | ||
EF105382 # Dak_Ar_2039 # sylvatic | ||
EF105380 # Dak_Ar_578 # sylvatic | ||
EF105381 # DAK_Ar_510 # sylvatic | ||
EF105378 # PM33974 # sylvatic | ||
EF105386 # Dak_Ar_A2022 # sylvatic | ||
EF105389 # Dak_Ar_141069 # sylvatic | ||
EF105390 # Dak_Ar_141070 # sylvatic | ||
EF457904 # Dak_Ar_D75505 # sylvatic | ||
EF105384 # Dak_HD_10674 # sylvatic | ||
EF105385 # Dak_Ar_D20761 # sylvatic | ||
EF105388 # IBH11664 # sylvatic | ||
EF105387 # IBH11208 # sylvatic | ||
EU003591 # IBH11234 # sylvatic | ||
EF105379 # P8_1407 # sylvatic | ||
JF262779 # P75_514 # sylvatic | ||
JF262780 # P73_1120 # sylvatic | ||
EF457906 # P75_215 # sylvatic | ||
FJ467493 # DKD811 # sylvatic | ||
EF051521 # ZS01/01 # metadata issue | ||
MT929160 # Vero # cell line | ||
MH048676 # MS13002673 # too divergent | ||
MH048674 # MS11011405 # too divergent | ||
MT597439 # V43257 # too divergent | ||
MN448607 # KDC0574A2_06/02/2011 # too divergent | ||
ON046268 # 00178/03 # too divergent | ||
ON046278 # 00759/12 # too divergent | ||
ON046276 # 00988/11 # too divergent | ||
ON046273 # 01113/10 # too divergent | ||
ON046270 # 01224/04 # too divergent | ||
ON046274 # 01231/10 # too divergent | ||
ON046272 # 01488/09 # too divergent | ||
ON046271 # 01542/04 # too divergent | ||
MZ284953 # dev1 # too divergent | ||
MZ215848 # DKE_121 # too divergent | ||
MW946564 # SENDAK_HD_10674 # sylvatic | ||
OK605757 # DENV2_1_DAK_HD_76395 # sylvatic | ||
MW945427 # DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent | ||
OM258630 # PR_6 # too divergent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consolidate the two
.snakemake
directories here and we probably want to exclude alllogs/
directories as well: