Skip to content

Commit

Permalink
ingest: Create new accession field without version
Browse files Browse the repository at this point in the history
Resolves #39

Create a new accession field without the version number so that
annotations do not need to be updated when the version number is updated.
  • Loading branch information
joverlee521 committed Apr 17, 2024
1 parent 0f799d7 commit ceb7fbc
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 1 deletion.
2 changes: 2 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ curate:
# This is the first step in the pipeline, so any references to field names in the configs below should use the new field names
field_map:
accession: accession
accession_version: accession_version
sourcedb: database
sra-accs: sra_accessions
isolate-lineage: strain
Expand Down Expand Up @@ -100,6 +101,7 @@ curate:
# The list of metadata columns to keep in the final output of the curation pipeline.
metadata_columns: [
"accession",
"accession_version",
"strain",
"date",
"region",
Expand Down
5 changes: 4 additions & 1 deletion ingest/rules/fetch_from_ncbi.smk
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,9 @@ rule format_ncbi_dataset_report:
--fields {params.ncbi_datasets_fields:q} \
--elide-header \
| csvtk add-header -t -l -n {params.ncbi_datasets_fields:q} \
| csvtk rename -t -f accession -n accession_version \
| csvtk -tl mutate -f accession_version -n accession -p "^(.+?)\." \
| tsv-select -H -f accession --rest last \
> {output.ncbi_dataset_tsv}
"""

Expand All @@ -120,7 +123,7 @@ rule format_ncbi_datasets_ndjson:
augur curate passthru \
--metadata {input.ncbi_dataset_tsv} \
--fasta {input.ncbi_dataset_sequences} \
--seq-id-column accession \
--seq-id-column accession_version \
--seq-field sequence \
--unmatched-reporting warn \
--duplicate-reporting warn \
Expand Down

0 comments on commit ceb7fbc

Please sign in to comment.