Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize ncbi_id in fetch_sequences #146

Merged
merged 1 commit into from
Mar 28, 2023
Merged

Parameterize ncbi_id in fetch_sequences #146

merged 1 commit into from
Mar 28, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 8, 2023

Description of proposed changes

Could we parameterize the NCBI Taxon ID in the fetch sequences scripts?

Historically the NCBI Taxon ID has been hardcoded in ingest/bin/genbank-url:

https://github.com/nextstrain/monkeypox/blob/b54768ec17872eb0d898e29527785642f6b98c0d/ingest/bin/genbank-url#L21

This PR generalizes that and related scripts such that the NCBI Taxon ID can be defined in a snakemake rule:

https://github.com/nextstrain/monkeypox/blob/b54768ec17872eb0d898e29527785642f6b98c0d/ingest/workflow/snakemake_rules/fetch_sequences.smk#L17-L23

View example here! 👀

Open to discussion or other ways to parameterize this. Can test the code via:

git clone https://github.com/j23414/monkeypox.git
cd monkeypox/ingest
nextstrain build  . data/sequences.ndjson

or simpler

./bin/fetch-from-genbank 10244 > genbank.ndjson

Related issue(s)

Related commit: nextstrain/dengue@5d3281d

Testing

  • Checks pass

@j23414
Copy link
Contributor Author

j23414 commented Mar 16, 2023

Speaking of NCBI datasets, parameterizing the NCBI Taxon ID feeds in nicely to the datasets command:

 f"datasets download virus genome taxon {params.ncbi_id} --no-progressbar --filename {output.dataset_package}

https://github.com/nextstrain/ncov-ingest/blob/master/workflow/snakemake_rules/fetch_sequences.smk#L53

or with Zika example:

datasets download virus genome taxon 64320 --filename zika_datasets.zip

unzip zika_datasets.zip
#> Archive:  zika_datasets.zip
#>   inflating: README.md               
#>   inflating: ncbi_dataset/data/data_report.jsonl  
#>   inflating: ncbi_dataset/data/genomic.fna  
#>   inflating: ncbi_dataset/data/virus_dataset.md  
#>   inflating: ncbi_dataset/data/dataset_catalog.json  

@j23414 j23414 requested a review from a team March 25, 2023 08:36
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the parameterization. I've left a number of specific comments on small things which I think are important enough to mention, but none of them I consider blocking (e.g. if for some reason you can't/won't address them).

Additionally, I think it would be good practice to put some of the rationale you included in your PR description into the commit message itself, as that's more durable and tightly linked to the source code.

ingest/bin/fetch-from-genbank Outdated Show resolved Hide resolved
ingest/bin/genbank-url Outdated Show resolved Hide resolved
ingest/bin/genbank-url Outdated Show resolved Hide resolved
ingest/bin/genbank-url Outdated Show resolved Hide resolved
ingest/bin/genbank-url Outdated Show resolved Hide resolved
Historically the NCBI Taxon ID has been hardcoded in ingest/bin/genbank-url
(e.g. 'VirusLineageId_ss:(10244)') and this commit generalizes the script to
take an NCBI Taxon ID which could be defined at the Snakemake rule level.

Even if we move to use NCBI datasets, parameterizing the NCBI Taxon ID would
feed nicely into the datasets command:

f"datasets download virus genome taxon {params.ncbi_id} --no-progressbar --filename {output.dataset_package}
@j23414 j23414 merged commit 6a72efc into nextstrain:master Mar 28, 2023
j23414 added a commit to nextstrain/dengue that referenced this pull request Mar 28, 2023
j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants