Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Create new accession field without version #40

Merged
merged 2 commits into from
May 3, 2024

Conversation

joverlee521
Copy link
Contributor

Resolves #39

Create a new accession field without the version number so that annotations do not need to be updated when the version number is updated.

Related issue(s)

Related to nextstrain/measles#24

Checklist

  • Checks pass

Resolves #39

Create a new accession field without the version number so that
annotations do not need to be updated when the version number is updated.
@@ -46,6 +46,7 @@ curate:
# This is the first step in the pipeline, so any references to field names in the configs below should use the new field names
field_map:
accession: accession
accession_version: accession_version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose the column name accession_version to be clearer, but open to changing it to match the existing pattern accession_rev.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accession_version is what NCBI actually calls it and what will be used by a similar project I'm working on, so I fully support accession_version - it's the right term!

@joverlee521 joverlee521 requested a review from a team April 17, 2024 21:03
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@@ -97,6 +97,9 @@ rule format_ncbi_dataset_report:
--fields {params.ncbi_datasets_fields:q} \
--elide-header \
| csvtk add-header -t -l -n {params.ncbi_datasets_fields:q} \
| csvtk rename -t -f accession -n accession_version \
| csvtk -tl mutate -f accession_version -n accession -p "^(.+?)\." \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would refrain from adding more csvtk to any nextstrain code base when used with TSV files, and phase it out

There are rare but disastrous bugs when tsv fields contain single or double quotes: shenwei356/csvtk#130

csvtk is just not a TSV tool, it mostly works, but it disobeys the standard.

We can now use `csvtk fix-quotes` and `csvtk del-quotes` to work around
quoting issues (e.g. internal quotes in the submitter.affiliation).

Copied commit from Zika repo:

* nextstrain/zika#58
@j23414
Copy link
Contributor

j23414 commented May 2, 2024

Tried to address the csvtk comment with c0b9e50 . Feel free to suggest other changes, otherwise I think this PR is ready for review.

@joverlee521
Copy link
Contributor Author

Thanks for porting over the fix for csvtk @j23414! I'll merge this now to keep the repo up-to-date.

@joverlee521 joverlee521 merged commit 6e9e77b into main May 3, 2024
@joverlee521 joverlee521 deleted the accession-version branch May 3, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: GenBank accession should not include version number
3 participants