Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transform strain names script #20

Merged
merged 2 commits into from
Sep 18, 2023
Merged

Add transform strain names script #20

merged 2 commits into from
Sep 18, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Sep 14, 2023

Description of proposed changes

Since these are used for multiple pathogen ingest pipelines, copied the following scripts from monkeypox:

Added some tests and documentation.

Related issue(s)

Subset of scripts listed in #1

Checklist

  • Checks pass
  • If adding a script, add an entry for it in the README.

@j23414 j23414 changed the title Transform strain names Add transform strain names and join metadata clades scripts Sep 14, 2023
@j23414 j23414 self-assigned this Sep 14, 2023
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for working on these @j23414!

I've left a comment on modifications to the join-metadata-and-clades script that I think are needed for it to be used for other pathogens.

I would also like a little more detail in the commit messages. It would be helpful to point to the permalink of original scripts, similar to 3b69a10


column_map = {
"clade": "clade",
"outbreak": "outbreak",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe "outbreak" is a very specific column for mpox.

I expect that each pathogen can have unique columns in the Nextclade output (e.g. ncov-ingest includes SC2 specific columns). We should probably make the column map customizable to support these. It might be easiest to add an input for a TSV column map that can override the default defined in the script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added column map customization in 7df1636
I moved the old columns to default but also open to dropping the defaults altogether.

join-metadata-and-clades.py Outdated Show resolved Hide resolved
Comment on lines 19 to 25
Check whether join-metadata-clades script produces an output metadata file, but do not assess the accuracy or validity of that output file.

$ python $TESTDIR/../../join-metadata-and-clades.py \
> --id-field strain \
> --metadata metadata_raw.tsv \
> --nextclade nextclade.tsv \
> -o test_metadata.tsv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do at least some minor checking of the output file here. I think at minimum we want to ensure that all of the sequences in the metadata file are included in the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a minor check in a5fa32e
I'm happy to discuss a more detailed checking strategy; I was unsure on level of detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General strategy is just to ensure that the output has the expected format. If we wanted cover all of our bases, we could include an expected output file and just diff the output with the expected output file.

@j23414 j23414 force-pushed the transform_strain_names branch 2 times, most recently from 08c108a to 6a64136 Compare September 15, 2023 21:48
@j23414
Copy link
Contributor Author

j23414 commented Sep 15, 2023

@joverlee521, ready for review. Thanks for the comments and still open for discussion!

Additional edits include:

  • parameterizing the nextclade id so it's not hardcoded as "seqName" c2f5e0e
  • dropping an unused variable 6a64136

@j23414 j23414 changed the title Add transform strain names and join metadata clades scripts Add transform strain names script Sep 15, 2023
@j23414
Copy link
Contributor Author

j23414 commented Sep 15, 2023

Through discussion with @joverlee521, realized that join-metadata-and-clades.py requires more thought. Therefore the commits were moved to a draft PR #23.

This PR has been rebased to only involve the transform-strain-name script.

@joverlee521
Copy link
Contributor

OH, one last thing to note: I had dropped the automated Cram tests in 6e955d7.

We should add it back since this PR adds new Cram tests.

@j23414 j23414 merged commit c02fa81 into main Sep 18, 2023
2 checks passed
@j23414 j23414 deleted the transform_strain_names branch September 18, 2023 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants