Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip]: Add the functionality of join metadata and clades #23

Closed
wants to merge 7 commits into from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Sep 15, 2023

Description of proposed changes

After some discussion with @joverlee521, moved join-metadata-and-clades.py from PR: #20 to this draft.

Some of the functionality may be replaced by csvtk but there are customized calculations in certain pathogen repositories.

mpx/join-metadata-and-clades.py
        rsv/join-metadata-and-clades.py       3 lines different
        dengue/join-metadata-and-clades.py      IDENTICAL
        ncov/join-metadata-and-clades     114 lines different

This is a placeholder that the functionality of join-metadata-and-clades requires more discussion and thought.

Related issue(s)

Subset of scripts listed in #1

Checklist

  • Checks pass
  • If adding a script, add an entry for it in the README.

@j23414 j23414 mentioned this pull request Sep 15, 2023
2 tasks
Add a minor check to join-metadata-and-clades to ensure that all of the
sequences in the metadata file are included in the output.
Each pathogen can have unique columns in the Nextclade output
(e.g. ncov-ingest includes SC2 specific columns). This change makes the
nextclade column map customizable to support these.
@j23414
Copy link
Contributor Author

j23414 commented Sep 19, 2023

We could combine the two files without performing complex calculations by using a combination of csvtk rename and tsv-join as follows:

# Rename columns in the input.nextclade file
cat {input.nextclade} \
| csvtk -tl rename2 \
  -F \
  -f '*' \
  -p '(.+)' \
  -r '{{kv}}' \
  -k {input.nextclade_field_map} \
  > results/nextclade_renamed.tsv
  
# Join the renamed nextclade file with the input.metadata file
cat {input.metadata} \
 | tsv-join -H \
 --filter-file results/nextclade_renamed.tsv \
 --key-fields seqName \
 --data-fields accession \
 --append-fields `awk '{print $2}' results/nextclade_renamed.tsv | tr '\n' ','` \
 --allow-duplicate-keys \
 --write-all -1 \
 > {output.metadata}

@joverlee521
Copy link
Contributor

@j23414 Would you be up for replacing join-metadata-and-clades in monkeypox with your csvtk/tsv-join example? It would be nice to do a full test run there to see how the outputs compare.

@huddlej
Copy link

huddlej commented Sep 20, 2023

I found myself needing to implement something similar in the flu_frequencies workflow and appreciated csvtk join's option to specify different names for the join columns in the two inputs (e.g., seqName in Nextclade input and strain in metadata). I originally planned to use tsv-join, but I was frustrated by the need to rename columns to have a matching key column name in both inputs.

Independent from the tools you end up using here, you can drop the cat block from the code above such that this:

cat {input.nextclade} \
| csvtk -tl rename2 \
  -F \
  -f '*' \
  -p '(.+)' \
  -r '{{kv}}' \
  -k {input.nextclade_field_map} \
  > results/nextclade_renamed.tsv

becomes this:

csvtk -tl rename2 \
  -F \
  -f '*' \
  -p '(.+)' \
  -r '{{kv}}' \
  -k {input.nextclade_field_map} \
  {input.nextclade} > results/nextclade_renamed.tsv

j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 4, 2023
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23
j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 10, 2023
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23
j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 10, 2023
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23
j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 11, 2023
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23
j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 12, 2023
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23

Relatedly, this commit also adds a nextclade config section where mapping
fields from the nextclade output to be appended to the metadata can be specified.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
@j23414
Copy link
Contributor Author

j23414 commented Oct 13, 2023

Closed since this script is replaced with csvtk and tsv utils.

@j23414 j23414 closed this Oct 13, 2023
@j23414 j23414 deleted the join_metadata_and_clades branch October 13, 2023 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants