Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gene coverage columns during ingest workflow #36

Merged
merged 15 commits into from
Apr 24, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Mar 22, 2024

Description of proposed changes

pathogen-repo-guide (6)

Several approaches were explored to add {gene}_coverage columns during ingest workflow (as opposed to during phylogenetic workflow). The different approaches were summarized by @joverlee521 and @jameshadfield and copied here for context of this PR, along with added comments from @j23414 in [comments]:

  1. Generalize RSV's extend-metadata to take gene coordinates as input to calculate gene coverage. This will require gene coordinates to be maintained in the config YAML. Follow current pattern of outputing gene coverage columns that can be used for filter [ Opened an issue: Generalize the "extend-metadata.py" script for any {gene}_coverage columns rsv#57 ~ @j23414 ]
  2. Use Nextclade's failedCdses column to determine if E gene has coverage. Outputs E gene included with True/False that can be used for filter. [I went ahead and appended the failedCdses column from Nextclade, so we can still use this method for other genes ~ @j23414 ]
  3. We briefly talked about whether it would be possible for Nextclade to output {CDS}_coverage columns in addition to the full genome coverage column. This will allow the workflow to use the Nextclade columns for filter without having to maintain the gene coordinates or parse the dataset GFF file to get the coordinates
  4. Use the output (translated) CDS alignments from nextclade to add columns to the metadata with amino acid length or similar. This could then be used via augur filter --query .... This approach would be made obsolete by (3), but it's pretty easy to do right now. I [@jameshadfield] think it's preferable to (1) in both the case of compound CDSs and the case where a genome alignment extends both sides of the CDS but actually has very little coverage over the CDS itself. [This PR is following approach 4 ~ @j23414 ]

New Metadata

To view the new "E_coverage" columns, feel free to download the new metadata at:

wget https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst
zstd -d metadata_all.tsv.zst 

The new {gene}_coverage columns are the rightmost columns.

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 requested a review from a team March 23, 2024 16:09
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking through all the methods you've tried and where we've landed!

I've left some minor comments, but I think my main question is whether or not we still need the E_indicator column? It seems extra now that we can have the E_coverage column.

ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
j23414 added a commit that referenced this pull request Mar 27, 2024
#36 (comment)

Since we are not using the E_indicator column, drop it.
We have separate steps to calculate the E_coverage column.
@j23414 j23414 requested a review from a team April 5, 2024 20:01
@j23414
Copy link
Contributor Author

j23414 commented Apr 5, 2024

Thanks @joverlee521 ! This PR is ready for the next round of reviews

j23414 added a commit that referenced this pull request Apr 10, 2024
#36 (comment)

Since we are not using the E_indicator column, drop it.
We have separate steps to calculate the E_coverage column.
j23414 added a commit that referenced this pull request Apr 15, 2024
#36 (comment)

Since we are not using the E_indicator column, drop it.
We have separate steps to calculate the E_coverage column.
j23414 and others added 14 commits April 16, 2024 15:09
This is using the Nextclade "coverage" as "genome_coverage" and the Nextclade "failedCdses"
to check if E_coverage is present or not.

fixup: use 1 instead of true
This can be one gene or a set of genes, can then be used to calculate gene_coverage columns.
Move intermediate files to the "data" folder
Adds the following rules for gene coverage

* calculate_gene_coverage: calls a python script which takes a Nextclade CDS translation FASTA and calculates (valid AA)/(total length). The percentage is rounded to 3 significant figures.
* aggregate_gene_coverage_by_gene: combines the gene_coverage files by gene (e.g. ["E", "NS1"] ) across all serotypes (e.g. denv1-4)
* appends_gene_coverage_columns: Add each gene_coverage column (e.g. "E_coverage", "NS1_coverage") to the the final metadata
Co-authored-by: Jover Lee <joverlee521@gmail.com>
#36 (comment)

Since we are not using the E_indicator column, drop it.
We have separate steps to calculate the E_coverage column.
…g params so they don't get out of sync between rules
Encode serotype and gene as part of the directory structure where possible.
As suggested by #36 (comment)
Merge ID should be the first item in the map
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for continuing to push on this @j23414! This looks good to merge to me 👍

@j23414 j23414 merged commit 4ee7ec5 into main Apr 24, 2024
32 checks passed
@j23414 j23414 deleted the add-gene-coverage-columns branch April 24, 2024 23:05
@j23414 j23414 mentioned this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants