Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve original contig names when using Polypolish #7

Closed
jmtsuji opened this issue Feb 28, 2022 · 7 comments
Closed

Preserve original contig names when using Polypolish #7

jmtsuji opened this issue Feb 28, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@jmtsuji
Copy link

jmtsuji commented Feb 28, 2022

Thanks so much for your work on Polypolish, and congrats on its recent publication!

I have a very minor feature request. When I run polypolish, I notice that it always appends _polypolish on the ends of contig names in the FastA file. E.g.,

>contig_1

Becomes

>contig_1_polypolish

Adding the _polypolish suffix to sequence names is nice because it helps the user to track data provenance. However, it can also be a bit cumbersome sometimes, e.g., if the user wants to programmatically compare assembly statistics (pre-polishing) with other analyses done after polishing. Would it be possible to add a flag to polypolish to preserve the original contig names in the FastA file during analysis?

Thanks!

@jmtsuji
Copy link
Author

jmtsuji commented Mar 15, 2022

For those interested, here is a simple workaround to remove the _polypolish suffix on FastA header names using awk:

./polypolish assembly.fasta R1.sam R2.sam 2>polypolish.log | \
  awk '{ if ($0 ~ /^>/) { gsub("_polypolish$", ""); print } \
      else { print } }' \
  > polypolish.fasta

This issue can potentially be closed now given that an awk one-liner can accomplish the desired task.

@rrwick
Copy link
Owner

rrwick commented Mar 29, 2022

Thanks! I've added a new bit on the Polypolish FAQ addressing this case:
Can I prevent Polypolish from changing contig names?

I'll close the issue now 😄

@rrwick rrwick closed this as completed Mar 29, 2022
@jmtsuji
Copy link
Author

jmtsuji commented Mar 29, 2022

Thanks!

@d-yarmosh
Copy link

d-yarmosh commented Feb 8, 2023

./polypolish assembly.fasta R1.sam R2.sam | sed 's/_polypolish//g' > polypolish.fasta
If anyone wants a slightly easier to read method to do the same thing as the above awk

Further, it appears that polypolish saves the original headers, provided they aren't separated by spaces -- which Unicycler does and the data after spaces is pretty useful, so you could also do

sed -i 's/ /_/g' assembly.fasta to replace all spaces with underscores and then apply the above sed with a small addition like:
./polypolish assembly.fasta R1.sam R2.sam | sed 's/_polypolish//g' | sed 's/_/ /g' > polypolish.fasta

@hkaspersen
Copy link

@rrwick Just to add here, do you have any plans on looking into the above issue that happens when spaces are used in contig names? I was wondering if this was an easy fix within polypolish itself, or perhaps in unicycler!

@rrwick
Copy link
Owner

rrwick commented Jun 9, 2023

Polypolish doesn't current save any of the sequence description strings (anything after the first whitespace in the FASTA header), but this could potentially be added. I'll reopen the issue and tag it as an enhancement request. Thanks!

@rrwick rrwick reopened this Jun 9, 2023
@rrwick rrwick added the enhancement New feature or request label Jun 9, 2023
@rrwick
Copy link
Owner

rrwick commented Jan 15, 2024

I have just released Polypolish v0.6.0, which addresses this issue. Contig descriptions are now kept, and polypolish is added to the end of the description instead of to the contig name. See this FAQ entry for an example of how it now works.

@rrwick rrwick closed this as completed Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants