-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter primer dimers #530
Comments
Based on trimming primers in #552, we found that primer dimers show up like this: [start of LEFT_PRIMER][overlap][reversed start of RIGHT_PRIMER][reversed read 2 ADAPTER][garbage] Trimming the adapters drops the last two sections. Trimming primers drops the first two sections, because it matches the left primer. That leaves you with a small section like this: [reversed start of RIGHT_PRIMER] We can check all reads that are shorter than the longest primer to see if they match the reversed start of one of the right primers, then just completely trim that read. It's probably worth caching the matches we find, since they're usually repeated a lot. It might also be worth caching the ones that don't match. Can cutadapt be used somehow for this step? Maybe an X on both ends? |
Here are some performance measurements, updated from last time. Time in minutes for cutadapt and other trimming steps:
|
Here are the results when retrying some of the problem samples listed above:
Error message for 90611A-HCV_S136 from 22 Nov 2019:
So far, I've been unable to reproduce the error. |
This seems to be a general improvement, and particularly on samples with primer dimer spikes, so I'm closing the issue. |
There are a few problems that we suspect are caused by primer dimers:
Some samples take a long time for the de novo assembler to finish. 90542B-RELOAD-HCV_S83 from 06 Nov 2019 is the slowest, but 90611A-HCV_S136 from 22 Nov 2019 took 45 hours to finish.
Some contig seeds seem to just be a bunch of repeated primer dimers. 90612A-HCV_S146 from 22 Nov 2019 is a good example. More examples are 73051ANS5A1-HCV-NS5a_S89, 73051ANS5A2-HCV-NS5a_S101, 73060ANS5A-HCV-NS5a_S77, and 73076B-HCV_S27 from 15 Jul 2016.
Possibly related: random primer samples seem very slow and assemble a ton of unknown contigs. There are some examples named "RP" in the 13 Jul 2016 and 28 Jul 2016 runs.
Some random primer samples didn't assemble at all. Are these at all related to primer dimer problems?
Some samples show huge spikes in coverage, and the sequence of the spike looks like two primers overlapping. One example is sample NCOV7-IL-Unknown_S34 from the 20 Apr 2020 run. It had many spikes in coverage, one of the biggest was at amino acid 547 of the spike gene. Looking in the trimmed reads, I found a ton of them with sequence "AGGCACAGGTGT". Looking at the original read, I found this pattern:
Find a tool that can detect primer dimers, and filter them out before running the assembler. Maybe put them back in for the mapping step? It sounds like you need to predict primer dimers based on the primer, and then search for them. Hopefully there are some existing tools that we can use. Would it be good enough to just filter out short reads before assembling, and then adding them back in after?
The text was updated successfully, but these errors were encountered: