Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter primer dimers #530

Closed
3 tasks done
donkirkby opened this issue Feb 7, 2020 · 4 comments
Closed
3 tasks done

Filter primer dimers #530

donkirkby opened this issue Feb 7, 2020 · 4 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Feb 7, 2020

There are a few problems that we suspect are caused by primer dimers:

  1. Some samples take a long time for the de novo assembler to finish. 90542B-RELOAD-HCV_S83 from 06 Nov 2019 is the slowest, but 90611A-HCV_S136 from 22 Nov 2019 took 45 hours to finish.

  2. Some contig seeds seem to just be a bunch of repeated primer dimers. 90612A-HCV_S146 from 22 Nov 2019 is a good example. More examples are 73051ANS5A1-HCV-NS5a_S89, 73051ANS5A2-HCV-NS5a_S101, 73060ANS5A-HCV-NS5a_S77, and 73076B-HCV_S27 from 15 Jul 2016.

  3. Possibly related: random primer samples seem very slow and assemble a ton of unknown contigs. There are some examples named "RP" in the 13 Jul 2016 and 28 Jul 2016 runs.

  4. Some random primer samples didn't assemble at all. Are these at all related to primer dimer problems?

    • 26 Oct 2015 - D62201-HCV_S3
    • 02 Nov 2015 - D62218-HCV_S4
    • 06 Nov 2015 - D62226-HCV_S4
  5. Some samples show huge spikes in coverage, and the sequence of the spike looks like two primers overlapping. One example is sample NCOV7-IL-Unknown_S34 from the 20 Apr 2020 run. It had many spikes in coverage, one of the biggest was at amino acid 547 of the spike gene. Looking in the trimmed reads, I found a ton of them with sequence "AGGCACAGGTGT". Looking at the original read, I found this pattern:

                       [ nCoV-2019_76_RIGHT ]
    [nCoV-2019_18_LEFT          ]            [Nextera_XT_Read2                ]
    TGGAAATACCCACAAGTTAATGGTTTAACAGGCACAGGTGTCTGTCTCTTATACACATCTCCGAGCCCACGAGAC...random.garbage...
    

Find a tool that can detect primer dimers, and filter them out before running the assembler. Maybe put them back in for the mapping step? It sounds like you need to predict primer dimers based on the primer, and then search for them. Hopefully there are some existing tools that we can use. Would it be good enough to just filter out short reads before assembling, and then adding them back in after?

  • Implement a basic Python version.
  • See what effect it has on the samples listed above.
  • Measure effect on timing, decide if we need to find a more efficient tool.
@donkirkby donkirkby added this to the 7.13 milestone Feb 7, 2020
@donkirkby
Copy link
Member Author

Based on trimming primers in #552, we found that primer dimers show up like this:

[start of LEFT_PRIMER][overlap][reversed start of RIGHT_PRIMER][reversed read 2 ADAPTER][garbage]

Trimming the adapters drops the last two sections. Trimming primers drops the first two sections, because it matches the left primer. That leaves you with a small section like this:

[reversed start of RIGHT_PRIMER]

We can check all reads that are shorter than the longest primer to see if they match the reversed start of one of the right primers, then just completely trim that read. It's probably worth caching the matches we find, since they're usually repeated a lot. It might also be worth caching the ones that don't match.

Can cutadapt be used somehow for this step? Maybe an X on both ends?

@donkirkby
Copy link
Member Author

donkirkby commented Jun 4, 2020

Here are some performance measurements, updated from last time. Time in minutes for cutadapt and other trimming steps:

sample read count censor adapters left primer right primer primer dimer skipped censor
S8CT25MIXATOH-NT-Unknown_S29 108,287 0 1 1 0 0
P90786A-HCV_S86 475,577 0 4 4 1 2
S3CT27-IP-Unknown_S8 482,083 0 4 4 1 2
HIV3428H2-E21-HIV_S39 544,536 1 4 4 1 2
POS1TO1K-IP-Unknown_S2 1,632,059 2 0 3 3 1 1
POS1TO1KMIX-IP-Unknown_S4 1,751,820 12 2 12 12 4 6
POS1TO20K-IP-Unknown_S1 2,228,806 10 1 10 10 4 5
POS1TO20KMIX-IP-Unknown_S3 4,376,862 17 3 16 16 6 8

@donkirkby
Copy link
Member Author

donkirkby commented Jun 22, 2020

Here are the results when retrying some of the problem samples listed above:

  1. Slow samples 90542B-RELOAD-HCV_S83 from 06 Nov 2019 and 90611A-HCV_S136 from 22 Nov 2019 - the 22 Nov 2019 sample finished in 16 hours instead of 45, but the results still looked terrible. It also failed on the first try (see the error below). Rerunning with the primer trimming disabled took 29 hours. (Better hardware than the 45 hour run.)
  2. Repeated primer dimers in 90612A-HCV_S146 from 22 Nov 2019 and 73051ANS5A1-HCV-NS5a_S89, 73051ANS5A2-HCV-NS5a_S101, 73060ANS5A-HCV-NS5a_S77, and 73076B-HCV_S27 from 15 Jul 2016 - the 15 Jul 2016 samples didn't improve much, and the repeated primers seem to be in the individual reads. Read @M01841:246:000000000-ARF87:1:2112:10748:5978 had eight copies of a primer back to back. Very strange.
  3. Random primer samples named "RP" in the 13 Jul 2016 and 28 Jul 2016 runs - the 28 Jul 2016 run went from 12 days down to 17 hours. There were still tons of contigs, though. 5 out of 8 samples assembled 50 contigs.
  4. Random primer samples didn't assemble at all: 26 Oct 2015 - D62201-HCV_S3, 02 Nov 2015 - D62218-HCV_S4, and 06 Nov 2015 - D62226-HCV_S4 - 06 Nov 2015 sample still doesn't assemble at all.
  5. Spikes in coverage: NCOV7-IL-Unknown_S34 from the 20 Apr 2020 run - much improved, but some spikes remain, usually with too many read errors to be trimmed.

Error message for 90611A-HCV_S136 from 22 Nov 2019:

2020-06-17 19:10:13,064[WARNING]micall.core.denovo.denovo(): iva failed to assemble.                                                                                                                       
Traceback (most recent call last):                                                                   
  File "/home/don/git/MiCall/micall/core/denovo.py", line 200, in denovo                                                                                                                                   
    run([IVA, '--fr', joined_path, '-t', '2', iva_out_path],                                                                                                                                               
  File "/usr/lib/python3.8/subprocess.py", line 512, in run                                                                                                                                                
    raise CalledProcessError(retcode, process.args,                                                                                                                                                        
subprocess.CalledProcessError: Command '['iva', '--fr', '/home/don/data/RAW_DATA/MiSeq/runs/191122_M04401_0155_000000000-CLKG8/micall-denovo-results/scratch/90611A-HCV_S136/assembly_7gf_3zfx/joined.fastq
', '-t', '2', '/home/don/data/RAW_DATA/MiSeq/runs/191122_M04401_0155_000000000-CLKG8/micall-denovo-results/scratch/90611A-HCV_S136/assembly_7gf_3zfx/iva_out']' returned non-zero exit status 1.
2020-06-17 19:10:13,065[WARNING]micall.core.denovo.denovo(): Traceback (most recent call last):                                                                                                            
  File "/home/don/git/MiCall/venv_micall/bin/iva", line 286, in <module>                                                                                                                                   
    assembly.read_pair_extend(reads_prefix, 'iteration')                                                                                                                                                   
  File "/home/don/git/MiCall/venv_micall/lib/python3.8/site-packages/iva/assembly.py", line 427, in read_pair_extend
    self._merge_overlapping_contigs(list(self.contigs.keys()))                                                                                                                                             
  File "/home/don/git/MiCall/venv_micall/lib/python3.8/site-packages/iva/assembly.py", line 515, in _merge_overlapping_contigs
    simple_path = assembly_graph.find_simple_path(connected_component)                                                                                                                                     
  File "/home/don/git/MiCall/venv_micall/lib/python3.8/site-packages/iva/graph.py", line 98, in find_simple_path
    assert len(degree_one_nodes) == 2                                                                
AssertionError                                    

So far, I've been unable to reproduce the error.

@donkirkby
Copy link
Member Author

This seems to be a general improvement, and particularly on samples with primer dimer spikes, so I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant