Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid V3 reads ending up in the unmapped files #228

Closed
6 tasks done
ArtPoon opened this issue Jul 15, 2015 · 15 comments
Closed
6 tasks done

Valid V3 reads ending up in the unmapped files #228

ArtPoon opened this issue Jul 15, 2015 · 15 comments
Assignees
Labels

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Jul 15, 2015

See sample 63249A-V3.3 from run 150529, it has too many unmapped reads. When it was processed in production under Micall 6.8, it had 136199 unmapped reads. When Art processed it on his workstation, it only had 43125 unmapped reads. Try to see what the difference is, and fix the problem.

Updated Summary

There were several bugs in the G2P clipping, so we fixed those. The Micall 7.0 pipeline now calls 19% X4 compared to the 454 calling 23% in this sample.

  • compare 454 data with Micall
  • compare Ruby G2P with Python, and fix a bunch of bugs
  • Run the whole 150529 run with Micall 7.0
  • Get Conan to compare 454 calls with new Micall results.
  • Remove the is_ruby_compatible flag from sam_g2p.py.
  • Process a bunch of other runs and get Conan to compare results.
    • 150130
    • 150311
    • 150316
    • 150415
    • 150612
@ArtPoon ArtPoon added the bug label Jul 15, 2015
@ArtPoon ArtPoon self-assigned this Jul 15, 2015
@ArtPoon ArtPoon added this to the 7.0 - standalone milestone Jul 15, 2015
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

I was unable to reproduce this when re-running this sample locally.
The original remap_counts file (pipeline release 6.8) read:

sample_name,type,count,filtered_count
63249A-V3-3-V3LOOP_S93.censored1.fastq,raw,697420,
63249A-V3-3-V3LOOP_S93.censored1.fastq,prelim HLA-B-seed,2,2
63249A-V3-3-V3LOOP_S93.censored1.fastq,prelim HCV-6V-seed,1,0
63249A-V3-3-V3LOOP_S93.censored1.fastq,prelim HIV1B-env-seed,12149,12129
63249A-V3-3-V3LOOP_S93.censored1.fastq,prelim HCV-1A-H77-seed,2,2
63249A-V3-3-V3LOOP_S93.censored1.fastq,remap HIV1B-env-seed,561221,
63249A-V3-3-V3LOOP_S93.censored1.fastq,unmapped,136199,

Using master branch on my workstation (commit 67566cd) I obtained the following:

type,count,filtered_count
raw,697420,
prelim HLA-B-seed,2,2
prelim HCV-6V-seed,1,0
prelim HIV1B-env-seed,12149,12129
prelim HCV-1A-H77-seed,2,2
remap HIV1B-env-seed,654295,
unmapped,43125,

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

I just checked out release 6.8.1 on my workstation (commit aa76593) and still could not reproduce this. Hopefully it is not stochastic. I will repeat to check.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

On Don's suggestion, copying the required sample files to a temp folder on cluster and running 6.8 from there. It may be a samtools or bowtie2 versioning issue.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

Reproduced issue on cluster. Either samtools or bowtie2 is at fault.
Attempt re-run with different version of samtools version 0.1.19. If this works, then reprocess all MOTIVATE V3 samples with pipeline release 6.8.1 but patched to call this version of samtools.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

I ran the development pipeline with remap.py using samtools-0.1.19 but the same number of reads remained unmapped. Now trying with samtools-1.1

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 15, 2015

samtools-1.1 failed to remedy this issue. Now checking if it is a bowtie2 versioning problem.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 16, 2015

Running bowtie2-2.2.3 (same as my workstation) did not resolve the problem either: 136199 reads unmapped.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 16, 2015

Calling remap.py directly on the command line rather than through run_processor.py gets the correct result. This does not appear to be due to the latter running processes on a compute node -- if I call remap.py via bpsh on a compute node then I still get the correct result.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 23, 2015

I just pulled the master branch in the development folder and now calling remap.py directly gives me 136199 unmapped reads. I have tried adjusting the maximum pileup size and this has no effect.

@donkirkby
Copy link
Member

I just reproduced the problem on my workstation. I ran the code from the v6.8.1 tag with bowtie2 2.2.1 and samtools 0.1.18. I saw 136199 unmapped reads in sample 63249A-V3-3.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 24, 2015

Funny - I've reverted back to 6.8.1 in the development folder and now calling remap.py directly yields 136199 unmapped reads.

@donkirkby donkirkby assigned donkirkby and unassigned ArtPoon Aug 20, 2015
@donkirkby
Copy link
Member

I ran it once on my workstation with standalone Micall 7.0 and bowtie2 2.2.1, and only saw 2800 unmapped reads, so that's a good sign. However, when I ran it with the data that had been censored for high error rates in the phiX data, I saw 108352 unmapped reads.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 20, 2015

Going back to the pipeline 6.8 results: the X4 variant of interest appears 5185 times in the file 63249A-V3-3-V3LOOP_S93.unmapped1.fastq (exact match), which contains 18905 reads in total (27%). It appears in 104107 reads in the raw FASTQ R1 file (348710 reads total, so 30%).

@donkirkby
Copy link
Member

I ran the 63249A sample with Micall 7.0, and looked at all the variants that were successfully called in the g2p.csv file. Both with the censored data and the uncensored data, I counted 73% called as X4 (fpr score ≤ 3.5).

  • With the uncensored data, 121838 reads were called X4 out of 166530 successful calls.
  • With the censored data, 101903 reads were called X4 out of 139697 successful calls.

It looks like the censoring is affecting the X4 reads at the same rate as the other reads, so the only remaining question is why the Micall results disagree with the 454 results. According to Conan, the 454 called 23% of the reads X4.

When I calculated the same totals for the g2p.csv file from Micall 6.8 using censored data, I found 534 reads called X4 out of 99612 successful calls. That's less than 1%.

@donkirkby
Copy link
Member

After a few bug fixes, I reran the 63249A sample with Micall 7.0 as above. I counted the following:

  • 19% called as X4 in the uncensored data (35215 out of 182600 successful calls)
  • 19% called as X4 in the censored data (24259 out of 128574 successful calls)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants