Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with mate CIGAR string #18

Open
TDDB-limagrain opened this issue Feb 24, 2022 · 10 comments
Open

Possible issue with mate CIGAR string #18

TDDB-limagrain opened this issue Feb 24, 2022 · 10 comments

Comments

@TDDB-limagrain
Copy link

TDDB-limagrain commented Feb 24, 2022

Hi @ksahlin !
sorry for bothering you again!

I am trying to run GATK haplotypecaller on reads mapped with Strobealign. The Haplotypecaller ends with an error:
java.lang.IllegalArgumentException: Sequence [VC HC @ Chr02:1-49670989 Q. of type=SYMBOLIC alleles=[C*, <NON_REF>] attr={END=49670989} GT=[[CB10 C*/C* GQ 0 DP 0 PL 0,0,0 {MIN_DP=0}]] filters= added out of order currentReferenceIndex: 0, referenceIndex:6
Besides I aligned the same reads with bwa mem and successfully ran the Haplotycaller. Therefore, I wondered whether something could have happened with the alignment formats.

I ran the Picard tool ValidateSamFile, giving me:

Strobealign:

ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE        341946
ERROR:MISMATCH_MATE_CIGAR_STRING        68204774

Bwa mem:

ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE        249950

I still need to dive into the details, especially to look for this kind of error: MISMATCH_MATE_CIGAR_STRING.
That might also be a GATK-related issue... so I will keep you in touch with additional information.

(I finally succeeded in running Strelka on the same reads, thanks for adjusting alignment formatting! compared to array-based genotyping data, the combination strobealign+strelka performed very similarly to bwa mem + strelka, but of course at much higher speed for the mapping!).

@TDDB-limagrain
Copy link
Author

here is an example of a MISMATCH_MATE_CIGAR_STRING:

output given by Picard ValidateSamFile:

ERROR::MISMATCH_MATE_CIGAR_STRING:Record 46, Read name A00604:215:HNL2GDSXY:2:2272:26576:36996, Mate CIGAR string does not match CIGAR string of mate
A00604:215:HNL2GDSXY:2:2272:26576:36996 99      Chr1    6006    0       147M4S  =       6010    147     CCCTATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCCAGAT    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:F    AS:i:294        MQ:i:0  MC:Z:143=       ms:i:5107       MD:Z:147        NM:i:0     RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 147     Chr1    6010    0       143M    =       6006    -147    ATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCC    FF,FFF,FFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:286        MQ:i:0  MC:Z:147=4S     ms:i:5551       MD:Z:143        NM:i:0  RG:Z:CB10

@ksahlin
Copy link
Owner

ksahlin commented Feb 24, 2022

Hi @TDDB-limagrain,

As for the MISMATCH_MATE_CIGAR_STRING, perhaps it is because the cigar reports 147M4S while the mate cigar string field reports MC:Z:147=4S (i.e., M in one and =/X in the other)? I think this is because you have converted strobealigns =/X cigars to M after it has added the field MC:Z:. Perhaps one can perform the translation of =/X after the cigar is modified, so that you would get MC:Z:147M4S? (if you need M cigars for downstream callers?)

It would be interesting to see what an example of ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE is.

As for the error in haplotype caller, I do not know. Do you think you could post an issue with the developers? (even though it is likely some misreporting in strobealign). Maybe they know what will cause this error. Looks like there is an issue that has to do with sorting of some kind.

@ksahlin
Copy link
Owner

ksahlin commented Feb 25, 2022

Which program do you use to add on the fields MC:Z: ms:i: etc? Perhaps the haplotype caller error is caused by some invalid tag added by this program, much like the added mate cigar? I may add these fields within strobealign instead in the next version to remove such downstream step.

@TDDB-limagrain
Copy link
Author

TDDB-limagrain commented Feb 25, 2022

I am piping those following commands:

strobealign -t 4 $reffasta $forward $reverse | \
        samtools fixmate -u -m - - | \
        samtools addreplacerg -u -r "@RG\tID:${line}\tSM:${line}\tLB:Solution\tPL:illumina\tPU:none" - - |
        samtools sort  -u -T bam/$line - | \
        samtools markdup -@2 --reference $reffasta --write-index --output-fmt cram,version=3.0 - bam/$line.sorted.cram

The new fields probably came from the mark duplicate step http://www.htslib.org/doc/samtools-markdup.html.

These are the same commands I piped to bwa-mem and here are the same reads:

A00604:215:HNL2GDSXY:2:2272:26576:36996 163     Chr11   56372617        0       143M    =       56372617        147     GGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTAT      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFF,FFF,FF AS:i:143    XS:i:143 MQ:i:0  MC:Z:4S147M     ms:i:5551       MD:Z:143        NM:i:0  RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 83      Chr11   56372617        0       4S147M  =       56372617        -147    ATCTGGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTATAGGG      F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      AS:i:147        XS:i:147        MQ:i:0  MC:Z:143M       ms:i:5107       MD:Z:147        NM:i:0  RG:Z:CB10

That looks quite weird they are located on different chromosome! Bwa mem mapped it on Chr11 while strobealign mapped it on Chr1.

I had a look at this issue on the GATK github broadinstitute/gatk#7199 (comment), but no response from the developer so far.

@ksahlin
Copy link
Owner

ksahlin commented Feb 25, 2022

So as for the MISMATCH_MATE_CIGAR_STRING I believe this is simply the =/X cigars conversion to M - so no problem there (unless some caller gets hung up on this warning). Maybe I'll implement to support output of M in addition to X/= cigars.

As for the map location: Both mates within the read pair are mapped with the same 'fit' (m1 perfect match and m2 perfect except 4 softclipps). They even have the same insert distance (template length 147). This is likely because of a perfect repeat where the location is assigned "at random", so I also don't see a potential issue here.

@TDDB-limagrain
Copy link
Author

regarding the ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE , Picard ValidateSamFile returns for example:

ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 2259, Read name A00604:215:HNL2GDSXY:2:1373:14136:8938, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped

with the following details from the strobealign-related bam file:

A00604:215:HNL2GDSXY:2:1373:14136:8938  1097    Chr1    7107    0       151M    =       7107    0       CACCAACCATGTCAAGATCAATTCCTTTTGATCCCAAACTTCTCCCACATCACTTTGAAGGCATTTGTGAGGATTGAGTCCAGCCACTTTTCCAAAACCATGATTTTCGTTTTTCACCAAAATCACCACCAAGCCACAAAATCAAGCCAAA      FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFF:FF AS:i:302    MC:Z:*   ms:i:5587       MD:Z:151        NM:i:0  RG:Z:CB10
A00604:215:HNL2GDSXY:2:1373:14136:8938  133     Chr1    7107    0       *       =       7107    0       GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MQ:i:0  MC:Z:151=    ms:i:5478       RG:Z:CB10

@TDDB-limagrain
Copy link
Author

Ok thanks for the anwser!

@ksahlin
Copy link
Owner

ksahlin commented Feb 25, 2022

Ah, I see. This is also an error caused by the program adding the attributes to the bam file. For some reason, picard does not like that the cigar of the mateMC:Z:151= is added as a field/attribute of the unmapped read. No idea why this is undesired.

@TDDB-limagrain
Copy link
Author

ok, seems that samtools command I add are introducing some confusing fields!
I will try to troubleshoot the error I had with the Haplotypecaller because it seems to more a GATK related issue. Thanks!

@marcelm
Copy link
Collaborator

marcelm commented Jun 7, 2023

Hi, I’m going through some open strobealign issues and was wondering whether this one can be closed.

Note that in version 0.10.0, strobealign has switched to outputting M CIGAR operations by default instead of =/X, which may help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants