MAPQ scores #25

ksahlin · 2022-06-01T10:37:16Z

Note to developer:

It is not clear if strobealign has optimal mapq scores for variant calling. For example, having an accurate MAPQ score is crucial for SNP calling. See this tweet thread and this subthread.

For info, a relevant paper about this is found here.

marcelm · 2022-09-22T22:47:45Z

Link to mentioned paper: A tandem simulation framework for predicting mapping quality

The current MAPQ computation function in StrobeAlign uses the formula from minimap2, see Section 2.3.1 in the paper.

PaulPyl · 2022-09-23T06:39:46Z

I don't see any section numbers in the linked paper, both links above point to the same one: A tandem simulation framework for predicting mapping quality

marcelm · 2022-09-23T09:26:43Z

I don't see any section numbers in the linked paper

Sorry, I shouldn’t comment so late at night – I have fixed the link now.

marcelm · 2022-09-23T10:21:31Z

@ksahlin sent me this comment by e-mail and asked me to post it here:

Good catch with the min_matches bug! Since our “hits” (pairs of linked seeds) are different to minimap2’s minimizer hits, me should probably not use this formula in the future. The heuristic min(M/10,1) is probably derived by Heng from testing. We will not have the same number of matches for our seeds and long reads, therefore the magic constant 10 may be off.

Just mentioning this for caution. Again, mimicking what BWA-MEM is doing will probably be better (in case possible). Although I heard that BWA-MEM always tries a second mapping location to compute MAPQ(?). This is probably something we want to avoid for all reads, but use whenever we still have to align twice due to multiple mappings. Or in case WFA alignment is so much faster than SSW that we can “afford it”.

(min_matches bug is fixed in #84).

marcelm · 2022-09-23T10:28:18Z

Copying @PaulPyl’s comment from #79:

I didn't have a lot of time this week to check it more clearly but I was wondering about the current implementation of MAPQ as // MAPQ = 40(1−s2/s1) ·min{1,|M|/10} · log s1 in the get_MAPQ function.

Where did this definition arise? As I understand it we would want to put our best guess for the probability that the alignment is wrong in the MAPQ score. So here we take 1 minus the ratio of best to second-best alignment and scale that to 40, i.e. if the alignments are equally good then we end up with MAPQ 0. Since 0*.

Now here is a fun question, if I have exactly two alignments and they are equally good and not discarded because of bad alignment score, would I assume the probability of a given alignment being wrong as 1, implying a MAPQ of 0, or would I assume the probability is 1/<#equally good alignments> and in this case 0.5 with a resulting MAPQ of around 3?

It seems like there is a bit of an issue with this, in the sense that there are no 'correct' MAPQ scores we can benchmark against, since that really depends on the algorithm and our choices.

For reference: -10*log10(c(0.0001,0.5,0.9999,1.0)) 40 3.01029 0.00043 0

marcelm · 2022-09-23T10:32:57Z

I temporarily changed get_MAPQ to not cap the mapping qualities at 60. With this, we can see what the minimap2 formula outputs. The median score is 2013, so it’s no surprise that nearly all MAPQs are set to 60. I agree with @ksahlin that the minimap2 heuristic just doesn’t work the way in StrobeAlign as it does in minimap2.

marcelm · 2022-09-24T19:17:51Z

Oh, made a mistake when computing the median, it’s actually ~1780. And with the fix in #84 merged, this went down to 378.

ksahlin · 2023-02-22T20:34:55Z

After getting more evidence that MAPQ scores are not computed accurately from this benchmark I looked at how I describe the MAPQ score calculation in the paper. It is, in retrospect, obvious that we should not simply compute the MAPQ score from the seed-score ratio of the first and second best hit. We should compute the score from the ratio of the alignment scores whenever possible.

Guiding anecdata: IIRC, strobealign's variant calling sensitivity improved significantly without similar decrease in precision when simply putting all non-zero mapq scores values to highest possible confidence (60?). While this is not a good idea, it could suggest that the current mapq values may be too conservative.

First, there was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated thus: min_mapq_diff = std::max(0, alignment.score - best_score); This is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. (This is clamped to 60 in the output, but still.) Here, we keep track of the highest and second-highest alignment score in best_score and second_best_score. Then, mapq is computed such that it is * 60 for second_best_score=0 and * 0 for second_best_score=best_score, where anything in between is interpolated. See #25

ksahlin added feature New feature or request high priority Extra attention is needed labels Jun 1, 2022

ksahlin mentioned this issue Sep 14, 2022

-N 1 outputs no extra alignment #66

Closed

ksahlin mentioned this issue Sep 22, 2022

Adressing Issue #75 with some initial refactoring, both functions now… #79

Closed

marcelm mentioned this issue Sep 24, 2022

Refactor mapq computation #84

Merged

marcelm mentioned this issue Jun 9, 2023

Reads get MAPQ=60 although they are multimappers #293

Closed

marcelm mentioned this issue Aug 21, 2023

Rework single-end MAPQ computation #328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAPQ scores #25

MAPQ scores #25

ksahlin commented Jun 1, 2022

marcelm commented Sep 22, 2022 •

edited

Loading

PaulPyl commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 24, 2022

ksahlin commented Feb 22, 2023 •

edited

Loading

MAPQ scores #25

MAPQ scores #25

Comments

ksahlin commented Jun 1, 2022

marcelm commented Sep 22, 2022 • edited Loading

PaulPyl commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 23, 2022

marcelm commented Sep 24, 2022

ksahlin commented Feb 22, 2023 • edited Loading

marcelm commented Sep 22, 2022 •

edited

Loading

ksahlin commented Feb 22, 2023 •

edited

Loading