Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPQ scores #25

Open
ksahlin opened this issue Jun 1, 2022 · 8 comments
Open

MAPQ scores #25

ksahlin opened this issue Jun 1, 2022 · 8 comments
Labels
feature New feature or request high priority Extra attention is needed

Comments

@ksahlin
Copy link
Owner

ksahlin commented Jun 1, 2022

Note to developer:

It is not clear if strobealign has optimal mapq scores for variant calling. For example, having an accurate MAPQ score is crucial for SNP calling. See this tweet thread and this subthread.

For info, a relevant paper about this is found here.

@marcelm
Copy link
Collaborator

marcelm commented Sep 22, 2022

Link to mentioned paper: A tandem simulation framework for predicting mapping quality

The current MAPQ computation function in StrobeAlign uses the formula from minimap2, see Section 2.3.1 in the paper.

@PaulPyl
Copy link
Collaborator

PaulPyl commented Sep 23, 2022

I don't see any section numbers in the linked paper, both links above point to the same one: A tandem simulation framework for predicting mapping quality

@marcelm
Copy link
Collaborator

marcelm commented Sep 23, 2022

I don't see any section numbers in the linked paper

Sorry, I shouldn’t comment so late at night – I have fixed the link now.

@marcelm
Copy link
Collaborator

marcelm commented Sep 23, 2022

@ksahlin sent me this comment by e-mail and asked me to post it here:

Good catch with the min_matches bug! Since our “hits” (pairs of linked seeds) are different to minimap2’s minimizer hits, me should probably not use this formula in the future. The heuristic min(M/10,1) is probably derived by Heng from testing. We will not have the same number of matches for our seeds and long reads, therefore the magic constant 10 may be off.

Just mentioning this for caution. Again, mimicking what BWA-MEM is doing will probably be better (in case possible). Although I heard that BWA-MEM always tries a second mapping location to compute MAPQ(?). This is probably something we want to avoid for all reads, but use whenever we still have to align twice due to multiple mappings. Or in case WFA alignment is so much faster than SSW that we can “afford it”.

(min_matches bug is fixed in #84).

@marcelm
Copy link
Collaborator

marcelm commented Sep 23, 2022

Copying @PaulPyl’s comment from #79:

I didn't have a lot of time this week to check it more clearly but I was wondering about the current implementation of MAPQ as // MAPQ = 40(1−s2/s1) ·min{1,|M|/10} · log s1 in the get_MAPQ function.

Where did this definition arise? As I understand it we would want to put our best guess for the probability that the alignment is wrong in the MAPQ score. So here we take 1 minus the ratio of best to second-best alignment and scale that to 40, i.e. if the alignments are equally good then we end up with MAPQ 0. Since 0*.

Now here is a fun question, if I have exactly two alignments and they are equally good and not discarded because of bad alignment score, would I assume the probability of a given alignment being wrong as 1, implying a MAPQ of 0, or would I assume the probability is 1/<#equally good alignments> and in this case 0.5 with a resulting MAPQ of around 3?

It seems like there is a bit of an issue with this, in the sense that there are no 'correct' MAPQ scores we can benchmark against, since that really depends on the algorithm and our choices.

For reference: -10*log10(c(0.0001,0.5,0.9999,1.0)) 40 3.01029 0.00043 0

@marcelm
Copy link
Collaborator

marcelm commented Sep 23, 2022

I temporarily changed get_MAPQ to not cap the mapping qualities at 60. With this, we can see what the minimap2 formula outputs. The median score is 2013, so it’s no surprise that nearly all MAPQs are set to 60. I agree with @ksahlin that the minimap2 heuristic just doesn’t work the way in StrobeAlign as it does in minimap2.

@marcelm
Copy link
Collaborator

marcelm commented Sep 24, 2022

Oh, made a mistake when computing the median, it’s actually ~1780. And with the fix in #84 merged, this went down to 378.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 22, 2023

After getting more evidence that MAPQ scores are not computed accurately from this benchmark I looked at how I describe the MAPQ score calculation in the paper. It is, in retrospect, obvious that we should not simply compute the MAPQ score from the seed-score ratio of the first and second best hit. We should compute the score from the ratio of the alignment scores whenever possible.

Guiding anecdata: IIRC, strobealign's variant calling sensitivity improved significantly without similar decrease in precision when simply putting all non-zero mapq scores values to highest possible confidence (60?). While this is not a good idea, it could suggest that the current mapq values may be too conservative.

marcelm added a commit that referenced this issue Aug 20, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
marcelm added a commit that referenced this issue Aug 21, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
marcelm added a commit that referenced this issue Aug 21, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
marcelm added a commit that referenced this issue Aug 23, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
marcelm added a commit that referenced this issue Nov 13, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
marcelm added a commit that referenced this issue Nov 15, 2023
First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request high priority Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants