Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make rescue_mate more efficient by returning position in has_shared_substring? #286

Open
ksahlin opened this issue Jun 1, 2023 · 1 comment

Comments

@ksahlin
Copy link
Owner

ksahlin commented Jun 1, 2023

The pairwise alignment done in rescue_mate cuts out a much larger segment of the reference compared to when we can use seeds. This makes the pairwise alignment much slower for those instances (anecdata). More precisely, we cut out a reference segment of length r/2 + mu + 5*sigma where mu and sigma are fragment size distribution parameters.

Before we rescue, we check whether the mate to be rescued share at least one submer of size 2k/3 with the reference segment using ref_seq.find(submer). Since this operation returns the location of the hit, the location of this hit can be used to cut out the reference appropriately.

Since we already compute the location of the seed, I believe my suggestion comes with close to no compute overhead. We would therefore strictly improve speed (through faster pairwise alignment).

To the potential downsides. The rescue functionality is likely invoked for:

  1. Repetitive regions (due to masked seeds)
  2. In regions containing larger indels and other types of SVs.
  3. Too many mutations/read error between read and reference.

The downsides with proposed approach is that:
(i) if we stop at the first seed hit (if many) in repeated regions, the first hit may not lead to the optimal alignment.
(ii) If the mate span an indel, we may be able to align over it with our cut out segment of length r/2 + mu + 5*sigma, but not with a shorter reference segment.

@ksahlin
Copy link
Owner Author

ksahlin commented Jun 1, 2023

Related, we could perhaps print timings for ssw alignment (normal) vs ssw alignmnt (rescue) form more fine grained logging. I just looked at the timings for aligning the BIO150 dataset (Illumina 150PE reads from GIAB) to human.

Some relevant lines below. It seems like the rescue alignment amounts to near half of the times we call SSW, and the extension alignment takes the majority of the time (compare to total mapping time below). I am not sure about the library parameters but I think its something like mu=300-350, sigma=30-50, r=150, meaning we cut out about a size 525-675 on the reference.

Total calls to ssw: 237776607
Calls to ksw (rescue mode): 105966142
Total time mapping: 3929.15 s.
Total time base level alignment (ssw): 2269.20 s.

(One could perhaps also enlarge the initial cut-out segment sent to has_shared_substring to find more segments that could be rescued over larger deletions.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant