Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base level alignment #26

Open
ksahlin opened this issue Jun 1, 2022 · 3 comments
Open

Base level alignment #26

ksahlin opened this issue Jun 1, 2022 · 3 comments
Labels
optimization Further information is requested

Comments

@ksahlin
Copy link
Owner

ksahlin commented Jun 1, 2022

Note to developer:

The extension step (nucleotide level alignment) is the bottleneck in strobealign. There are different three ways to reduce this:

  1. Direction 1 (change the alignment module):
    1. Change to base level alignment with WFA (WFA publ) as is done in Accelalign
  2. Direction 2 Speedup the current module used (SSW):
    1. By using 8bit slots in alignment matrix?
    2. By not computing alignment twice - is ssw does this?
  3. Direction 3 (partitioned SSW)
    1. Finish implementing partitioned SW (split alignment into several small hamming or SW alignments) if seeds are in middle.
@ksahlin ksahlin added the optimization Further information is requested label Jun 1, 2022
@ksahlin
Copy link
Owner Author

ksahlin commented Sep 9, 2022

I think this is worth exploring WFA as we are currently relying on SSW which has its issues being a local aligner as mentioned in issue #54. The method seems to have great performance, see table2 in WFA paper for timings to e.g. SeqAn and ksw2 in Table 2. Furthermore, the maturity of the implementation is here with WFA2 in terms of providing different penalty models (including dual gap cost penalties!), traceback cigar etc,

Also in strobealign the extension step is over 50% of the runtime for many biological datasets, see 'aln' field for BIO150 and BIO250 in attached figure for extension with SSW.

image

@ekg
Copy link

ekg commented Feb 8, 2023

This will also make it easy to scale to longer sequences, provided your seed chaining can do it. BiWFA uses order divergence space and is consequently very cache coherent and actually fast even for ~500bp sequences.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 8, 2023

That's a good point. As mentioned in #24, extending to alignments of long reads is one objective. Strobealign's seeds should be very suitable for long reads, so BiWFA may be the way to go already at start. We are currently exploring WFA (#229), but we have yet to find a way to use the library efficiently compared to SSW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants