GitHub - vbar/sefira: state of the art in tree comparison

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
bin		bin
lib		lib
scripts		scripts
sefira.test		sefira.test
testdata		testdata
.gitignore		.gitignore
INSTALL		INSTALL
LICENSE		LICENSE
Makefile.am		Makefile.am
README		README
configure.ac		configure.ac

Repository files navigation

sefira - state of the art in tree comparison

Finding the largest common subtree of two ordered labeled trees is a
problem with increasing applications (i.e. in XML processing and
computational biology), on which substantial progress has been made in
recent years. However, papers on tree comparison may not be available
without subscription to the appropriate journals and even when they
are, they generally don't contain executable code, so using the new
algorithms is non-trivial. Also, there isn't one "best" algorithm yet
- distance definitions differ, and even for the same edit distance,
time complexity can be measured on different tree features. The goal
of the sefira library is to provide well tested and reasonably
optimized implementations of some tree comparison algorithms, so that
they can be tried out in practical applications.

Scripting languages being impractical for fast implementations of
memory-intensive algorithms, sefira is a C++ (C would probably be
feasible, but the implementations depend heavily on C++ containers)
library with common classes plus executables - one per implemented
algorithm - wrapping the classes into command-line programs. The
common model of all algorithm implementations is the transformation of
two XML trees (sefira depends on libxml2) to their largest common
subtree (also XML).

When only the distance is required, it can be trivially computed from
tree sizes, although keeping the intermediate subtrees wastes quite a
lot of memory in that case - using a simplified algorithm may be worth
a rewrite. Implementations generally use unsigned short indices,
limiting the size of one input tree somewhere around 65536 DOM nodes
(on 32bit systems; sefira has never been tried on anything else,
although obviously you're welcome to try it out on as big an iron as
you want) - beyond that, the algorithms would probably be hopelessly
slow anyway...

Error handling is implemented using C++ exceptions, throwing instances
of the standard C++ exception classes. Checking of parameters is far
from comprehensive, but there is some - performance-hyperconscious
users may want to remove it, but really, the current implementation
presents many more interesting optimization opportunities... sefira
also has unified compile-time configurable logging (to standard error
via std::cerr, so it can be redirected in one place) and many objects
implement their own serialization formats by overloading operator
<<. The implemented algorithms are as follows:

sefira-optimistic

This algorithm is from

S. Mozes, D. Tsur, O. Weimann, and M. Ziv-Ukelson
Fast algorithms for computing tree LCS
probably a preliminary version which appeared in Proc. 19th Symposium
on Combinatorial Pattern Matching (CPM), LNCS 5029, 230-243, 2008,
but I got it at
http://www.cs.bgu.ac.il/~dekelts/publications/treelcs.pdf .

This algorithm is specialized for the LCS problem and measures its
complexity in the compared trees size, height and the number of pairs
from the first and second tree having the same label - in other words,
it's maximally efficient for totally different trees and looses its
edge as node names start to repeat.

sefira-straight

This algorithm is from

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann,
An Optimal Decomposition Algorithm for Tree Edit Distance
in Proceedings of the 34th International Colloquium on Automata,
Languages and Programming (ICALP 2007), Wroclaw, Poland, July 9, 2007.

This algorithm is formulated for a generic distance metric, but sefira
implements only the size comparison - a generalization is left as an
exercise for the reader. sefira-straight is the memoizing version,
using n^3 in both space and time (where n is the size of the larger
input tree).

sefira-systematic

This algorithm is the same as sefira-straight, i.e. also from

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann,
An Optimal Decomposition Algorithm for Tree Edit Distance
in Proceedings of the 34th International Colloquium on Automata,
Languages and Programming (ICALP 2007), Wroclaw, Poland, July 9, 2007,
except it's the memory-efficient variant, as described in
http://erikdemaine.org/papers/TreeEdit_ICALP2007/paper.pdf .

It uses n^2 space and n^3 time.