Guided Forest Edit Distance: Better Structure Comparisons by Using Domain-knowledge Z.S. Peng H.F. Ting
The Forest Edit Distance
Edit distance of two ordered, labeled forests Edit operations between E and F Relabling node i in E by the label of node j in F E F a h fm a me z v uy
Edit distance of two ordered, labeled forests Edit operations between E and F Relabling node i in E by the label of node j in F Relabel (3,5) E F a h fm a me z v uy y
Edit distance of two ordered, labeled forests Edit operations between E and F Relabling node i in E by the label of node j in F Cost of the operation: (3,5) E F a h fm a me z v uy p
Edit distance of two ordered, labeled forests Edit operations between E and F Delete node i from E E F a h fm a me z v uy
Edit distance of two ordered, labeled forests Edit operations between E and F Delete node i from E Delete (2,-) E F a h fm a me z v uy
Edit distance of two ordered, labeled forests Edit operations between E and F Delete node i from E Delete (2,-) E F a h m a me z v uy
Edit distance of two ordered, labeled forests Edit operations between E and F Delete node i from E Cost of the operation: (2,-) E F a h m a me z v uy
Edit distance of two ordered, labelled forests Edit operations between E and F Delete node j from F The cost of operation: (-,j) E F a h fm a me z v uy
Edit distance of two ordered, labelled forests The edit distance (E,F) between E and F is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F' E F a h fm a me z v uy a h fm a me z v uy
Edit distance of two ordered, labelled forests The edit distance (E,F) between E and F is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F' E F a h fm a me z v uy a h fm a me z v uy e
Edit distance of two ordered, labelled forests The Guided edit distance (E,F,G) between E and F with respect to a third forest G is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F' include G as a subforest E F a h fm a me z v uy a m a mee 3 12 a me G
Application 1: RNA comparisons Cherry small circular viroid-Like RNA GI: between base 287 and base 337. T he Hammerhead motif of the RNA is printed in bold.
Application 2: Comparing XML documents XML documents with same Document Type Descriptor should be aligned with this DTD to get more accurate results
The algorithms (E,F) Tai 1979: Zhang and Shasha 1989: where Klein 1998: (E,F,G) : This paper:
Special Cases a a c c b a c c a c c f f
a a c c b a c c a c c f f Longest Constraint Common Subsequence Constrained Sequence Alignment
The algorithms Constrained Longest Common Subsequent Tsai 2003: Constrained Sequence Alignment Chin et al. : This paper: where Since G has one leaf, the time becomes
Our algorithm for computing (E,F,G) Dynamic Programming
The sub-problems Post-order numbering (naming) of the nodes
The sub-problems : A "consecutive" sub-forest
The sub-problems : A "consecutive" sub-forest
The sub-problems E FG
The sub-problems E FG
is equal to the minimum of the followings:
E FG
E FG
E FG
E FG
E FG
E FG
E FG
E FG
E FG
E FG
The order for solving the sub-problems for i=1 to |E| for j=1 to |F| for h=1 to |G| for k=1 to (|G|-h+1) if k is a leaf then find
The time complexity
Sparsify the dynamic program using a clever trick of Zhang and Shasha
key-root: if it is the root, or has a left-slibling E FG 2 1
E FG 2 1 No. of key-roots ≤ no. of leaves
To compute (E,F,G)= (E|| 1..|E|,F|| 1..|F|,G|| 1..|G| ) for i=1 to |E| for j=1 to |F| for h=1 to |G| for k=1 to (|G|-h+1) if k is a leaf find
To compute (E,F,G)= (E|| 1..|E|,F|| 1..|F|,G|| 1..|G| ) for i=1 to |E| for j=1 to |F| for h=1 to |G| for k=1 to (|G|-h+1) if k is a leaf and i and j are key-roots find
The new running time
Thank you