Dr. David Dailey Dr. Beverly Gocal Dr. Deborah Whitfield
Introduction Graph distance String Distance ◦ Definitions ◦ Examples ◦ Implementation ◦ Theoretical Results ◦ String Space Examples
Distance ◦ may be defined for any structure Overlap of the substructures of two structures ◦ Strings ◦ Graphs ◦ Algebraic structures ◦ Semi-groups ◦ Trees Web site and web page similarity
Past 15 years ◦ Over 20 papers on graph similarity ◦ Several more on string similarity Semi-Group Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms ◦ x, y S, xy S ◦ x, y, z S, x(yz) = (xy)z
Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms ◦ x, y S, x ~ y y ~ x ◦ x , (x ~ x) String Let T=(S,A) together with an associative operation (expressed by concatenation). ◦ Then let S n be defined recursively by S 1 = S and S n = S x S n-1 and S* be defined as the infinite union of ordered tuples: S 1 S 2 … S n
Levenshtein distance calculates minimum number of transformations Largest shared substructure Smallest super structure All of these approaches are relative
Enumerate all substructures within T and U Union those two sets (T* U*) =Z |Z|-dimensional vector space z(T) be the number of occurrences of structure z as a substructure of T Calculate Minkowski distance d(T,U)
Alphabet S = {a,b,c}, = abaac and = cbaac *= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac} * = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac} Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to and boldfaced are unique to *) Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac} Different frequency: D={a}, Unique: O= {ab, cb, cba,aba, cbaa, abaa, cbaac, abaac} |I| = 8, |D| = 1, and |O| = 8
|I| +|D| +|O| = |Z| = 18. Contribution of O is |O| Contribution of I is 0 - substrings appear equally often Contribution of D, in this case will be 1. d( , ) = contribution(I)+ contribution(D)+ contribution(O) = 9
A= aabc B= abcd S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} Counts for S and T ◦ a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 ◦ a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1 Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1 Distance (aabc, abcd) = 8
Too tedious by hand ipt/StringDistances.html ipt/StringDistances.html Distance (aabc, abcd) = 8
Conjecture: if | |=| |=n and and share no substrings in common (i.e., |I D|=0), then d( ) = n(n+1) Conjecture: if | |=| |=n and a and b share no substrings in common (i.e., |I D|=0), then d( ) = n(n+1) Lemma: if =a n then d( )= n 2 + n(n+1)/2 Conjecture: if | |=| |=n, then d( )=d( )=d( )=d( )= n 2 + n(n+1)/2
Pretty pics
Exhaustive substructure vector space Calculate distance Interesting observations used to study structure similarity based on size