Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. David Dailey Dr. Beverly Gocal Dr. Deborah Whitfield

Similar presentations


Presentation on theme: "Dr. David Dailey Dr. Beverly Gocal Dr. Deborah Whitfield"— Presentation transcript:

1 Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu

2  Introduction  Graph distance  String Distance ◦ Definitions ◦ Examples ◦ Implementation ◦ Theoretical Results ◦ String Space Examples

3  Distance ◦ may be defined for any structure  Overlap of the substructures of two structures ◦ Strings ◦ Graphs ◦ Algebraic structures ◦ Semi-groups ◦ Trees  Web site and web page similarity

4  Past 15 years ◦ Over 20 papers on graph similarity ◦ Several more on string similarity  Semi-Group  Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms ◦  x, y  S, xy  S ◦  x, y, z  S, x(yz) = (xy)z

5  Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms ◦  x, y  S, x ~ y  y ~ x ◦  x ,  (x ~ x)  String Let T=(S,A) together with an associative operation (expressed by concatenation). ◦ Then let S n be defined recursively by  S 1 = S and  S n = S x S n-1 and  S* be defined as the infinite union of ordered tuples: S 1  S 2  …  S n

6  Levenshtein distance calculates minimum number of transformations  Largest shared substructure  Smallest super structure  All of these approaches are relative

7  Enumerate all substructures within T and U  Union those two sets (T*  U*) =Z  |Z|-dimensional vector space  z(T) be the number of occurrences of structure z as a substructure of T  Calculate Minkowski distance d(T,U)

8

9  Alphabet S = {a,b,c},  = abaac and  = cbaac   *= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac}   * = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac}  Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to  and boldfaced are unique to  *)  Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac}  Different frequency: D={a},  Unique: O= {ab, cb, cba,aba, cbaa, abaa, cbaac, abaac}  |I| = 8, |D| = 1, and |O| = 8

10  |I| +|D| +|O| = |Z| = 18.  Contribution of O is |O|  Contribution of I is 0 - substrings appear equally often  Contribution of D, in this case will be 1.  d( ,  ) = contribution(I)+ contribution(D)+ contribution(O) = 9

11  A= aabc B= abcd  S= {a, a, aa, aab, aabc, ab, abc, b, bc, c}  T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d}  Counts for S and T ◦ a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 ◦ a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1  Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1  Distance (aabc, abcd) = 8

12  Too tedious by hand  http://srufaculty.sru.edu/david.dailey/javascr ipt/StringDistances.html http://srufaculty.sru.edu/david.dailey/javascr ipt/StringDistances.html  Distance (aabc, abcd) = 8

13  Conjecture: if |  |=|  |=n and  and  share no substrings in common (i.e., |I  D|=0), then d(  ) = n(n+1)  Conjecture: if |  |=|  |=n and a and b share no substrings in common (i.e., |I  D|=0), then d(  ) = n(n+1)  Lemma: if  =a n then d(  )= n 2 + n(n+1)/2  Conjecture: if |  |=|  |=n, then d(  )=d(  )=d(  )=d(  )= n 2 + n(n+1)/2

14  Pretty pics

15  Exhaustive substructure vector space  Calculate distance  Interesting observations used to study structure similarity based on size


Download ppt "Dr. David Dailey Dr. Beverly Gocal Dr. Deborah Whitfield"

Similar presentations


Ads by Google