Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity.

Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity of Alberta

Motivations Ranked retrievals e.g. query: book[author=‘Abiteboul’ and year=‘2000’] DTD extraction –useful for query processing Clustering –for efficient storage and indexing –for efficient retrievals (similar documents are expected to match the same queries more often)

Problem Statement How to measure similarity (or distance) between XML documents Desired properties –The distance must be a metric –Documents generated by the same DTD are expected to have less distance –Documents with more common tags are expected to be more similar –Interested in syntactic similarity only

Examples Similar documents Non-similar documents Abiteboul 2000 John 20 Abiteboul 2000 John 1994 George Animal Farm

Related Work Structural Similarity –Edit distance between ordered trees ( Nierman and Jagadish [11], Zhang et al. [21, 23], Chawate et al. [96] ) –Edit distance between unordered trees: NP- Complete ( Zhang et al. [22] ) Specialized Solutions ( Flesca et al. [5], Zaki and Aggrawal [20] )

Related Work (Cont.) More Syntactic Similarity –Based on common parent-child tags ( Lian et al. [10] ); e.g. of non-similar documents A T 2006 A T 2006 –Use parent-child tags, twigs, content terms, semantic relationships ( Theobald et al. [18] )

Structural Sketch For every path in d, there is a path in t and vice versa and t is minimal. John Mary u200 d t

Sketch Similarity John Mary u200 d Problems of matching trees Sketch tree is not unique t

Path Sets user/person/name user/person/id user/person/name user/person/id user/person person/name user person name person/id id Root paths Path set

Similar Path Sets Standard set comparisons apply –E.g. Cosine, Jaccard, Dice Path set size nl(l+1)/2 –for n root paths, each of length l Fast similarity comparison –Cost: linear on the size of the path set

Evaluation Effectiveness in clustering documents generated by the same DTD –Count the mis-clusterings For result comparison –Used the same dataset and setting as some earlier work Also used a larger dataset

Real Data XML files of ACM Sigmod Record since March 1999 Four DTDs (total of 989 xml files) –ProceedingsPage17 xml files –IndexTermsPage920 xml files –OrdinaryIssuePage51 xml files –SigmodRecod1 xml file

Synthetic Data Generated using IBM xml generator DTDs –Set A: the set used by Neirman and jagadish –Set B: set A plus 5 more DTDs Parameters –M: max repeat for + or * –P: probability of an optional attribute

Example Clusters

Mis-Clusterings Cosine was used for similarity measurements –Also tried Jaccard and Dice coefficients but the results weren’t better. Real data DS1- DS4 DS5DS6DS7DS8 Binary vector 0033302529 Freq. vec. 000000 N. freq. vec. 000000

Comparison Real data DS1- DS4 DS5DS6DS7DS8 Binary vector 0033302529 Freq. vec.000000 N. freq. vec.000000 real data DS1DS2DS3DS4 Nierman0102119 Chawathe31683025 Shasha31693239 Tag Freq.322213540 Our results Earlier results

Tag Frequency real Data DS1DS2DS3DS4 City block24208200211240 Euclidean2406200 Cosine6838333935

Conclusions Presented a method for clustering documents generated by the same DTD Compared to tree-edit distance-based methods, our method is –more effective (based on our evaluations) –and also much more efficient

Future Work Detecting documents with similar structures and related tag names, e.g. Possible solutions: –allow users to specify relabeling rules –Learn relabeling rules from a training data Abiteboul 2000

Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity.

Similar presentations

Presentation on theme: "Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity.

Similar presentations

Presentation on theme: "Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity."— Presentation transcript:

Similar presentations

About project

Feedback