Download presentation
Presentation is loading. Please wait.
Published byElla Thornton Modified over 9 years ago
1
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University
2
Copyright 2007 by CEBT Outline Introduction Measuring Entropy Information Distance Measuring Structural Similarity Extract structural information Ziv-Lempel Encoding Ziv-Merhav Crossparsing Competitors Tree-editing distance Discrete Fourier Transformation Path shingles Experiments Clustering Quality Document Collections Results Conclusion IDS Lab. Seminar - 2Center for E-Business Technology
3
Copyright 2007 by CEBT Introduction Semistructured document HTML, XHTML, XML, … In traditional IR detecting similarities used widely For querying For clustering Lots of similarity measures for text documents New challenges with semistructured documents Measuring structural similarity Semistructured documents show great structural diversity Measuring structural similarity can integrate heterogeneous data sources IDS Lab. Seminar - 3Center for E-Business Technology
4
Copyright 2007 by CEBT Measuring Entropy Bennet et al. Introduced concept of universal information metric Based on Kolmogorov complexity – Given data object x, Kolmogorov complexity K(x) means the length of shortest program that outputs x Conditional Kolmogorov complexity K(x|y) – Generalized form of Kolmogorov complexity – Length of the shortest program with input y that outputs x IDS Lab. Seminar - 4Center for E-Business Technology
5
Copyright 2007 by CEBT Information Distance Similarity of two data objects can be measured by normalized information distance Based on Conditional Kolmogorov complexity Has some nice property – Almost a metric property – Lower bound for admissible distance Unfortunately, Kolmogorov complexity is not computable in general – Can be approximated by compression – C(x) : length of compressed x IDS Lab. Seminar - 5Center for E-Business Technology
6
Copyright 2007 by CEBT Measuring Structural Similarity Just Compressing XML files does not get the job done Extract structural information Tags – List element/attribute names in document order Pairwise – Like tags, but with names of parents Path – Like tags, but with full path to root Family order – Family-order traversal of document Except path, all extractions can be done in linear time IDS Lab. Seminar - 6Center for E-Business Technology
7
Copyright 2007 by CEBT Measuring Structural Similarity (cont.) After extracting structural information Come up with similarity measure – NCD with GZIP (Zip-Lempel Encoding) – Ziv-Merhav crossparsing Can be done in linear time We can calcurate the actual value for the NCD by encoding or crossparsing Crossparsing is alternative way to measure relative entropy between sequences in files – Instead of compressing IDS Lab. Seminar - 7Center for E-Business Technology
8
Copyright 2007 by CEBT Measuring Structural Similarity (cont.) Ziv-Lempel Encoding Build a directory for encoding by utilizing substring that have already appeared in the previous encoded part of the document – ex) String to encode “aababbccccd”,,,, Ziv-merhav Crossparsing Find the longest prefix of string x that appears as a string in y until we have parsed the whole document x – ex) x=“abbbbaaabba”, y=“baababaabba” c(x|y) = 3, the three phrases being “abb”,”bba”,”aabba” IDS Lab. Seminar - 8Center for E-Business Technology
9
Copyright 2007 by CEBT Competitors Tree-editing distance Nierman and Jagadish Discrete Fourier Transformation Flesca et al. Path shingles Buttler IDS Lab. Seminar - 9Center for E-Business Technology
10
Copyright 2007 by CEBT Tree-editing distance Measuring the minimum editing distance between string Levenshtein distance Five different edit operations Relabel Insert Delete Insert Tree Delete Tree Quadratic runtime O(|d 1 ||d 2 |) – |d n | : size of document d IDS Lab. Seminar - 10Center for E-Business Technology
11
Copyright 2007 by CEBT Discrete Fourier Transformation Called DFT technique The feature of an XML document relevant to its structure are described via a numerical encoding Numerical values are interpreted as a time series Rotate document by 90 ◦, interpret indentations as time series Use DFT transform to compute similarity – Tag encoding – Document encoding – Document similarity Runtime O(NlogN) – where N = max((|d 1 ||d 2 |) IDS Lab. Seminar - 11Center for E-Business Technology
12
Copyright 2007 by CEBT Path Shingles Extract structural information using Full Path variant Compute a hash value h j for each path A shingle of width w is the combination of w consecutive hash values Compute similarity between two documents using Dice coefficient on the two sets of shingles Not linear run time IDS Lab. Seminar - 12Center for E-Business Technology
13
Copyright 2007 by CEBT Clustering quality Measure quality of similarity measure by clustering Hierarchical agglomerative clustering Quality expressed in number of misclustering in dendrogram IDS Lab. Seminar - 13Center for E-Business Technology
14
Copyright 2007 by CEBT Document Collections Use three different document collections for experiment Real data set – SIGMOD record, INEX 2005, music sheets encoded in XML Synthetically generated data sets from DFT paper Own synthetically generated various data sets – Element names – Element frequencies – Element positions – Element depths IDS Lab. Seminar - 14Center for E-Business Technology
15
Copyright 2007 by CEBT Overall Results Number of misclustering for different methods IDS Lab. Seminar - 15Center for E-Business Technology
16
Copyright 2007 by CEBT Detailed results Different method have different strengths and weaknesses Tree-edit – Generally good, has problems with largely varying document sizes DFT – Good at frequencies, bad at element names, position, and depth GZIP/Ziv-Merhav – Bad at frequencies, good at element names, position, and depth DFT and GZIP/Ziv-Merhav are complementary to each other Performance of hybrid becomes even better Hybrid approach does not have linear run time IDS Lab. Seminar - 16Center for E-Business Technology
17
Copyright 2007 by CEBT Runtime Shallow documents Crossparsing/pairwise, shingle path algorithm Deep documents Crossparsing/pairwise IDS Lab. Seminar - 17Center for E-Business Technology
18
Copyright 2007 by CEBT Conclusion Determining the similarity of document Based on measuring entropy Encoding/Crossparsing approach totally different to previous approaches Can be done in linear time Important for large document collection Future work More sophisticated ways of encoding document structure Different entropy measure better suited for structural information IDS Lab. Seminar - 18Center for E-Business Technology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.