Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity.

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.

Greedy Algorithms Amihood Amir Bar-Ilan University.

Inferring XML Schema Definitions from XML Data

22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon,

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Aki Hecht Seminar in Databases (236826) January 2009

Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)

Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.

Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.

A Multiresolution Symbolic Representation of Time Series

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

ECA 228 Internet/Intranet Design I Intro to XSL. ECA 228 Internet/Intranet Design I XSL basics W3C standards for stylesheets – CSS – XSL: Extensible Markup.

Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.

Automated malware classification based on network behavior

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.

Hexastore: Sextuple Indexing for Semantic Web Data Management

Presented by Tienwei Tsai July, 2005

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.

Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

1 The tree data structure Outline In this topic, we will cover: –Definition of a tree data structure and its components –Concepts of: Root, internal, and.

CS654: Digital Image Analysis Lecture 11: Image Transforms.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

Spatial Data Management

RE-Tree: An Efficient Index Structure for Regular Expressions

Lecture 18. Basics and types of Trees

Toshiyuki Shimizu (Kyoto University)

Detecting Phrase-Level Duplication on the World Wide Web

Structure and Content Scoring for XML

Structure and Content Scoring for XML

Huffman Coding Greedy Algorithm

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University

Copyright  2007 by CEBT Outline  Introduction  Measuring Entropy  Information Distance  Measuring Structural Similarity Extract structural information Ziv-Lempel Encoding Ziv-Merhav Crossparsing  Competitors Tree-editing distance Discrete Fourier Transformation Path shingles  Experiments Clustering Quality Document Collections Results  Conclusion IDS Lab. Seminar - 2Center for E-Business Technology

Copyright  2007 by CEBT Introduction  Semistructured document HTML, XHTML, XML, …  In traditional IR detecting similarities used widely For querying For clustering Lots of similarity measures for text documents  New challenges with semistructured documents Measuring structural similarity Semistructured documents show great structural diversity Measuring structural similarity can integrate heterogeneous data sources IDS Lab. Seminar - 3Center for E-Business Technology

Copyright  2007 by CEBT Measuring Entropy  Bennet et al. Introduced concept of universal information metric Based on Kolmogorov complexity – Given data object x, Kolmogorov complexity K(x) means the length of shortest program that outputs x Conditional Kolmogorov complexity K(x|y) – Generalized form of Kolmogorov complexity – Length of the shortest program with input y that outputs x IDS Lab. Seminar - 4Center for E-Business Technology

Copyright  2007 by CEBT Information Distance  Similarity of two data objects can be measured by normalized information distance Based on Conditional Kolmogorov complexity Has some nice property – Almost a metric property – Lower bound for admissible distance Unfortunately, Kolmogorov complexity is not computable in general – Can be approximated by compression – C(x) : length of compressed x IDS Lab. Seminar - 5Center for E-Business Technology

Copyright  2007 by CEBT Measuring Structural Similarity  Just Compressing XML files does not get the job done  Extract structural information Tags – List element/attribute names in document order Pairwise – Like tags, but with names of parents Path – Like tags, but with full path to root Family order – Family-order traversal of document  Except path, all extractions can be done in linear time IDS Lab. Seminar - 6Center for E-Business Technology

Copyright  2007 by CEBT Measuring Structural Similarity (cont.)  After extracting structural information Come up with similarity measure – NCD with GZIP (Zip-Lempel Encoding) – Ziv-Merhav crossparsing Can be done in linear time  We can calcurate the actual value for the NCD by encoding or crossparsing Crossparsing is alternative way to measure relative entropy between sequences in files – Instead of compressing IDS Lab. Seminar - 7Center for E-Business Technology

Copyright  2007 by CEBT Measuring Structural Similarity (cont.)  Ziv-Lempel Encoding Build a directory for encoding by utilizing substring that have already appeared in the previous encoded part of the document – ex) String to encode “aababbccccd”,,,,  Ziv-merhav Crossparsing Find the longest prefix of string x that appears as a string in y until we have parsed the whole document x – ex) x=“abbbbaaabba”, y=“baababaabba” c(x|y) = 3, the three phrases being “abb”,”bba”,”aabba” IDS Lab. Seminar - 8Center for E-Business Technology

Copyright  2007 by CEBT Competitors  Tree-editing distance Nierman and Jagadish  Discrete Fourier Transformation Flesca et al.  Path shingles Buttler IDS Lab. Seminar - 9Center for E-Business Technology

Copyright  2007 by CEBT Tree-editing distance  Measuring the minimum editing distance between string Levenshtein distance  Five different edit operations Relabel Insert Delete Insert Tree Delete Tree  Quadratic runtime O(|d 1 ||d 2 |) – |d n | : size of document d IDS Lab. Seminar - 10Center for E-Business Technology

Copyright  2007 by CEBT Discrete Fourier Transformation  Called DFT technique  The feature of an XML document relevant to its structure are described via a numerical encoding  Numerical values are interpreted as a time series Rotate document by 90 ◦, interpret indentations as time series Use DFT transform to compute similarity – Tag encoding – Document encoding – Document similarity  Runtime O(NlogN) – where N = max((|d 1 ||d 2 |) IDS Lab. Seminar - 11Center for E-Business Technology

Copyright  2007 by CEBT Path Shingles  Extract structural information using Full Path variant  Compute a hash value h j for each path  A shingle of width w is the combination of w consecutive hash values  Compute similarity between two documents using Dice coefficient on the two sets of shingles  Not linear run time IDS Lab. Seminar - 12Center for E-Business Technology

Copyright  2007 by CEBT Clustering quality  Measure quality of similarity measure by clustering  Hierarchical agglomerative clustering  Quality expressed in number of misclustering in dendrogram IDS Lab. Seminar - 13Center for E-Business Technology

Copyright  2007 by CEBT Document Collections  Use three different document collections for experiment Real data set – SIGMOD record, INEX 2005, music sheets encoded in XML Synthetically generated data sets from DFT paper Own synthetically generated various data sets – Element names – Element frequencies – Element positions – Element depths IDS Lab. Seminar - 14Center for E-Business Technology

Copyright  2007 by CEBT Overall Results  Number of misclustering for different methods IDS Lab. Seminar - 15Center for E-Business Technology

Copyright  2007 by CEBT Detailed results  Different method have different strengths and weaknesses Tree-edit – Generally good, has problems with largely varying document sizes DFT – Good at frequencies, bad at element names, position, and depth GZIP/Ziv-Merhav – Bad at frequencies, good at element names, position, and depth  DFT and GZIP/Ziv-Merhav are complementary to each other Performance of hybrid becomes even better Hybrid approach does not have linear run time IDS Lab. Seminar - 16Center for E-Business Technology

Copyright  2007 by CEBT Runtime  Shallow documents Crossparsing/pairwise, shingle path algorithm  Deep documents Crossparsing/pairwise IDS Lab. Seminar - 17Center for E-Business Technology

Copyright  2007 by CEBT Conclusion  Determining the similarity of document Based on measuring entropy  Encoding/Crossparsing approach totally different to previous approaches  Can be done in linear time Important for large document collection  Future work More sophisticated ways of encoding document structure Different entropy measure better suited for structural information IDS Lab. Seminar - 18Center for E-Business Technology