Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.

Slides:



Advertisements
Similar presentations
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Introduction to Trees Chapter 6 Objectives
Data Structures: A Pseudocode Approach with C 1 Chapter 6 Objectives Upon completion you will be able to: Understand and use basic tree terminology and.
Edited by Malak Abdullah Jordan University of Science and Technology Data Structures Using C++ 2E Chapter 12 Graphs.
Greedy Algorithms (Huffman Coding)
Constant-Time LCA Retrieval
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
A Simpler Minimum Spanning Tree Verification Algorithm Valerie King July 31,1995.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Presenter: Cosmin Adrian Bejan Alexander Budanitsky and.
Chapter 9: Huffman Codes
Data Flow Analysis Compiler Design Nov. 8, 2005.
Solving problems by searching
1 Structures and Strategies for State Space Search 3 3.0Introduction 3.1Graph Theory 3.2Strategies for State Space Search 3.3Using the State Space to Represent.
Data Compression Basics & Huffman Coding
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
Data Structures Using C++ 2E
Let G be a pseudograph with vertex set V, edge set E, and incidence mapping f. Let n be a positive integer. A path of length n between vertex v and vertex.
CS 146: Data Structures and Algorithms July 21 Class Meeting
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms Shortest-Path.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Trees. Introduction to Trees Trees are very common in computer science They come in different forms They are used as data representation in many applications.
Ceng-112 Data Structures I 1 Chapter 7 Introduction to Trees.
Section 10.1 Introduction to Trees These class notes are based on material from our textbook, Discrete Mathematics and Its Applications, 6 th ed., by Kenneth.
Foundations of Discrete Mathematics
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Data Structures TREES.
Algorithmic Detection of Semantic Similarity WWW 2005.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
COSC 2007 Data Structures II Chapter 14 Graphs I.
Using Semantic Relatedness for Word Sense Disambiguation
Discrete Mathematics Chapter 5 Trees.
Foundation of Computing Systems
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Data Structures Lakshmish Ramaswamy. Tree Hierarchical data structure Several real-world systems have hierarchical concepts –Physical and biological systems.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Implicit Representation of Graphs Paper by Sampath Kannan, Moni Naor, Steven Rudich.
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of.
Data Structures and Design in Java © Rick Mercer
Chapter 5 : Trees.
Greedy Technique.
Chapter 5. Greedy Algorithms
Probabilistic Data Management
TREES General trees Binary trees Binary search trees AVL trees
Chapter 9: Huffman Codes
Data Structures & Algorithms
Elementary graph algorithms Chapter 22
Advanced Algorithms Analysis and Design
Week nine-ten: Trees Trees.
Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35
Elementary graph algorithms Chapter 22
Important Problem Types and Fundamental Data Structures
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Analysis of Algorithms CS 477/677
Presentation transcript:

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and Kenneth Berman University of Cincinnati

Introduction Consider Lowest Common Ancestor Query Problem – Find most specific common generalization or least common subsumer among 2 or more terms or attributes in a large hierarchical/classification data sets – Constraint: Evaluate queries without indirection – Goal: Compact labeling schemes for taxonomies

Introduction (cont’d) Applications – Fast classification of sets and similarity, e.g. prediction sets similar to Google Sets (given “Bush" and “Clinton” it predicts all other US presidents) – Fast answers to ancestor queries in XML search, e.g., test if 2 terms share a parent node without loading XML file (see[1],[2]) – Fast navigation through voluminous web taxonomies (see [3])

Data Model Structural properties found in well- known web taxonomies: – large variance out-degree(Δ), i.e., some nodes have many subclasses – small in-degree (δ) range and variance – small depth (σ) (logarithmic) – small number (>1) of paths from root See paper for table of statistical values for Wordnet, ODP, and Math taxonomies

Our Approach Given: large, rooted web taxonomies represented abstractly as Directed Acyclic Graph or DAG with above statistics Problem: Label each node of the DAG so that all local path information for each taxonomy element is preserved in the encoding Our labeling scheme is a variable-length, prefix-based scheme, and built up in two stages

Our Approach (cont’d) 1.Greedy Dewey Labeling for Trees (TGDL) -Identifies a Breadth-First tree T in a DAG -Encodes path information for the paths in T -Label nodes with concatenation of edge labels

GDL example

TGDL example

Analysis of the Length for TGDL Labels Performed in 2 steps First step: assume that delimiting labels are empty -- each node v labeled with bits at most Second step: Using different edge delimiting schemes estimated upper bound of node labels

Delimiting schemes They encode length of each tree- edge label Two approaches tested: Unary Length Encoding Fixed Binary Length Encoding

Unary Length Encoding (ULE) Comparable to Elias Gamma Code Gamma ULE ULE assigns |e|-1 bits long zero prefix to an edge label e with GDL label of the length |e|

Unary Length Encoding (ULE) Analysis Theorem: Upper bound on TGDL label length with ULE of delimiters is bits, for an arbitrary node v in a tree T - is the depth of v in T - n is number of nodes in T

Fixed Binary Length Encoding (FBLE) For an edge e, this encoding is the binary representation of the length for GDL(e) Encoded with a fixed number of bits - is the maximum node out-degree in T - uses 4 bits in our application

FBLE example - 4 bits will encode delimiters for any T with maximum out-degree < 2^16 - Let e is an edge in T with a given GDL label, e.g. GDL(e)= Then FBLE produces delimiter 1010, so label for e is

Fixed Binary Length Encoding (FBLE) Analysis Upper bound on TGDL label length with FBLE of delimiters is bits, for an arbitrary node v in a tree T

Our Approach (cont’d2) 2.Extended Greedy Dewey Labeling for DAGs (EGDL) -Augment codes generated from step 1 -Used for inferring paths not part of the Breadth-First tree -Adds TGDL node label pairs of non-tree edges

EGDL Labeling - Example.01* * *.0.01

Experimental Results for Wordnet taxonomy (n= 80K)

Experimental Results-Label Lengths Encoding Length Wordnet 2.1 Statistics

References [1] Budanitsky, A., Hirst, G. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh,PA, [2] Resnik, F. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 448–453, [3] Christophides, V., Plexousakis, D. On Labeling Schemes for the Semantic Web. In Proceedings of the 12th international conference on World Wide Web, pages 544–555, Budapest, Hungary. [4] Abiteboul., S., Kaplan, H., Milo, T. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556, Washington, D.C., [5] Strunjas-Yoshikawa, S., Annexstein, F., Berman, K. Compact Encodings for All Local Path Information in Web Taxonomies with applications to WordNet. In Proceedings of the 32 nd International Conference on Current Trends in Theory and Practice of Computer Science, Merin, Czech Republic, January 21-27, 2006.