PODS 20021 Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Courant Institute, NYU Joint work with Jason Wang.

Slides:



Advertisements
Similar presentations
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Advertisements

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Indexing DNA Sequences Using q-Grams
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
RDFBrowser A tool to analyse metadata Bernhard Schueler CSCI 8350, Spring 2002,UGA.
Chapter 6: Transform and Conquer
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
IR Models: Structural Models
Aki Hecht Seminar in Databases (236826) January 2009
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
CS 206 Introduction to Computer Science II 10 / 31 / 2008 Happy Halloween!!! Instructor: Michael Eckmann.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
PODS Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Courant Institute, NYU Joint work with Jason Wang.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
PODS Phylogenetic Tree Comparison using a “Cousins” Approach Dennis Shasha, Courant Institute, NYU.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant.
Important Problem Types and Fundamental Data Structures
Binary Trees Chapter 6.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
A Level Computer Science Topic 9: Data Structures T eaching L ondon C omputing William Marsh School of Electronic Engineering and Computer Science Queen.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Computational Intelligence II Lecturer: Professor Pekka Toivanen Exercises: Nina Rogelj
CS 415 – A.I. Slide Set 5. Chapter 3 Structures and Strategies for State Space Search – Predicate Calculus: provides a means of describing objects and.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Trees Ellen Walker CPSC 201 Data Structures Hiram College.
CHAPTER 11 TREES INTRODUCTION TO TREES ► A tree is a connected undirected graph with no simple circuit. ► An undirected graph is a tree if and only.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
School of Computing Clemson University Fall, 2012
Fast nearest neighbor searches in high dimensions Sami Sieranoja
RE-Tree: An Efficient Index Structure for Regular Expressions
Integrating XML Data Sources Using Approximate Joins
Chapter 1.
Structure and Content Scoring for XML
Chapter 6: Transform and Conquer
Searching for and Comparing Trees and Graphs
Structure and Content Scoring for XML
Graph Algorithms DS.GR.1 Chapter 9 Overview Representation
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

PODS Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Courant Institute, NYU Joint work with Jason Wang and Rosalba Giugno

PODS Usefulness Trees and graphs represent data in many domains in linguistics, vision, chemistry, web. (Even sociology.) Tree and graphs searching algorithms are used to retrieve information from the data.

PODS Tree Inclusion Editor Chapter Book Title XML ? (a) Title Book EditorChapter Title XMLJohn Author Name Mary Jack OLAP (b)

PODS 20024

5 TreeBASE Search Engine

PODS l1l1 l5l5 l2l2 l4l4 l3l3 e1e1 e5e5 e4e4 e3e3 e2e2 From pixels to a small attributed graph Vision Application: Handwriting Characters Representation D.Geiger, R.Giugno, D.Shasha, Ongoing work at New York University

PODS l1l1 l5l5 l2l2 l4l4 l3l3 e1e1 e5e5 e4e4 e3e3 e2e2 l4l4 l2l2 l1l1 l3l3 l5l5 e2e2 e1e1 e4e4 e5e5 e3e3 e6e6 l4l4 l5l5 l3l3 l1l1 l2l2 e3e3 e4e4 e5e5 e3e3 Best Match l4l4 l2l2 l1l1 l3l3 l5l5 e2e2 e1e1 e4e4 e5e5 e3e3 e7e7 e6e6 Vision Application: Handwriting Characters Recognition QUERY DATABASEDATABASE

PODS Vision Application: Region Adjacent Graphs J. Lladós and E. Martí and J.J. Villanueva, Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23-10,1137—1143, 2001.

PODS Chemistry Application Protein Structure Search. Daylight ( MDL BCI (

PODS Algorithmic Questions Question: why can’t I search for trees or graphs at the speed of keyword searches? (Proper data structure) Why can’t I compare trees (or graphs) as easily as I can compare strings?

PODS Tree Searching Given a small tree t is it present in a bigger tree T? t T

PODS Present but not identical "Happy families are all alike; every unhappy family is unhappy in its own way” Anna Karenina by Leo Tolstoy Preserving sibling order or not Preserving ancestor order or not Distinguishing between parent and ancestor Allowing mismatches or not

PODS Sibling Order Order of children of a node: A B C A C B ?=?=

PODS Ancestor Order Order between children and parent. A B C A C B ?=?=

PODS Ancestor Distance Can children become grandchildren: A B C A B X ?=?= C

PODS Mismatches Can there be relabellings, inserts, and deletes? If so, how many? A B C A X C how far?

PODS Bottom Line There is no one definition of inexact or subtree matching (Tolstoy problem). You must ask the question that is appropriate to your application.

PODS TreeSearch Query Language Query language is simply a tree decorated with single length don’t cares (?) and variable length don’t cares (*). A * B C ? D >= 0, on each side =1

PODS Exact Match Query matches exactly if contained regardless of sibling order or other nodes A * B C ? D = X Y A W Z C B X Q D U

PODS Inexact Match Inexact match if missing or differing node labels. Higher differences cost more. A * B C ? D Differ by 1 X Y A W Z C B X Q E U

PODS Treesearch Conceptual Algorithm Take all paths in query tree. Filter using subpaths. Find out where each real path is in the data tree. Distance = number of paths that differ. Higher nodes are more important. Implementation: hashing and suffix array. A few seconds on several thousand trees.

PODS Treesearch Data Preparation Take nodes and parent-child pairs and hash them in the data tree. This is used for filtering. Take all paths in data trees and place in a suffix array. (In worst case O(num of nodes * num of nodes) space but usually less).

PODS Treesearch Processing Take nodes and parent-child pairs and hash them in the query tree. Accept data trees that have a supermultiset of both. (If mismatches are allowed, then liberalize.) Match query tree against data trees that survive filter. Do one path at a time and then intersect to find matches.

PODS Tree == Set of “Paths” A A E C AA={(0,1)} AB={(1,4)} AC ={(0,2),(0,3),(1,5)} CE={(2,6)} 1 0 A A 5 C 2 0 A C 6 E 1 0 A A 4 B 3 0 A C 456 C C B Paths: Parent-Child Pairs:

PODS Parent-Child Pairs of 3 Data Trees 223h(AC) 0 0 t2t2 …… 01h(AB) 11h(AA) t3t3 t1t1 Key Tree t 1 Tree t 2 Tree t A A E C 456 C CB D B G E 5 6 CC A B C E E 6 7 CA A 2 D 8 C 3

PODS Patterns in a Query AA={(0,1)} AB={(1,4)} AC ={(0,2),(1,3)} 1 0 A A 4 B 1 0 A A 3 C 2 0 A C Paths: Parent-Child Pairs: 2 1 AC 34 B C 0 A

PODS Filter the Database 2h(AC) 1h(AB) 1h(AA) QueryKey Tree t 1 Tree t 2 Tree t 3 Query Discarded 223h(AC) 0 0 t2t2 …… 01h(AB) 11h(AA) t3t3 t1t1 Key 1 2 AC 34 B C 0 A A A E C 456 C CB D B G E 5 6 CC A B E E 6 7 C A A 2 8 CC D (Max distance = 1)

PODS Path Matching Tree t 3 CAA BAA CA Select the set of paths in t 3 matching the paths of the query (maybe not root/leaf) CAA={(7,3,1)} BAA= Ø CA = {(4,1), (7,3)} Count all paths when labels correspond to identical starting roots |Node(1)|=2 |Node(3)|=1 Remove roots if they do not satisfy the Max distance restriction Node(1) matches query tree within distance 1 Query 1 2 AC 34 B C 0 A B E E 6 7 C A A 2 B 8 C (Max distance = 1) C

PODS Matching Query with Wildcards Glue the subtrees based on the matching semantics of wildcards. Find matching candidate subtrees 2 1 *? 3 4 B C 0 A 0 A 5 E 0 1 B C 2 E Partition into subtrees

PODS Complexity: Building the database M is number of trees and N is the number of nodes of biggest tree. The space/time complexity is O(MN 2 ). This is for trees that are narrow at top and bushy at the bottom. In practice much better.

PODS Complexity: Tree Search Current implementation: Linear in the number of the trees in the database that survive filter, because we have one suffix array for each tree. Could have one larger suffix array, but filtering is very effective in practice. The time complexity for searching for a path of length L is O(L log S) where S is the size of the suffix array.

PODS Filtering on 1528 trees

PODS Scalability

PODS trees were used Parallel Processing

PODS Treesearch Review Ancestor order matters. Sibling order doesn’t. Don’t cares: * and ? Distance metric is based on numbers of path differences. System available; please see our web site.

PODS Related Work S. Amer-Yahia, S. Cho, L.V.S. Lakshmanan, and D. Srivastava. Minimization of tree pattern queries. SIGMOD, Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. ICDE, J. Cracraft and M. Donoghue. Assembling the tree of life: Research needs in phylogenetics and phyloinformatics. NSF Workshop Report, Yale University, 2000.

PODS Tree Edit Order of children matters A B C A'A' CB A A' del(B) ins(B)

PODS Tree Edit in General Operations are relabel A->A', delete (X), insert (B). A X C A'A' C B A A' del(X) ins(B) C C

PODS Review of Tree Edit Generalizes string editing distance (with *) for trees. O(|T1| |T2| depth(T1) depth(T2)) The basis for XMLdiff from IBM alphaworks. “Approximate Tree Pattern Matching” in Pattern Matching in Strings, Trees, and Arrays, A. Apostolico and Z. Galil (eds.) pp Oxford University Press.

PODS Graph Matching Algorithms: Brute Force root (1,4) (2,5) (3,6) (3,7) (2,6) (3,5) (3,7) (2,7) (3,5) (3,6) (1,5) (2,4) (3,6) (3,7) (2,6) (3,4) (3,7) (2,7) (3,4) (1,7) (1,6) 1 32 GaGa GbGb

PODS Graph Matching Algorithms root (1,4) (1,5) (2,4) (2,6) (3,4) (3,7) Ullmann’s Alg. root (1,4) (1,5) (2,4) (2,6) (3,4) (3,7) (2,7) (1,7) (1,6) (1,_) (2,_) Nilsson’s Alg GaGa GbGb Exact MatchingInexact Matching Bad connectivity Delete

PODS Complexity of Graph Matching Algorithms Matching graph of the same size: –Difficulty, time consuming, but it is not proved to be NP-Complete Matching a small graph in a big graph –NP-Complete

PODS Steps in Graph Searching Filter the search space. We need indexing techniques to Find the most relevant graphs Then the most relevant subgraphs Filtering finds the answer in a fast way: How similar the query is to a database graph? Could a database graph “G” contain the query? STEP 1

PODS Formulate query –Use wildcards –Decompose query into simple structures Set of paths, set of labels Matching –Traditional (sub)graph-to-graph matching techniques –Combine set of paths (from step 2) –Application specific techniques Steps in Graph Searching STEP 2 STEP 3

PODS Filtering Techniques Content Based: Bit Vector of Features Application dependent, use it when feature set is rich, e.g. the graph contains 5 benzene rings. Structural (representation of the data) Based: Subgraph relations Take tracks of the paths (all-some) in the database graphs Dataguide, 1-index, XISS, ATreeGrep, GraphGrep, Daylight Fingerprint, Dictionary Fingerprints (BCI). STEP 1

PODS Daylight Fingerprint Fixed-size bit vector; For each graph in the database: Find all the paths in a graph of length one and up to a limit length ; Each path is used as a seed to compute a random number r which is ORed in. fingerprint := fingerprint | r [Daylight ( [BCI ( ] STEP 1

PODS Daylight Fingerprint –Similarity- The similarity of two graphs is computed by comparing their fingerprints. Some similarity measures are: Tanamoto Coefficient (the number of bits in common divided by the total number); Euclidean distance (geometric distance); STEP 1

PODS T-Index (Milo/Suciu ICDT 99) STEP 1 Non-deterministic automaton (right graph) whose states represent the equivalence classes (left graph) produced by the Rabin-Scott algorithm (Aho) and whose transitions correspond to edges between objects in those classes Book EditorChapter Name Title Author JohnXML MaryJackOLAP Title Author ,4 6 7,8 Book Editor Chapter NameTitle Author Keyword 9 keyword Title

PODS LORE Nodes: V-index, T-index, L-index (node labels, incoming labels, outgoing labels) Data Guide for root to leaf Book EditorChapter Name Title Author John XML MaryJack OLAP Title Author ,4 6, 97,8 Book Editor Chapter Name Title Author Keyword 9

PODS SUBDUE Find similar repetitive subgraphs in a single-graph database. STEP 3 –An improvement over the inexact graph matching method proposed by Nilsson – Minimum description length of subgraphs – Domain-Dependent Knowledge Application in : protein databases, image databases, Chinese character databases, CAD circuit data and software source code. –An extension of SUBDUE (WebSUBDUE ) has been applied in hypertext data. It uses:

PODS GraphGrep Glide: an interface to represent graphs inspired by SMILES and XPATH Fingerprinting: to filter the database A subgraph matching algorithm STEP 2 STEP 1 STEP 3 D. Weininger, SMILES. Introduction and Encoding Rules, Journal Chemical Information in Computer Science,28-31,1998. J. Clark and S. DeRose, Xml Path Language (Xpath),

PODS Glide:query graph language Node a/ Edge a/b/ Path a/b/c/f/ Branches a/(h/c/)b/ ab a abcf a h c b

PODS Glide: query graph language c f i a c h d i Cycle c%1/ f/ i%1/ Cycles (c returns to a and starts its own cycle) a%1/h/c%1%2/d/i%2/

PODS Glide: wildcards 1.. a/./c/ 2. * a/*/c/ 3. ? a/?/c/ 4. + a/+/c/ a c a c a c a c

PODS Query Graphs in Glide a % 1/(./*/ b/) ?/c/d % 1/ a % 1/(m/o/o/b/)n/c/ d % 1/ a c b d a c b d m o n o