Download presentation
Presentation is loading. Please wait.
Published byAlvin Patterson Modified over 9 years ago
1
Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant
2
Introduction Searching in Trees Approximate Containment Queries Path-Only Searches Extension to Trees Searching in Graphs Keygraph Searching in Graph DBs GraphGrep Subgraph Matching Conclusion
3
Introduction Modern search engines Keyword-based queries Impressive speed Several research efforts have attempted to generalize keyword search to keytree and keygraph searching
4
XQuery
5
AQUA Query
6
Query expressed as a tree pattern, termed “query tree” DB can be represented as single tree or as set of trees Each tree could be ordered or unordered Queries often concerned with the parent-child, ancestor-descendant”, or path relationship among nodes Queries can be expressed by containment mapping.
7
Query tree may contain fixed length don’t cares (FLDCs) ex. “?” Query tree may contain variable length don’t cares (VLDCs) ex. “*” This class of queries referred to as approximate containment (AC) queries
8
Path-Only Searches Many AC queries are concerned with paths only. Ex. “Find the descendants of Mary who is a child of John” XISS is an indexing and querying system designed to support regular path expressions
9
Extension to Trees Pathfix algorithm Phase 1: Encodes each root-to-leaf path of every data tree into a suffix array DB Phase 2: Compares the query tree Q with each data tree D in the DB allowing a difference of DIFF
10
Handling Don’t Cares Partition query into connected subtrees having don’t cares Match each of those don’t care free subtrees with data trees in the DB For the matched subtrees that belong to the same data tree, determine whether they combine to match the query based on the matching semantics of the don’t cares. Filtering
11
Implementation ATreeGrep
13
Graphs
14
Graphs Abstract data type of elements (nodes or vertices) interconnected by edges. A graph is a specialized tree in which there is no constraint on the number of paths is possible from a node No root Graph may contain cycles
15
Keygraph Searching Searching for a particular graph or order of elements inside of a large graph (i.e. internet) Searching for a particular graph or structure among several graphs (i.e. chemical elements) Use indexing to reduce complexity
16
Keygraph Searching Three basic steps 1. Reduce the search space by filtering 2. Formulate query into simple structures 3. Match
17
Keygraph Searching (survey) A* algorithm GraphDB Daylight Lore
18
A* Seminal work by Nilson (1980) Route finding algorithm that keeps track of its visited nodes and the distance it has traveled. Applications: Protein databases (discovery and search) Image databases Chinese character databases CAD circuit data and software source code
19
A* Pseudocode function A*(start,goal) var closed := the empty set var q := make_queue(path(start)) while q is not empty var p := remove_first(q) var x := the last node of p if x in closed continue if x = goal return p add x to closed foreach y in successors(x) enqueue(q, p, y) return failure
20
GraphDB Specifies a data model and query model. 1. Queries are in the form of regular expressions 2. Nodes are classes representing data objects 3. Edges are classes to store paths in the database 4. Path classes are and indexing data structures are used to index database Provides graph and search operations to: Shortest path between two nodes Subgraphs from a starting node and range
21
GraphDB
22
Daylight "Provide the best known computer algorithms for chemical information processing to those who need them." Uses finger printing to index/prune
23
ChemDB (Contains 6.5 million unique structures or subgraphs)
24
Lore Database management system for XML Modeled using rooted labeled subgraph Indexed in four ways for fast regular expression use Vindex, Tindex, Lindex, Pindex(Data Guide)
25
Lore 1) Vindex: For each edge labeled l, all nodes are index with incomming edges labeled l and some unique atomic value that satisfy some condition. 2) Tindex: A text index for all nodes with l-labeled edges a with a string of specific values containing specific words 3) Lindex: Link index to index nodes with outgoing l-labled edges 4) Pindex (DataGuide): indexes all nodes reachable from root through labled path. The DataGuide is used by all queries from root. Other queries traverse paths using indexs(1-3), pruning what is not a match.
26
Tindex (1999) A Data structure to index semistructured database nodes that are reachable from several regular path expressions T-index may be more efficient than P- index because it relaxes some constraints Reportedly in graph of size 1500 T-index is 13% of database
27
GraphGrep Uses variable length paths (cyclic or acyclic) to index DB. This provides for efficient filtering. Nodes have ids (numbers) and labels (letters).
28
GraphGrep Index Construction 1. Choose an l p max indexing length 2. Create “path-representation” 3. Create fingerprint
29
GraphGrep Filtering the Database 1. Query graph is parsed and a fingerprint built 2. Fingerprint are compared 1. If a graph has at least one value in its fingerprint that is less than the query fingerprint it is discarded. 2. Remaining graphs may contain > 1 sub graphs
30
GraphGrep Filtering the Database Takes linear time to the size of the database But discards 99% of database!!!
31
GraphGrep Finding Subgraphs Matching with Queries Query tree depth first traversal branches are decomposed into sequences of overlapping label-paths (patterns)
32
GraphGrep Overlaps 1. Last node in a patters coincides with first node of next pattern (e.g. ABCB (l p = 3) ABC CB) 2. If a node has branches, it is included in the first pattern of every branch 3. The first node in a cycle is visited twice
33
GraphGrep Matching Example 1. Select the set of paths 2. Combine lists with constraints 3. Remove lists with equal id nodes in non overlapping positions
34
GraphGrep Techniques for Queries with Wildcards Consider the parts of the query graph that is between wild cards (like pathfix) The cartesian product of the components that match are valid. An entry in the cartesian product is a valid path (length = wildcards) between nodes.
35
GraphGrep 1 GHz pentium III NCI databases (1,000 – 16,000 nodes) Average 20 nodes in db (max 270 nodes) Queries 13-189 nodes L p = 4 and 10
36
GraphGrep Linear in size of DB Different l p influence running time
37
Conclusions / Questions Searching in Trees Introduces ATreeGrep Searching in Graphs Introduces GraphGrep
38
Thanks to: God Class Wikipedia Various other Googled sources
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.