Download presentation
Presentation is loading. Please wait.
Published byAutumn Alvarez Modified over 11 years ago
1
XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November 22 2010 Presentation in TeleCom ParisTech
2
Research experience Associate Professor: Renmin University of China XML data management, Cloud data management, Approximate search Post-doc: University of California, Irvine Data integration, Approximate string match PhD National University of Singapore XML data management
3
Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction
4
XML twig query processing XPath: Section[Title]/Paragraph//Figure Twig pattern Section Title Paragraph Figure
5
XML twig query processing (Cont.) Problem Statement Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D. E.g. Consider Query and Document: Document: s1 s2 f1 p1 t1 t2 Section titlefigure Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) Query:
6
Previous work: TwigStack TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. Two steps in TwigStack : (1) intermediate path solutions are output to match each query root-to-leaf path; and (2) these intermediate path solutions are merged to get the final results. [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
7
Running example: TwigStack algorithm s tf Query: s(1,12,1) t f (2,3,2) (8,9,4) Data streams: (5,6,3) (4,11,2) State of stacks: Output path intermediate solutions: (1,12,1) (2,3,2) s//t: (1,12,1) (5,6,3) (4,11,2) (5,6,3) s//f: (1,12,1) (8,9,4) (4,11,2) (8,9,4) Final results: (1,12,1) (2,3,2) (8,9,4) (1,12,1) (5,6,3) (8,9,4) (4,11,2) (5,6,3) (8,9,4) (1,12,1)(4,11,2) (2,3,2) (5,6,3) (8,9,4)
8
Limitations of TwigStack (1) TwigStack may output many useless intermediate results for queries with parent-child relationship (2) TwigStack cannot process XML twig queries with ordered predicates, like Proceeding, Following in XPath (3) TwigStack cannot answer queries with wildcards in branching nodes. E.g. * B C The parent of B should be an ancestor of C
9
XML twig query processing (Cont.) Several efficient pattern matching algorithms TJFast (VLDB 05)(citation: 173) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10)
10
Motivation: new labeling scheme TwigStackList and iTwigJoin are all based on the containment labeling scheme Why not try Dewey labeling scheme for XML twig pattern query ? Oh, it is really a novel idea!
11
Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by an integer sequence: (i) the root is labeled by a empty string ε (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. For example: s1 s2 f1 f2 t1 t2 1 2 3 2.1 2.2 ε
12
Main problem of the original Dewey If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms. Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone
13
Modular function We need to know some schema information: DTD (Document Type Definitions ) or XML schema Given DTD information: book author, title, chapter* Our solution: using modular function, we create a match between an element tag and an integer number. We define X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2; where, X t is the last integer of the label of tag t. book ε 0 title author 1 chapter 2 5 Why not 3 as the original Dewey ? The number of distinct tags under book
14
Derive element tag From a label, we can derive its tag name. book author, title, chapter* Recall that we define: X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2. book ε 0 title author 1 chapter 2 5 ? ?? ?
15
More examples for assigning labels Let us consider a more complicated DTD a (b | c )*, d?, c+ We define: X b mod 3 = 0 X c mod 3 = 1 X d mod 3 = 2 (Why do we use mod 3 instead of 4?) a ε 0 d b 2 c 4 c 7
16
Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book author, title, chapter* chapter (paragraph | section)* section (paragraph | section)* book chapter section author title book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 Mod 2=1 Question: Given a label 5.1.0, what is the corresponding path ? Document: FST: chapter section paragraph section
17
Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book author, title, chapter* chapter (paragraph | section)* section (paragraph | section)* book chapter section author title Document: chapter section paragraph section Following the above red path, we get 5.1.0 denotes : book/ chapter/section/paragraph book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 FST: Mod 2=1
18
Two properties of extended Dewey Find Ancestor Label From a label of any element, we can derive the labels of its all ancestors. Find Ancestor Name From a label of any element, we can derive the tag names of its all ancestors. Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.
19
A new algorithm: TJFast For each node n in the query, there exists a corresponding input stream T n. T n contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. For each branching node b of twig pattern, there is a corresponding set S b, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? ) During any point of computing, the size of set S b is bounded by the depth of the XML document.
20
An example for TJFast algorithm Document:Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 TD:TD: TC:TC: DTD: a -> a*,d*, b* b -> d*, c* d -> c* Root 0 … 0.5.0 A set for the branching node A Why are there only two streams? { }
21
An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 0.0.1 a1/a2/d1 derive 0.3.2.1 a1/a3/b1/c1 derive By finite state transducer of extended Dewey labeling scheme TD:TD: TC:TC: { }
22
An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 Both a1 and a3 possibly involve in query answers. (Why not a2 ?) TD:TD: TC:TC: { }
23
Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 Then we insert a1, a3 to the set, Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) TD:TD: TC:TC: An example for TJFast algorithm {a1,a3}
24
Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 Move the cursor of T D from d1 to d2 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) {a1,a3}
25
Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 Move the cursor of stream T D from d2 to d3 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) (a1, d3) {a1,a3}
26
Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 Root 0 … 0.5.0 Move the cursor of stream T C from c1 to c2 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a1, b2, c2) (a3, d2) (a1, d3) {a1,a3}
27
Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A// D:, A/B//C:, Phase 1. Intermediate paths,, Phase 2. Final solutions Join Sort and merge-join in TJFast
28
TJFast+L Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes
29
Optimal query classes Only P-C in all edges A BC C A B D D Optimal Class of TJFast Optimal Class of TJFast+L Only A-D in branching edges
30
XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 ……
31
Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing
32
XQuery vs. XQuery: for $a in doc(bib.xml)//author $n in $a/name where $n=Mike return $a//inproceedings Query papers by Mike Keyword search: Mike inproceedings Complicated
33
The proposed keyword search returns the set of smallest trees containing all keywords. bib author namepublicationshobby title inproceedingsarticles year Mike ward Paper folding titleyear Base line of XML key Information Retrival 2002 namepublicationshobby title inproceedingsarticle year John Hopking Read book titleyear Data Mining Keyword Search in XML 20092007 Keywords: Mikehobby article2009 Paper
34
XML keyword search –Search intention identification –Query result retrieval –Result ranking –Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data –Detailed papers: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 (one of best papers to be invited in TKDE Journal)
35
XML keyword search XML Keyword search Inspired by IR style keyword search on the web Enables user to access information in XML database XML data modeled as a rooted, labeled tree Recent research efforts Efficiency Effectiveness
36
Capture users search intention Identify the target that user intends to search for Infer the predicate constraint that user intends to search via Result ranking Rank the query results according to their objective relevance to user search intention
37
State of the Art Search semantics design LCA (Lowest Common Ancestor) Node v is a LCA of keyword set K={w 1, w 2,…,w k } if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K SLCA (Smallest LCA) Node v is a SLCA of keyword set K={w 1, w 2,…,w k } if (1) v is a LCA of K (2) no proper descendant of v is LCA of K XSeek Infers the search intention based on the concept of objects and an analysis of the matching between keyword and data node
38
State of the Art (cont) Efficient result retrieval Designed based on a certain search semantics XKSearch, Multiway SLCA etc. Result ranking XRANK, XKSEarch, EASE They only consider Structural compactness of matching results Keyword proximity Similarity at node level
39
Problems Unaddressed Not address the user search intention adequately! Meaningfulness of query result SLCA is less meaningful in many cases Keyword Ambiguity Problems 1. A keyword can appear both as an xml node type and as the text value of some other nodes 2. A keyword can appear in the text values of different xml node types and carry different meanings Neither SLCA nor Xseek can well address keyword ambiguity
40
ProblemsKeyword Ambiguity Q = customer, interest, art Ambiguity 1: customer, interest; Ambiguity 2: art Intention: find customer whose interest is art less relevant or irrelevant result to be returned also --- C 1,C 3, B 1 s title purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford
41
ProblemsKeyword Ambiguity (cont) Q = customer, interest, art art can be the value of interest node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) customer can be tag name of customer node, or (part of) value of title of(B1) - How to rank C1 to C4 and B1? customers storeDB books... book title publisher ID authors author B 2... Edward Martin Sophia Jones author customer ID name interest interests... artRock Davis C 4... Daniel Jones John Williams book title... ID authors author B 1 Art of Customer Interest Care customer ID name address interest street city interests contact no. 1 Art Street... fashion Mary Smith C 1 customer ID name interest interests rock music Art Smith C 3 purchase purchases customer ID name interest interests street artJohn Martin C 2... name Oxford
42
Objectives & Challenges Challenges I.How to decide which sub-tree(s) with appropriate node types can capture user desired information II.How to return sub-trees of an appropriate size (i.e. contain enough but non- overwhelming information) III.How to rank those sub-trees by their relevance Address the below as a single problem –Search intention identification –Query result retrieval –Result ranking –Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data
43
Challenges Difficulty in applying TF*IDF to XML XML DB carries semantic information while text DB contains pure text information. XML TF*IDF must be aware of the underlying semantics. All contents of XML data are stored in leaf nodes only What is analogy of flat document in XML? o Sub-tree classified according to its prefix path Normalization factor is not simply the size of sub-tree o Structure of sub-trees may also infest the ranks
44
Our Approach Extend IR-style keyword search techniques (like TF*IDF) from text database to XML database, in order to capture the hierarchical structure of xml document by analyzing the knowledge of statistics of underlying XML data Major Contributions 1.Identify users desired search-for node and search-via node(s) in a heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates 2.Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account 3.Design a Keyword Search Engine XReal
45
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 …… XML keyword search
46
Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing
47
Graphical and interactive XML search Auto-completion XML search Order-sensitive XML twig query XML query suggestion Demo online: http://datasearch.ruc.edu.cn:8080/LotusX/
48
Outline XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction
49
Motivation: Data Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Real-world data is dirty Typos Inconsistent representations (PO Box vs. P.O. Box) Approximately check against clean dictionary Should clearly be Niels Bohr
50
Motivation: Record Linkage NameHobbiesAddress Brad Pitt…… Forest Whittacker…… George Bush…… Angelina Jolie…… Arnold Schwarzenegger…… PhoneAgeName ……Brad Pitt ……Arnold Schwarzeneger ……George Bush ……Angelina Jolie ……Forrest Whittaker We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker
51
Motivation: Query Relaxation http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google
52
What is Approximate String Search? String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … Queries against collection: Find all entries similar to Forrest Whitaker Find all entries similar to Arnold Schwarzenegger Find all entries similar to Brittany Spears What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similaity - Dice - Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?
53
Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query Sliding Window
54
Approximate Query Example Query: irvine, Edit Distance 1 2-grams {ir, rv, vi, in, ne} tfviirefrvneun in …… Lookup Grams 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Each edit operations can destroy at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold. Candidates = {1, 5, 9} May have false positives Need to compute real similarity
55
Outline XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction
56
Introduction: An Example A dictionary of strings we are interested in E.g. product names, postal addresses… We are going to locate their approximate occurrence in documents. See the meaning of approximate occurrence in the following example:
57
Problem Definition Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) δ(or Distance(r, m) k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:
58
Why pre-pruning is needed We need evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-time-specific or filtering- power-specific? Less time or less survivors?
59
The issue of compromise comes again Balance between the two stages should be reached: More(less) filtration time Strong(weak) filtration power Fewer(more) candidates Less(more) verification time Overall performance =Tf+Tv ?????
60
State-of-the-art techniques K-signature scheme K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m K is a parameter for filtration power tuning Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=
61
Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An 1 for each occurrence of a tuple (1- rectangle) Bitwise-or all solid matrices to get the matrix of R Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. Formalized into an NPC problem Solution causes too weak filtering power State-of-the-art techniques Inverted Signature-based Hashtable
62
If Sim(m,r) δ, what do we have ? wt(Sig(m)Sig(r)) τ(m) wt(Sig(m)Sig(r)) min{τ(m),τ(r) } So the threshold does not remain constant involves unknown evidence Our solution: Use inverted lists to count sig- token overlappings. Note that sig-tokens usually have low document frequency (e.g. IDF as weights) Our proposed theorem
63
Signature-based Inverted Lists (SLH) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the strings id) in the corresponding list. E.g. R = { r1 = canon eos 5d digital camera", r2 =nikon digital slr camera, r3=canon slr camera}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7,9). 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 1 1 2 1 2 2 3 3 Our algorithms and evaluations EvSCAN:Filtration by SIL
64
Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space- Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739 ……
65
Thank you Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.