Download presentation
Presentation is loading. Please wait.
Published byDaniel Buckley Modified over 11 years ago
1
Reverse Spatial and Textual k Nearest Neighbor Search Jiaheng Lu Renmin University of China Sep 6 2011 Presentation in HP Labs China
2
Research experience Associate Professor: Renmin University of China XML data management, Spatial data management, Cloud data management Post-doc: University of California, Irvine Data integration, Approximate string match PhD National University of Singapore XML data management
3
Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011)
4
XML twig query processing XPath: Section[Title]/Paragraph//Figure Twig pattern Section Title Paragraph Figure
5
XML twig query processing (Cont.) Problem Statement Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D. E.g. Consider Query and Document: Document: s1 s2 f1 p1 t1 t2 Section titlefigure Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) Query:
6
An example for TJFast algorithm Document:Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 0.0 0.0.1 0.3 0.3.1 0.3.2 0.3.2.1 0.5 0.5.0.0 0.3.2.1, 0.5.0.0 0.0.1, 0.3.1, 0.5.0 TD:TD: TC:TC: Root 0 … 0.5.0 A set for the branching node A { }
7
XML twig query processing (Cont.) Several efficient pattern matching algorithms TJFast (VLDB 05) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10) Current works: distributed XML twig pattern processing
8
XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 ……
9
XQuery vs. XQuery: for $a in doc(bib.xml)//author $n in $a/name where $n=Mike return $a//inproceedings Query papers by Mike Keyword search: Mike inproceedings Complicated XML keyword search
10
The proposed keyword search returns the set of smallest trees containing all keywords. bib author namepublicationshobby title inproceedingsarticles year Mike ward Paper folding titleyear Base line of XML key Information Retrival 2002 namepublicationshobby title inproceedingsarticle year John Hopking Read book titleyear Data Mining Keyword Search in XML 20092007 Keywords: Mikehobby article2009 Paper XML keyword search
11
Effectiveness Capture users search intention Identify the target that users intend to search for Infer the predicate constraint that user intends to search via Result ranking Rank the query results according to their objective relevance to user search intention
12
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 …… XML keyword search
13
Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search
14
Motivation: Data Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Real-world data is dirty Typos Inconsistent representations (PO Box vs. P.O. Box) Approximately check against clean dictionary Should clearly be Niels Bohr
15
Motivation: Record Linkage NameHobbiesAddress Brad Pitt…… Forest Whittacker…… George Bush…… Angelina Jolie…… Arnold Schwarzenegger…… PhoneAgeName ……Brad Pitt ……Arnold Schwarzeneger ……George Bush ……Angelina Jolie ……Forrest Whittaker We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker
16
Motivation: Query Relaxation http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google
17
What is Approximate String Search? String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … Queries against collection: Find all entries similar to Forrest Whitaker Find all entries similar to Arnold Schwarzenegger Find all entries similar to Brittany Spears What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similaity - Dice - Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?
18
Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query Sliding Window
19
Approximate Query Example Query: irvine, Edit Distance 1 2-grams {ir, rv, vi, in, ne} tfviirefrvneun in …… Lookup Grams 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Each edit operations can destroy at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold. Candidates = {1, 5, 9} May have false positives Need to compute real similarity
20
Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space- Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739 ……
21
Outline XML data management XML twig query processing XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011)
22
If add a new shop at Q, which shops will be influenced? Influence facts Spatial Distance Results: D, F Textual Similarity Services/Products... Results: F, C Motivation food clothes sports food clothes 2
23
Problems of finding Influential Sets Traditional query Reverse k nearest neighbor query (RkNN) Our new query Reverse spatial and textual k nearest neighbor query (RSTkNN) 3
24
Problem Statement Spatial-Textual Similarity describe the similarity between such objects based on both spatial proximity and textual similarity. Spatial-Textual Similarity Function 4
25
Problem Statement (cont) RSTkNN query finding objects which have the query object as one of their k spatial-textual similar objects. 5
26
Related Work Pre-computing the kNN for each object (Korn ect, SIGMOD2000, Yang ect, ICDE2001) (Hyper) Voronio cell/planes pruning strategy (Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009) 60-degree-pruning method (Stanoi ect, SIGMOD2000) Branch and Bound ( based on Lp-norm metric space ) (Achtert ect, SIGMOD2006, Achtert ect, EDBT2009) Pre-computing the kNN for each object (Korn ect, SIGMOD2000, Yang ect, ICDE2001) (Hyper) Voronio cell/planes pruning strategy (Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009) 60-degree-pruning method (Stanoi ect, SIGMOD2000) Branch and Bound ( based on Lp-norm metric space ) (Achtert ect, SIGMOD2006, Achtert ect, EDBT2009) 7 Challenging Features: Lose Euclidean geometric properties. High dimension in text space. k and α are different from query to query. Challenging Features: Lose Euclidean geometric properties. High dimension in text space. k and α are different from query to query.
27
Intersection and Union R-tree (IUR-tree) 10
28
Overview of Search Algorithm RSTkNN Algorithm: Travel from the IUR-tree root Progressively update lower and upper bounds Apply search strategy: prune unrelated entries in Pruned; report entries to be results Ans; add candidate objects to Cnd. FinalVerification For objects in Cnd, check whether results or not by updating the bounds for candidates using expanding entries in Pruned. 14
29
N4 N1 p1 N2 p2 p3 N3 p4 p5 EnQueue(U, N4); Initialize N4.CLs; Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 U N4, (0, 0) 15
30
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 U N4(0, 0) DeQueue(U, N4) Mutual-effect N1 N2 N1 N3 N2 N3 N4 N1 p1 N2 p2 p3 N3 p4 p5 EnQueue(U, N2) EnQueue(U, N3) Pruned.add(N1) Pruned N1(0.37, 0.432) N3(0.323, 0.619 )N2(0.21, 0.619 ) 16
31
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 U DeQueue(U, N3) Mutual-effect p4 N2 p5 p4,N2 Answer.add(p4) Candidate.add(p5) Pruned N1(0.37, 0.432) N3(0.323, 0.619 )N2(0.21, 0.619 ) Answer Candidate p4(0.21, 0.619 ) p5(0.374, 0.374) N4 N1 p1 N2 p2 p3 N3 p4 p5 17
32
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 U DeQueue(U, N2) Mutual-effect p2 p4,p5 p3 p2,p4,p5 Answer.add(p2, p3) Pruned.add(p5) Pruned N1(0.37, 0.432) N2(0.21, 0.619 ) Answer Candidate p4 p5(0.374, 0.374) N4 N1 p1 N2 p2 p3 N3 p4 p5 p2p3 So far since U=Cand=empty, algorithm ends. Results: p2, p3, p4. So far since U=Cand=empty, algorithm ends. Results: p2, p3, p4. 18
33
Cluster IUR-tree: CIUR-tree IUR-tree: Texts in an index node could be very different. CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. 19
34
Optimizations Motivation To give a tighter bound during CIUR-tree traversal To purify the textual description in the index node Outlier Detection and Extraction (ODE-CIUR) Extract subtrees with outlier clusters Take the outliers into special account and calculate their bounds separately. Text-entropy based optimization (TE-CIUR) Define TextEntropy to depict the distribution of text clusters in an entry of CIUR-tree Travel first for the entries with higher TextEntropy, i.e. more diverse in texts. 20
35
Experimental Study Experimental Setup OS: Windows XP;CPU: 2.0GHz; Memory: 4GB Page size: 4KB;Language: C/C++. Compared Methods baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE. Datasets ShopBranches(Shop), extended from a small real data GeographicNames(GN), real data CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia. Metric Total query time Page access number StatisticsShopCDGN Total # of objects304,0081,555,2091,868,821 Total unique words in dataset393321,578222,409 Average # words per object45474 21
36
Scalability (1) Log-scale version (2) Linear-scale version 22
37
Effect of k Query time 23
38
Conclusion Propose a new query problem RSTkNN. Present a hybrid index IUR-Tree. Show the enhanced variant CIUR- Tree and two optimizations ODE-CIUR and TE-CIUR to further improve search processing. 24
39
Current and future works Distributed XML query processing Cloud-based SQL Processing Spatial and Temporal Keyword search
40
Thank you Q&A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.