On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Slides:



Advertisements
Similar presentations
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
COP 3502: Computer Science I (Note Set #21) Page 1 © Mark Llewellyn COP 3502: Computer Science I Spring 2004 – Note Set 21 – Balancing Binary Trees School.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Advanced Data Structures
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
CS 171: Introduction to Computer Science II
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
B + -Trees Sept. 2012Yangjun Chen ACS B + -Tree Construction and Record Searching in Relational DBs Chapter 6 – 3rd (Chap. 14 – 4 th, 5 th ed.; Chap.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Modern Information Retrieval
BTrees & Bitmap Indexes
6/14/2015 6:48 AM(2,4) Trees /14/2015 6:48 AM(2,4) Trees2 Outline and Reading Multi-way search tree (§3.3.1) Definition Search (2,4)
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter Trees and B-Trees.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 19: Heap Sort.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
CS 206 Introduction to Computer Science II 02 / 11 / 2009 Instructor: Michael Eckmann.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Primary Indexes Dense Indexes
Constructing Signature Graphs for Signature Files Dr. Yangjun Chen Dept. Applied Computer Science University of Winnipeg Canada.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
UNC Chapel Hill M. C. Lin Orthogonal Range Searching Reading: Chapter 5 of the Textbook Driving Applications –Querying a Database Related Application –Crystal.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Section 10.1 Introduction to Trees These class notes are based on material from our textbook, Discrete Mathematics and Its Applications, 6 th ed., by Kenneth.
COMP20010: Algorithms and Imperative Programming Lecture 4 Ordered Dictionaries and Binary Search Trees AVL Trees.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
1 Heaps and Priority Queues Starring: Min Heap Co-Starring: Max Heap.
CSC 211 Data Structures Lecture 13
Outline Binary Trees Binary Search Tree Treaps. Binary Trees The empty set (null) is a binary tree A single node is a binary tree A node has a left child.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Review 1 Queue Operations on Queues A Dequeue Operation An Enqueue Operation Array Implementation Link list Implementation Examples.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
© Copyright 2012 by Pearson Education, Inc. All Rights Reserved. 1 Chapter 19 Binary Search Trees.
Trees 2: Section 4.2 and 4.3 Binary trees. Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children
CS 206 Introduction to Computer Science II 10 / 02 / 2009 Instructor: Michael Eckmann.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Lecture 15 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
A New Top-down Algorithm for Tree Inclusion Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Structures Li Tak Sing( 李德成 ) Lectures
UNC Chapel Hill M. C. Lin Geometric Data Structures Reading: Chapter 10 of the Textbook Driving Applications –Windowing Queries Related Application –Query.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 07/28/16 11:04 Text Compression
A Linear-Space Top-down Algorithm for Tree Inclusion Problem
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Chapter 5. Greedy Algorithms
Chapter Trees and B-Trees
Chapter Trees and B-Trees
(edited by Nadia Al-Ghreimil)
Heap Chapter 9 Objectives Upon completion you will be able to:
Orthogonal Range Searching and Kd-Trees
Week nine-ten: Trees Trees.
Assignment #3 Due: April 03, 2017
6. Implementation of Vector-Space Retrieval
Binary Trees, Binary Search Trees
On the Graph Decomposition
Binary Trees, Binary Search Trees
Presentation transcript:

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation New Index Structure - Trie over word sequences -Interval sequence assigned to words -Sublists assigned to intervals Algorithm -basic algorithm based on linear search -algorithm based on binary search Experiments Summary

Motivation Evaluation of conjunctive queries in text databases and search engines w 1  w 2  …  w k, where each w i is a word. To find all the documents containing these words. Example cat  dog

New Index Structures e:d:f:a:c:b:e:d:f:a:c:b: {3, 5, 6, 7, 8, 9, 10, 11} {1, 2, 3, 5, 6, 7, 8} {1, 4, 6, 7, 8, 10, 11} {1, 2, 3, 4, 7, 10} {5, 6, 9, 11} {4, 8} Documents and word sequences: DocId words a, f, d a, d a, e, d f, b, a c, d, e d, f, e, c f, d, e, a f, d, e, b e, c a, e, f f, e, c sorted ws d, f, a d, a e, d, a f, a, b e, d, c e, d, f, c e, d, f, a e, d, f, b e, c e, f, a e, f, c Inverted lists: e  d= {3, 5, 6, 7, 8, 9, 10, 11}  {1, 2, 3, 5, 6, 7, 8} = {3, 5, 6, 7, 8}

Sorted word sequences: DocId sequences d, f, a d, a e, d, a f, a, b e, d, c e, d, f, c e, d, f, a e, d, f, b e, c e, f, a e, f, c [16, 16] v9v9 v 11 v 15 v 13 [1, 1] v 10 [1, 20] v1v1 d f a [1, 2] [1, 4] v0v0 v4v4  a (c) [3, 3] v3v3 [10, 10] [17, 17] v8v8 [8, 14] [10, 13] [11, 11] [12, 12] [9, 9] [8, 8] e[8, 19] v 12 c [15, 15] d v7v7 f [16, 18] c f a a c a b c v 14 v 17 v 18 v 19 v6v6 f a [5, 6] [5, 7] v2v2 b [5, 5] v5v5 v 16 Trie over sorted word sequences: Fig. 1 New Index Structures

Trie. Assume that S = {s 1, …, s n }. If |S| = 0, the trie is, of course, empty. For |S| = 1, trie(S) is a single node. If |S| > 1, S is split into m (possibly empty) subsets S 1, S 2, …, S m so that a string is in S j if its first word is w j (1 ≤ j ≤ m). The tries trie(S 1 ), trie(S 2 ), …, trie(S n ) are constructed in the same way except that at the kth step, the splitting of sets is based on the kth words in the sequences. They are then connected from their respective roots to a single node to create trie(S). Tree encoding. Label each node v in a trie with an interval I v = [α v, β v ], where β v denotes the rank of v in a post-order traversal of the trie. Here the ranks are assumed to begin with 1, and all the children of a node are assumed to be ordered and fixed during the traversal. Furthermore, α v denotes the lowest rank for any node u in T[v] (the subtree rooted at v, including v). Thus, for any node u in T[v], we have I u  I v since the post-order traversal enters a node before all of its children, and leaves after having visited all of its children. New Index Structures

More than one node may be labeled with the same word We associate each word w with a interval sequence of the form: L w =,, …,, where k is the number of all those nodes labeled with w and each = [ [1], [2]] (1  i  k) is an interval associated with a certain node labeled with w. New Index Structures

Le:Ld:Lf:La:Lc:Lb:Le:Ld:Lf:La:Lc:Lb: [8,19] [1, 4][8, 14] [1, 2][5, 7][10, 13][16, 18] [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] [9, 9][10, 10][15, 15][17, 17] [5, 5][12, 12] New Index Structures e:d:f:a:c:b:e:d:f:a:c:b: {3, 5, 6, 7, 8, 9, 10, 11} {1, 2, 3, 5, 6, 7, 8} {1, 4, 6, 7, 8, 10, 11} {1, 2, 3, 4, 7, 10} {5, 6, 9, 11} {4, 8} In general, an interval sequence is shorter than the corresponding inverted list. The longer an inverted list, the shorter the corresponding interval sequence.

v7v7 v4v4 v 19 v 18 v 14 Assignment of DocIDs to intervals v 13 v 17 v0v0 v3v3 {11} v 15 {6} v8v8 {3, 5, 6, 7, 8} {6, 7, 8} {7} {8} {5} {3} e {3, 5, 6, 7, 8, 9, 10, 11} v 12 c {9} df {10, 11} v9v9 c f a a {10} c a b c v 17 {4} {2} {1} v 10 v1v1 d f a {1, 2}  a v6v6 f a v2v2 v 11 b v5v5 = [10, 13]. The set {6, 7, 8} assigned to v 14 can be considered as the set assigned to [10, 13]. L d : [1, 4][8, 14] {1, 2}{3, 5, 6, 7, 8} Fig. 2 New Index Structures

Query Evaluation Q = w  w′ ? Lw:Lw: L w′ : S1S1 S2S2 S3S3 ⊎⊎ Assume that frequency of w is higher than w. answer:

BASIC EVALUATION ALGORITHM conj(L w, L w ) - to find all those intervals in L w with each being contained in some interval in L w, stored in a new sequence L. 1.Let L w = I 1, …, I k. Let L w = J 1,, …, J k. L  . 2.Step through L w and L w from left to right. Let I p and I q be the interval currently encountered. We will do one of the following checkings: i)If I p  J q append J q to the end of L. Move to J q+1 if q < k (then, in a next step, we will I p check against J q+1 ). If q = k, stop. ii)If I p [1] > J q [2], move to J q+1 if q < k. If q = k, stop. iii)If I p [2] < J q [1], move I p+1 to if p < k (then, in a next step, we will check J q against I p+1 ). If p = k, stop.

p [1, 4][8, 14] q [5, 5][12, 12] p q [1, 4][8, 14] [5, 5][12, 12] p q L b : [5, 5][12, 12] L d : [1, 4][8, 14] 1 st step:2 nd step:3 rd step: BASIC ALGORITHM In L b, only [12, 12] is contained in an interval [8, 12] in L d. Return the subset associated with [12, 12] as the result. It is {8}. d  b ?

BASIC ALGORITHM Q = d  f  a. L d = [1, 4][8, 14] L f = [1, 2][5, 7][10, 13][16, 18] L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] L = conj(L d, L f ) = [1, 2][10, 13]. L = conj(L, L a ) = [1, 1][11, 11]. The results: {1} ⊎ {7} = {1, 7}. {1}{7}

Algorithm based on binary search  Let L 1 = I 1, …, I n and Let L 2 = J 1,, …, J m be two interval sequences with m = |L 2 | ≤ n = |L 1 |.  Let l = log(n/m). Then, 2 l is the largest power of two not exceeding n/m. Let t = n - 2 l + 1. Set intersection based on binary search L2:L2: L1:L1: t Fig. 3

Compare J m and I t. If J m [1] > I t [2] (J m appears to the right of I t ), we should look for the intervals (in L 1 ) covered by J m somewhere to the right of I t.  By using the traditional binary search, we try to find an interval I covered by J m with l more comparisons.  Around I, we will continually (by a simple linear search) find the left-most interval x in L 1, which can be covered by J m ; and then with l more comparisons, we will find the right-most interval y covered by J m, in a similar way.  Obviously, all the intervals between x and y, including x and y, can be covered by J m. Algorithm based on binary search

L2:L2: L1:L1: txy L2L2 L1L1 txy This information allows us to reduce the problem to the situation illustrated in Fig. 3. To complete the whole operation, it is sufficient to apply the above process to L 1 and L 2, where L 1 = I 1, …, I x-1 and L 2 = J 1,, …, J m-1. Fig. 4 Algorithm based on binary search

If, on the other hand, J m [2] < I t [1] (J m appears to the left of I t ),, we should check the intervals to the left of I t, and the problem immediately reduces to the checking of L 2 = L 2 against L 1 = L 1 [1.. t - 1]. We can complete the operation by applying the above process to L 1 and L 2. L2:L2: L1:L1: t Fig. 5 t L1L1 L 2 = L 2 Algorithm based on binary search

However, L 2 may become larger than L 1. So in the recursive call, the roles of L 2 and L 1 may be reversed, by which we will check each interval I in L 2 against L 1 to find an interval I in L 2 such that I  the last interval in L 1. L2:L2: L1:L1: txy L2L2 L1L1 xy Fig. 6 Algorithm based on binary search

t If J m  I t, we will check linearly I t-1, I t-2, … until we meet a left-most interval x which can covered by J m. Then, check I t+1, I t+2, … until a right-most interval y which can be covered by J m. All the encountered nodes, except x and y, must be covered by J m. This reduces the problem to a checking of L 2 = L 2 [1.. m - 1] against L 1 = L 1 [1.. x]. L2:L2: L1:L1: t x y L2L2 L1L1 Fig. 7 x y Algorithm based on binary search

If J m  I t (we may have this case due to the roll interchange), we add J m to the result and the problem reduces to a checking L 2 = L 2 [1.. m - 1] against L 1 = L 1 [1.. t]. L2:L2: L1:L1: t L2L2 L1L1 Fig. 8 t Algorithm based on binary search

Example 2 ConsiderL d = [1, 4][8, 14] and L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]. (m = 2, n = 6) By our algorithm based on the binary search, the following operations will be conducted: Step 1: checking L d [2] = [8, 14] against L a. l =  log(6/2)  = 1, t = n - 2 l + 1= 6 – 2 + 1= 5, L a [5] = [11, 11]. Since [11, 11]  [8, 14], we will call linearSearch( ) to find x = 4 and y = 5. Step 2: checking L d [1] = [1, 4] against L a [1.. 3]. l = = 1, t = 3 – = 2, L a [2] = [3, 3]. Since [3, 3]  [1, 4], we will will call linearSearch( ) to find x = 1 and y = 2. Algorithm based on binary search

IMPROVEMENTS Search control by using LCAs (least common ancestors) [3, 3][16, 16] [8, 8] [1, 1] [5, 6] [11, 11] v 10 v5v5 v1v1 v 12 v 18 v7v7 v6v6 v 15 v2v2 v0v0 [1, 4] [1, 20] [8, 14] [8, 19] Tw:Tw: Fig. 7: Illustration for T w and : L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] v 10 v5v5 v6v6 v 12 v 18 v 15 Fig. 9

IMPROVEMENTS [1, 1] [1, 4] [3, 3] [5, 6] [1, 20] [8, 8] [8, 14] [11, 11] [16, 16] [8, 19] 11 44 22 33 IaIa 1 IaIa 2 IaIa 3 IaIa 4 IaIa 5 IaIa 6 Fig. 10 L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]

Experiments In the experiments, we have tested seven methods: Inverted files with melding [6] (IFm for short), Inverted files with adaptive [16] (IFa for short), Hashing-based (RanGroupScan in [19]; Hb for short), Skip-list-based [33] (SkipL for short), Interval-based (linear-search, discussed in the paper, Ib for short), setIntersect (binary-search, discussed in the paper; sI for short), setIntersect with LCAs (discussed in the paper; sIL for short).

Experiments For the experiments with real data, we use part of Wikipedia data, which contains more than 10 million text documents. We numbered the documents as they were stored, by assigning them a sequential number indicating their order in the indexing process. The characteristics of this collection are shown in Table 1. Wikipedia data pages10,500,000 Size (gigabytes)16.25 Word occurrences (without markup)1,567,324,812 Distinct words (after stemming)3,603,556 Table 1: Characteristics of Wikipedia Data

Experiments Two-word queries: Two inverted lists of the same length: 5 million elements on average for 20 queries

Experiments Queries with more than two words: inverted lists of the same length: 5 million elements on average for 20 queries

Summary An efficient algorithm for intersection of inverted lists - Trie over sorted word sequences - Tree encoding - Interval sequences associated with words -Binary search of interval sequences Computational complexities time: O(m (log(n/m) + 1)) (m  n), where n and m are respectively the lengths of the two interval sequences taking parting the intersection.

Thanks you.