Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,

Similar presentations


Presentation on theme: "Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,"— Presentation transcript:

1 Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3, Kunihiko Sadakane 4 1, Nagoya University, Japan 2, University of New South Wales, Australia 3, AIST and JST ERATO, Japan 4, NII, Japan Presenter: Jianbin Qin jqin@cse.unsw.edu.au

2 Database Group – CSE - UNSW 2

3 3

4 4 Index Edit Distance Prefix Searcher Target String set S Mobile Phone Core Browser User query string q Target String set S = {s1, s2, …, sn}. Edit distance threshold τ. Return a set of Result strings R contains all strings s ∈ S, such that ∃ s′ ≼ s, ed(s′, q) ≤ τ Example: τ = 1, q = “abc”, S= {“acdefg”, “cda”, … } Then: R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ. q qRR Challenges: String set S usually very large. Query response time is critical.

5 Database Group – CSE - UNSW 5 a a b b ξ ξ d d b b c c c c c c S1 S4 S2 d d d d 17 0 5 2 6 3 8 9 4 Simulate edit distance calculation when traversing the trie. q = “”q = “a”q = “ab”q = “abc” Drawback: Tracking too many nodes during process. O(|Σ| τ ) ED = 0 ED = 1 ED > 1 S SiDString S1“abcd” S2“abdc” S4“bcd” Example: τ = 1 When user types in: Directly index string set S into a trie.

6 Database Group – CSE - UNSW 6 We offer another option to trade space for runtime performance. Index Error Tolerant Prefix Searcher Up to X20 larger Up to X20 larger Build Deletion Variants Trie Build Deletion Variants Trie Transform an Edit Prefix Search Problem into an Exact Prefix Search Problem Transform an Edit Prefix Search Problem into an Exact Prefix Search Problem Up to X1000 Faster Up to X1000 Faster One server can serve up to 1000 times more users simultaneously.

7 Database Group – CSE - UNSW 7 Deletion Neighborhood Generation. x, Dx is called a variant-list pair, Dx is the deletion list. V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s. s = abcd abcd {} bcd {1} acd {2} abd {3} abc {4} cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3} 0-Variants 1-Variants 2-Variants 2-Variants Family of s. V(s,2)

8 Database Group – CSE - UNSW 8 Variants Matching Principle: Given two strings s and t,ED(s,t) ≤ τ, iff there exist x,Dx ∈ V(s,τ) and y,Dy ∈ V(t,τ), such that x = y and |Dx ∪ Dy| ≤ τ. Two conditions need to satisfy: 1.x = y Identical Check. (Efficiently process with index) 2.|Dx U Dy| ≤ τ Deletion list Union Size Check.(No efficient methods) abcd {} bcd {1} acd {2} abd {3} abc {4} cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3} s = abcd, V(s,2) abxd {} bxd {1} axd {2} abd {3} abx {4} xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3} q = abxd, V(q,2)

9 Database Group – CSE - UNSW 9 abxd {} bxd {1} axd {2} abd {3} abx {4} xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3} q = abxd, V(q,2) abxd {} abxd {1} abxd {2} abxd {3,4} …… abxd {4,4} abd {3} abd {} abd {1} abd {2} …… abd {3,3} …… abd {3} …… ab (3,3) ab {} ab {3} ab {3,3} q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2) abcd {} bcd {1} acd {2} abd {3} abc {4} cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3} s = abcd, V(s,2) Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist x,Dx ∈ V(s,τ) and y,Dy ∈ EnumV(q,τ) such that x = y and Dx = Dy.

10 Database Group – CSE - UNSW 10 Then we encode together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …} a a b b ξ ξ # # b b c c c c c c S1 S3 S1 d d d d # # d d b b c c c c c c S3 S2 d d d d # # d d c c c c d d S1 d d S2 c c S1 # # S2 # # # # S3 d d # # d d S2 S SiDString S1“abcd” S2“abdc” S3“bcd” S1: abcd, #bcd, a#cd, ab#d, abc#, … S2: abdc, #bdc, a#dc, ab#c, abd#, … S3: bcd, #cd, b#d, bc#, … bd {1,2} #b#d

11 Database Group – CSE - UNSW 11 t = 2 q = abc abc #bc a#c ab# ##c #b# a## abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc## abc #bc bc, #bc, ##bc, #b#c, #bc# a#c ac, a#c, #a#c, a##c, a#c# ab# ab, ab#, #ab#, a#b#, ab## ##c c, #c, ##c a## a, a#, a## #b# b, #b, b#, #b#

12 Database Group – CSE - UNSW 12

13 Database Group – CSE - UNSW 13 a a b b ξ ξ # # b b c c c c c c S1 S3 S1 d d d d # # d d b b c c c c c c S3 S2 d d d d # # d d c c c c d d S1 d d S2 c c S1 # # S2 # # # # S3 d d # # d d S2 q = “” EnumV = { ξ, # } q = “a” EnumV = { a, a#, # } q = “ab” EnumV = { ab, ab#, a#, #b } q = “abc” EnumV = { abc, abc#, ab#c, ab#, a#c, #bc } O(τ ·(|q|+τ) τ )

14 Database Group – CSE - UNSW 14 Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count. 1000 query average. Edit distance threshold τ = 3, IncNgTrie: Our algorithms ICAN and ICPAN: previous direct trie methods.

15 Database Group – CSE - UNSW 15 DirectTrie: Original trie. NoReduction: IncNGTrie before compression. StringMerge: Merge branches reaching the same string. SubtreeMerge: Merge subtrees with identical content.

16 Database Group – CSE - UNSW 16 An alternative way to solve edit prefix search Problem. Our method is independent of character set size. Gain up to 1000 times of query performance improvement. Data adaptive enumeration method.

17 Database Group – CSE - UNSW 17

18 Database Group – CSE - UNSW 18

19 Database Group – CSE - UNSW 19

20 Database Group – CSE - UNSW 20 Core Component is the Prefix Edit Similarity Search. A string Q is t-edit prefix matching another string S is that there exist one prefix of S, that the edit distance with Q is within t. R = {s | s  S,  s’  P(s) such that ed(s’, Q)  t}, P(s) denotes all the prefixes of s. User Client Index Fuzzy Prefix Searcher Target String set Result Ranker Core Q Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1; R

21 Database Group – CSE - UNSW 21 Index data strings into a trie (Radix Tree). Exact Prefix Match. T-edit prefix match. Q=“p” a a 1 b b 7 ξ ξ 0 d d 5 b b 2 a a 8 c c 6 c c 3 c c 11 S1 S2 S3 S SiDStrin g S1abcd S2abdc S4bade S5bcd d d 9 e e 10 S2 d d 12 d d 4

22 Database Group – CSE - UNSW 22 a a b b ξ ξ # # b b c c c c c c S1 S3 S1 d d d d # # d d b b c c c c c c S3 S2 d d d d # # d d c c c c d d S1 d d S2 c c S1 # # S2 # # # # S3 d d # # d d S2 q = “” EnumV = { ξ 1, # 1, # 2 } q = “a” EnumV = { a 2, a# 2, a# 3, # 2 } q = “ab” EnumV = { ab 3, ab# 3, ab# 4, a# 3, #b 3 } q = “abc” EnumV = { abc 4, abc# 4, abc# 5, ab#c 4, ab# 4, a#c 4, #bc 4 }

23 Database Group – CSE - UNSW 23 Index data strings into a trie (Radix Tree). Keep active nodes while traversal the tree. For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i])  t (also called active nodes/states) 1 2 3 4 5 6 7 8 9 e 0 m c a b a t a p Q=Ø 1 2 3 4 5 6 7 8 9 e 0 m c a b a t a p Q=“p” s1 s2 s3 s1 s2 s3 IdString s1cab s2eat s3map

24 Database Group – CSE - UNSW 24 Embed the second condition into the first condition and efficiently process with Index. s=“abcd” 0-Variant-list = { } 1-Variant-list = {,,, 2-Variant-list = {,,, … q=“abxd” 0-Variant-list = { } 1-Variant-list = {,, … 2-Variant-list = {,, …

25 Database Group – CSE - UNSW 25 Index data strings into a trie (Radix Tree). Exact Prefix Match. T-edit prefix match. S SiDString S1“abcd” S2“abdc” S3“bade” S4“bcd” a a b b ξ ξ d d b b a a c c c c c c S1 S4 S2 d d e e S3 d d d d 17 0 5 28 6 3 11 9 10 12 4 Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “”q = “a”q = “ab”q = “abc”

26 Database Group – CSE - UNSW 26 Index data strings into a trie (Radix Tree). Exact Prefix Match. T-edit prefix match. a a b b ξ ξ d d b b a a c c c c c c S1 S4 S2 d d e e S3 d d d d 17 0 5 28 6 3 11 9 10 12 4 Simulate Edit distance Calculation During Traversal The TRIE. Directly indexing strings S into a TRIE. Example: When t = 1 User Types: q = “”q = “a”q = “ab”q = “abc” Draw Back: Tracking too many nodes during process. ED = 0 ED = 1 ED > 1 S SiDString S1“abcd” S2“abdc” S3“bade” S4“bcd”

27 Database Group – CSE - UNSW 27 Index data strings into a trie (Radix Tree). Exact Prefix Match. T-edit prefix match. S SiDString S1“abcd” S2“abdc” S3“bcd” Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “”q = “a”q = “ab”q = “abc” a a b b ξ ξ d d b b c c c c c c S1 S4 S2 d d d d 17 0 5 2 6 3 8 9 4

28 Database Group – CSE - UNSW 28 a a b b ξ ξ # # b b a a c c c c c c S1 S2 S3 d d e e S2 d d d d # # d d b b a a c c c c c c S1 S2 S3 d d e e S2 d d d d # # d d c c c c S3 d d S1 d d S2 c c S1 # # # # # # S2 d d e e d d # # e e # #

29 Database Group – CSE - UNSW 29

30 Database Group – CSE - UNSW 30  K-Matching Variant Given two i-deletion-marked variants(0  i  k ) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists  k, y is called a k- matching variant of x. Problem Transformation cb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

31 Database Group – CSE - UNSW 31  K-Matching Variant Given two i-deletion-marked variants(0  i  k ) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists  k, y is called a k- matching variant of x. Problem Transformation cb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

32 Database Group – CSE - UNSW 32


Download ppt "Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,"

Similar presentations


Ads by Google