Download presentation
Presentation is loading. Please wait.
Published byLawrence Stevens Modified over 9 years ago
1
Online Spelling Correction for Query Completion Huizhong Duan, UIUC Bo-June (Paul) Hsu, Microsoft WWW 2011 March 31, 2011
2
Background Query misspellings are common (>10%) 2 Typing quickly exxit mis[s]pell Inconsistent rules concieve conceirge Keyboard adjacency imporyant Ambiguous word breaking silver_light New words kinnect Typing quickly exxit mis[s]pell Inconsistent rules concieve conceirge Keyboard adjacency imporyant Ambiguous word breaking silver_light New words kinnect
3
Spelling Correction Goal: Help users formulate their intent 3 Offline: After entering query Online: While entering query Inform users of potential errors Help express information needs Reduce effort to input query Online: While entering query Inform users of potential errors Help express information needs Reduce effort to input query
4
Motivation 4 Existing search engines offer limited online spelling correction Offline Spelling Correction (see paper) Model: (Weighted) edit distance Data: Query similarity, click log, … Auto Completion with Error Tolerance (Chaudhuri & Kaushik, 09) Poor model for phonetic and transposition errors Fuzzy search over trie with pre-specified max edit distance Linear lookup time not sufficient for interactive use Goal: Improve error model & Reduce correction time
5
Outline Introduction Model Search Evaluation Conclusion 5
6
Offline Spelling Correction 6 Query Histogram Query Histogram Query Correction Pairs Query Correction Pairs elefnatelephant Training Decoding faecbok ← facebook kinnect ← kinect … facebook0.01 kinect0.005 … ec ← ec0.1 nn ← n0.2 … a 0.4 a 0.4 b 0.2 b 0.2 c 0.2 c 0.2 $ 0.4 $ 0.4 $ 0.2 $ 0.2 c 0.1 c 0.1 c 0.1 c 0.1
7
Online Spelling Correction 7 Query Histogram Query Histogram Query Correction Pairs Query Correction Pairs elefnelephant faecbok ← facebook kinnect ← kinect … facebook0.01 kinect0.005 … ae ← ea0.1 nn ← n0.2 … Training Decoding a 0.4 a 0.4 b 0.2 b 0.2 c 0.2 c 0.2 $ 0.4 $ 0.4 $ 0.2 $ 0.2 c 0.1 c 0.1 c 0.1 c 0.1
8
8
9
Joint-sequence modeling (Bisani & Ney, 08) Learn common error patterns from spelling correction pairs without segmentation labels Adjust correction likelihood by interpolating model with identity transformation model 9 Expectation Maximization E-step M-step Pruning Smoothing
10
Estimate from empirical query frequency Add future score for A* search 10 QueryProb a0.4 ab0.2 ac0.2 abc0.1 abcc0.1 a a b b c c $ 0.4 $ 0.4 $ 0.2 $ 0.2 c c c c Query Log Query Log a 0.4 a 0.4 b 0.2 b 0.2 c 0.2 c 0.2 $ 0.4 $ 0.4 $ 0.2 $ 0.2 c 0.1 c 0.1 c 0.1 c 0.1
11
Outline Introduction Model Search Evaluation Conclusion 11
12
12 a a b b c c $ 0.4 $ 0.4 $ 0.2 $ 0.2 c c c c a 0.4 a 0.4 b 0.2 b 0.2 c 0.2 c 0.2 $ 0.4 $ 0.4 $ 0.2 $ 0.2 c 0.1 c 0.1 c 0.1 c 0.1 b 0.2 b 0.2 c 0.1 c 0.1
13
Outline Introduction Model Search Evaluation Conclusion 13
14
Data Sets 14 Correctly SpelledMisspelledTotal Unique101,640 (70%)44,226 (30%)145,866 Total1,126,524 (80%)283,854 (20%)1,410,378 Correctly Spelled MisspelledTotal Unique7585(76%)2374(24%)9959
15
MinKeyStrokes (MKS) – # characters + # arrow keys + 1 enter key Penalized MKS (PMKS) – MKS + 0.1 × # suggested queries MinKeyStrokes (MKS) – # characters + # arrow keys + 1 enter key Penalized MKS (PMKS) – MKS + 0.1 × # suggested queries Recall@K – #Correct in Top K / #Queries Precision@K – (#Correct / #Suggested) in Top K Recall@K – #Correct in Top K / #Queries Precision@K – (#Correct / #Suggested) in Top K Metrics 15 Offline Online
16
All QueriesMisspelled Queries R@1R@10MKSR@1R@10MKS Proposed 0.918*0.976 11.86* 0.677* 0.900* 11.96* Edit Dist0.8990.97313.390.5790.88714.53 Results 16 Baseline: Weighted edit distance (Chaudhuri and Kaushik, 09) Outperforms baseline in all metrics (p < 0.05) except R@10 Google Suggest (August 10) Google Suggest saves users 0.4 keystrokes over baseline Proposed system further reduces user keystrokes by 1.1 1.5 keystroke savings for misspelled queries! GoogleN/A 13.01N/A 13.49
17
Risk Pruning 17 Apply threshold to preserve suggestion relevance Risk = geometric mean of transformation probability per character in input query Prune suggestions with many high risk words Pruning high risk suggestions lowers recall and MKS slightly, but improves precision and PMKS significantly All Queries R@1R@10P@1P@10MKSPMKS No Pruning0.9180.9760.9200.26211.8619.60 With Pruning0.9160.9690.9270.30411.8719.42
18
Beam Pruning 18 Prune search paths to speed up correction Absolute – Limit max paths expanded per query position Relative – Keep only paths within probability threshold of best path per query position
19
Example 19
20
Outline Introduction Model Search Evaluation Conclusion 20
21
Summary Modeled transformations using unsupervised joint-sequence model trained from spelling correction pairs Proposed efficient A* search algorithm with modified trie data structure and beam pruning techniques Applied risk pruning to preserve suggestion relevance Defined metrics for evaluating online spelling correction Future Work Explore additional sources of spelling correction pairs Utilize n-gram language model as query prior Extend technique to other applications 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.