Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park, IDS Lab., Seoul National University
Copyright 2008 by CEBT INTRODUCTION effective search of text information in relational databases keyword search – one of the challenges assembling keyword-matching tuples from different tables into one view exponential search space (w.r.t. the number of keywords) 2 idnamecolormanufac.size id1T41blackcid112 id2T60Silvercid17 id3MiniBluecid34 id4vaiograycid27 id5vaiopredcid22 id6xnoteblackcid47 idnameheadquarter cid1IBMChina cid2 SONYJapan cid3 DELLUSA cid4 samsungKorea ProductCompany id1T41Blackcid112 cid1IBMChina T41 IBM query …
Copyright 2008 by CEBT INTRODUCTION ex) “Green Mile Tom Hanks” segment: > – segment-matching (not keyword-matching) reducing the search space Conditional Random Fields (CRF) probabilistic model to segment and label sequence data – normalized model for multiple feature function combination outperform Hidden Markov Model and Maximum-Entropy Markov Model in real world labeling tasks alleviating independence assumption avoid label bias assumption model a conditional probability distribution over a label sequence given keyword sequence – for given “Green Mile Tom Hanks” 3
Copyright 2008 by CEBT PROBLEM DEFINITION query an ordered keyword sequence ( x = ) segment a subsequence of keywords in the query ( S = ) is valid – if this subsequence appears at least once in the database D segmentation a sequence of non-overlapping segments that completely cover all keywords in the query ( Š = ) is valid – iff all S ∈ Š are valid – ex) query: Star Wars Clone valid segmentations ( ) To find the optimal segmentation 4
Copyright 2008 by CEBT NON-STATISTICAL ALGORITHMS greedy search starting with the first keyword in a given query keep including the next keyword into the current segment until adding the new keyword would make the segment no longer valid : not valid … 5
Copyright 2008 by CEBT NON-STATISTICAL ALGORITHMS Keyword Query Cleaning [VLDB 2008] dynamic programming expanding each keyword to a set of similar tokens scoring function – TF-IDF (IR sense) – favors longer segments – penalizes spelling corrections 6
Copyright 2008 by CEBT QUERY SEGMENTATION USING CRF Computing the label for each keyword in a given query grouping adjacent keywords with the same label into the same segment – conditional probability for y: label sequence x: keyword sequence – best label sequence y’ = training set – Database D = {x k, y k } k=1~N obtained from query logs 7
Copyright 2008 by CEBT QUERY SEGMENTATION USING CRF the query is segmented based on those label the invalid segments is further broken down into valid segments – “Green Mile Tom Hanks” > – “Johny Depp Orlando Bloom” MaxScore, MaxTerm algorithm computing optimal segmentation S’ = – the optimal segmentation for each invalid segment is computed through the tree search procedure MaxTerm: valid segment of maximum length E> from finest segment to MaxTerm 8 …JDOB… … JO B D
Copyright 2008 by CEBT QUERY SEGMENTATION USING CRF enhanced CRF model column-position pairing – exact position of a keyword in a segment – ex) “Green Mile Tom Hanks” start position segmentation boundary other position adapting user preferences 9
Copyright 2008 by CEBT EXPERIMENTS Dataset IMDB ( tuples), FoodMark ( tuples) the training set and test queries are generated by random sampling – 10-fold cross validation segmentation accuracy – A x = 1 – (|Sx – S’x| + |S’x – Sx|) / |Sx| Sx: true segment set S’x: predicted segment set accuracy the CRF model is not sensitive to query length 10
Copyright 2008 by CEBT EXPERIMENTS ambiguous connection deciding which segment a keyword should belong when that keyword can form valid segments with both the preceding and the following keywords –, k,, …> – ambiguity level the number of ambiguous connections 11
Copyright 2008 by CEBT EXPERIMENTS efficient query segmentation two or three orders of magnitude improvement over keyword query cleaning less than 0.02 seconds to segment one medium query (except keyword query cleaning) 12
Copyright 2008 by CEBT CONCLUSION CRF-based models for query segmentation Experiments have demonstrated the effectiveness of the proposed approach Future work accommodating spelling errors online segmentation of query in a streaming fashion 13
Copyright 2008 by CEBT discussion ( ) subsequence of valid segment is also valid strong assumptions prefix as a subsequence of keywords considering only adjacent keywords in the same segment not clear sum of all segments is always constant (MaxScore algorithm) – it cannot solve the problem of two terms in a segment hard to follow the tree merging algorithm (MaxTermSearch algorithm) 14