Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker: Yi-Lin,Hsu Advisor: Dr. Koh, Jia-ling Date: 2009/11/16
Outline Introduction to NERQ NERQ Problem Implementation WSLDA Experimental Results Conclusion and Future work 2009/10/222
Introduction to NERQ Named entity recognition (NER)is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. information extraction 2009/10/223
Introduction to NERQ NERQ involves 2 tasks: – 1. Detection of the named entity in a given query – 2. Classification of the named entity into predefined classes. – Example: mine movie titles – Applications: Web search, etc. Challenges – Queries are usually very short – Queries are not necessarily in standard form 2009/10/224
Query Data New data source for NER – About 70% of search queries contain named entities. – Rich context for determining the classes of entities. Query Context – “harry potter walkthrough”→“harry potter cheats” (context in the same class) Wisdom-of-crowds Very Large-scale data and keep on growing Frequent update with emerging named entities 2009/10/225
NERQ Problem A query having one named entity is represented as a triple (e, t, c), – e : named entity, – t : context of e α#β – c : class of e 2009/10/226
Probabilistic Approach (e,t,c)* = argmax (e,t,c) Pr(q,e,t,c) = argmax (e,t,c) Pr(q|e,t,c) Pr(e,t,c) = argmax (e,t,c) Pr(e,t,c) (1) Pr(e,t,c) = Pr(e) Pr(c|e) Pr(t|e,c) = Pr(e) Pr(c|e) Pr(t|c) (2) 2009/10/227 Make an assumption here
Topic Model for NERQ T = {(e i,t i,c i ) | i = 1..N}, the learning problem can be formalized as : 2009/10/228
Implementation Offline Training Online Prediction 2009/10/229
Offline Training 2009/10/2210 ……………….. Harry Potter ……………….. Harry Potter ……………….. Seeds Scan the query log with the seed name entity and collect the queries contain them ……………….. Harry Potter trail Harry Potter walk through Harry Potter cheats ……………….. Harry Potter trail Harry Potter walk through Harry Potter cheats ……………….. Query log
movie Offline Training Pr(e) : the total frequency of queries containing e in the query log 2009/10/2211 Harry PottertrailsNew Moon Name entityContextClass Query Pr(c|e) : estimated by WS-LDA Pr(c|t) : fixed
Online Prediction harry 2009/10/2212 trailspotter Find the most likely triple (e,t,c) in G(q)
WSLDA 2009/10/2213
WSLDA Introduce Weak Supervision – LDA log likelihood + soft constraints – Soft Constraints 2009/10/2214 LDA Probability Soft Constraints Document Probability on i -th Class Document Probability on i -th Class Document Binary Label on i -th Class Document Binary Label on i -th Class
WSLDA Objective Fuction : 2009/10/2215
Experiments A real data set consisting of 6 billion queries 930 million unique queries Four semantic classes,“Movie”, “Game”, “Book”, and “Music”. 4 human annotators. 180 named entities were selected from the web sites of Amazon, GameSpot, and Lyrics. 120 for training and 60 for test. Finally, we obtain 432,304 contexts and about 1.5 millions name entities. 2009/10/2216
Experiments Randomly sampled 400 queries from the recognition results(0.14 millions) for evaluation. 2009/10/2217 Example Queries pics of fight clubbraveheart quote watch gladiator onlineamerican beauty company 12 angry men charactersmario kart guide pc mass effectcrysis mods mother teresa imagescondemned screenshots 4 minutes lyricking kong the black swan summaryblackwater novel new moonrehab the song nineteen minutes synopsisumbrella chords all summer long videogirlfriend lyrics
Experiments The performance of NERQ is evaluated in terms of Top N accuracy. 2009/10/2218
Experiments We performed experiments to make comparison between the WS-LDA approach and two baseline methods: Determ and LDA. Determ learns the contexts of a certain class by simply aggregating all the contexts of named entities belonging to that class. LDA and WS-LDA take a probabilistic approach 2009/10/2219
Experiments 2009/11/1620 Movie ContextsGame Contexts Book Contexts Music Contexts DetermLDAWS-LDADetermLDAWS-LDA DetermLDAWS-LDA DetermLDAWS-LDA
Table 5: Comparisons on Learned Named Entities of Each Class 2009/11/1621 MovieGameBookMusicAverage-Class
Experiments Comparisons between WS-LDA and LDA 2009/10/2222
Conclusion Formalized the Problem of NERQ Proposed a novel method for NERQ Develop a new topic model called WSLDA Future Works: – We plan to add more classes and conduct the experiments. – The proposed method focuses on single named entity queries. – Some queries contained the named entity out of predefined classes. (e.g. American beauty company) – Some contexts were not learned in our approach since they are uncommon. (e.g lyrics for # by chris brown ) 2009/10/2223