Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA
Talk Outline Named Entity Mining – Exploiting click-through data – Applying Latent Dirichlet Allocation – Developing a weakly supervised Learning approach Weakly Supervised LDA Experimental Results Summary
Named Entity Mining Named Entity Mining (NEM) – To mine the information of named entities of a class from a large amount of data. – Example: mine movie titles from a textual data collection – Applications: Web search, etc. Three Challenges – Suitable data source for NEM – Ambiguity in classes of named entities – Supervision from human knowledge Click-through Data LDA (Topic Model) Weakly Supervised Learning
Click-through Data Query context – [movie] trailer, [game] cheats Click context – imdb.com for movies, gamespot.com for games – Wisdom-of-crowds Very Large-scale data and keep on growing Frequent update with emerging named entities New data source for NEM – Over 70% queries contain named entities. – Rich context for determining the classes of entities. Query_1Site_11Freq_11 Site_12Freq_12 …… Query _...…… Click-Through Data
Latent Dirichlet Allocation Deal with ambiguity in classes of named entities – Classes of named entities are ambiguous. Harry Potter: Book, Movie and Game – Topic models (LDA) Classes of Named Entity as Topics # trailer # dvd # movie imdb.com movies.yahoo.com disney.go.com # cheats # walkthrough # game gamespots.com cheats.ign.com gamefaqs.com Movie Game Query Context Click Context Query Context Click Context Harry Potter harry potter trailer imdb.com harry potter dvd movies.yahoo.com harry potter cheats cheats.ign.com harry potter game gamespots.com
Weakly Supervised Learning Supervise LDA training with examples – LDA is unsupervised model. Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. – Human labels are inaccurate and partial. Binary indicator rather than proportion Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. – Weakly-supervised LDA Supervise LDA training with partial labels
Weakly Supervised LDA Overview Create a virtual document for each seed and train WS-LDA Websites Contexts Find new named entities as well as their classes by using obtained query contexts and clicked websites Newly Discovered Entities ……………….. Harry Potter ……………….. Harry Potter ……………….. harry potter book harry potter cheats harry potter trailer …………………………………….. harry potter book harry potter cheats harry potter trailer …………………………………….. SeedsClick-through Data # book, # cheats, # trailer, …………………………………….. # book, # cheats, # trailer, …………………………………….. Virtual Document
Weakly Supervised LDA (cont.) LDA with two types of virtual words – w 1 : Query context – w 2 : Click context # book # cheats # trailer …………… # book # cheats # trailer …………… ………………………………… …………………………………. Virtual Document
Weakly Supervised LDA (cont.) Introduce Weak Supervision – LDA log likelihood + soft constraints – Soft Constraints LDA Probability Soft Constraints Document Probability on i -th Class Document Probability on i -th Class Document Binary Label on i -th Class Document Binary Label on i -th Class
Experimental Results Dataset – Seed named entities About 1,000 seeds for each class, and 3767 unique named entities in total – Click-through data 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs
Experimental Results (cont.) Top Contexts and websites Movie ContextsGame ContextsBook ContextsMusic Contexts Movie WebsitesGame WebsitesBook WebsitesMusic Websites
Experimental Results (cont.) Accuracy of Mined Entities
Summary Proposed to use click-through data as a new data source for NEM Employed topic model to deal with ambiguity in classes of named entities Devised weakly supervised LDA for modeling click-through data – Two types of virtual words – Introduce weakly supervised learning into LDA Experiments on large-scale data verified effectiveness of proposed approach
THANKS