Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009.

Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

UCB Computer Vision Animals on the Web Tamara L. Berg CSE 595 Words & Pictures.

Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker: Yi-Lin,Hsu Advisor: Dr. Koh, Jia-ling Date: 2009/11/16.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.

Caimei Lu et al. (KDD 2010) Presented by Anson Liang.

Personalized Search Result Diversification via Structured Learning

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

11 Learning to Suggest Questions in Online Learning to Suggest Questions in Online Forums Tom Chao Zhou, Chin-Yew Lin, Irwin King Michael R.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.

Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.

Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.

1 Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration Fangjiao Jiang Renmin University of China Joint work with Weiyi Meng.

Topic Modeling using Latent Dirichlet Allocation

Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.

Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.

Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,

More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

A New Algorithm for Inferring User Search Goals with Feedback Sessions.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Exploring Social Tagging Graph for Web Object Classification

Sentiment analysis algorithms and applications: A survey

Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.

Research at Open Systems Lab IIIT Bangalore

Unsupervised Extraction of Template Structure in Web Search Queries www 2012 – Session: search Qingxia Liu.

Community Distribution Outliers in Heterogeneous Information Networks

Mining Query Subtopics from Search Log Data

Stochastic Optimization Maximization for Latent Variable Models

Topic Models in Text Processing

Presentation transcript:

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA

Talk Outline Named Entity Mining – Exploiting click-through data – Applying Latent Dirichlet Allocation – Developing a weakly supervised Learning approach Weakly Supervised LDA Experimental Results Summary

Named Entity Mining Named Entity Mining (NEM) – To mine the information of named entities of a class from a large amount of data. – Example: mine movie titles from a textual data collection – Applications: Web search, etc. Three Challenges – Suitable data source for NEM – Ambiguity in classes of named entities – Supervision from human knowledge Click-through Data LDA (Topic Model) Weakly Supervised Learning

Click-through Data Query context – [movie] trailer, [game] cheats Click context – imdb.com for movies, gamespot.com for games – Wisdom-of-crowds Very Large-scale data and keep on growing Frequent update with emerging named entities New data source for NEM – Over 70% queries contain named entities. – Rich context for determining the classes of entities. Query_1Site_11Freq_11 Site_12Freq_12 …… Query _...…… Click-Through Data

Latent Dirichlet Allocation Deal with ambiguity in classes of named entities – Classes of named entities are ambiguous. Harry Potter: Book, Movie and Game – Topic models (LDA) Classes of Named Entity as Topics # trailer # dvd # movie imdb.com movies.yahoo.com disney.go.com # cheats # walkthrough # game gamespots.com cheats.ign.com gamefaqs.com Movie Game Query Context Click Context Query Context Click Context Harry Potter harry potter trailer  imdb.com harry potter dvd  movies.yahoo.com harry potter cheats  cheats.ign.com harry potter game  gamespots.com

Weakly Supervised Learning Supervise LDA training with examples – LDA is unsupervised model. Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. – Human labels are inaccurate and partial. Binary indicator rather than proportion Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. – Weakly-supervised LDA Supervise LDA training with partial labels

Weakly Supervised LDA Overview Create a virtual document for each seed and train WS-LDA Websites Contexts Find new named entities as well as their classes by using obtained query contexts and clicked websites Newly Discovered Entities ……………….. Harry Potter ……………….. Harry Potter ……………….. harry potter book harry potter cheats harry potter trailer …………………………………….. harry potter book harry potter cheats harry potter trailer …………………………………….. SeedsClick-through Data # book, # cheats, # trailer, …………………………………….. # book, # cheats, # trailer, …………………………………….. Virtual Document

Weakly Supervised LDA (cont.) LDA with two types of virtual words – w 1 : Query context – w 2 : Click context # book # cheats # trailer …………… # book # cheats # trailer …………… ………………………………… …………………………………. Virtual Document

Weakly Supervised LDA (cont.) Introduce Weak Supervision – LDA log likelihood + soft constraints – Soft Constraints LDA Probability Soft Constraints Document Probability on i -th Class Document Probability on i -th Class Document Binary Label on i -th Class Document Binary Label on i -th Class

Experimental Results Dataset – Seed named entities About 1,000 seeds for each class, and 3767 unique named entities in total – Click-through data 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs

Experimental Results (cont.) Top Contexts and websites Movie ContextsGame ContextsBook ContextsMusic Contexts Movie WebsitesGame WebsitesBook WebsitesMusic Websites

Experimental Results (cont.) Accuracy of Mined Entities

Summary Proposed to use click-through data as a new data source for NEM Employed topic model to deal with ambiguity in classes of named entities Devised weakly supervised LDA for modeling click-through data – Two types of virtual words – Introduce weakly supervised learning into LDA Experiments on large-scale data verified effectiveness of proposed approach

THANKS