Personal Name Classification in Web queries Dou Shen*, Toby Walker*, Zijian Zheng*, Qiang Yang**, Ying Li* *Microsoft Corporation ** Hong Kong University.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Personalized Query Classification Bin Cao, Qiang Yang, Derek Hao Hu, et al. Computer Science and Engineering Hong Kong UST.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Transfer Learning for WiFi-based Indoor Localization
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Multi-Task Learning for Boosting with Application to Web Search Ranking Olivier Chapelle et al. Presenter: Wei Cheng.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
A Simple Approach for Author Profiling in MapReduce
CRF &SVM in Medication Extraction
CIKM Competition 2014 Second Place Solution
CIKM Competition 2014 Second Place Solution
Biased Random Walk based Social Regularization for Word Embeddings
Socialized Word Embeddings
Presentation transcript:

Personal Name Classification in Web queries Dou Shen*, Toby Walker*, Zijian Zheng*, Qiang Yang**, Ying Li* *Microsoft Corporation ** Hong Kong University of Science and Technology

Introduction Goal: To detect whether a Web query is a personal name, without referring to any other context information; Motivation 2~4% of daily Web queries are personal names ~6% of jumping queries are personal names Users tend to test a search engine by their names Applications Paid Search “toby walker”  “Wheeled Walkers Sale $89.”Wheeled Walkers Sale $89. Query Suggestion Show the profile-related information once a query is determined as a personal name X

Related Work Improve Personal Name Search Dozier studied some specific strategies for personal name search [1] Wan et al. studied a person resolution system to improve people search performance [2] Personal Name Extraction/Recognition As a special case of named entities, personal name extraction from document/webpage/ s has been widely studied recently [3, 4, 5] Web Query Enrichment Use click-trough data [6] User Web Search [7]

Overview of Our Solution Offline Training Online Classifier

Offline Training

Terminology Candidate Dictionaries Term Context: “…toby walker…”; “…city of walker…” Name Term Context: “…toby walker…” Name Context: “…Dr. Qiang Yang’s student…” Probability Estimation Methods Relative Frequency Context Probability Co-Occurrence in Search Snippets Co-Occurrence in Bigrams

Probability Estimation Methods (1) Relative Frequency Get a set of names Get the relative frequency of each term Context Probability Assumption: if a term is name term, its term contexts should be name contexts Train a unigram model over some name contexts Given a term, get its term contexts through search engines and calculate the probability of term contexts using the trained model

Probability Estimation Methods (2) Co-Occurrence in Search Snippets (S-CoOcc) Get term contexts through search engines Identify name term contexts using some rules followed by a last name term, such as “john smith”; followed by a first name term and then a last name term, such as “John Maynard Smith”; followed by a special kind of verbs such as “did, said, announced, claimed…” as in “John said”; …… Estimate the term probability as the ratio between name term contexts and term contexts Golden Dictionaries

Probability Estimation Methods (3) Co-Occurrence in Bigrams (B-coOcc) Get term contexts from a bigram file Estimate the term probability as in S-coOcc

Online Classifier (1) Grammar matching Geometric average Tricks p(title) = 1; p(suffix)=1 [ ] [ ] [ ] q = [t 1 ] t 2 [t 3 ] t 4 [t 5 ]

Online Classifier (2) Personal Name Grammars ::= [ ] [ ] [ ] ::= dr | doctor | ms, … ::= sr.| jr.| III … ::= | - ; ::= | ; ::= | - ;

Experiments Data Sets Dictionaries Name Context: Top 200, 000 personal names from WP 10,000,000 name contexts Term Context 2,000,000 candidate terms 80,000,000 term contexts Testing Data Sets Validation dataset: 2,000 queries, 81 names Test dataset: 10,000 queries, 232 names

Baselines Dictionary Look-up Supervised Methods Classifiers SVM, Logistic Regression Features f 1 : the length of the query; f 2 : whether the query contains a title term; f 3 : whether the query contains a suffix term; f 4 : the probability of a term being a first name term f 5 : the probability of a term being generated by a character level bigram model trained on first-name terms;

Experiment Results (1) Comparison among our method and the baselines Our Methods Dictionary Look-up Supervised Methods

Experiment Results (2) Comparison of different ways of constructing probabilistic dictionaries. Candidate Dictionaries: WP Golden Dictionaries: CENSUS

Experiment Results (3) Effect of Golden Dictionaries and Candidate Dictionaries

Experiment Results (4) Effect of Enlarging Golden Dictionaries Golden dictionaries: CENSUS and its expansions Candidate dictionaries: WP S-coOcc

Experiment Results (5) Effect of Enlarging Candidate Dictionaries Gold dictionaries: CENSUS Candidate dictionaries: CENSUS and its expansions B-coOcc

Conclusion and Future Work Conclusion Put forward an easy but effective method for personal name classification in Web queries Exploit the methods of enlarging golden dictionaries and candidate dictionaries. Future work Instead of using rules, try to define name term contexts using existing named entity recognition algorithms Validate the contribution of personal name classification to Web Search and Advertising Validate the classifier in non-US names

References [1] C. Dozier. Assigning belief scores to names in queries. HLT '01. [2] X. Wan, J. Gao, M. Li, and B. Ding. Person resolution in person search results: Webhawk. CIKM '05 [3] Z. Chen, L. Wenyin, and F. Zhang. A new statistical approach to personal name extraction. ICML '02: [4] H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. HLT-NAACL ‘03 [5] V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. ACL- COLLING ’06 [6] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. WWW ‘01 [7] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. TOIS ‘06.

Thanks!