 Person Name Disambiguation by Bootstrapping SIGIR’10 Yoshida M., Ikeda M., Ono S., Sato I., Hiroshi N. Supervisor: Koh Jia-Ling Presenter: Nonhlanhla.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Measuring Semantic Similarity between Words Using Web Search Engines WWW 07.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Data Mining and Text Mining. The Standard Data Mining process.
Information Organization: Clustering
Presentation transcript:

 Person Name Disambiguation by Bootstrapping SIGIR’10 Yoshida M., Ikeda M., Ono S., Sato I., Hiroshi N. Supervisor: Koh Jia-Ling Presenter: Nonhlanhla Shongwe Date:

Preview  Introduction  Two stage clustering Algorithm  First stage clustering  Second stage clustering  Experiments  Evaluation measures  Results  Conclusion 2

Introduction  World Wide Web(www)  Used for learning about real-world entities e.g people  Problem  Ambiguity in names  This causes the search results to contain web pages about several different entities 3

4

Introduction cont’s  First typical approach  Define similarities between documents based on features extracted from the documents  Cluster Web pages returned by search engines for person name queries by using the similarity  Use Named Entities (NE’s) 5

Introduction cont’s  Named Entities (NE)  Good feature for distinguishing people  Represent real-world concepts related to the people  e.g. Microsoft is related to Bill Gates  Compound Key (CKW) words and URL’s  Similar properties like NE’s  e.g. CKW: chief software architect indicate Bill Gates  Challenge  Show high precision values and low recall values 6

Introduction cont’s  Second typical approach  Use weak features for calculation of document similarities  2 categories to represent a document  Strong features e.g. NE, CKW, URL  Ability to clearly distinguish between clusters  Weak features e.g. single words  No ability to clearly distinguish between clusters.  Challenges  Improves the recall but worsens the precision 7

Two stage clustering algorithm  First stage clustering algorithm  Use the strong features to represent documents  E.g. NE, CKW, URL’s  Second stage clustering algorithm  Use bootstrapping to extract instances (weak features)  E.g. words  For example  A computer scientist and baseball player share the same name  Computer scientist->memory and algorithm  Baseball -> ball and battling 8

First stage clustering algorithm  Extract 3 types of strong features  Named entities  Compound key words  URL’s  Named Entities features  Person names (used as is)  Organization names (filtered out)  Place names (filtered out) 9

First stage clustering algorithm cont’s  Compound key word features  Calculate the importance score for compound words in a document  CW means compound word  CW = (W 1, W 2 …. W L ) where W i is a simple noun  f(CW) = number of independent occurrences of CW in a document  Independent means CW is not part of a longer CW 10

First stage clustering algorithm cont’s  Example  “Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage”  f(CW) of the compound word of data and data mining  f(data) = 2  f(data mining) = 2 11

First stage clustering algorithm cont’s  Compound key word features  where  frequencies of nouns that  LN(W i )- precede simple noun W i  RN(W i )- succeed simple noun W i 12

First stage clustering algorithm cont’s  Compound key word features LR(information, security) = {(3+1)(6+1)} x {(3+1)(2+1)} = (28x12)^1/4 = 4.28 Score(information security) = 3* 4.28= f(Information) = 9 LN(Information) = 3 LR(Information)= 3+1+2=6 f(security) = 3 LN(Security) = 3 LR(Security)= 2 13

First stage clustering algorithm cont’s  Link Features  Extract URLs (in tags) from each document  Use them as link features  Include the URL of the document itself  Filtered out stop words 14

First stage clustering algorithm cont’s  Document similarity  How to calculate document similarities using the extracted features  f x and f y are sets of features extracted from documents d x and d y  Θ overlap is a threshold value to avoid too small denominators  Calculate similarities by  NEs (sim NE )  CKWs (sim CKW )  Calculate similarity of URLs (sim URL ) 15

First stage clustering algorithm cont’s Document 1Document 2Document 3 NE:Steve Jobs, Pixar, Disney, Apple, Macintosh Bill Gates, MicrosoftGeorge Bush, white house CKW:American business magnate, CEO, computer American business magnate, Chief Software Architect, Computer, CEO President, CEO, oil business, republican URL:fastcompany.com Wikipedia.com 16 Overlap(D1, D2)Overlap(D1, D3)Overlap(D2, D3) Sim(NE)000 Sim(CKW)¾=0.75¼= Sim(URL)10.25

First stage clustering algorithm cont’s  Calculation of merged similarities  Different similarity scores are calculated for different types of named entities  Person names  Location names  Organization names αp = 0.78 αL= 0.16 αo=

First stage clustering algorithm cont’s  Merging different similarities: for NE, CKW and LINK  Calculated by taking the maximum of the given similarity values Sim max(D1, D2) = max(0, 0.75, 1)=1 Sim max(D1, D3)= max(0, 0.25, 0.25)= 0.25 Sim max(D2, D3)= max(0, 0.25, 0.25)=

First stage clustering algorithm cont’s  Clustering algorithm  Used standard hierarchical agglomerative clustering(HAC ) algorithm  Start with one-in-one clustering and merges the most-similar cluster pairs  Used the average clustering approach 19

Second stage clustering algorithm  Used Bootstrapping  Used to extract a set of instances (e.g. country names finds “ *’s prime minister”)  Espresso was used to find in documents word pairs that have relations  Uses a HITS-like matrix multiplication 20

Second stage clustering algorithm cont’s 21

Second stage clustering algorithm cont’s  Algorithm (STEP 1)  A feature-document matrix P is generated. 22 P(f,d) = d1 d2 d3 d4 f f f f C1 C2 d1 1 0 d2 1 0 d3 0 1 d4 0 1 =

Second stage clustering algorithm cont’s  Algorithm (STEP 2)  A feature-cluster matrix ( )is calculated from P and current document-cluster matrix( ) 23 P(f,d) = d1 d2 d3 d4 f f f f C1 C2 d1 1 0 d2 1 0 d3 0 1 d4 0 1 = x= C1 C2 f f f f

Second stage clustering algorithm cont’s  Algorithm (STEP 3)  A document-cluster matrix ( )is calculated from P and the current feature-cluster matrix 24 P(T)= f1 f2 f3 f4 d d d d C1 C2 f f f f x = C1 C2 d d d d

Second stage clustering algorithm cont’s  Algorithm (STEP 4) 25 C1 C2 d d d d

Experiments  Evaluation measures  L(e) = collect class  C(e) = machine-predicted class  Extended B-Cubed precision(BEP) and recall (BER) 26

Experiments cont’s 27 Baseline All-in-one: all documents belong to one cluster One-in-one :when one document is a size-one cluster Combined : mixture of the two Second Stage Clustering TOPIC: estimates latent topic using a probabilistic generative model CKW: extracts CKW from the resulting first stage clusters Bootstrap Used unigram and bigram features excluding stop words and TF-IDF score was used to calculate the weights for each feature

Conclusion  Introduced a problem caused my name ambiguity  Explained a two stage clustering method  First stage clustering using HAC  Second stage clustering using Bootstrapping  Explained results showing that this method out performs other previously proposed methods 28

Thanks you for listening 29