Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Chapter 5: Introduction to Information Retrieval
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Problem Semi supervised sarcasm identification using SASI
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Online Clustering of Web Search results
Personal Name Classification in Web queries Dou Shen*, Toby Walker*, Zijian Zheng*, Qiang Yang**, Ying Li* *Microsoft Corporation ** Hong Kong University.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Aki Hecht Seminar in Databases (236826) January 2009
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Chapter 5: Information Retrieval and Web Search
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Webpage Understanding: an Integrated Approach
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
 Person Name Disambiguation by Bootstrapping SIGIR’10 Yoshida M., Ikeda M., Ono S., Sato I., Hiroshi N. Supervisor: Koh Jia-Ling Presenter: Nonhlanhla.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Data Mining and Text Mining. The Standard Data Mining process.
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
Web Data Integration Using Approximate String Join
Applying Key Phrase Extraction to aid Invalidity Search
Disambiguation Algorithm for People Search on the Web
Learning Literature Search Models from Citation Behavior
Presentation transcript:

Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang

Outlines Introduction Motivation Two-stage Clustering Algorithm Experiments

People Name Disambiguation Given a target name (query q ), search engine returns a set of web pages P={ d 1, d 2, …, d n } Task: cluster web pages P such that each cluster refers to a single person.

Example: People Name Disambiguation

People Name Disambiguation A typical solution:  Extract a set of features from each document returned by search engine  Cluster the documents based on some similarity metrics on sets of features Two types of features  Strong features such as named entities (NEs), compound key words (CKWs), URLs NE: Paul Allen, Microsoft (indicate the person Bill Gates) CKW: chief software architect (a concept strongly related to Bill Gates) Very strong ability to distinguish between clusters.  Weak features: single words

People Name Disambiguation Evaluation Metric: F measure  Treat each cluster as if it were the result of a query and each class as if it were the desired set of documents for a query  For class i and cluster j, Recall(i, j)= n ij /n i, Precision(i, j)=n ij /n j F(i, j) = (2 * Recall(i, j) * Precision(i, j)) / ((Precision(i, j) + Recall(i, j))

Motivation Problem of current systems: Using only strong features achieves high precision but low recall. Proposed solution: two-stage clustering algorithm by bootstrapping to improve the recall value.  1 st stage: strong features  2 nd stage: weak features

Two-stage Clustering Algorithm Input: one query string Output: a set of clusters 1. Preprocessing documents returned by search engine 2. First-stage clustering 3. Second-stage clustering

Preprocessing a Document Covert HTML files to text files  Remove HTML tags  Keep sentences Extract text around query string  Using a window size Extract strong features (NEs, CKWs, URLs)

Extract Strong Features Use Stanford NER to identify NEs:  a set of sets of names including names of persons, organizations, and places Compound Key Word (CKW) Features: a set of CKWs  Extract compound words (CW): w 1 w 2..w l  Score each CW:  Determine CKW based on a threshold of scores. Extract URLs from the original HTML files  exclude URLs with high frequencies

Two-stage Clustering Algorithm Input: one or more query strings Output: a set of clusters 1. Preprocessing documents returned by search engine 2.1 st stage clustering 3. 2 nd stage clustering

First stage clustering 1. Calculate the similarities between documents based on these features 2. Use standard hierarchical agglomerative clustering (HAC) algorithm for clustering

Document Similarities Similarity for NE features and CKW features  avoids too small denominator values in the equation

Document Similarities Similarity for URLs

Document Similarities Similarity for NE: Similarities for NE, CKW, and URL

First stage clustering 1. Calculate the similarities between documents based on these features 2. Use standard hierarchical agglomerative clustering (HAC) algorithm for clustering

HAC algorithm Starts from one-in-one clustering, i.e. each document is a cluster Iteratively merge the most similar cluster pairs, which similarity is above a threshold. Cluster similarity:

Two-stage Clustering Algorithm Input: one or more query strings Output: a set of clusters 1. Preprocessing documents returned by search engine 2.1 st stage clustering 3. 2 nd stage clustering

Second Stage Clustering Goal: Cluster documents still in one-in-one clustering after the first stage clustering Idea of bootstrapping algorithm:  Given some seed instances, finds patterns useful to extract such seed instances;  Use these patterns to harvest new instances, and form the harvested new instances new patterns are induced. Instances correspond to documents Patterns correspond to weak features: 1-gram, 2-gram in experiment

Second Stage Clustering

Experiments Setup Dataset: WePS-2  30 names, each has 150 pages  The same page can refer to two or more entities; Evaluation Metrics [5]  Multiplicity precision and recall between document e and e’ C(e) is predicted cluster of e, L(e) is the cluster assigned to e by the gold standard

Example of Evaluation Metrics L(1)={A,B} L(2)={A,B} C(1)={ct1, ct2} C(2)={ct1, ct2} L(1)={A,B} L(2)={A,B} C(1)={ct1} C(2)={ct1, ct2} L(1)={A,B} L(2)={A,B} C(1)={ct1,ct2,ct3} C(2)={ct1, ct2,ct3}

Experiments Setup Evaluation Metrics  Extended B-Cubed precision (BEP) and recall (BER)

Experiments Setup Baselines:  First stage clustering: all-in-one, one-in-one, combined baseline (each doc belongs to one cluster from all-in-one and one from one-in- one).  Second stage clustering: TOPIC algorithm, CKW algorithm

Experiments Results

References [1] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of COLING-ACL 1998, pages 79–85, [2] C. Niu, W. Li, and R. K. Srihari. Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pages 598–605, [3] X. Liu, Y. Gong, W. Xu, and S. Zhu. Document clustering with cluster refinement and model selection capabilities. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191– 198, [4] X. Wan, M. L. J. Gao, and B. Ding. Person resolution in person search results: WebHawk. In Proceedings of CIKM2005, pages 163–170, [5] E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), [6] Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa. Person Name Disambiguation by Bootstrapping. In Proceedings of SIGIR, 2010.