Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
Unsupervised learning
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Clustering Unsupervised learning Generating “classes”
Jan 4 th 2013 Event Extraction Using Distant Supervision Kevin Reschke.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
Music retrieval Conventional music retrieval systems Exact queries: ”Give me all songs from J.Lo’s latest album” What about ”Give me the music that I like”?
Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Linguistic Resources for the 2013 TAC KBP Entity Linking Evaluation Joe Ellis (presenter), Justin Mott, Xuansong Li, Jeremy Getman, Jonathan Wright, Stephanie.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
MCMC in structure space MCMC in order space.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
Lecture 9: Entity Resolution
Junghoo “John” Cho UCLA
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer

What is Cold-Start KBP? Corpus of interest – Read about one entity – Want to know information about that entity E.g. spouse, employment – Search the corpus for other mentions – Extract the relevant facts For all the entities in the corpus

Overview Goal: Generate Wikipedia like KB from scratch Need many technologies to create it. What are the hard parts? – Scalability

Wikipedia Cold-Start Infobox

Wikipedia Cold-Start Summary

Wikipedia Cold-Start Entity Links

Wikipedia Cold-Start Cross Language Links

Why is Cold-Start Hard? Clustering harder than Entity Linking – In Entity Linking you have a KB Relation extraction – Last several years at TAC shown how hard this is How do you test it? How do you scale?

System Diagram Corpus Lorify KB Entries Entity Clustering Entity Linking Infobox Extraction In-Doc Coref Entity Extraction Zoning Information Fusion

System Diagram Corpus Lorify KB Entries Entity Clustering Entity Linking Infobox Extraction In-Doc Coref Entity Extraction Zoning Information Fusion

Entity Clustering NIL Clustering or Cross-Document Coreference – Comparison Space All pairs or subset – Model similarity Vector space or ML Classifier – Perform clustering Hierarchical Agglomerative or Statistical We chose a statistical clustering algorithm based on MCMC Metropolis-Hastings – (Singh et al. 2011)

MCMC Clustering Start with size one clusters Propose moving an entity from one cluster to another cluster – Use similarity function to judge which cluster is better – Don’t always make optimal decision Temperature parameter controls the level of randomness

Proposal System Limits which pairs of entities can be clustered together – Require some evidence Each proposal links two entity mentions in the following ways – String/phonemic similarity – Alias Relation in text – Link to Knowledge Base Cold-Start statistics – Cold-Start Entity Mentions: 85,289 – 12,000 total proposal tags – # Pairs (naïve): 3.6 billion – # Pairs (proposal): 20 million 92% recall over training data

Movement Step temperature

Performance of Base Model KBP NIL Clustering 2011 P/R/F: 0.794/0.843/0.818 KBP NIL Clustering 2012 P/R/F: 0.257/0.376/0.305 minutes MentionsClusters / Mentions Percentage Moves Accepted

Singleton Step

With Singletons MentionsClusters / Mentions Percentage Moves Accepted minutes KBP NIL Clustering 2011 P/R/F: 0.844/0.803/0.823 KBP NIL Clustering 2012 P/R/F: 0.596/0.627/0.611

Convergence

Thermostat Mentions Acceptance RatioTemperature Clusters/ Mentions KBP NIL Clustering 2011 P/R/F: 0.861/0.824/0.842 KBP NIL Clustering 2012 P/R/F: 0.644/0.669/0.657 minutes

Temperature : Steady vs. Dropping vs. Zero Constant Temperature No temperature Dropping Temperature Movement Acceptance Ratios minutes

Clustering Algorithm Assign each mention to default cluster while temperature >= 0 do for N iterations do – Propose movement or singleton, compute similarity, decide to move end for Drop temperature end while

MCMC Clustering Requires some similarity function A proposal model A movement model Two parameters – Temperature controls time to cluster – Bias determines size of clusters Scalable to large data sets To do streaming clustering, add new data and adjust temperature function

Producing Final KB Once the clustering is completed –Each cluster becomes a KB entry –Fact extraction is run over each mention Information is shared between mentions –The KB is stored in a Riak database Riak is distributed key/value store Riak database exported to a tsv

Results Combined LDC queries and derived queries at hop level 0. SystemF1PRLinkingZoning lcc NoYes lcc Yes lcc No lcc YesNo

Thanks!