Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Slides:



Advertisements
Similar presentations
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
Improved TF-IDF Ranker
Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.
Psyc 235: Introduction to Statistics
STATISTICS. SOME BASIC STATISTICS MEAN (AVERAGE) – Add all of the data together and divide by the number of elements within that set of data. MEDIAN –
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 7 Using Nonexperimental Research.
Central Limit Theorem.
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Interactions in Regression.
Intro to Statistics for the Behavioral Sciences PSYC 1900
1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.
1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.
©2001 Chariot Software Group Using MicroGrade Classroom Management Software.
Nonparametric or Distribution-free Tests
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Confidential Crisis Management Innovations, LLC. CMI CrisisPad TM Product Overview Copyright © 2011, Crisis Management Innovations, LLC. All Rights Reserved.
Introduction to Error Analysis
Chapter 3 Averages and Variations
Sarasota Policy Wiki Why Wiki? To provide a new platform for community input on public policies and issues. To encourage engagement.
Text Classification, Active/Interactive learning.
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Trinity Washington University 1 . Trinity –You get a free Google Mail account i.e.,
Our website is designed to: Help students conserve resources (fuel, money, time) Motivate event planners to not waste their excess food Appeal to the.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
1 Implementing Communications-Driven and Group Decision Support Systems Collaborating with peers at other locations is needed in many companies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.
Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.4 Analyzing Dependent Samples.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Foreign Policy: Latin America Megan, Trevor, Rebecca.
Central Tendency. Variables have distributions A variable is something that changes or has different values (e.g., anger). A distribution is a collection.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Google Docs Staff Development Web 2.0 Tools. Door Prize  Put your name in the box  Drawing will be at the end of the session.
Within Subject ANOVAs: Assumptions & Post Hoc Tests.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
The Writing Process Five Steps to Writing it Right Spend time on each step for A great finished product!
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
LECTURE 10: TEXT AS DATA April 13, 2015 SDS 136 Communicating with Data Portions of this slide deck adapted from J.Chuang University of Washington.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
AP Review Exploring Data. Describing a Distribution Discuss center, shape, and spread in context. Center: Mean or Median Shape: Roughly Symmetrical, Right.
Language Identification and Part-of-Speech Tagging
Where it is today and how it is used.
Measuring Monolinguality
Statistical NLP: Lecture 7
Examining Relationships
Erasmus University Rotterdam
Statistics with Stiles
Collaboration with Google Docs
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Sam Norman-Haignere, Nancy G. Kanwisher, Josh H. McDermott  Neuron 
Measuring Complexity of Web Pages Using Gate
What is the number whose area is 16 unit square?
10-K filing annual report word and document statistics
14.2 Measures of Central Tendency
From Unstructured Text to StructureD Data
Superposition of Waves
Presentation transcript:

Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT

Information Extraction President Bush signed the Central America Free Trade Agreement into law Tuesday… WhoWhatWhen

Named Entity Detection President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open- door policy that will benefit U.S. exporters and seed prosperity and democracy in Central America and the Dominican Republic.

Informal Communication Other Sources of Information – –Web Bulletin Boards –Mailing Lists More specialized, up-to-date information But, harder to extract

IE for Informal Comm. SUBJECT: Two New Ipswich Seafood Joints to Open Soon. ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…

NED for Informal Comm. Subject: finale harvard square has anyone been to the recently opened finale in harvard square?

Restaurant Bulletin Board Gathered from a Restaurant BBoard –6 sets of ~100 posts –132 threads –Applied Ratnaparki’s POS tagger –Hand-labeled each token In/Out of restaurant name

Detecting Named Entities Named Entity Informative Bursty Named Entity Informative

Document 1Document 2Document 3 Quantifying Informativeness the clandestine Brazil

A Little History… Z-measure [Brookes,1968] Inverse Doc. Freq. [Jones,1973] x I [Bookstein & Swanson, 1974] Residual IDF [Church & Gale, 1995] Gain [Papenini, 2001]

Main Idea Informative words are: –Rare (IDF) –Modal (Mixture Score) Rarity and Modality are independent qualities We quantify informativeness using a product of IDF and Mixture Score

Binomial Distribution

Term Frequency Distributions “the” “Brazil”

Mixture Models    0.1%    5% 10% 05  90%

Modality Modal words fit a mixture much better than a single binomial We separately fit the binomial and mixture models to each term frequency distribution We quantify modality by comparing the fitness of the two models

Learning Mixture Parameters Use Gradient Descent to learn,  1,  2

Comparing Fitness Use log-odds to compare fitness of the two models

Top Mixture Score Words TokenScoreRest. Occur. sichaun /52 fish50.597/73 was48.790/483 speed /19 tacos43.774/19

Independence Rareness (IDF) Modality (Mixture Score) ?

Correlation Coefficient Score PairCorr. Coefficient IDF/Mixture IDF/RIDF.4113 Mixture/RIDF.7380

Top Words Overlap Plot Two sorted lists –Sorted by IDF –Sorted by Mixture Score Look at % overlap among top N in both lists Plot % overlap as we vary N Independent scores would produce line along diagonal

Overlap Plot # Top Words Percent Overlap IDF/Mixture IDF/RIDF

Top IDF*Mixture Words TokenScoreRest. Occur. sichaun /52 villa /11 tokyo /11 ribs /13 speed /19

Intro to NED Experiments Task: Identify Restaurant Names Use standard NED features (capitalization, punctuation, POS) as “Baseline” Add informativeness score as an additional feature Use F1 Breakeven as performance metric

NED Experiments Feature SetF1 Breakeven Baseline55.0% IDF56.0% Mixture56.0% IDF,Mixture56.9% Residual IDF57.4% IDF*RIDF58.5% IDF*Mixture59.3% Better

Summary Traditional syntax-based features are not enough for IE in & bulletin boards We used term occurrence statistics to construct an informativeness score (IDF*Mixture) We found IDF*Mixture to be useful for identifying topic-centric words and named entites

Discussion Phrases Foreign languages, Speech Co-reference resolution, context tracking Collaborative filtering