Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Part-Of-Speech Tagging and Chunking using CRF & TBL
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9,
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
Conditional Random Fields
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
MAchine Learning for LanguagE Toolkit
TokensRegex August 15, 2013 Angel X. Chang.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Graphical models for part of speech tagging
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
Open Information Extraction using Wikipedia
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
MedKAT Medical Knowledge Analysis Tool December 2009.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
John Lafferty Andrew McCallum Fernando Pereira
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
CRF Recitation Kevin Tang. Conditional Random Field Definition.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Edit Distances William W. Cohen.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
Parsing in Multiple Languages
Conditional Random Fields for ASR
CSC 594 Topics in AI – Natural Language Processing
Stanford CoreNLP
Extracting Information from Diverse and Noisy Scanned Document Images
CS224N Section 3: Corpora, etc.
CS224N Section 3: Project,Corpora
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007

Named Entity Recognition Germany’s representative to the European Union’s veterinary committee Werner Zwingman said on Wednesday consumers should … IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.

Why NER?  Question Answering  Textual Entailment  Coreference Resolution  Computational Semantics ……

NER Data/Bake-Offs  CoNLL-2002 and CoNLL-2003 (British newswire)  Multiple languages: Spanish, Dutch, English, German  4 entities: Person, Location, Organization, Misc  MUC-6 and MUC-7 (American newswire)  7 entities: Person, Location, Organization, Time, Date, Percent, Money  ACE  5 entities: Location, Organization, Person, FAC, GPE  BBN (Penn Treebank)  22 entities: Animal, Cardinal, Date, Disease, …

Hidden Markov Models (HMMs)  Generative  Find parameters to maximize P(X,Y)  Assumes features are independent  When labeling X i future observations are taken into account (forward-backward)

MaxEnt Markov Models (MEMMs)  Discriminative  Find parameters to maximize P(Y|X)  No longer assume that features are independent  Do not take future observations into account (no forward-backward)

Conditional Random Fields (CRFs)  Discriminative  Doesn’t assume that features are independent  When labeling Y i future observations are taken into account  The best of both worlds!

Model Trade-offs Speed Discrim vs. Generative Normalization HMMvery fastgenerativelocal MEMMmid-rangediscriminativelocal CRFkinda slowdiscriminativeglobal

Stanford NER  CRF  Features are more important than model  How to train a new model

Our Features  Word features: current word, previous word, next word, all words within a window  Orthographic features:  Jenny Xxxx  IL-2 XX-#  Prefixes and Suffixes:  Jenny, ny>, y>  Label sequences  Lots of feature conjunctions

Distributional Similarity Features  Large, unannotated corpus  Each word will appear in contexts - induce a distribution over contexts  Cluster words based on how similar their distributions are  Use cluster IDs as features  Great way to combat sparsity  We used Alexander Clark’s distributional similarity code (easy to use, works great!)  200 clusters, used 100 million words from English gigaword corpus

Training New Models Reading data:  edu.stanford.nlp.sequences.DocumentReaderAndWriter  Interface for specifying input/output format  edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter: Germany ’s representative to The European Union LOCATION O ORGANIZATION

Training New Models  Creating features  edu.stanford.nlp.sequences.FeatureFactory  Interface for extracting features from data  Makes sense if doing something very different (e.g., Chinese NER)  edu.stanford.nlp.sequences.NERFeatureFactory  Easiest option: just add new features here  Lots of built in stuff: computes orthographic features on- the-fly  Specifying features  edu.stanford.nlp.sequences.SeqClassifierFlags  Stores global flags  Initialized from Properties file

Training New Models  Other useful stuff  useObservedSequencesOnly  Speeds up training/testing  Makes sense in some applications, but not all  window  How many previous tags do you want to be able to condition on?  feature pruning  Remove rare features  Optimizer: LBFGS

Distributed Models  Trained on CoNLL, MUC and ACE  Entities: Person, Location, Organization  Trained on both British and American newswire, so robust across both domains  Models with and without the distributional similarity features

Incorporating NER into Systems  NER is a component technology  Common approach:  Label data  Pipe output to next stage  Better approach:  Sample output at each stage  Pipe sampled output to next stage  Repeat several times  Vote for final output  Sampling NER outputs is fast

Textual Entailment Pipeline  Topological sort of annotators Parser Coreference NE Recognizer SR Labeler RTE

Sampling Example Yes ARG0[…] ARG1[…] ARG-TMP[…] Parser Coreference NE Recognizer SR Labeler RTE

No Sampling Example Yes ARG0[…] ARG1[…] ARG-LOC[…] Parser Coreference NE Recognizer SR Labeler RTE

Yes No Yes Sampling Example ARG0[…] ARG1[…] ARG-TMP[…] Parser Coreference NE Recognizer SR Labeler RTE

Yes No Yes Sampling Example ARG0[…] ARG1[…] ARG2[…] Parser Coreference NE Recognizer SR Labeler RTE

Yes No Yes Sampling Example No ARG0[…] ARG1[…] ARG-TMP[…] Parser Coreference NE Recognizer SR Labeler RTE

Yes No Yes No Sampling Example Parser Coreference NE Recognizer SR Labeler RTE Yes No Yes No

Conclusions  NER is a useful technology  Stanford NER Software  Has pretrained models for english newswire  Easy to train new models   Questions?