MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Search-Based Structured Prediction

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Angelo Dalli Department of Intelligent Computing Systems

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

Mallet & MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 16, 2011.

Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011.

Query Segmentation and Structured Annotation via NLP Rifat Reza Joye Panagiotis Papadimitriou.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Named Entity Recognition LING 570 Fei Xia Week 10: 11/30/09.

Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.

Conditional Random Fields

Hidden Markov Models David Meir Blei November 1, 1999.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Graphical models for part of speech tagging

Ling 570 Day 17: Named Entity Recognition Chunking.

Albert Gatt Corpora and Statistical Methods Lecture 10.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Hidden Markov Models for Information Extraction CSE 454.

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.

1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Ling 570 Day 16: Sequence modeling Named Entity Recognition.

For Friday Finish chapter 6 Program 1, Milestone 1 due.

CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Sequence Alignment.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

CSC 594 Topics in AI – Natural Language Processing

Handwritten Characters Recognition Based on an HMM Model

Presentation transcript:

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

Roadmap MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging

MaxEnt Feature Template Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1 Next next word: w +2 Tags: Previous tag: t -1 Previous tag pair: t -2 t -1 How many features? 5|V|+|T|+|T| 2

Representing Orthographic Patterns How can we represent morphological patterns as features? Character sequences Which sequences? Prefixes/suffixes e.g. suffix(w i )=ing or prefix(w i )=well Specific characters or character types Which? is-capitalized is-hyphenated

MaxEnt Feature Set

Examples well-heeled: rare word

Examples well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS- IN:1

Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

Sequence Labeling

Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging

Solving Sequence Labeling

Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features?

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions:

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info)

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use Perform incremental classification to get labels, use labels as features for instances later in sequence

HMM Trellis time flies like an arrow Adapted from F. Xia

Viterbi Initialization: Recursion: Termination:

1 2 time 3 flies 4 like 5 an 6 arrow N00.05 BOS N 00[D,5]*P(N|D)*P(arrow|N) D V00.01 BOS N N,V 0 P V 0 D V BOS

Decoding Goal: Identify highest probability tag sequence

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T Top K highest probability sequences

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-first Search Is breadth-first search efficient?

Breadth-first Search Is it efficient? No, it tries everything

Beam Search Intuition: Breadth-first search explores all paths

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths?

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search W={w 1,w 2,…,w n }: test sentence

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n:

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1

POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages:

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95%

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time:

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage:

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage:

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage: Limited window of context

Beam vs Viterbi Dynamic programming vs heuristic search

Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee

Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice

Named Entity Recognition

Roadmap Named Entity Recognition Definition Motivation Challenges Common Approach

Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: Person names Locations Organizations Dates Times

Example Microsoft released Windows Vista in 2007.

Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007

Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence:

Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical:

Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical: Genes, proteins, diseases, drugs, …

Why NER? Machine translation:

Why NER? Machine translation: Person

Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number:

Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number: 9/11: Date vs ratio 911: Emergency phone number, simple number

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target Nes

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech:

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech: Phone numbers (vs other digit strings), differ by language

Challenges Ambiguity Washington chose

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results)

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal

Evaluation Precision Recall F-measure

Resources Online: Name lists Baby name, who’s who, newswire services Gazetteers etc Tools Lingpipe OpenNLP Stanford NLP toolkit