Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

An Introduction to GATE
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Natural Language Processing - Feature Structures - Feature Structures and Unification.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Stemming, tagging and chunking Text analysis short of parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Tutorial 1 Scanner & Parser
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.
Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
ELN – Natural Language Processing Giuseppe Attardi
LING 388: Language and Computers Sandiway Fong Lecture 17.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Semantic Grammar CSCI-GA.2590 – Lecture 5B Ralph Grishman NYU.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Information Extraction From Medical Records by Alexander Barsky.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
What it’s ? “parsing” Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages,
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
Open Information Extraction using Wikipedia
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
CPS 506 Comparative Programming Languages Syntax Specification.
Artificial Intelligence: Natural Language
CSA2050 Introduction to Computational Linguistics Parsing I.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
MedKAT Medical Knowledge Analysis Tool December 2009.
Cross Language Clone Analysis Team 2 February 3, 2011.
Supertagging CMSC Natural Language Processing January 31, 2006.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Natural Language Processing (NLP)
Formal Language Theory
Machine Learning in Natural Language Processing
Introduction Task: extracting relational facts from text
Chunk Parsing CS1573: AI Application Development, Spring 2003
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
PHP –Regular Expressions
Natural Language Processing (NLP)
Presentation transcript:

Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU

Road Map Goal: information extraction Paths – POS tags  partial parses  semantic grammar – POS tags  full parses  semantic interp. rules 3/3/15NYU2

Road Map Goal: information extraction Paths – POS tags  partial parses  semantic grammar – POS tags  full parses  semantic interp. rules 3/3/15NYU3

Partial Parses (Chunks) Strategy: – identify as much local syntactic structure as we can simply and deterministically general (not task specific) result are termed chunks or partial parses – build rest of structure using semantic patterns task specific 3/3/15NYU4

Partial Parses (Chunks) I fed the hungry young man in the house with a spoon 3/3/15NYU5 definitely part of the NP headed by ‘man’ attachment uncertain: need semantics or context

Partial Parses (Chunks) We will build and use two kinds of chunks: noun groups – head + left modifiers of an NP verb groups – head verb + auxiliaries and modals [“eats”, “will eat”, “can eat”, …] (+ embedded adverbs) (we will use the terms ‘noun groups’ and ‘noun chunks’ interchangeably) 3/3/15NYU6

Chunk patterns Jet provides a regular expression language which can match specific words or parts of speech We will write our first chunker using these patterns 3/3/15NYU7

Chunk patterns: noun groups ng := det-pos? [constit cat=adj]* [constit cat=n] | proper-noun | [constit cat=pro]; det-pos :=[constit cat=det] | [constit cat=det]? [constit cat=n number=singular] "'s"; proper-noun :=([token case=cap] | [undefinedCap])+; 3/3/15NYU8

Chunk patterns (verb groups) vg :=[constit cat=tv] | [constit cat=w] vg-inf | tv-vbe vg-ving; vg-inf :=[constit cat=v] | "be" vg-ving; vg-ven :=[constit cat=ven] | "been" vg-ving; vg-ving :=[constit cat=ving]; tv-vbe :="is" | "are" | "was" | "were"; 3/3/15NYU9

Assembling the pipeline tokenizer  POS-tagger  chunker 3/3/15NYU10

Tipster Architecture Want a uniform data structure for passing information from one stage of the pipeline to the next In the Tipster architecture, basic structure is the document with a set of annotations, each consisting of a type a span zero or more features Each annotator (tokenizer, tagger, chunker) reads current annotations and adds one or more new types of annotations 3/3/15NYU11

Tipster Architecture Offset annotation means original document is never modified – benefit in displaying provenance of extracted information Document + annotations widely used – JET – GATE (gate.ac.uk – Univ. of Sheffield) – UIMA (uima.apache.org) but not universally: NLTK Python toolkit 3/3/15NYU12

Adding Annotations My cat is sleeping 3/3/15NYU13 typestartendfeatures

Adding Annotations: tokenizer My cat is sleeping 3/3/15NYU14 typestartendfeatures token03case=forcedCap token token1019

Adding Annotations: POS tagger My cat is sleeping 3/3/15NYU15 typestartendfeatures token03case=forcedCap token token1019 constit03cat=det constit37cat=n number=singular constit710cat=tv number=singular constit1019cat=ving

Adding Annotations: chunker My cat is sleeping 3/3/15NYU16 typestartendfeatures token03case=forcedCap token token1019 constit03cat=det constit37cat=n number=singular constit710cat=tv number=singular constit1019cat=ving ng07 vg719

Jet pattern language (1) Matching an annotation: [type feature = value …] – must be able to unify features in pattern and annotation – can have nested features: feature = [f1 = v1 f2 = v2] Matching a string “word” Optionality and repetition X ? (optionality) X * (zero or more X’s) X + (one or more X’s) 3/3/15NYU17

Jet pattern language (2) Binding a variable to a pattern element: [constit cat=n] : Head Adding an annotation when pattern add [ng]; Writing output when pattern write “head =“, Head; 3/3/15NYU18

Setting Up The Pipeline # JET properties file to run chunk patterns Jet.dataPath = data Tags.fileName = pos_hmm.txt Pattern.fileName1 = chunkPatterns.txt Pattern.trace = on processSentence = tagJet, pat(chunks) 3/3/15NYU19 pipeline stages (tokenization implicit for interactive use)

Processing a Document # JET properties file # apply chunkPatterns to article.txt Jet.dataPath = data Tags.fileName = pos_hmm.txt Pattern.fileName1 = chunkPatterns.txt Pattern.trace = on JetTest.fileName1 = article.txt processSentence = tokenize, tagJet, pat(chunks) WriteSGML.type = ngroup 3/3/15NYU20 split doc into sentences, then runs processSentence on each

Viewing Annotations can be activated through Jet menu: tools : process documents and view … provides color-coded display of annotations 3/3/15NYU21

Corpus-Trained Chunkers We know two ways of building a sequence classifier which can assign a tag to each token in a sequence of tokens: HMMs and TBL Can we use a sequence classifier to do chunking? Assign N and O tags: I gave a book to the new student. N O N N O N N N sequence of one or more consecutive N s = a noun group 3/3/15NYU22

A Problem How about I gave the new student a book N O N N N N N 2 noun groups or 3? 3/3/15NYU23

BIO Tags A solution: 3 tags – B: first word of a noun group – I: second or subsequent word of a noun group – O: not part of a noun group I gave the new student a book B O B I I B I To tag noun and verb groups, need 5 tags: B-N, I-N, B-V, I-V, and O 3/3/15NYU24