March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Chunk/Shallow Parsing Miriam Butt October PP-Attachment Recall the PP-Attachment Problem (demonstrated with XLE ): The ambiguity increases exponentially.
Grammars.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
INTERPRETER Main Topics What is an Interpreter. Why should we learn about them.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Context-free.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Stemming, tagging and chunking Text analysis short of parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Annotation Types for UIMA Edward Loper. UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
Ling 570 Day 17: Named Entity Recognition Chunking.
1 Chapter 3 Describing Syntax and Semantics. 3.1 Introduction Providing a concise yet understandable description of a programming language is difficult.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Lexical Analysis Hira Waseem Lecture
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Interpreter CS 124 Reference: Gamma et al (“Gang-of-4”), Design Patterns Some material taken from:
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
11 Chapter 4 Grammars and Parsing Grammar Grammars, or more precisely, context-free grammars, are the formalism for describing the structure of.
Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
Grammars Grammars can get quite complex, but are essential. Syntax: the form of the text that is valid Semantics: the meaning of the form – Sometimes semantics.
CSA2050 Introduction to Computational Linguistics Parsing I.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
The Interpreter Pattern (Behavioral) ©SoftMoore ConsultingSlide 1.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 3.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 2130 Lecture 18 Bottom-Up Parsing or Shift-Reduce Parsing Warning: The precedence table given for the Wff grammar is in error.
Chapter 3 – Describing Syntax
Parsing 2 of 4: Scanner and Parsing
Programming Languages Translator
Automata and Languages What do these have in common?
Syntax Analysis Chapter 4.
Formal Language Theory
Lecture 7: Introduction to Parsing (Syntax Analysis)
CSE 311: Foundations of Computing
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Chunk Parsing CS1573: AI Application Development, Spring 2003
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
High-Level Programming Language
CSCE 590 Web Scraping – NLTK IE
COMPILER CONSTRUCTION
Presentation transcript:

March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing

2March 2006CLINT-CS Sources This presentation is largely based on the NLTK Chunking Tutorial by Steven Bird, Ewan Klein and Edward Loper version (2006)

3March 2006CLINT-CS Two Kinds of Parsing full parsing with formal grammars (HPSG, LFG, TAG,...) to be compared with robust parsing chunk parsing in the context of tagging (also called partial, shallow, or robust parsing. Chunk parsing is an efficient and robust approach to parsing natural language It is a popular alternative to full parsing

4March 2006CLINT-CS Example Chunks [I begin] [with an intuition] : [when I read] [a sentence], [I read it] [a chunk] [at a time].

5March 2006CLINT-CS Abney (1991): Parsing by Chunks These chunks correspond in some way to prosodic patterns. … the strongest stresses in the sentence fall one to a chunk, and pauses are most likely to fall between chunks. The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template.

6March 2006CLINT-CS Content and Function Words [I begin] [with an intuition] : [when I read] [a sentence], [I read it] [a chunk] [at a time].

7March 2006CLINT-CS Inter and Intra Chunk Structure Within chunks A simple context-free or finite state grammar is quite adequate to describe the structure of chunks. Between chunks within a sentence By contrast, the relationships between chunks are mediated more by lexical selection than by rigid templates. The order in which chunks occur is much more flexible than the order of words within chunks.

8March 2006CLINT-CS Summary Chunks are non-overlapping regions of text Usually consisting of a head word (such as a noun) and the adjacent modifiers and function words (such as adjectives and determiners).

9March 2006CLINT-CS Chunking and Tagging Chunking Segmentation – identifying tokens Labelling – identifying the correct tag Chunk Parsing Segmentation – identifying strings of tokens Labelling- identifying the correct chunk type

10March 2006CLINT-CS Segmentation and Labelling

11March 2006CLINT-CS Chunking vs Full Parsing Both can be used to build hierarchical structures Chunking involves hierarchies of limited depth (usually 2) Complexity of full parsing typically O(n 3 ), versus O(n) for chunking Chunking leaves gaps between chunks Chunking can often give imperfect results: I turned off the spectroroute

12March 2006CLINT-CS Representing Chunks IOB Tags vs Trees Each token is tagged with one of three special chunk tags, INSIDE, OUTSIDE, or BEGIN. A token is tagged as BEGIN if it is at the beginning of a chunk, and contained within that chunk. Subsequent tokens within the chunk are tagged INSIDE. All other tokens are tagged OUTSIDE.

13March 2006CLINT-CS Example IOB Tags

14March 2006CLINT-CS Example Tree

15March 2006CLINT-CS Chunk Parsing A chunk parser finds contiguous, non- overlapping spans of related tokens and groups them together into chunks. It then combines these individual chunks together, along with the intervening tokens, to form a chunk structure. A chunk structure is a two-level tree that spans the entire text, and contains both chunks and un-chunked tokens.

16March 2006CLINT-CS Example Chunk Structure (S: (NP: 'I') 'saw' (NP: 'the' 'big' 'dog') 'on' (NP: 'the' 'hill'))

17March 2006CLINT-CS Chunking with Regular Expressions NLTK-Lite provides a regular expression chunk parser, parse.RegexpChunk to define the kinds of chunk we are interested in, and then to chunk a tagged text. The chunk parser begins with a structure in which no tokens are chunked Each regular-expression pattern (or chunk rule) is applied in turn, successively updating the chunk structure. Once all of the rules have been applied, the resulting chunk structure is returned.

18March 2006CLINT-CS Tag Strings A tag string is a string consisting of tags delimited with angle-brackets, e.g., tag strings do not contain any whitespace.

19March 2006CLINT-CS Chunking with Regular Expressions The NLTK chunk parser operates using chunk definitions that are supplied in the form of tag patterns which are regular expressions over tags, e.g. <DT><JJ>?<NN> NB: RE operators come after the tag. ? means an optional element

20March 2006CLINT-CS Tag Patterns Angle brackets group, so + matches one or more repetitions of the tag string ; and matches the tag strings or. The operator * within angle brackets is a wildcard, so that matches any single tag starting with NN. * matches zero or more repetitions of

21March 2006CLINT-CS Another Example <DT>?<JJ.*>*<NN.*> In this case the DT is optional It is followed by zero or more instances of It is followed by zero or more instances of matches any type of adjective matches any type of adjective The adjectives are followed by, i.e. any type of NN.

22March 2006CLINT-CS Creating a Chunk Parser in NLTK First create one or more rules rule1 = parse.ChunkRule(' +', 'Chunk sequences of DT and NN') Then create the parser chunkparser = parse.RegexpChunk([rule1], chunk_node='NP', top_node='S') Note that RegexpChunk has optional second and third arguments that specify the node labels for chunks and for the top-level node, respectively.

23March 2006CLINT-CS First Match Takes Precedence If a tag pattern matches at multiple overlapping locations, the rst match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then the first two nouns will be chunked

24March 2006CLINT-CS Example >>> from nltk_lite import tag >>> text = "dog/NN cat/NN mouse/NN" >>> nouns = tag.string2tags(text) >>> rule = parse.ChunkRule(' ', 'Chunk two consecutive nouns') >>> parser = parse.RegexpChunk([rule], chunk_node='NP', top_node='S') >>> parser.parse(nouns) (S: (NP: ('dog', 'NN') ('cat', 'NN')) ('mouse', 'NN'))

25March 2006CLINT-CS Two Rule Example >>> sent = tag.string2tags("the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN") >>> rule1 = parse.ChunkRule(' ', 'Chunk det+adj+noun') >>> rule2 = parse.ChunkRule(' +', 'Chunk sequences of NN and DT') >>> chunkparser = parse.RegexpChunk([rule1, rule2], chunk_node='NP', top_node='S') >>> chunk_tree = chunkparser.parse(sent, trace=1)

26March 2006CLINT-CS Trace Input: Chunk det+adj+noun: { } Chunk sequences of NN and DT: { }

27March 2006CLINT-CS Rule Interaction When a ChunkRule is applied to a chunking hypothesis, it will only create chunks that do not partially overlap with chunks already in the hypothesis. Thus, if we apply these two rules in reverse order, we will get a different result:

28March 2006CLINT-CS Reverse Order of Rules >>> chunkparser = parse.RegexpChunk([rule2, rule1], chunk_node='NP', top_node='S') Input: Chunk sequences of NN and DT: { } { } { } Chunk det+adj+noun: { } { } { }

29March 2006CLINT-CS Chinking Rules Sometimes it is easier to define what we don't want to include in a chunk than we do want to include. Chinking is the process of removing a sequence of tokens from a chunk. If the sequence of tokens spans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before. If the sequence is at the beginning or end of the chunk, these tokens are removed, and a smaller chunk remains.

30March 2006CLINT-CS Creating Chink Rules ChinkRules are created with the ChinkRule constructor, >>> chink_rule = parse.ChinkRule(' +', 'Chink sequences of VBD and IN') To show how it works, we first define chunkall_rule = parse.ChunkRule(' +', 'Chunk everything') and then put the two together

31March 2006CLINT-CS Running Chink Rules >>> chunkparser = parse.RegexpChunk([chunkall_rule, chink_rule], chunk_node='NP', top_node='S') Input: Chunk everything: { } Chink sequences of VBD and IN: { } { }

32March 2006CLINT-CS The Unchunk Rule Unchunk rules are very similar to ChinkRule except that it will only remove a chunk if the pattern matches an entire chunk. >>> unchunk_rule = parse.UnChunkRule(' +', 'Unchunk sequences of NN and DT')

33March 2006CLINT-CS Chunkparser with Unchunk Rule >>> chunk_rule = parse.ChunkRule(' +','Chunk sequences of NN, JJ, and DT') >>> chunkparser = parse.RegexpChunk([chunk_rule, unchunk_rule], chunk_node='NP', top_node='S')

34March 2006CLINT-CS Unchunk Parser Trace Input: Chunk sequences of NN, JJ, and DT: { } Unchunk sequences of NN and DT: { }

35March 2006CLINT-CS Merge Rules MergeRules are used to merge two contiguous chunks. Each MergeRule is parameterized by two tag patterns: a left pattern and a right pattern. A MergeRule will merge two contiguous chunks C1 and C2 if the end of C1 matches the left pattern, and the beginning of C2 matches the right pattern.

36March 2006CLINT-CS Split Rules SplitRules are used to split a single chunk into two smaller chunks. Each SplitRule is parameterized by two tag patterns: a left pattern and a right pattern. A SplitRule will split a chunk at any point p, where the left pattern matches the chunk to the left of p, and the right pattern matches the chunk to the right of p.

37March 2006CLINT-CS Evaluating Chunk Parsers Essentially, evaluation is about the comparing the behaviour of a chunk parser against a standard. Typically this involves the following phases Save already chunked text Save already chunked text Unchunk it Unchunk it Chunk it using chunk parser Chunk it using chunk parser Compare the result with the original chunked text Compare the result with the original chunked text

38March 2006CLINT-CS Evaluation Metrics Metrics are typically based on the following sets: guessed: The set of chunks returned by the chunk parser. correct: The correct set of chunks, as defined in the test corpus. From these we can define five useful measures

39March 2006CLINT-CS Evaluation Metrics Precision: what percentage of guessed chunks were correct? Recall: what percentage of guessed chunks were guessed F-Measure: harmonic mean of Precision and recall Missed Chunks: what correct chunks were not guessed Incorrect Chunks: what guessed chunks were not correct