Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Semantic Role Labeling Abdul-Lateef Yussiff
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
1 Words and the Lexicon September 10th 2009 Lecture #3.
The Wonderful World of Tregex
1 More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Linux+ Guide to Linux Certification, Second Edition
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Guide To UNIX Using Linux Third Edition
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Introduction to Florian Jaeger, For the Methods class, December 3 rd, 2003.
Fex Feature Extractor - v2. Topics Vocabulary Syntax of scripting language –Feature functions –Operators Examples –POS tagging Input Formats.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
Linux+ Guide to Linux Certification, Third Edition
Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Week 3 Exploring Linux Filesystems. Objectives  Understand and navigate the Linux directory structure using relative and absolute pathnames  Describe.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
LING 581: Advanced Computational Linguistics Lecture Notes February 19th.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntax II “I really do not know that anything has ever been more exciting than diagramming sentences.” --Gertrude Stein.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
Linux+ Guide to Linux Certification, Second Edition
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
NATURAL LANGUAGE PROCESSING
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Chapter 4 Syntax a branch of linguistics that studies how words are combined to form sentences and the rules that govern the formation of sentences.
Lesson 5-Exploring Utilities
An Introduction to the Government and Binding Theory
CO4301 – Advanced Games Development Week 2 Introduction to Parsing
LING/C SC 581: Advanced Computational Linguistics
Lecture 7: Introduction to Parsing (Syntax Analysis)
Chapter Four UNIX File Processing.
PolyAnalyst Web Report Training
David Kauchak CS159 – Spring 2019
COMPILER CONSTRUCTION
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

Using Treebanks tgrep2 Lecture 2: 07/12/2011

Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class of words – distribution of structural configurations – frequency of a certain distribution

Why Treebanks Raw corpora are not enough for most linguistic purposes. Let’s start with the rawest of them all: the web, which I’ll call `Google’ -Convenient -Potentially inexhaustible -Varied and free-form

Problems with `Google’ Quality control – hard to identify the identity of the author, making it difficult to keep track of variation the text could be computer generated What does Google count? – google counts are notoriously unreliable and change from minute to minute – problem of repeated elements – no clear estimate of sample size, so difficult to go beyond order of magnitude estimations of frequency

Sentences Sentences are an important unit of linguistic organization. They are not an important unit of organization for most search engines.  Consequently not straightforward to restrict searches to remain within a sentence, a task that is crucial for linguistic purposes.

Selecting Texts Using the web/search engine directly is inadequate for any but the most basic linguistic purposes. The next step forward is to judiciously assemble a set of texts (possibly from the web) and use an appropriate search language – Regular Expressions based systems are fast, easily available, and easy to use.

The need for annotation Even after the creation of a corpus (a set of texts), there are still many basic linguistic investigations that cannot be conducted. generalizing searches – e.g. when we want to examine all sequences of Det and N identifying a subset of cases when there is ambiguity in part-of-speech e.g. `to’

Step 1: POS tags Part of Speech Tagging is the most basic kind of annotation. – POS-tagging makes corpora much more linguistically useful. – POS-tagging can often be done automatically with high reliability, allowing us to use large texts for linguistic purposes.

Step 2: Beyond POS tags (1)Ann likes Bill and Tim likes Nina. (2)Ann likes Bill and Tim, who are her mentors. Assume you want to search for coordinated noun phrases. You want to get (2) but not (1). But a search for the POS sequence `Noun Det Noun’ will catch both. We need structural information.

Step 3: Structural Information We need structural information for a corpus to be fully linguistically useful. We also need structural information if we want to train parsers off the corpus. The nature of this structural information is quite underdetermined.

Structural and Other Information One could include structural information, leading to a set of syntactic trees of the familiar sort: hence the term `treebank’ But the information can be quite different and there is no commitment that the formal objects involved are `trees’ Other alternatives: theta-roles, semantic argument information (PropBank) etc.

Searching a Treebank For linguistic purposes, we need a way to extract linguistically interesting patterns. What counts as a `linguistically interesting pattern’ will vary greatly depending upon your theoretical interests and the nature of the treebank. Here we will assume that the formal objects are trees and discuss a general and powerful way to search for trees with certain properties.

Verbs Accounts Class material is in: /data/home/verbs/shared/LSA7800_076 To reduce typing, set up a link: ln –s /data/home/verbs/shared/LSA7800_076 lsa

The Unix you’ll need ls –l lists files in current directory cd DIRECTORYNAME go to designated directory cat FILENAME read file and display on screen, or > FILENAME direct output into designated file more FILENAME scroll designated file wc count number of words/characters in input, provided by | pipe output of one program into another cat FILENAME | wc

Regular Expressions and grep Regular Expressions are a powerful and fast way to express search patterns grep – program for using Regular Expression search patterns – Syntax: grep RegExPattern FILENAME

Very Basic RegEx words are RegExs that match themselves as well as superstrings of themselves If A and B are RegEx, then 1. AB 2. A|B 3. A* (also A? and A+) are also RegEx

Regular expression summary. Matches any character (aka wildcard) ^abcMatches some pattern abc at the start of a string abc$Matches some pattern abc at the end of a string [abc]Matches one of a range of characters [A-Z0-9]Matches one of a range of characters ed|ing|sMatches one of the specified strings (aka disjunction) * Zero or more of previous item (aka closure) + One or more of previous item (aka closure) ? Zero or one of the previous item (aka optionality) {n} Exactly n repeats {n,} At least n repeats {,n} At most n repeats {m,n} At least m and at most n repeats a(b|c)+ Parentheses indicate the scope of the operators

TGrep2 A general way of writing Regular Expressions over trees tgrep2, grep for trees, written by Douglas Rohde, builds upon tgrep To use TGrep2, a special TGrep2 corpus file needs to be created – wsj.t2c in lsa directory, has 49,209 sentences.

TGrep2 Syntax 1 Basic Pattern Syntax – Regular expression syntax can be used to select words or node labels: – Ex: /ˆNP/ matches any node label that begins with NP such as NP-SBJ Command syntax: lsa/tgrep –c lsa/wsj.t2c PATTERN

At the beginning >lsa/tgrep -c lsa/wsj.t2c 'S << Vinken' | more (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (,,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (,,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (..)) (S (NP-SBJ (NNP Mr.) (NNP Vinken)) (VP (VBZ is) (NP-PRD (NP (NN chairman)) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.)) (,,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) (..))

Tree Relationships 1.Immediate domination (<) 2.Domination (<<) 3.Sisterhood ($) 4.Immediate Precedence (.) 5.Precedence (..) 6.First/nth/last child 7.First/Last descendant

Immediate Domination A < B A is the parent of B, retrieves subtree rooted at A. A > B A is the daughter of B, retrieves subtree rooted at A. NP < PP Matches any NP that immediately dominates a PP.

Domination A << B A dominates B, retrieves subtree rooted at A. A >> B A is dominated by B, retrieves subtree rooted at A. NP << PP Matches any NP that dominates a PP.

Combining Search Patterns 1 Default interpretation is & VP < NP < PP - a VP that immediately dominates an NP and a PP - Ex: (You can [see comets with a telescope]) VP < (NP < PP) - a VP that immediately dominates [an NP that dominates a PP] - Ex: (I [praised [the students [with good grades]]])

Combining Search Patterns 2 For optionality, use | NP < PP | << AP - an NP that immediately dominates a PP OR dominates an AP

Sisterhood and Precedence 1 A $ B - A is a sister of B A. B - A immediately precedes B A.. B - A precedes B A, B - A immediately follows B A,, B - A follows B

Sisterhood and Precedence 2 Combining the two: A $. B – A is a sister of B and immediately precedes B A $.. B -- A is a sister of B and precedes B A $, B -- A is a sister of B and immediately follows B A $,, B -- A is a sister of B and follows B

Which child? We can pick out a particular child: from left to right (start at 1): A <N B - B is the nth child of A A >N B - A is the nth child of B A <1 B (also used: A <, B) - B is the first child of A A >1 B(also used: A >, B) - A is the first child of B

Which child? We can pick out a particular child: from right to left (start at -1): A <-N B - B is the nth child of A from the right A >-N B - A is the nth child of B from the right A <-1 B(also used: A <‘ B) - B is the last child of A A >-1 B(also used: A >‘ B) - A is the last child of B

Which descendant? A <<, B - B is *a* left-most descendant of an A A >>, B - A is *a* left-most descendant of a B A <<‘ B - B is *a* right-most descendant of an A A >>’ B - A is *a* right-most descendant of a B

Uniqueness A <: B - B is the only child of A A >: B - A is the only child of B A <<: B - there is a single path of descent from A and B is on it A >>: B - there is a single path of descent from B and A is on it

Combining Links NP << (PP. VP) NP <‘ (PP <, (IN < on)) S < (A < B) < C S < ((A < B) < C) S < (A < B < C)

! Negating Links ! ! before any link relationship negates it A !.. B - A does not precede B A [< B |. C] [< D |. E] A [< B | ![. C !, F]] | ![< D !.. E]

= Naming Node labels = Any node label can be given a name using = S=foo > =foo)) NP < (PP=pp < (IN < on)) | < (NP < =pp)

: Segmented Patterns : Sometimes it is useful to break patterns up into segments S < (NP=n1.. (VP=v < PP)) < (NP=n2 !.. VP) S < NP=n1 < NP=n2 : =n1.. VP=v : =v < PP : =n2 !.. VP S << (VP=v < NP) : =v < /ˆPP/ S << (VP=v < NP < /ˆPP/)

@ Patterns that are likely to re-used can be constructed using NP NN | #CNP – core NP macros make subsequent modification easy!

Heavy NP Shift >tgrep -c wsj.t2c 'VP <-1 (NP <: *)' | more >tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’ >tgrep -c wsj.t2c 'VP <2 (PP $. NP=foo) <-1 (=foo <: *)’ >tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’ >tgrep -c wsj.t2c 'VP <-1 (NP <: PRP)’ >tgrep -c wsj.t2c 'VP <1 /^V*/ <2 PP <-1 (NP <: PRP)'

Adjective Ordering >tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/’ >tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/ <4 /^JJ/ <5 /^JJ/’ (NP (DT the) (JJ first) (JJ negative) (JJ compound) (JJ annual) (NN growth) (NN rate))