1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006.

Slides:



Advertisements
Similar presentations
Talking Letters Consonants Lessons 1 - 5
Advertisements

CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
Regular Expressions grep
Regular Expressions grep and egrep. Previously Basic UNIX Commands –Files: rm, cp, mv, ls, ln –Processes: ps, kill Unix Filters –cat, head, tail, tee,
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
LING 581: Advanced Computational Linguistics Lecture Notes February 2nd.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller and Ruth Anderson
1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer.
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
1 More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Filters using Regular Expressions grep: Searching a Pattern.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2014) Dan Jurafsky (From Chris Manning’s modification of Ken Church’s presentation)
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Introduction to Unix – CS 21 Lecture 6. Lecture Overview Homework questions More on wildcards Regular expressions Using grep Quiz #1.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
CIS 218 Advanced UNIX1 Advanced UNIX CIS 218 Advanced UNIX Regular Expressions.
CSC 352– Unix Programming, Spring 2015 April 28 A few final commands.
Regular Expression Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer.
Quiz 30 minutes 10 questions No talking, texting, collaboration, etc…
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Appendix A: Regular Expressions It’s All Greek to Me.
Test Automation For Web-Based Applications Portnov Computer School Presenter: Ellie Skobel.
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
By Corey Stokes 9/14/10. What is grep? Global Regular Expression Print grep is a command line search utility in Unix Try: Search for a word in a.cpp file.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University.
UNIX Commands RTFM: grep(1), egrep(1) & fgrep(1) Gilbert Detillieux April 13, 2010 MUUG Meeting.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
1 Introduction to Python LING 5200 Computational Corpus Linguistics Martha Palmer.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
CS 124/LINGUIST 180 From Languages to Information
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Regular Expressions (RegEx). Regular expression is like another language What is a regular expression? Literal (or normal characters) – Alphanumeric abc…ABC…
1 XWindows apps: emacs, xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 9, 2006.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Bio 271 Lecture 1. Robert Gentleman Office M1B28, Mayer Building at DFCI Phone:
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CSC 352– Unix Programming, Fall 2011 November 8, 2011, Week 11, a useful subset of regular expressions, grep and sed, parts of Chapter 11.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
CS 124/LINGUIST 180 From Languages to Information
Regular Expression - Intro
Lecture 9 Shell Programming – Command substitution
CS 124/LINGUIST 180 From Languages to Information
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
LING 408/508: Computational Techniques for Linguists
LING 408/508: Computational Techniques for Linguists
CS 124/LINGUIST 180 From Languages to Information
CSE 303 Concepts and Tools for Software Development
CSCI The UNIX System Regular Expressions
Presentation transcript:

1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006

LING 5200, 2006 BASED on Kevin Cohen’s LING grep/egrep X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing

LING 5200, 2006 BASED on Kevin Cohen’s LING More grepping/egrepping /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l OR egrep ^'[0-9]+.[A-Z]‘ epw.cd | wc –l

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 Please give me command AND results! 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper- case letters, e.g. USSR and VTOL. ANS:158  grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l  egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l  egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l  egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 2. How many entries have a syllable that ends with a 4-consonant cluster? ANS: 45  egrep 'CCCC]' epw.cd (why not \] )? 56  grep 'CCCC]' epw.cd 56  grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36  egrep 'CCCC]\\' epw.cd 45

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. ANS: 238/243  egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l  egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l \X \X \Y \Y

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. ANS: 296/298  egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 5. Find all disyllabic words that contain only vowels. ANS: 4  egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] \i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:]

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. )  egrep –i ‘.tip of the *[a-z] iceberg’  egrep ‘[Tt]he tip of (a|the).* iceberg’  patriarchical /a more alarming

LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 6. Other multiword expressions  war on (inflation/drugs/the dictator)  fight the war on the expenditure side rather  rule of (the day/journalism/Ferdinand Marcos)  cream of the (British) crop

LING 5200, 2006 BASED on Kevin Cohen’s LING Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’ OR xkwic?

LING 5200, 2006 BASED on Kevin Cohen’s LING XWin 32 See Load on laptops, bring laptops to class if any issues Go to Feb 9 Emacs & Xkwic lecture