LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.

Slides:



Advertisements
Similar presentations
LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong.
Advertisements

Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 7: 9/11.
1 Chapter 9 - Formatted Input/Output Outline 9.1Introduction 9.2Streams 9.3Formatting Output with printf 9.4Printing Integers 9.5Printing Floating-Point.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
Scalar Variables Start the file with: #! /usr/bin/perl –w No spaces or newlines before the the #! “#!” is sometimes called a “shebang”. It is a signal.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Scripting Languages Chapter 8 More About Regular Expressions.
String Escape Sequences
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Last Updated March 2006 Slide 1 Regular Expressions.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved Streams Streams –Sequences of characters organized.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Chapter 9 Formatted Input/Output. Objectives In this chapter, you will learn: –To understand input and output streams. –To be able to use all print formatting.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Programming for Linguists An Introduction to Python 24/11/2011.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 4: 8/30.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
Finding the needle(s) in the textual haystack
Programming Languages Meeting 13 December 2/3, 2014.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
© 2004 Pearson Addison-Wesley. All rights reserved ComS 207: Programming I Instructor: Alexander Stoytchev
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
Strings See Chapter 2 u Review constants u Strings, concatenation and repetition 1.
Strings and Related Classes String and character processing Class java.lang.String Class java.lang.StringBuffer Class java.lang.Character Class java.util.StringTokenizer.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Operating System Discussion Section. The Basics of C Reference: Lecture note 2 and 3 notes.html.
An Introduction to Programming with C++ Sixth Edition Chapter 13 Strings.
Chapter 4 © 2009 by Addison Wesley Longman, Inc Pattern Matching - JavaScript provides two ways to do pattern matching: 1. Using RegExp objects.
LING/C SC/PSYC 438/538 Online Lecture 7 Sandiway Fong.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Copyright Doug Maxwell (
Looking for Patterns - Finding them with Regular Expressions
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
“If you can’t write it down in English, you can’t code it.”
EECE.2160 ECE Application Programming
Perl Regular Expressions – Part 1
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Presentation transcript:

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28

Today’s Lecture regexp: Recap Perl: Recap More on Perl and regexps Homework 1

regexp: Recap Repetition abbreviations: –a exactly one a –a? a optional –a* zero or more a’s –a+ one or more a’s –a{n,m} between n and m a’s –a{n,} at least n a’s –a{n} exactly n a’s Metacharacters: –{}[]()^$.|*+?\ –may be escaped using by prefixing the metacharacter with backslash (\) Concatenation –two regexps may be concatenated to form a new regexp Disjunction –infix operator: | (vertical bar) –[set of characters] match one of the characters –[^set of characters] don’t match any of the characters –[char1-char2] dash (-) shorthand for a range of characters (ASCII)

regexp: Recap Range Abbreviations: –period (.) stands for any character (except newline) –\d (digit) = [0-9] –\s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) –\w (word character) = [0-9a-zA-Z_] –uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: –caret (^) at the beginning of a regexp string matches the “beginning of a line” –dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: –a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] –\b matches a word boundary

Perl: Recap Example –Perl program ( match.pl ) to read in a text file line by line (using a while loop) and print those lines that successfully match the regexp \b[tT]he\b enclosed by /.../ open (F,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ( ) { print $_ if (/\b[tT]he\b/); } Usage example input file ( text.txt ) command perl match.pl text.txt

More Perl Reference: – /perlintro.html

More Perl Variables: –prefixed by $ –e.g. $count, $i Assignment and arithmetic expressions: –e.g. –$count = 0; –$i = “this”; –$count = $count + 1; –$count++; (auto-increment) –$i = $i. “ moment”; Print: –print $count; –print “Count: “, $count, “\n”; Conditionals: –if ($count == 1000) {... } else {...} Iteration: –$i = 10; –while ($i>0) { $i-- } –for ($i=0; $i <= $max; $i++) {... }

Perl and regexps Grouping –uses the metacharacters ( and ) to delimit a group –inside a regexp, each group can be referenced using backreferences \1, \2, and so on... –outside a regexp, each group is stored in a variable $1, $2, and so on... –hint: this may be very useful for your homework Examples: –doubled vowel –([aeiou])\1 –matches –heed and book –but not head

Homework 1 Due next Tuesday –submit by –midnight deadline

Homework 1 Use file wsj2000.txt –on course homepage –contains the 1st 2000 lines from the Wall Street Journal (WSJ) section of the Penn Treebank –each sentence occupies one line... –make sure your lines end with the right newline marker for your platform –a space separates each word and punctuation symbol Excerpt (1st 5 lines): Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported. The asbestos fiber, crocidolite, is unusually resilient once it enters the lungs, with even brief exposures to it causing symptoms that show up decades later, researchers said.

Homework 1 Question 1: Using Perl and wsj2000.txt –What is the maximum number of consonants occurring in a row within a word? –How many words are there with that maximum number? –List those words –Give your Perl program

Homework 1 Question 2: Using your Perl program for Question 1 –modify your Perl program to report the sentence number as well as the word encountered in Question 1 –submit your modified program Example: –676 Pennsylvania –means on line number 676 the word Pennsylvania occurs

Homework 1 Question 3: (optional 438/mandatory 538) using Perl and wsj2000.txt –find the words with the longest palindrome sequence of letters as a substring –give your Perl code Example: common has a palindrome sequence of length 2: ommo