Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28

2 Today’s Lecture regexp: Recap Perl: Recap More on Perl and regexps Homework 1

3 regexp: Recap Repetition abbreviations: –a exactly one a –a? a optional –a* zero or more a’s –a+ one or more a’s –a{n,m} between n and m a’s –a{n,} at least n a’s –a{n} exactly n a’s Metacharacters: –{}[]()^$.|*+?\ –may be escaped using by prefixing the metacharacter with backslash (\) Concatenation –two regexps may be concatenated to form a new regexp Disjunction –infix operator: | (vertical bar) –[set of characters] match one of the characters –[^set of characters] don’t match any of the characters –[char1-char2] dash (-) shorthand for a range of characters (ASCII)

4 regexp: Recap Range Abbreviations: –period (.) stands for any character (except newline) –\d (digit) = [0-9] –\s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) –\w (word character) = [0-9a-zA-Z_] –uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: –caret (^) at the beginning of a regexp string matches the “beginning of a line” –dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: –a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] –\b matches a word boundary

5 Perl: Recap Example –Perl program ( match.pl ) to read in a text file line by line (using a while loop) and print those lines that successfully match the regexp \b[tT]he\b enclosed by /.../ open (F,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ( ) { print $_ if (/\b[tT]he\b/); } Usage example input file ( text.txt ) command perl match.pl text.txt

6 More Perl Reference: –http://perldoc.perl.org /perlintro.html

7 More Perl Variables: –prefixed by $ –e.g. $count, $i Assignment and arithmetic expressions: –e.g. –$count = 0; –$i = “this”; –$count = $count + 1; –$count++; (auto-increment) –$i = $i. “ moment”; Print: –print $count; –print “Count: “, $count, “\n”; Conditionals: –if ($count == 1000) {... } else {...} Iteration: –$i = 10; –while ($i>0) { $i-- } –for ($i=0; $i <= $max; $i++) {... }

8 Perl and regexps Grouping –uses the metacharacters ( and ) to delimit a group –inside a regexp, each group can be referenced using backreferences \1, \2, and so on... –outside a regexp, each group is stored in a variable $1, $2, and so on... –hint: this may be very useful for your homework Examples: –doubled vowel –([aeiou])\1 –matches –heed and book –but not head

9 Homework 1 Due next Tuesday –submit by email –midnight deadline

10 Homework 1 Use file wsj2000.txt –on course homepage –contains the 1st 2000 lines from the Wall Street Journal (WSJ) section of the Penn Treebank –each sentence occupies one line... –make sure your lines end with the right newline marker for your platform –a space separates each word and punctuation symbol Excerpt (1st 5 lines): Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported. The asbestos fiber, crocidolite, is unusually resilient once it enters the lungs, with even brief exposures to it causing symptoms that show up decades later, researchers said.

11 Homework 1 Question 1: Using Perl and wsj2000.txt –What is the maximum number of consonants occurring in a row within a word? –How many words are there with that maximum number? –List those words –Give your Perl program

12 Homework 1 Question 2: Using your Perl program for Question 1 –modify your Perl program to report the sentence number as well as the word encountered in Question 1 –submit your modified program Example: –676 Pennsylvania –means on line number 676 the word Pennsylvania occurs

13 Homework 1 Question 3: (optional 438/mandatory 538) using Perl and wsj2000.txt –find the words with the longest palindrome sequence of letters as a substring –give your Perl code Example: common has a palindrome sequence of length 2: ommo


Download ppt "LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28."

Similar presentations


Ads by Google