LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong
Administrivia Homework 9 Perl regex Python re import re slightly complicated string handling: use raw https://docs.python.or g/3/library/re.html
File I/O Summary Common: Perl: Python: open filehandle (concept comes from the underlying OS) streams: STDIN STDOUT STDERR (Perl) streams: sys.stdin sys.stdout sys.stderr (Python) close Perl: https://perldoc.perl.org/perlopentut.html <filehandle> (context: reads a line or the whole file) print filehandle String Python: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files .read() (methods) .readline() .readlines() .write(String) (no newline) print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False) (function)
Regular Expressions to the rescue https://xkcd.com/208/
Regular Expressions from Hell Email validation: RFC 5322: (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~- ]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0- 9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1- 9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9- ]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])+)\])
Homework 9 File: hw9.txt Contents: each line has 3 fields 56 lines Contents: each line has 3 fields name of state or US territory (in alphabetical order) population area (sq. miles) fields are separated by a tab (\t) Source: Wikipedia
Homework 9 Question 1 Using Perl supply the file hw9.txt on the command line DO NOT MODIFY hw9.txt read the file use regex to extract the information create hash table(s) indexed by name containing population and land area Print a table of states/territories inversely ranked by land area Print a table of states/territories ranked by population (i.e. 1st is highest population) compute the density (population per sq. mile) Print a table of states/territories ranked by density (i.e. 1st is highest density)
Homework 9 Question 1 Hints: note that some state/territory names consist of more than one word note that numeric values may have commas read about @ARGV read about split read about tr: $num =~ tr/,//d deletes the pesky commas in $num revisit sort parameters: https://perldoc.perl.org/functions/sort.html if you need to trim whitespace from the ends: $line =~ s/^\s+|\s+$//g; for nicely-formatted lists, read http://perldoc.perl.org/functions/sprintf.html about printf FORMAT
Homework 9: Question 2 538 only (optional for 438): Do the same exercise as Question 1 in Python3 using a dictionary or dictionaries In your opinion, which code is simpler? These may prove useful: str.strip() str.replace() str.split() sys.argv int()
Homework 9 Usual submission rule: ONE PDF file Submit code/run/comments Email subject heading: 438/538 Homework 4 Your Name Due date by midnight of next Monday (review in class on Tuesday)
regex Read textbook chapter 2: section 1 on Regular Expressions
Perl regex Read up on the syntax of Perl regular expressions Online tutorials http://perldoc.perl.org/perlrequick.html http://perldoc.perl.org/perlretut.html
Perl regex Perl regex matching: Perl regex match and substitute: $s =~ /foo/ (/…/ contains a regex) can use in a conditional: e.g. if ($s =~ /foo/) … evaluates to true/false depending on what’s in $s can also use as a statement: e.g. $s =~ /foo/; global variable $& contains the match Perl regex match and substitute: $s =~ s/foo/bar/ s/…match… /…substitute… / contains two expressions will modify $s by looking for a single occurrence of match and replacing that with substitute s/…match… /…substitute… /g global substitution
Perl regex Most useful with the code template for reading in a file line-by-line: open($fh, $ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$fh>) { do RE stuff with $line } close($fh)
Chapter 2: JM spaces matter! character class: Perl lingo
Chapter 2: JM range: in ASCII table backslash lowercase letter for class Uppercase variant for all but class
Chapter 2: JM
Chapter 2: JM Can use (…) if > 1 char Sheeptalk
Perl regex \s is a whitespace, so \S is a non-whitespace \S+ing\b \s is a whitespace, so \S is a non-whitespace + is repetition (1 or more) \b is a word boundary, (words are made up of \w characters)
Perl regex global variables \b or \b{wb} other boundary metacharacters: ^ (beginning of line), $ (end of line)
Perl regex: Unicode and \b \b{wb} Note: global match in while-loop Note: .*? is the non-greedy version of .*
Perl regex: Unicode and \w \w is [0-9A-Za-z_] Definition is expanded for Unicode: use utf8; use open qw(:std :utf8); my $str = "school école École šola trường स्कूल škole โรงเรียน"; @words = ($str =~ /(\w+)/g); foreach $word (@words) { print "$word\n" } list context Pragma https://perldoc.perl.org/open.html
Chapter 2: JM Why? * means zero or more repetitions of the previous char/expr . means any single character ? means previous char/expr is optional
Chapter 2: JM Precedence of operators Perl: Precedence Hierarchy: Example: Column 1 Column 2 Column 3 … /Column [0-9]+ */ /(Column [0-9]+ *)*/ /house(cat(s|)|)/ (| = disjunction; ? = optional) Perl: in a regular expression the pattern matched by within the pair of parentheses is stored in global variables $1 (and $2 and so on). (?: … ) group but exclude from storage Precedence Hierarchy: space
Online regex tester https://regex101.com
returns 1 (true) or "" (empty if false) Perl regex http://perldoc.perl.org/perlretut.html returns 1 (true) or "" (empty if false) A shortcut: list context for matching returns a list