LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong
Administrivia Homework 4 Perl regex Python re import re slightly complicated string handling: use raw https://docs.python.or g/3/library/re.html
Regular Expressions to the rescue https://xkcd.com/208/
Regular Expressions from Hell Email validation: RFC 5322: (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~- ]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0- 9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1- 9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9- ]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])+)\])
File I/O Summary Common: Perl: Python: open filehandle (concept comes from the underlying OS) close Perl: https://perldoc.perl.org/perlopentut.html <filehandle> (context: reads a line or the whole file) print filehandle String Python: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files .read() (methods) .readline() .readlines() .write(String) (no newline) print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False) (function)
Homework 4 File: population.txt Contents: Source: Wikipedia rank name continent population (2016) population (2017) fields are separated by a tab (\t) Source: Wikipedia
Homework 4: Question 1 Using Perl Hints: read the file create hash table(s) indexed by country name containing the following information: continent/2016 population/2017 population Compute and print the country that decreased in population. Compute and print the country with the smallest positive increase in population. Print a table of countries in Asia and 2016 population ranked by 2016 population Print a table of countries in Africa and 2016 population ranked inversely by 2016 population Hints: read about split read about tr: $num =~ tr/,//d deletes the pesky commas in $num revisit sort parameters: https://perldoc.perl.org/functions/sort.html if you need to trim whitespace from the ends: $line =~ s/^\s+|\s+$//g; for nicely-formatted lists, read http://perldoc.perl.org/functions/sprintf.html about printf FORMAT
Homework 4: Question 2 Review Do the same exercise in Python3 using a dictionary or dictionaries These may prove useful: str.strip() str.replace() str.split() sys.argv int()
Homework 4: Question 3 In Your Opinion: which code is simpler?
Homework 4 Usual submission rule: ONE PDF file Submit code/run/comments Email subject heading: 438/538 Homework 4 Your Name Due date by midnight of next Monday (review in class on Tuesday)
regex Read textbook chapter 2: section 1 on Regular Expressions
Perl regex Read up on the syntax of Perl regular expressions Online tutorials http://perldoc.perl.org/perlrequick.html http://perldoc.perl.org/perlretut.html
Perl regex Perl regex matching: Perl regex match and substitute: $a =~ /foo/ (/…/ contains a regex) can use in a conditional: e.g. if ($a =~ /foo/) … evaluates to true/false depending on what’s in $a can also use as a statement: e.g. $a =~ /foo/; variable $& contains the match Perl regex match and substitute: $a =~ s/foo/bar/ s/…match… /…substitute… / contains two expressions will modify $a by looking for a single occurrence of match and replacing that with substitute s/…match… /…substitute… /g global substitution
Perl regex Most useful with the code template for reading in a file line-by-line: open($txtfile, $ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$txtfile>) { do RE stuff with $line } close($txtfile)
Chapter 2: JM character class: Perl lingo
Chapter 2: JM range: in ASCII table backslash lowercase letter for class Uppercase variant for all but class
Chapter 2: JM
Chapter 2: JM Sheeptalk
Perl regex \s is a whitespace, so \S is a non- whitespace \S+ing\b \s is a whitespace, so \S is a non- whitespace + is repetition (1 or more) \b is a word boundary, (words are made up of \w characters)
Perl regex \b or \b{wb} global variables
Perl regex: Unicode and \b \b{wb} Note: global match in while-loop
Perl regex: Unicode and \w \w is [0-9A-Za-z_] Definition is expanded for Unicode: use utf8; use open qw(:std :utf8); my $str = "school école École šola trường स्कूल škole โรงเรียน"; @words = ($str =~ /(\w+)/g); foreach $word (@words) { print "$word\n" } list context Pragma https://perldoc.perl.org/open.html
Chapter 2: JM
Chapter 2: JM Precedence of operators Perl: Precedence Hierarchy: Example: Column 1 Column 2 Column 3 … /Column [0-9]+ */ /(Column [0-9]+ *)*/ /house(cat(s|)|)/ (| = disjunction; ? = optional) Perl: in a regular expression the pattern matched by within the pair of parentheses is stored in global variables $1 (and $2 and so on) Precedence Hierarchy: space
returns 1 (true) or "" (empty if false) Perl regex http://perldoc.perl.org/perlretut.html returns 1 (true) or "" (empty if false) A shortcut: list context for matching returns a list