Download presentation
Presentation is loading. Please wait.
1
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong
2
Administrivia Homework 9 Perl regex Python re
import re slightly complicated string handling: use raw g/3/library/re.html
3
File I/O Summary Common: Perl: Python: open
filehandle (concept comes from the underlying OS) streams: STDIN STDOUT STDERR (Perl) streams: sys.stdin sys.stdout sys.stderr (Python) close Perl: <filehandle> (context: reads a line or the whole file) print filehandle String Python: .read() (methods) .readline() .readlines() .write(String) (no newline) print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False) (function)
4
Regular Expressions to the rescue
5
Regular Expressions from Hell
validation: RFC 5322: (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~- ]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01- 9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1- 9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9- ]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])+)\])
6
Homework 9 File: hw9.txt Contents: each line has 3 fields
56 lines Contents: each line has 3 fields name of state or US territory (in alphabetical order) population area (sq. miles) fields are separated by a tab (\t) Source: Wikipedia
7
Homework 9 Question 1 Using Perl
supply the file hw9.txt on the command line DO NOT MODIFY hw9.txt read the file use regex to extract the information create hash table(s) indexed by name containing population and land area Print a table of states/territories inversely ranked by land area Print a table of states/territories ranked by population (i.e. 1st is highest population) compute the density (population per sq. mile) Print a table of states/territories ranked by density (i.e. 1st is highest density)
8
Homework 9 Question 1 Hints:
note that some state/territory names consist of more than one word note that numeric values may have commas read read about split read about tr: $num =~ tr/,//d deletes the pesky commas in $num revisit sort parameters: if you need to trim whitespace from the ends: $line =~ s/^\s+|\s+$//g; for nicely-formatted lists, read about printf FORMAT
9
Homework 9: Question 2 538 only (optional for 438):
Do the same exercise as Question 1 in Python3 using a dictionary or dictionaries In your opinion, which code is simpler? These may prove useful: str.strip() str.replace() str.split() sys.argv int()
10
Homework 9 Usual submission rule: ONE PDF file
Submit code/run/comments subject heading: 438/538 Homework 4 Your Name Due date by midnight of next Monday (review in class on Tuesday)
11
regex Read textbook chapter 2: section 1 on Regular Expressions
12
Perl regex Read up on the syntax of Perl regular expressions
Online tutorials
13
Perl regex Perl regex matching: Perl regex match and substitute:
$s =~ /foo/ (/…/ contains a regex) can use in a conditional: e.g. if ($s =~ /foo/) … evaluates to true/false depending on what’s in $s can also use as a statement: e.g. $s =~ /foo/; global variable $& contains the match Perl regex match and substitute: $s =~ s/foo/bar/ s/…match… /…substitute… / contains two expressions will modify $s by looking for a single occurrence of match and replacing that with substitute s/…match… /…substitute… /g global substitution
14
Perl regex Most useful with the code template for reading in a file line-by-line: open($fh, $ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$fh>) { do RE stuff with $line } close($fh)
15
Chapter 2: JM spaces matter! character class: Perl lingo
16
Chapter 2: JM range: in ASCII table
backslash lowercase letter for class Uppercase variant for all but class
17
Chapter 2: JM
18
Chapter 2: JM Can use (…) if > 1 char Sheeptalk
19
Perl regex \s is a whitespace, so \S is a non-whitespace
\S+ing\b \s is a whitespace, so \S is a non-whitespace + is repetition (1 or more) \b is a word boundary, (words are made up of \w characters)
20
Perl regex global variables \b or \b{wb}
other boundary metacharacters: ^ (beginning of line), $ (end of line)
21
Perl regex: Unicode and \b
\b{wb} Note: global match in while-loop Note: .*? is the non-greedy version of .*
22
Perl regex: Unicode and \w
\w is [0-9A-Za-z_] Definition is expanded for Unicode: use utf8; use open qw(:std :utf8); my $str = "school école École šola trường स्कूल škole โรงเรียน"; @words = ($str =~ /(\w+)/g); foreach $word { print "$word\n" } list context Pragma
23
Chapter 2: JM Why? * means zero or more repetitions of the previous char/expr . means any single character ? means previous char/expr is optional
24
Chapter 2: JM Precedence of operators Perl: Precedence Hierarchy:
Example: Column 1 Column 2 Column 3 … /Column [0-9]+ */ /(Column [0-9]+ *)*/ /house(cat(s|)|)/ (| = disjunction; ? = optional) Perl: in a regular expression the pattern matched by within the pair of parentheses is stored in global variables $1 (and $2 and so on). (?: … ) group but exclude from storage Precedence Hierarchy: space
25
Online regex tester
26
returns 1 (true) or "" (empty if false)
Perl regex returns 1 (true) or "" (empty if false) A shortcut: list context for matching returns a list
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.