CSCE 590 Web Scraping Lecture 4

Slides:



Advertisements
Similar presentations
Python: Regular Expressions
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Data types and variables
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Chapter 2 Data Types, Declarations, and Displays
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Overview Regular expressions Notation Patterns Java support.
Scripting Languages Chapter 8 More About Regular Expressions.
CPSC 388 – Compiler Design and Construction
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expression What is Regex? Meta characters Pattern matching Functions in re module Usage of regex object String substitution.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
Notes on Python Regular Expressions and parser generators (by D. Parson) These are the Python supplements to the author’s slides for Chapter 1 and Section.
CS 203: Introduction to Formal Languages and Automata
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Regular Expressions Upsorn Praphamontripong CS 1110
Theory of Computation Lecture #
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
Notes on Python Regular Expressions and parser generators (by D
Chapter 3 Lexical Analysis.
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
Python regular expressions
CSC 594 Topics in AI – Natural Language Processing
Pattern Matching in Strings
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
CS 1111 Introduction to Programming Fall 2018
Compiler Construction
Regular Expressions and Grep
CSCE 590 Web Scraping Lecture 5
CSCI The UNIX System Regular Expressions
Lecture 5 Scanning.
Python regular expressions
LING 388: Computers and Language
Presentation transcript:

CSCE 590 Web Scraping Lecture 4 Topics Beautiful Soup Libraries, installing, environments Readings: Chapter 1 Python tutorial January 10, 2017

Overview Last Time: Today: References Code in text: Webpage of book: BeautifulSoup Installing Libraries Today: Regular Expressions References Code in text: . Webpage of book:

Reading and writing files f = open (filename, “r”) all_of_it = f.read() line = f.readline() for line in f: print(line, end='') f.close()

Writing files f = open (filename, “w”) f.write(“Some string”) f.write(“xbar=“, end=“ “) f.write(xbar)

Csv files Input file = eggs.new import csv with open('eggs.csv', newline='') as csvfile: spamreader = csv.reader(csvfile, delimiter=',', quotechar='|') for row in spamreader: print(', '.join(row)) print (row[1]) Produces the output one, two, three, four two five, six six Input file = eggs.new one, two, three, four five, six seven, eight, nine

Regular Expressions Base regular expressions Any character “a” is a regular expression that matches the string “a” Recursive Def.: If r and s are regular expressions that matches the set of strings R and S respectively then: Then rs is a regular expression that matches string for which the first part matches r and the second part matches s Then r | s is a regular expression that matches any string that matches either r or s r* is a regular expression that matches zero or more occurrences of strings that match r.

Examples of base regular expressions (c | b)at matches the set of strings {cat, bat} a*b matches {b, ab, aab, aaab, aaaab …}

Regular expressions, Languages and recognizers Let r be a regular expression then L(r) is the set of all strings that match r (this is called the language denoted by r) Finite automata are machines that recognize languages (sets of strings) Example

Kleene Closure (*) and Alternation '|' A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. To match a literal '|', use \|, or enclose it inside a character class, as in [|]. '*' Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

Positional Special Characters ‘^’, ‘$’ '^' (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline. '$' Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

Variations on repetition '+' Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’. '?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

Non-Greedy -- minimal matching *?, +?, ?? The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.

{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. {m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.

Quoting ‘\’ Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

Character Classes [ … ] Used to indicate a set of characters. In a set: Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'. Ranges of characters – for example [a-m] will match any lowercase ASCII letter from a to m, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'. Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'. Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.

Complemented Character Classes [^ … ] Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set. To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis.

Grouping (...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; The contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence

Grouping examples >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") >>> m.group(0) # The entire match 'Isaac Newton' >>> m.group(1) # The first parenthesized subgroup. 'Isaac' >>> m.group(2) # The second parenthesized subgroup. 'Newton' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Isaac', 'Newton')

Special characters classes \w matches any character that could be part of a word \W not … \s Matches Unicode whitespace characters \S not … \d Matches any Unicode decimal digit \D not … \b Matches the empty string, but only at the beginning or end of a word \B not … \A Matches only at the start of the string \Z Matches only at the end of the string \\

Standard escape characters \a \b \f \n \r \t \u \U \v \x \\

Machines for matching

re.compile re.compile(pattern, flags=0) Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below. The sequence prog = re.compile(pattern) result = prog.match(string) is logically equivalent to result = re.match(pattern, string)

Matching re.match(pattern, string, flags=0) If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match. re.search(pattern, string, flags=0) Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Examples

re.fullmatch(pattern, string, flags=0) If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

re.split() re.split(pattern, string, maxsplit=0, flags=0) Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

Split Examples >>> re.split('\W+', 'Words, words, words.') ['Words', 'words', 'words', ''] >>> re.split('(\W+)', 'Words, words, words.') ['Words', ', ', 'words', ', ', 'words', '.', ''] >>> re.split('\W+', 'Words, words, words.', 1) ['Words', 'words, words.'] >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) ['0', '3', '9']

re.sub(pattern, replacement, text) sub() replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters: >>> def repl(m): ... inner_word = list(m.group(2)) ... random.shuffle(inner_word) ... return m.group(1) + "".join(inner_word) + m.group(3) >>> text = "Professor Abdolmalek, please report your absences promptly." >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Pythex.org

Displaymatch helper function def displaymatch(match): if match is None: return None return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Displaymatch examples >>> valid = re.compile(r"^[a2-9tjqk]{5}$") >>> displaymatch(valid.match("akt5q")) # Valid. "<Match: 'akt5q', groups=()>" >>> displaymatch(valid.match("akt5e")) # Invalid. >>> displaymatch(valid.match("akt")) # Invalid. >>> displaymatch(valid.match("727ak")) # Valid. "<Match: '727ak', groups=()>“ >>> pair = re.compile(r".*(.).*\1") >>> displaymatch(pair.match("717ak")) # Pair of 7s. "<Match: '717', groups=('7',)>" >>> displaymatch(pair.match("718ak")) # No pairs. >>> displaymatch(pair.match("354aa")) # Pair of aces. "<Match: '354aa', groups=('a',)>"

re.findall findall() matches all occurrences of a pattern, not just the first one as search() does Find the Adverbs example >>> text = "He was carefully disguised but captured quickly by police." >>> re.findall(r"\w+ly", text) ['carefully', 'quickly']

m.start and m.end >>> text = "He was carefully disguised but captured quickly by police." >>> for m in re.finditer(r"\w+ly", text): ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 07-16: carefully 40-47: quickly

Raw strings (again) r“string” helps keep regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. >>> re.match(r"\W(.)\1\W", " ff ") <_sre.SRE_Match object; span=(0, 4), match=' ff '> >>> re.match("\\W(.)\\1\\W", " ff ") >>> re.match(r"\\", r"\\") <_sre.SRE_Match object; span=(0, 1), match='\\'> >>> re.match("\\\\", r"\\")

Phonebook example >>> text = """Ross McFluff: 834.345.1254 155 Elm Street ... ... Ronald Heathmore: 892.345.3428 436 Finley Avenue ... Frank Burger: 925.541.7625 662 South Dogwood Way ... Heather Albrecht: 548.326.4584 919 Park Place"""

Using split to remove extra lines >>> entries = re.split("\n+", text) >>> entries ['Ross McFluff: 834.345.1254 155 Elm Street', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', 'Heather Albrecht: 548.326.4584 919 Park Place']

maxsplit parameter of split() >>> [re.split(":? ", entry, 3) for entry in entries] [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

Now maxsplit = 4 >>> [re.split(":? ", entry, 4) for entry in entries] [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]