LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong. Administrivia We'll postpone Homework 4 review until next week …
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong. Administrivia Homework 1 graded – you should have gotten an from me.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia.
Modern Information Retrieval Chapter 4 Query Languages.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
LING/C SC/PSYC 438/538 Lecture 5 9/8 Sandiway Fong.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 4: 8/30.
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong. Continuing with Perl Homework 3: first Perl homework – due Sunday by midnight – one PDF file, by .
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
LING/C SC/PSYC 438/538 Lecture 3 8/30 Sandiway Fong.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong. Administrivia Homework 2 graded.
Michael Olivero Microsoft Student Ambassador for FIU Special Guest: FIU Graduate Student Miguel Gonzalez.
Javascript’s RegExp. RegExp object Javascript has an Object which compiles Regular Expressions into a Finite State Machine The F.S.M. is internal, and.
Pushdown Automata Chapters Generators vs. Recognizers For Regular Languages: –regular expressions are generators –FAs are recognizers For Context-free.
Pattern Matching II. Greedy Matching When dealing with quantifiers, Perl’s pattern matcher is by default greedy. For example, –$_ = “Bob sat next to the.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
LING/C SC/PSYC 438/538 Lecture 14 Sandiway Fong. Administrivia Homework 6 graded.
Michael Kovalchik CS 265, Fall  Parenthesis group parts of expressions together  “/CS265|CS270/” => “/CS(265|270)/”  Groups can be nested  “/Perl|Pearl/”
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong. Adminstrivia Homework 4 graded Homework 5 out today – Due Saturday night by midnight – (Gives me Sunday.
LING/C SC/PSYC 438/538 Online Lecture 7 Sandiway Fong.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
Lecture # 16. Applications of Incrementing and Complementing machines 1’s complementing and incrementing machines which are basically Mealy machines are.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
Parallel embedded system design lab 이청용 Chapter 2 (2.6~2.7)
LING/C SC/PSYC 438/538 Lecture 5 Sandiway Fong.
CSE 374 Programming Concepts & Tools
Regular Expressions in Perl
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.
Grep Allows you to filter text based upon several different regular expression variants Basic Extended Perl.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 6 Sandiway Fong.
LING 408/508: Computational Techniques for Linguists
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
LING 388: Computers and Language
CS 3304 Comparative Languages
Regular Expressions
i206: Lecture 19: Regular Expressions, cont.
Time Complexity We use a multitape Turing machine
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
Lists in Python Outputting lists.
Programming Languages
LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.
String Processing 1 MIS 3406 Department of MIS Fox School of Business
1.5 Regular Expressions (REs)
LING/C SC/PSYC 438/538 Lecture 5 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
Chapter 13 Control Structures
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong

Chapter 2: JM s/([0-9]+)/<\1>/ Backreferences technically give Perl regexs more expressive power than Finite State Automata. I.e. no finite state machine can ever do this.

Shortest vs. Greedy Matching default behavior in Perl RE match: take the longest possible matching string aka greedy matching This behavior can be changed, see next slide

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; } Output: greedy (.*): matched <d is under the bar in the > shortest (.*?): matched <d is under the > Notes: ? immediately following a repetition operator like * (or +) makes the operator work in non-greedy mode (.*?) (.*)

Regex: exponential time Regex search is supposed to be fast but searching is not necessarily proportional to the length of the string (or corpus) being searched in fact, Perl RE matching can can take exponential time (in length) non-deterministic may need to backtrack (revisit) if it matches incorrectly part of the way through Consider a?a?a?aaa matching against the string aaa Let's do a perl –e time length exponential time length linear

Regex: exponential time Now consider scaling up a?a?a?aaa, i.e. (a?)nan matching against an Reference: https://swtch.com/~rsc/regexp/regexp1.html

Global Matching: scalar context g flag in the condition of a while-loop \w is a word character []A-Za-z0-9_]

Global Matching: scalar context Python: please read https://docs.python.org/3/howto/regex.html

Global Matching: list context g flag in list context Disjunction: /(\w+)s|(\w+[a-rt-z])/g; /\w+[a-rt-z]/g

Split @array = split /re/, string splits string into a list of substrings split by re. Each substring is stored as an element of @array. Examples (from perlrequick tutorial):

Split m allows you to redefine / the regex delimiter character

Python: split

Matched Positions

Matched Positions

Variables in regex Example: qr/regex/ $text =~ /(a?){$n}a{$n}/ $n defined earlier {..} is an repetition operator qr/regex/ This operator quotes (and possibly compiles) regex as a regular expression. $re = qr/regex/; $line =~ m!$re $re!;

Zipf's Law Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. See: http://demonstrations.wolfram.com/ZipfsLawAppliedToWordAndLetterFrequencies/ Brown Corpus (1,015,945 words): only 135 words are needed to account for half the corpus. On a Log – Log scale: almost straight line http://www.learnholistically.it/esp-clil/wfk2.htm https://finnaarupnielsen.wordpress.com/2013/10/22/zipf-plot-for-word-counts-in-brown-corpus/

Character Frequency Counting Sample code is rather interesting (-e flag): (?{ Perl code }) Slightly modified but easier to read: note: lowercase

Character Frequency Counting