Ruby Regular Expressions

Slides:



Advertisements
Similar presentations
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Advertisements

Regular Expressions and DFAs COP 3402 (Summer 2014)
Regular Expressions using Ruby Assignment: Midterm Class: CPSC5135U – Programming Languages Teacher: Dr. Woolbright Student: James Bowman.
AND FINITE AUTOMATA… Ruby Regular Expressions. Why Learn Regular Expressions? RegEx are part of many programmer’s tools  vi, grep, PHP, Perl They provide.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
AND OTHER LANGUAGES… Ruby Regular Expressions. Why Learn Regular Expressions? RegEx are part of many programmer’s tools  vi, grep, PHP, Perl They provide.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Javascript’s RegExp. RegExp object Javascript has an Object which compiles Regular Expressions into a Finite State Machine The F.S.M. is internal, and.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
Deterministic Finite Automata Nondeterministic Finite Automata.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Department of Software & Media Technology
Regular Expressions Upsorn Praphamontripong CS 1110
CS510 Compiler Lecture 2.
Looking for Patterns - Finding them with Regular Expressions
Chapter 3 Lexical Analysis.
CSC 594 Topics in AI – Natural Language Processing
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Regular Expression - Intro
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Recognizer for a Language
CSC 594 Topics in AI – Natural Language Processing
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
CSC NLP - Regex, Finite State Automata
Regular Expressions.
4b Lexical analysis Finite Automata
Regular Expressions
Regular Expressions
Regular Expressions and Automata in Language Analysis
Compiler Construction
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
CSE 303 Concepts and Tools for Software Development
CIT 383: Administrative Scripting
Regular Expression in Java 101
Regular Expression: Pattern Matching
Lecture 5 Scanning.
Regular Expressions and Lexical Analysis
Regular Expressions.
Presentation transcript:

Ruby Regular Expressions and finite automata…

Why Learn Regular Expressions? RegEx are part of many programmer’s tools vi, grep, PHP, Perl They provide powerful search (via pattern matching) capabilities Simple regex are easy, but more advanced patterns can be created as needed (use with care, may not be efficient) ruby syntax closely follows Perl 5 Handy resource: rubular.com From: http://www.websiterepairguy.com/articles/re/12_re.html

Outline Regular expression basics Finite state automata how to create a pattern how to match using =~ Finite state automata Working with match data Working with named capture Regular expression objects Regexp.new/Regex.compile/Regex.union

Regular Expressions the Basics

Regular Expression patterns Constructed as /pattern/ /pattern/options %r{pattern} %r{pattern}options Options provide additional info about how pattern match should be done, for example: i – ignore case m – multiline, newline is an ordinary character to match u,e,s,n – specifies encoding, such as UTF-8 (u) From: http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ

Pattern Matching =~ is pattern match operator string =~ pattern OR pattern =~ string Returns the index of the first match Returns nil if no matches Note that nil doesn’t show when printing, but you can test for it

Literal characters /ruby/ /ruby/i

Character classes /[0-9]/ match digit /[^0-9]/ match any non-digit /[aeiou]/ match vowel /[Rr]uby/ match Ruby or ruby

Anchors – location of exp /^Ruby/ # Ruby at start of line /Ruby$/ # Ruby at end of line /\ARuby/ # Ruby at start of line /Ruby\Z/ # Ruby at end of line /\bRuby\b/ # Matches Ruby at word boundary Using \A and \Z are preferred in Ruby (vs $ and ^) http://stackoverflow.com/questions/577653/difference-between-a-z-and-in-ruby-regular-expressions

Alternatives /cow|pig|sheep/ # match cow or pig or sheep

Special character classes /./ #match any character except newline /./m # match any character, multiline /\d/ # matches digit, equivalent to [0-9] /\D/ #match non-digit, equivalent to [^0-9] /\s/ #match whitespace /[ \r\t\n\f]/ \f is form feed /\S/ # non-whitespace /\w/ # match single word chars /[A-Za-z0-9_]/ /\W/ # non-word characters NOTE: must escape any special characters used to create patterns, such as . \ + etc.

Repetition + matches one or more occurrences of preceding expression e.g., /[0-9]+/ matches “1” “11” or “1234” but not empty string ? matches zero or one occurrence of preceding expression e.g., /-?[0-9]+/ matches signed number with optional leading minus sign * matches zero or more copies of preceding expression e.g., /yes[!]*/ matches “yes” “yes!” “yes!!” etc.

More Repetition /\d{3}/ # matches 3 digits /\d{3,}/ # matches 3 or more digits /\d{3,5}/ # matches 3, 4 or 5 digits

Non-greedy Repetition Assume s = <ruby>perl> /<.*>/ # greedy repetition, matches <ruby>perl> /<.*?>/ # non-greedy, matches <ruby> Where might you want to use non-greedy repetition? Extra info, good to know but not on exams etc.

Grouping () can be used to create groups /\D\d+/ # matches non-digit followed by digits, e.g., a1111 /(\D\d)+/ # matches a1b2a3… ([Rr]uby(,\s)?)+ Would this recognize (play with this in rubular) “Ruby” “Ruby, ruby” “Ruby and ruby” “RUBY”

Finite State Automata a brief intro

Finite Automata – formal definition Formally a finite automata is a five-tuple(S,S,, s0, SF) where S is the set of states, including error state Se. S must be finite.  is the alphabet or character set used by recognizer. Typically union of edge labels (transitions between states). (s,c) is a function that encodes transitions (i.e., character c in changes to state s in S. ) s0 is the designated start state SF is the set of final states, drawn with double circle in transition diagram Theory of Computation view – we won’t be too formal in csci400

Simple Example Finite automata to recognize fee and fie: S = {s0, s1, s2, s3, s4, s5, se}  = {f, e, i} (s,c) set of transitions shown above s0 = s0 SF= { s3, s5} Set of words accepted by a finite automata F forms a language L(F). Can also be described by regular expressions. S0 S4 S1 f S3 S5 S2 e i What type of program might need to recognize fee/fie/etc.?

Finite Automata & Regular Expressions /fee|fie/ /f[ei]e/ Note: events/transitions are on the lines. Putting them in the nodes/circles is the #1 mistake. Note 2: end states should be in double lines, see next slide S0 S4 S1 f S3 S5 S2 e i

Another Example: Pascal Identifier Pascal id is a letter followed optionally by letters and digits /[A-Za-z][A-Za-z0-9]*/ A-Za-z0-9 A-Za-z S1 S0

Quick Exercise Go to rubular.com and review RegEx quick reference (same material as prior slides, but more concise) Look up the rules and create both FSA and RE to recognize: C identifier Perl identifier Ruby method identifier Turn in for class participation

RegExp to FSA ? = 0 or 1 [A-Z]?x + = 1 or more [A-Z]+ () = group 1-2 S0 S1 S2

Reg Exp to FSA * = 0 or more [A-Z]+[0-9]* A-Z 0-9 S0 S1 S2 A-Z 0-9

RegExp in Ruby some handy features

MatchData After a successful match, a MatchData object is created. Accessed as $~. Example: "I love petting cats and dogs" =~ /cats/ puts "full string: #{$~.string}" puts "match: #{$~.to_s}" puts "pre: #{$~.pre_match}" puts "post: #{$~.post_match}"

Named Captures str = "Ruby 1.9" if /(?<lang>\w+) (?<ver>\d+\.(\d+)+)/ =~ str puts lang puts ver end Read more: http://blog.bignerdranch.com/1575-refactoring-regular-expressions-with-ruby-1-9-named-captures/ http://www.ruby-doc.org/core-1.9.3/Regexp.html (look for Capturing)

Regexp class Can create regular expressions using Regexp.new or Regexp.compile (synonymous) ruby_pattern = Regexp.new("ruby", Regexp::IGNORECASE) puts ruby_pattern.match("I love Ruby!") => Ruby puts ruby_pattern =~ "I love Ruby!“ => 7

Regexp Union Creates patterns that match any word in a list lang_pattern = Regexp.union("Ruby", "Perl", /Java(Script)?/) puts lang_pattern.match("I know JavaScript") => JavaScript Automatically escapes as needed pattern = Regexp.union("()","[]","{}")

Resources

Some Resources http://www.bluebox.net/about/blog/2013/02/using-regular-expressions-in-ruby-part-1-of-3/ http://www.ruby-doc.org/core-2.0.0/Regexp.html http://rubular.com/ http://coding.smashingmagazine.com/2009/06/01/essential-guide-to-regular-expressions-tools-tutorials-and-resources/ http://www.ralfebert.de/archive/ruby/regex_cheat_sheet/ http://stackoverflow.com/questions/577653/difference-between-a-z-and-in-ruby-regular-expressions (thanks, Austin and Santi)

Topic Exploration http://www.codinghorror.com/blog/2005/02/regex-use-vs-regex-abuse.html http://programmers.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions http://coding.smashingmagazine.com/2009/05/06/introduction-to-advanced-regular-expressions/ http://stackoverflow.com/questions/5413165/ruby-generating-new-regexps-from-strings A little more motivation to use… http://blog.stevenlevithan.com/archives/10-reasons-to-learn-and-use-regular-expressions http://www.websiterepairguy.com/articles/re/12_re.html No longer required – so explore on your own.