LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.

Slides:



Advertisements
Similar presentations
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Advertisements

Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
AND FINITE AUTOMATA… Ruby Regular Expressions. Why Learn Regular Expressions? RegEx are part of many programmer’s tools  vi, grep, PHP, Perl They provide.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
LING 438/538 Computational Linguistics Sandiway Fong Lecture 8: 9/29.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
CPSC 388 – Compiler Design and Construction
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Finding the needle(s) in the textual haystack
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
VBScript Session 13.
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
RE Tutorial.
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
Presentation transcript:

LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23

Administrivia Class:

Administrivia Class:

Administrivia Did you bring your laptop?

Administrivia Did you install Perl yet? –Active State Perl – –install free version

Regular Expressions regular expressions are used in string pattern-matching –important tool in automated searching –formally equivalent to finite-state automata (FSA) and regular grammars popular implementations –Unix grep command line program returns lines matching a regular expression standard part of all Unix-based systems –including MacOS X (command-line interface in Terminal) many shareware/freeware implementations available for Windows XP –just Google and see... –grep functionality is built into many programming languages e.g. Perl –wildcard search in Microsoft Word limited version of regular expressions (not full power) with differences in notation

Regular Expressions Historical note –grep : name comes from Unix ed command –g/re/p –“search globally for lines matching the regular expression, and print them” –[Source: –ed is an obscure and difficult-to-use text edit program on Unix systems –doesn’t need a screen display –would work on an ancient teletype

Regular Expressions Formally –a regular expression (regexp) is formed from: an alphabet (= set of characters) operators –a regexp is shorthand for a set of strings (possibly infinite set) (strings are of finite length) Formally –a set of strings is called a language –a language that can be defined by a regular expression is called a regular language (not all languages are regular)

Regular Expressions alphabet –e.g. {a,b,c,...,z} set of lower case English letters Note: case is important operators –asingle symbol a –a n exactly n occurrences of a, n a positive integer –a n a 3  aaa –a * zero or more occurrences of a –a + one or more occurrences of a –concatenation two regexps may be concatenated, the resulting string is also a regexp e.g. abc –disjunction infix operator: | (vertical bar) e.g. a|b –parentheses may be used for disambiguation e.g. gupp(y|ies)

Regular Expressions Technically, a + is not necessary –aa* = a + “a concatenated with a* (zero or more occurrences of a)” = “one or more occurrences of a” Disjunction –[set of characters] set of characters enclosed in square brackets means match one of the characters –e.g. [aeiou] matches any of the vowels a, e, i, o or u but not d –dash (-) shorthand for a range –e.g. [a-e] matches a, b, c, d or e

Regular Expressions Range defined over a computer character set –typically ASCII –originally a 7 bit character set –2^7 = 128 (0-127) different characters –ASCII = American Standard Code for Information Interchange –e.g. [A-z] [0-9A-Za-z] rogramming/ascii_table/PROGRAM MING_ascii_table.shtml

Regular Expressions: Microsoft Word terminology: –wildcard search

Regular Expressions: Microsoft Word Note: zero or more times is missing in Microsoft Word

Regular Expressions Perl uses the same notation as grep (textbook also uses grep notation) More shorthand –question mark (?) means the previous regexp is optional –e.g. colou?r –matches color or colour –metacharacters or operators like ? have a function –to match a question mark, escape it using a backslash (\) –e.g. why\? –? in Microsoft Word means match any character More shorthand –period (.) stands for any character (except newline) –e.g. e.t matches eat as well as eet –caret sign (^) as the first character of a range of characters [^set of characters] means don’t match any of the characters mentioned (after the caret) –e.g. [^aeiou] –any character except for one of the vowels listed

Regular Expressions Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed) Typically, text files are read a line at a time by programs Matching in Perl and grep is line-oriented (can be changed in Perl) Differences in platforms for line breaking: –Unix: LF –Windows (DOS): CR LF –MacOS (X): CR

Regular Expressions Line-oriented metacharacters: –caret (^) at the beginning of a regexp string matches the “beginning of a line” –e.g. ^The matches lines beginning with the sequence The –Note: the caret is very overloaded... [^ab] a^b –dollar sign ($) at the end of a regexp string matches the “end of the line” –e.g. end\.$ –matches lines ending in the sequence end. –e.g. ^$ matches blank lines only –e.g. ^ $ matches lines contains exactly one space

Regular Expressions Word-oriented metacharacters: –a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] –(historical reasons for this) –\b matches a word boundary, e.g. a space or beginning or end of a line or a non-word character –e.g. the –matches the, they, breathe and other –but \bthe will only match the and they –the\b will match the and breathe –\bthe\b will only match the –(\ can also be used to match the beginning and end of a word) –e.g. \b99 –matches 99 but not 299 –also matches $99

Regular Expressions Range abbreviations: –\d (digit) = [0-9] –\s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) –\w (word character) = [0=9a- zA-Z_] uppercase versions denote negation –e.g. \W means a non-word character \D means a non-digit Repetition abbreviations: –a? a optional –a* zero or more a’s –a+ one or more a’s –a{n,m} between n and m a’s –a{n,} at least n a’s –a{n} exactly n a’s –e.g. \d{7,} matches numbers with at least 7 digits –e.g. \d{3}-\d{4} –matches 7 digit telephone numbers with a separating dash

Reading Perl Quick Intro – Perl Regular Expressions (RE) –perlrequick - Perl regular expressions quick startperlrequick –perlretut - Perl regular expressions tutorialperlretut