Finding the needle(s) in the textual haystack

Slides:



Advertisements
Similar presentations
BNF. What is BNF? BNF stands for “Backus-Naur Form,” after the people who invented it BNF is a metalanguage--a language used to describe another language.
Advertisements

Python: Regular Expressions
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Scripting Languages Chapter 8 More About Regular Expressions.
1 Lecture 3  Lexical elements  Some operators:  /, %, =, +=, ++, --  precedence and associativity  #define  Readings: Chapter 2 Section 1 to 10.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Faculty of Sciences and Social Sciences HOPE JavaScript Validation Regular Expression Stewart Blakeway FML
Definition A string is a sequence of symbols. Examples “Hi, Mom.” “YAK” “abbababba” Question In what ways do programmers use strings?
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
CPS120: Introduction to Computer Science
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
BASICS CONCEPTS OF ‘C’.  C Character Set C Character Set  Tokens in C Tokens in C  Constants Constants  Variables Variables  Global Variables Global.
Recognizing PL/SQL Lexical Units. 2 home back first prev next last What Will I Learn? List and define the different types of lexical units available in.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Test Automation For Web-Based Applications Portnov Computer School Presenter: Ellie Skobel.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
"Give a person a fish and you feed them for a day; teach that person to use the Internet and they won't bother you for weeks.“ --unknown "Treat your password.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Chapter 2 Variables.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
Topics to be covered  Introduction to C Introduction to C  Characterstics of C Characterstics of C  Characterset Characterset  Keywords Keywords 
Ajmer Singh PGT(IP) Programming Fundamentals. Ajmer Singh PGT(IP) Java Character Set Character set is a set of valid characters that a language can recognize.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
RE Tutorial.
Finding the needle(s) in the textual haystack
Chapter 2 Variables.
Regular Expressions 'RegEx'.
Perl Regular Expression in SAS
Strings and Serialization
Wel come.
Finding the needle(s) in the textual haystack
Week 14 - Friday CS221.
Finding the needle(s) in the textual haystack
Chapter 2 Variables.
ECE 103 Engineering Programming Chapter 8 Data Types and Constants
Chapter 2 Variables.
Chapter 3 - Binary Numbering System
Presentation transcript:

Finding the needle(s) in the textual haystack

Consider the text above. How would you identify … Proper names? … addresses? … Dates? From: Gow, Joe Subject: Reminder About Open Forums Today Date: March 25, :44:08 AM CDT Bcc: Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse From: Gow, Joe Subject: Reminder About Open Forums Today Date: March 25, :44:08 AM CDT Bcc: Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse

What do you think of when you see the following? MM/DD/YYYY This is a (string) pattern. Are there different patterns for this same thing? How would you describe the pattern of a credit card number?

Regular expressions are “formulas” for string patterns. Regular expressions follow a standard notation. Regular expressions can be used in various computer applications and programming languages. Applying a regular expression to a string (piece of text) is called pattern matching. - The regular expression might match the string (or part of it) or it might not.

Regular expressions use a standard pattern language. Any (non-meta) character is a pattern. The character pattern represents itself. The '.' (period) is a pattern. The period (a meta character) pattern represents "any character" If A and B are both patterns, then so are AB : This represents the pattern A followed by pattern B F. matches Fa FR and F3 but not fa or aF A|B : This represents either the pattern A or the pattern B P|Q matches P and Q but not R Parentheses are special; they form a pattern group. Anything in parenthesis is a group. A group is one "thing". (red|blue) fish matches what strings?

(1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) How would you write an expression for the time on a digital 12-hour clock? 1|2|3|4|5|6|7|8|9|10|11|12 A regular expression matching any possible minute: (0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) [HINT: Let’s divide & conquer] A regular expression matching any possible hour: A regular expression matching any possible time:

Quantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are: re * represents zero or more repetitions of re re + represents one or more repetitions of re re ? represents zero or one occurrences of re re { n } represents exactly n repetitions of re ( n is some positive integer) re { m, n } represents at least m and no more than n repetitions of re ( n, m are positive integers, m ≤ n ) Write a regular expression for Social Security Numbers

Text I sometimes wonder if the manufactures of foolproof items keep a fool or two on their payroll. Patten: o{2}1?

Some characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code. \+ represents + \. represents. \n represents the new line character The same technique works for * ? ( ) { } [ ] \ ^ $ | \t represents the tab character \r represents the carriage return character \v represents the vertical tab character \f represents the form feed character

There are also two “location” symbols. ^ matches the start of a new line, including right after \n $ matches the end of a new line, including right before \n

(snow|rain)(flake|drop) g(rr|ee)* W.* W B\.C\. ^Right now.$ ^Right now.\$

Square brackets enclose a character class (a set of characters). The class will match any one character from the set. Within brackets…  specific characters can be listed  ranges are denoted using - Examples [aDb] matches a or D or b and nothing else [c-e] matches c or d or e and nothing else [a-z] matches any lowercase letter and nothing else [a-zA-Z0-9] matches any alphabetic or numeric symbol [a+*] matches a or + or * and nothing else

Which of the following match [a-z][0-9]* abc 1z93 a-9 Which of the following match [0-9]*[02468] Give a pattern for social security numbers using character classes.

Create a regular expression to match phone numbers. The phone numbers can take on the following forms: x1234

Divide and conquer Note that each phone number has at most four parts. prefix (the number 1) area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length) Consider defining each of these parts – what is the prefix? – what is the area code? – what is the trunk? – what is the rest? – what is the extension?

We need to 'conquer' by combining the solutions for the parts. Rules: – The prefix is optional – One of the following must occur between the prefix and the area code: space, comma, dash, period – One of the following must occur between the area code and the trunk: space, comma, dash, period – One of the following must occur between the trunk and the rest: space, comma, dash, period – An ‘x’ must occur between the rest and the extension.

Suppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters. Examples Dave D.-riley Rdave Invalid dave doesn’t begin with a capital letter DDR3 capital letters and digits not permitted after first symbol R too short

Every computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits. Examples 10:22:93:04:91:00 AF:0C:AA:ED:B7:21 Invalid 10:22:93:04:91 too short 10:22:013:04:91 numbers must be two digits long, not three AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit

Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of Examples Invalid no number can be greater than too few numbers :17.2 separators must be periods

An address consists of two strings separated by domainString localString – Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~ – Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods. – Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore. domainString – Must be one or more of the following characters: alphabetic, digits, dashes or periods. – Alternately, the domain could be written as a pair of square brackets enclosing four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits. e.g., [ ]