Regex is Fun David Clawson SplunkYoda.

Slides:



Advertisements
Similar presentations
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Advertisements

REGULAR EXPRESSIONS FRIEND OR FOE?. INTRODUCTION TO REGULAR EXPRESSIONS.
Python: Regular Expressions
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
1 A Quick Introduction to Regular Expressions in Java.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Scripting Languages Chapter 8 More About Regular Expressions.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Perl 6 Update - PGE and Pugs Dr. Patrick R. Michaud April 26, 2005.
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 2 Chapter 2 - Introduction to C Programming.
Regular Expression in Java 101 COMP204 Source: Sun tutorial, …
Regular Expressions.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
C# Strings 1 C# Regular Expressions CNS 3260 C#.NET Software Development.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Module 6 – Generics Module 7 – Regular Expressions.
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming regular expressions.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 2 - Introduction to C Programming Outline.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Variable Variables A variable variable has as its value the name of another variable without $ prefix E.g., if we have $addr, might have a statement $tmp.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Regular Expressions Upsorn Praphamontripong CS 1110
Looking for Patterns - Finding them with Regular Expressions
Advanced Regular Expressions
Advanced Find and Replace with Regular Expressions
CS 1111 Introduction to Programming Fall 2018
CIT 383: Administrative Scripting
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

Regex is Fun David Clawson SplunkYoda

Regular Expressions “A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson QED Text Editor written by Ken in the 1970s Invented in the 1940s Help celebrate it’s 70th Year

Types of Regular Expressions

How is Regex used in Python? Python “re” Python's built-in "re" module provides excellent support for regular expressions, with a modern and complete regex flavor. The only significant features missing from Python's regex syntax are atomic grouping, possessive quantifiers, and Unicode properties. Using Regular Expressions in Python The first thing to do is to import the regexp module into your script with “import re”.

How is Regex used in Python? Call re.search(regex, subject) to apply a regex pattern to a subject string. The function returns None if the matching attempt fails, and a Match object otherwise. The Match object stores details about the part of the string matched by the regular expression pattern. Since None evaluates to False, you can easily use re.search() in an if statement.

How is Regex used in Python? Do not confuse re.search() with re.match(). Both functions do exactly the same, with the important distinction that re.search() will attempt the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string.

How is Regex used in Python? To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match attempt starts beyond the previous match. If the regex contains one or more capturing groups, re.findall() returns an array of tuples, with each tuple containing text matched by all the capturing groups. The overall regex match is not included in the tuple, unless you place the entire regex inside a capturing group.

How is Regex used in Python? More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match.

How is Regex used in Splunk? Field extraction | rex field=_raw “%UC_CALLMANAGER-(?<Severity>\d+)-EndPointUnregistered: Configure Line Breaking LINE_BREAKER = [\r\n]+ Filtering and Routing Data to Queues REGEX =(?m)^EventCode=(592|593) Many more…….

Regex Testing Tools RegExr http://gskinner.com/RegExr/ Reggy http://reggyapp.com/ RegexPal http://regexpal.com/ Regex Buddy http://www.regexbuddy.com/ Lars Olav Torvik http://regex.larsolavtorvik.com/ Rubular http://rubular.com/

Regex Reference Texts http://www.regular-expressions.info/reference.html - from the creators of RegexBuddy Introducing Regular Expressions by Michael Fitzgerald Mastering Regular Expressions by Jeffrey Friedl Regular Expressions Cookbook by Jan Goyvaerts Regular Expressions Pocket Reference by Tony Stubblebine

Basic Concepts of Regular Expressions Because Knowing leads to Doing

Simple Pattern Matching Matching String Literals Matching Digits and Non-Digits Matching Word and Non-Word Characters Matching Whitespace Matching Any Character

Matching String Literals Sample Apache Log 10.23.10.11 www.iamcool.com 10.100.0.11 - - [06/Dec/2012:14:39:03 -0800] "GET /Facelift/answers/swelling HTTP/1.1" 301 20 14932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” Literal String Match of the first ip address would be: 10.23.10.11

Matching Digits and Non-Digits \d or \D or [0-9] \d - match digit \D – match non-digit (matches whitespace, punctuation and other characters not used in words) [0-9] - match any number (called a character class) [^0-9] – match any non-number

Matching Words and Non-Words \w or \W \w – match any word character and is essentially the same as the character class [a-zA-Z0-9] \W – match any non-word character

Matching Whitespace \s or \S \s – match whitespace (Spaces, Tabs, Line Feeds and Carriage Returns) \S – match any character that is not whitespace. Same as [^\s]

Character shorthands for whitespace Description \f Form Feed \h Horizontal Whitespace \H Not Horizontal Whitespace \n Newline \r Carriage Return \t Horizontal Tab \v Vertical Tab (whitespace) \V Not vertical whitespace

Matching Any Character Dot (.) Matches any character but line ending characters \b – matches a word boundary without consuming any characters

Boundaries and Alternation Matching the Beginning and End of Line List of Regex Special Character Alternation and Regex Options Subpatterns Capturing and Named Groups Character Classes Negated Character Classes

Matching Beginning and End of Line ^ OR $ ^ - matches the beginning of a line $ - matches the end of a line

List of Regex Special Characters .^*+?|(){}[]\- . -matches any character ^ -matches beginning of the line * -matches zero or more + -matches one or more ? –matches one or more | -used for alternation (choice of patterns to match) () –used for grouping {} –used as a quantifier [] –used with character classes \ -used to make a character literal or as a special regex character - -hyphen is used in a character class range

Alternation and Options | OR ? | -gives choice of alternate patterns to match, ie: (THE|The|the) (?i) – Case insensitive (?J) –allow duplicate names (?m) –match on duplicate lines (?s) –match on a single line (?U) –match lazy (?X) –Ignore whitespace, comments (?-…) –Unset or turn off options

Group(s) within a group Subpatterns Group(s) within a group (THE|The|the) -has three subpatterns (tT)h(e|eir) –matches the, The, their, Their

Capturing and Named Groups () (?<name>…) OR (?P<name>…)  Store their content in memory (it is) (time to eat) $1 $2 (?<Severity>\d)  Splunk creates a field of Severity from this named group

Character Classes [] [aeiou] –only matches the characters inside of the brackets [0-9] –matches a range of characters, using a hyphen [a-zA-Z0-9] –matches all alphanumeric characters

Negated Character Classes [^…] *** Super important – especially for Splunk field extractions *** [^aeiou] –matches all consonants and NOT vowels [^\s] – match everything that is not a space

Quantifiers Greedy, Lazy, Possessive Matching a certain number of times

Greedy, Lazy, Possessive * + ? * - match zero of more times .* -will match all of the characters in the subject text (want to avoid this) + -match one or more \d+ -match all of the digits until there aren’t any more - greedy ? –match 0 or 1 of the preceeding token. colou?r –matches either color or colour

Matching a Certain Number of Times {} \d{3} -matches 3 digits only \d{1,3} –matches range of 1 to 3 digits \d{1,} -same as \d+ \d{0,} -same as \d* \d{0,1} -same as \d?

Any Thoughts, Ideas, Feedback? splunkyoda@splunk.com

Optimized Regular Expressions Because fast is elegant!

Optimize Regular Expressions Good Better (whiskey) (?:whiskey) Capture groups add unnecessary overhead and impact overall performance use them only when necessary.

Optimize Regular Expressions Good Better splunk|splash spl(?:unk|ash) Try to “factor” on the left, when you can, while exposing required text. Less alternation is better.

Optimize Regular Expressions Good Better (?:aussie$|gypsie$) (?:aus|gyp)sie$ Try to “factor” on the right when input text is close to end of the line. Most regex engines will anchor at end of line when “$” is present.

Optimize Regular Expressions Good Better 0{3,7} 0000{0,4} Typically exposing required or literal text makes the engine execute the regex faster

Optimize Regular Expressions Good Better (.)* .* Useless parenthesis add unnecessary overhead. As above, use them only when necessary.

Optimize Regular Expressions Good Better matty[:] matty: The character class/set (indicated by []) will add unnecessary overhead when not needed.

Optimize Regular Expressions Good Better ^genti|^collar ^(?:genti|collar) Anchoring the regex at the beginning of the line will result in improved performance with most regex engines.

Optimize Regular Expressions Good Better delaney$|connery$ (delaney|connery)$ I said, anchor the regex!

Optimize Regular Expressions Good Better ^src.*: ^src[^:]*: Using a negated character class/set instead of lazy/greedy quantifiers will typically result in faster regexes. Lazy/greedy quantifiers will make the regex engines backtrack which ultimately impacts overall performance.

Optimize Regular Expressions Good Better bride|brian bri(?:de|an) Full alternation is more expensive than partial alternation. Also, in this case the regex engine will alternate only AFTER ‘bri’ has been matched.

Optimize Regular Expressions Good Better (?:edu|com|net|…) (?:com|edu|net|…) Leading the engine to a match by placing the most popular match first may result in faster execution in some engines.

Optimize Regular Expressions Good Better ^.*(answer) ^.{42}(answer) Specifying an exact position inside the string and leading the engine to a match, will help improve performance drastically compared to using a simple greedy/lazy quantifier.

Optimize Regular Expressions Good Better .*?a ^.*a If ‘a’ is near the end of the input string will match faster as less backtracking will be required.

Optimize Regular Expressions Good Better .*a ^.*?a If ‘a’ is near the beginning of the input string the regex engine will match faster.

Optimize Regular Expressions Good Better :[^:]*: :[^:]*+: Ex. in ‘ :destination’ the second regex fails faster.

Optimize Regular Expressions Good Better :[^:]*: :(?>[^:]*): Same as above, using different notation. Explanation: Atomic grouping or possessive quantifiers instruct the regex engine not to keep the states captured by * or + therefore preventing it from unsuccessfully backtracking and in turn failing faster.

Python for the Masses