Editing Tons of Text? RegEx to the Rescue! Eric Cressey Senior UX Content Writer Symantec Corporation
Why regular expressions?
Content maintenance adds up Properties and source code XML, HTML, structured content Localization Bugs
Sometimes we solve big problems Remove legacy HTML from 10,000+ page Flare project – One week of work instead of more than a month – Saved 4 weeks of work Update KB URLs in 20,000+ s and files – Two weeks of work instead of months – Saved 10 weeks of work across multiple departments – No errors or missed references
Agenda 1.Basics 2.Syntax and examples 3.Tips for massive projects
Encountering regex in the wild can be scary ^[2-9]\d{2}-\d{3}-\d{4}$ ^#?([a-f0-9]{6}|[a-f0-9]{3})$ ^(?:(?:25[0-5]|2[0-4][0- 9]|[01]?[0-9][0- 9]?)\.){3}(?:25[0-5]|2[0-4][0- 9]|[01]?[0-9][0-9]?)$
What are they? Searches and regular expressions find patterns in text Add logic, precision, and flexibility to searches
Why use them?
Are there prerequisites? Your text editor must support regular expressions
How do you make them? 1.Start with the text you’re looking for 2.Identify a pattern 3.Add special characters 4.Test the regular expression to see if it matches what you want
Best practices 1.Use version control 2.Use a basic text editor 3.Test before replacing 4.Test again before committing
Syntax and examples Copyright © 2014 Symantec Corporation 12 1Dealing with variance 2Using positional context 3Matching unknown content 4Putting it all together: HTML patterns
Matching name variations
Start with what you know Regex Eric Text to match Eric Erik
Identify a pattern Regex Eri Text to match Eric Erik
Add special character and syntax Regex Eri[ck] Square braces define a set of allowed characters Text to match Eric Erik
This pattern also works Regex Eri(c|k) Or (Eric|Erik) Parenthesis group content together The pipe specifies OR logic Text to match Eric Erik
Matching URL variations
Start with what you know Regex Include optional content Text to match symantec.com
Identify a pattern in the text you want to match symantec.com
Escape special characters with a backslash Regex Text to match symantec.com
Add groups to logical sections with parentheses Regex ( Text to match symantec.com
Indicate number of times to match each group Regex (https?:\/\/)?(www\.)?symantec\.co m +, *, or ? specifies how many times to match a group or character + one or more * zero or more ? zero or one Text to match symantec.com
Find first name when followed by last name
Start with what you know Regex Eric Text to match Eric Eric Creasey Eric Cressey Eric C
Add special characters and syntax Regex Eric(?= Cressey) (?=) is a positive lookahead. Eric is returned only if the next characters match the lookahead content Text to match Eric Eric Creasey Eric Cressey Eric C
How do positive lookaheads work? Eric(?= Cressey) 1.Finds “Eric” as usual 2.Evaluates the following content to see if it matches the lookahead content 3.If the content is the same, “Eric” is a match Eric Eric Creasey Eric Cressey Eric C
Find first name not followed by last name
There are negative lookaheads Regex Eric(?! Cressey) (?!) is a negative lookahead Eric is matched if the next characters do not match the lookahead content Text to match Eric Eric Creasey Eric Cressey Eric C
Find last name when it follows first name
There are also lookbehinds Regex (?<=Eric )Cressey (?<=) is a positive lookbehind Cressey is matched if the previous characters match the lookbehind content Text to match Eric Eric Creasey Eric Cressey Eric C Erik Cressey
How do positive lookbehinds work? (?<=Eric )Cressey 1.Evaluates each character to see if it follows “Eric ” 2.It gets to “C” and then evaluates the rest of the expression 3.Only the match outside the lookbehind is returned Eric Eric Creasey Eric Cressey Eric C Erik Cressey
Find last name when it doesn’t follow first name
There are negative lookbehinds Regex (?<!Eric )Cressey (?<!) is a negative lookbehind Cressey is matched if the previous characters do not match the lookbehind content Text to match Eric Eric Creasey Eric Cressey Eric C Bill Cressey Cressey
Get the value for a given string ID
Start with what you know Regex stringID= Text to match stringID=Hello, world! stringID= 안녕하세요, 세계 stringID=Hola món
Add special characters and syntax Regex (?<=stringID=).* Positive lookbehind means content must follow the string ID. (period) matches any character * is greedy and matches the previous character as many times as possible Text to match stringID=Hello, world! stringID= 안녕하세요, 세계 stringID=Hola món
Make sure your ampersands are encoded
Start with what you know Regex & Text to match & &
Add special characters and syntax Regex &(?!amp;) Only matches ampersand when not followed by amp; Useful if you don’t want to replace all occurrences Text to match & &
Get the content in an HTML tag
Start with what you know Regex.* Text to match Hello, world This is an example
Add special characters and syntax Regex (? ).*(?= ) You can use lookaheads and lookbehinds together Text to match Hello, world This is an example
Get a paragraph with a specific class
Start with what you know Regex Text to match Hello, world This is the second paragraph
Add syntax to match unknown content Regex.* Greedy matches return the longest match Text to match Hello, world This is the second paragraph
Temper greedy matches Regex.*? *? Lazy matches return the shortest match Text to match Hello, world This is the second paragraph
Get a paragraph based on one of many attributes
Use lazy matches to fill in unknown content Regex.*? Text to match Hello, world This is the second paragraph Goodbye
Multi-line replacements
Sometimes you want to insert multiple lines of text Find Hello Text to match Hello Replace with Hello Hi What’s up
You can use whitespace special characters in replacement text Find Hello Result Hello Hi What’s up Replace with Hello\nHi\nWhat’s up
Add tags around content
You can reference groups in replacement text Regex (.*?) Text to match This sentence has some legacy content we want to replace. Replacement $1 Updated text This sentence has some legacy content we want to replace.
Updating URLs
Groups are numbered sequentially Text to match Replacement Updated text Regex
Let’s recap. Here’s what we’ve learned so far. Groups – OR logic – Using groups in replacement text Lookaheads and lookbehinds Special characters – frequency (*,+,?) – newlines (\n) – any character (.). – escape with backslash (\) if necessary
Tips for massive projects
The manual approach doesn’t scale well when… Multiple regex operations are needed Regex must be applied in a specific order You need to match a pattern within a pattern You are working with many files in many directories
Steps for manually editing files in a directory 1.Get all files in a directory 2.For each file: – If the extension is.properties,.xml, or.txt 1.Get the text. 2.Use regex to find and update URLs. 3.Save the file. 3.For each directory: 1. Repeat directory steps above.
Pseudo code for programmatically editing files Get all files in a directory For each file in directory If the extension is.properties,.xml, or.txt Get the text Use regex to find and update URLs Save the file For each directory, repeat directory steps above
Benefits of the programmatic approach Write each regex once You can perform them in a specific order Agile! Easy to update the program when requirements evolve Easy to test and iterate
You don’t have to start from scratch Get my basic program on GitHubGitHub Add regular expressions Visit eric.cressey.org for helpful resourceseric.cressey.org Feel free to ask me if you have questions
Takeaways If there’s a pattern, use regular expressions You only need to know a small part of regex syntax to automate most repetitive tasks You can save days or weeks of time on large projects
Resources Notepad++ - free text editor with regex support regex101.com - great for writing and testing your regex regex101.com eric.cressey.org - more regex tutorials eric.cressey.org
Thank you! Copyright © 2015 Symantec Corporation. All rights reserved. Symantec and the Symantec Logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice. Eric Cressey