Download presentation
Presentation is loading. Please wait.
1
Editing Tons of Text? RegEx to the Rescue!
Welcome everyone! I’m thrilled and honored to have you here. Today I’m going to talk about what a lot of people consider to be a very dry subject: regular expressions. Eric Cressey Senior UX Content Writer Symantec Corporation
2
Why regular expressions?
I get excited when I learn new things, and I try to find ways to use them. This cartoon nicely captures how I think it will go when I learn a new tool. I discovered regular expressions as I was learning JavaScript on my quest to become an API tech writer. Once I realized what they were for, their usefulness for many of my daily tasks became obvious and I started thinking about how I could use them to be more efficient as a writer. I really dislike repetitive text editing tasks, I’m sure you do too. And, while that’s not always a big part of our jobs as technical communicators, it’s always there. As the content we own grows, the time it takes for these tasks increases. Maybe you update legacy content in your content management system or you need to remove a chunk of text from structured content like properties, xml, or html. If you work with localization, you may perform these repetitive tasks across huge sets of localized files.
3
Content maintenance adds up
Properties and source code XML, HTML, structured content Localization Bugs
4
Sometimes we solve big problems
Remove legacy HTML from 10,000+ page Flare project One week of work instead of more than a month Saved 4 weeks of work Update KB URLs in 20,000+ s and files Two weeks of work instead of months Saved 10 weeks of work across multiple departments No errors or missed references At my previous job, I had to go through a pretty large Flare project and remove some legacy content, like inline styles and other html bits, from a 10,000 page project. Some obvious find and replace reduced the workload, but I was still looking at months of manual effort. I decided my time was better spent figuring out how to write regular expressions that would automate the tasks. Instead of taking months or longer, I completed the project in about a week. I’ve been a regex evangelist ever since. For example, after a few months at Symantec, I got a project to update URL references to our knowledge base articles. This was a massive project. Across all our websites and s, URL links needed to be modified to a new format. I immediately thought regular expressions would help, because the changes followed a logical pattern.
5
Agenda Basics Syntax and examples Tips for massive projects
6
Encountering regex in the wild can be scary
^[2-9]\d{2}-\d{3}-\d{4}$ ^#?([a-f0-9]{6}|[a-f0-9]{3})$ ^(?:(?:25[0-5]|2[0-4][0- 9]|[01]?[0-9][0- 9]?)\.){3}(?:25[0-5]|2[0-4][0- 9]|[01]?[0-9][0-9]?)$ If you’ve seen regular expressions before, chances are you’ve seen intimidating ones like these. We might have seen a regular expression written by a developer. These regular expressions find a phone number with zip code, a hex value, and an IP address. Of course, seeing something like this is enough to say “No thanks, that’s not for me.” They’re often poorly explained and difficult to understand. And generally, these aren’t the regular expressions we’re looking for. It’s important to understand that developers and technical communicators solve different problems with regular expressions. Our content is often structured, and typically we do the same sort of tasks repeatedly. As a result, we don’t need to get as deep into regex to get value the way a developer has to.
7
What are they? Searches and regular expressions find patterns in text
Add logic, precision, and flexibility to searches So, putting aside what you already know about regular expressions, here’s a friendlier definition. Regular expressions are searches with some extra syntax. If you think about it, when you do a basic search, you’re searching for a pattern of characters in a specific order. If I search for the word “cat” in a document, the search engine scans the text to find the letters c, a, and t together in order. You add regular expression syntax to a basic search to enable it to match more patterns. Instead of only matching a sequence of characters, regular expressions can match based on surrounding content, account for variance, and even match content based on incomplete information. Regular expressions are a great tool to have in your toolkit, but they’re not magic. If there’s not an identifiable pattern, regular expressions won’t help.
8
Why use them? The big reasons to use regular expressions is efficiency. Simply put, using regular expressions allows us to finish mindless tasks faster, so we spend more time on more valuable tasks, like writing.
9
Are there prerequisites?
Your text editor must support regular expressions All you need is a text editor that supports regular expressions. I like notepad++ shown here. Many authoring tools support regular expressions as a search option.
10
How do you make them? Start with the text you’re looking for
Identify a pattern Add special characters Test the regular expression to see if it matches what you want
11
Best practices Use version control Use a basic text editor
Test before replacing Test again before committing
12
Syntax and examples 1 Dealing with variance 2 Using positional context
3 Matching unknown content 4 Putting it all together: HTML patterns Copyright © 2014 Symantec Corporation
13
Matching name variations
14
Start with what you know
Regex Text to match Eric Eric Erik
15
Identify a pattern Regex Text to match Eri Eric Erik
16
Add special character and syntax
Regex Text to match Eri[ck] Square braces define a set of allowed characters Eric Erik
17
This pattern also works
Regex Text to match Eri(c|k) Or (Eric|Erik) Parenthesis group content together The pipe specifies OR logic Eric Erik
18
Matching URL variations
19
Start with what you know
Regex Text to match Include optional content symantec.com
20
Identify a pattern in the text you want to match
symantec.com
21
Escape special characters with a backslash
Regex Text to match symantec.com
22
Add groups to logical sections with parentheses
Regex Text to match ( symantec.com
23
Indicate number of times to match each group
Regex Text to match (https?:\/\/)?(www\.)?symantec\.co m +, *, or ? specifies how many times to match a group or character + one or more * zero or more ? zero or one symantec.com
24
Find first name when followed by last name
25
Start with what you know
Regex Text to match Eric Eric Eric Creasey Eric Cressey Eric C
26
Add special characters and syntax
Regex Text to match Eric(?= Cressey) (?=) is a positive lookahead. Eric is returned only if the next characters match the lookahead content Eric Eric Creasey Eric Cressey Eric C
27
How do positive lookaheads work?
Eric(?= Cressey) Finds “Eric” as usual Evaluates the following content to see if it matches the lookahead content If the content is the same, “Eric” is a match Eric Eric Creasey Eric Cressey Eric C
28
Find first name not followed by last name
29
There are negative lookaheads
Regex Text to match Eric(?! Cressey) (?!) is a negative lookahead Eric is matched if the next characters do not match the lookahead content Eric Eric Creasey Eric Cressey Eric C
30
Find last name when it follows first name
31
There are also lookbehinds
Regex Text to match (?<=Eric )Cressey (?<=) is a positive lookbehind Cressey is matched if the previous characters match the lookbehind content Eric Eric Creasey Eric Cressey Eric C Erik Cressey
32
How do positive lookbehinds work?
(?<=Eric )Cressey Evaluates each character to see if it follows “Eric ” It gets to “C” and then evaluates the rest of the expression Only the match outside the lookbehind is returned Eric Eric Creasey Eric Cressey Eric C Erik Cressey
33
Find last name when it doesn’t follow first name
34
There are negative lookbehinds
Regex Text to match (?<!Eric )Cressey (?<!) is a negative lookbehind Cressey is matched if the previous characters do not match the lookbehind content Eric Eric Creasey Eric Cressey Eric C Bill Cressey Cressey
35
Get the value for a given string ID
36
Start with what you know
Regex Text to match stringID= stringID=Hello, world! stringID=안녕하세요 , 세계 stringID=Hola món
37
Add special characters and syntax
Regex Text to match (?<=stringID=).* Positive lookbehind means content must follow the string ID . (period) matches any character * is greedy and matches the previous character as many times as possible stringID=Hello, world! stringID=안녕하세요 , 세계 stringID=Hola món
38
Make sure your ampersands are encoded
39
Start with what you know
Regex Text to match & & &
40
Add special characters and syntax
Regex Text to match &(?!amp;) Only matches ampersand when not followed by amp; Useful if you don’t want to replace all occurrences & &
41
Get the content in an HTML tag
42
Start with what you know
Regex Text to match <p>.*</p> <p>Hello, world</p> <p>This is an example</p>
43
Add special characters and syntax
Regex Text to match (?<=<p>) .*(?=<\/p>) You can use lookaheads and lookbehinds together <p>Hello, world</p> <p>This is an example</p>
44
Get a paragraph with a specific class
45
Start with what you know
Regex Text to match <p class="foo"> <p class="foo">Hello, world</p><p>This is the second paragraph</p>
46
Add syntax to match unknown content
Regex Text to match <p class="foo">.*<\/p> Greedy matches return the longest match <p class="foo">Hello, world</p><p>This is the second paragraph</p>
47
Temper greedy matches Regex Text to match
<p class="foo">.*?<\/p> *? Lazy matches return the shortest match <p class="foo">Hello, world</p><p>This is the second paragraph</p>
48
Get a paragraph based on one of many attributes
49
Use lazy matches to fill in unknown content
Regex Text to match <p.*?class=".*?foo.*?".*?>.*?<\/p> <p class="foo" id="first">Hello, world</p><p>This is the second paragraph</p><p id="third" class=“bar foo">Goodbye</p>
50
Multi-line replacements
51
Sometimes you want to insert multiple lines of text
Find Text to match Hello Hello Replace with Hello Hi What’s up
52
You can use white space special characters in replacement text
Find Result Hello Hello Hi What’s up Replace with Hello\nHi\nWhat’s up
53
Add tags around content
54
You can reference groups in replacement text
Regex <span.*?>(.*?)<\/span> Text to match <p>This sentence has some <span class="bold">legacy content</span> we want to replace.</p> Replacement <strong>$1</strong> Updated text <p>This sentence has some <strong>legacy content</strong> we want to replace.</p>
55
Updating URLs
56
Groups are numbered sequentially
Text to match Regex Replacement Updated text
57
Let’s recap. Here’s what we’ve learned so far.
Groups OR logic Using groups in replacement text Lookaheads and lookbehinds Special characters frequency (*,+,?) newlines (\n) any character (.). escape with backslash (\) if necessary
58
Tips for massive projects
All of that is great for daily tasks and will save you minutes or hours of time. But what happens when you get a project like I mentioned earlier, where you’re updating thousands of files across many directories? What happens when you have to perform multiple regular expressions?
59
The manual approach doesn’t scale well when…
Multiple regex operations are needed Regex must be applied in a specific order You need to match a pattern within a pattern You are working with many files in many directories The task was simple enough, but there were a few things that made doing it manually impossible. First, this also included localized URLs, so it wasn’t just a few regular expressions that I could perform in notepad++, it was updating URLs by locale to specific formats. Second, because I wanted to be as surgical with the updates as possible, I wanted to find URLs and then perform regex just on URL content, so no other content was in danger of being changed. Once you start having to do multiple regular expressions to complete a task, you start the “hands-off” benefit that draws us to regular expressions in the first place. Regular expressions will still save you time, but because you’re doing more of them or doing them in a specific order, you’re more likely to introduce human error and it takes longer to do the work.
60
Steps for manually editing files in a directory
Get all files in a directory For each file: If the extension is .properties, .xml, or .txt Get the text. Use regex to find and update URLs. Save the file. For each directory: Repeat directory steps above. If I perform this task manually, I need to go through a folder looking for s and text files that need to be updated. Then I need to apply a series of regular expressions to catch all of the cases for the URL updates. That’s a lot of work if I have a lot of directories to go through. Looking closely at this procedure, it’s pretty obvious that it is the same low-value work we’re already trying to avoid by using regular expressions. So what can we do to get the value back?
61
Pseudo code for programmatically editing files
Get all files in a directory For each file in directory If the extension is .properties, .xml, or .txt Get the text Use regex to find and update URLs Save the file For each directory, repeat directory steps above Luckily, computers are awesome at boring, repetitive stuff like this. That’s why we like regular expressions, and that’s why using a simple program or script to automate these tasks is the way to go. It’s pretty easy to translate that manual workflow into pseudo code shown here.
62
Benefits of the programmatic approach
Write each regex once You can perform them in a specific order Agile! Easy to update the program when requirements evolve Easy to test and iterate Okay, so back to the URL update project. I went with the programmatic approach and it saved me a bunch of time. Better yet, because everything was bundled so nicely in a simple program, it allowed other departments to use it to update their files, too. It was a huge win for us and we estimate it saved about 10 weeks of work. Because all of the time for this was spent writing the program and regular expressions, I knew all of the outliers and had planned for them before it was time to update the files. As a result, we eliminated human error. All of the URLs were updated. That’s pretty huge. I appreciate that we’re not programmers. You don’t need to be a programmer to go with this approach. You don’t really even need to know much about coding at all. Really, the most challenging part of this approach is coming up with the regular expressions. If you have those, it’s easy to plug them in to a generic program that goes through files and updates them. Of course, I wouldn’t advocate this approach without good reason. There are tons of benefits for doing it this way. But the biggest reason to do it is that it streamlines your work, allowing you to focus on building regular expressions. You write them once in the program and the program applies them to the files and folders you specify. Computers do this stuff really quickly, so once you’re ready to run the program, it’ll probably take less than a minute to update all of your files. This makes it really easy to test and iterate and make sure your regular expressions are doing what you want them to do.
63
You don’t have to start from scratch
Get my basic program on GitHub Add regular expressions Visit eric.cressey.org for helpful resources Feel free to ask me if you have questions So you might still be thinking: That worked for you, but I can’t write a program. If you are, I’m really excited because that’s what I thought you’d say. I don’t want you to have to write the program either, because that may not be that valuable for you. So, I’m sharing the basic program I used with you. All you’ll have to do is just add the regular expressions you want to perform and point it to the right directory and files. And now that you’ve gone through this presentation, you’re well equipped to write those regular expressions. I know the programmatic approach is daunting, especially if you don’t feel particularly code-savvy. It’s worth considering because of how efficient it is. Plus, I think time spent developing new skills is a lot more valuable than time spent doing mindless work. So even if you don’t save much time, you’re still actively developing skills that will help you be more efficient and possibly open up career opportunities for you. I want you to succeed, so if you have a massive project like the ones I’ve discussed, feel free to reach out and I’ll try to give you some tips and help you get started with the program I wrote. I can’t guarantee support, of course, but I’ll help as much as I can.
64
Takeaways If there’s a pattern, use regular expressions
You only need to know a small part of regex syntax to automate most repetitive tasks You can save days or weeks of time on large projects
65
Resources Notepad++ - free text editor with regex support
regex101.com - great for writing and testing your regex eric.cressey.org - more regex tutorials
66
eric_cressey@symantec.com eric.cressey.org @Eric_Cressey
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.