Advanced Find and Replace with Regular Expressions Robert Kiffe Senior Customer Support Engineer
Agenda Review: Global Find and Replace Introduction to Regular Expressions Challenge #1 Solution Advanced Regular Expressions Challenge #2 Hands On Q & A
Global Find and Replace Location: Content > Find and Replace Administrators Only (User Level 10) Searches a single site Adjust ‘Scope’ to limit searchable content Literal Text or Regex patterns
Global Find and Replace Simple search with results list Preview Replace Safe multi-step process Perform ‘sample’ find/replace and display results list Select pages from results to perform the actual find/replace operation (Optional) Publish selected results
Regular Expressions Regular Expression A pattern that ‘describes’ a certain amount of text The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. (Thanks Wikipedia) Now used in almost every major programming language
Literal Characters Literal Text Matches Most characters match exactly themselves Case Sensitive Robert does not like to be called robert. Robert does not like to be called robert. Robert
Special Characters Symbol characters that have special purpose (explained later) Full List: \ ^ $ . | ? * + ( ) [ { To match as literal characters, you must ‘escape’ them by adding “\” in front Rob does not like to be called Robert? Rob does not like to be called Robert? Robert\?
Special Character: Period ‘Wildcard’ Character Matches any character except newline. Robert does not like to be called oberth, Bobert, or Goobert. Robert does not like to be called oberth, Bobert, or Goobert. .obert
Special Characters: Quantifiers Symbol characters that define how many of the previous character(s) to match ? (0 or 1) * (0 or More) + (1 or More) Use Curly Brackets to indicate an exact number or range {3} (Exactly 3) {3,} (3 or More) {3,5} (3, 4, or 5) Only modifies the previous character (or group)
Special Characters: Quantifiers Quantifiers: Example ? : 0 or 1 Robert does not like to be called Roberta. Robert does not like to be called Roberta. Roberta?
Special Characters: Parenthesis Capture Groups Encapsulate a character sequence using parentheses: “(…)” Add a quantifier to affect the whole group Replace In the ‘replace field’, refer to your groups using the “dollar sign” and then the group number: $# Count the opening parenthesis characters, “(” , to determine the correct #
Special Characters: Parenthesis Capture Group: Example FIND I like https://school.edu but not https://www.school.edu. I like https://school.edu but not https://www.school.edu. https://www\.(school\.edu) REPLACE I like https://school.edu but not https://school.edu. https://$1
Challenge #1 Find All Links to a Particular Domain Problem is that it can have many formats: Root-relative “/” /about/contact.html Absolute (either protocol) http://www.gallena.com/about/contact.html https://www.gallena.com/about/contact.html No Subdomain http://gallena.com/about/contact.html Examples: <a href="/about/"> <a href="http://www.gallena.com/about/">
Challenge #1: Tips Use a quantifier (ie. ‘?’) to make a part of the URL optional a? Combine a quantifier with Parenthesis to make a substring of the URL optional (abc)?
Challenge #1: Solution Steps to Build the Regex Pattern: href="https?://www\.gallena\.com/ (HTTPS protocol) href="https?://(www\.)?gallena\.com/ (+Subdomain optional) href="(https?://(www\.)?gallena\.com)?/ (+Root-relative) Example Matches: <a href="http://www.gallena.com/about/">About</a> <a href="http://gallena.com/records/index.html">Records</a> <a href="/academics/index.html">Academics</a> <a href="https://www.gallena.com/portal/">Portal Login</a>
Special Characters: Square Brackets Character Sets Characters encased inside square brackets define all possible matches for a single text character: [abc] A quantifier placed directly after the set will affect the whole character set Placing a “-” between characters indicates a ‘range’ Placing a “^” as the first item in the set creates a ‘negative pattern’ Quantifier characters become literal matches: ? + * { } Period character becomes literal match: .
Character Sets: Examples Robert does not like to be called robert. Robert does not like to be called robert. [Rr]obert Robert does not like to be called Richard. Robert does not like to be called Richard. [A-Z][a-z]+ Robert does not like to be called Roberta. Robert does not like to be called Roberta. [^A-Z .]+
Shorthand Character Classes Certain characters can reference a range of characters when ‘escaped’ by a backslash (\) Common Examples: \d matches all digit characters: [0-9] \w matches all ‘word’ characters: [A-Za-z0-9_] \s matches all ‘space’ characters (including line breaks) Using the capital letter will ‘inverse’ the match \S matches all non-space characters: [^\s]
Character Classes: Example Jenny’s number is 867-5309. Jenny’s number is 867-5309. \d{3}-\d{4}
Greedy Matches When using quantifiers, a careless (or purposeful) pattern could match beyond an expected result Apply an extra coating of “?” after the initial quantifier, to make the pattern stop at the first successful match Robert likes dogs! Robert likes cats! Robert likes .*! Robert likes dogs! Robert likes cats! Robert likes .*?!
Challenge #2 Set External Links to Create a New Window Need to add the attribute target="_blank" Links will start with “http” or “https” Examples: <a href="http://www.omniupdate.com/">OmniUpdate</a> <a href="https://petitions.whitehouse.gov/">Petitions</a> Desired Result: <a href="http://www.omniupdate.com/" target="_blank">OmniUpdate</a> <a href="https://petitions.whitehouse.gov/" target="_blank">Petitions</a>
Challenge #2: Tips Remember lessions learned from Challenge #1 (abc)? Remember syntax requirements of HTML (or XML) HTML/XML have special characters that can only be used in certain places Use a “Not” to match any character not in the set [^abc] Use capture groups to re-place content as needed (abc) -> $1
Challenge #2: Solution Steps to Build the Regex Pattern FIND: REPLACE: <a href="http://www\.omniupdate\.com/">OmniUpdate</a> (Starting Pattern) <a\s*href="http://www\.omniupdate\.com/"\s*> (Account for whitespace) <a\s*href="https?://[^"]+"\s*> (Match any absolute URL) (<a\s*href="https?://[^"]+"\s*)> (Capture Group) REPLACE: $1 target="_blank"> (Use capture group, then end anchor tag) Example Match/Replace: <a href="http://www.omniupdate.com/about/">About</a> (Full Match) <a href="http://www.omniupdate.com/about/">About</a> (Capture) <a href="http://www.omniupdate.com/about/" target="_blank">About</a> (Replace)
Thank you. Robert Kiffe Sr. Customer Support Engineer OmniUpdate 805-484-9400 ext 223 rkiffe@omniupdate.com outc18.com/surveys