7 Searching and Regular Expressions (Regex) Mauro Jaskelioff
Introduction Shell metacharacters –What are they? –Why they are not the same as regular expressions! More about regular expressions –Searching file contents using: grep egrep fgrep
Shell Metacharacters
Special characters are characters that have some meaning to the shell Also known as metacharacters They are interpreted by the shell for expansion unless they are quoted or escaped (more on this later) E.g.: $ file../* (gives the file type for all files in the directory one level up)
Filename Expansion The * metacharacter matches multiple files. It means any string of zero or more characters. Eg.: –*.txt matches any filename ending in.txt –myfile.* matches all files with a prefix of myfile and any suffix –*.* matches files with any prefix and suffix –* matches all files –UST/* matches all files in the UST directory –.* matches all hidden files –*ology matches all filenames with ology at the end (or a filename of just ology ☺ )
Filename Expansion (2) The previous example: $ file../* 1.The shell expands the metacharacters in the command line $ file../file1../file2 /file3 2.The command is executed. Commands don’t interpret shell metacharacters The interpretation is done by the shell
Other Filename Metacharacters ? matches any single character [abc…] matches any of the enclosed characters. A hyphen can be used to specify a range, e.g. a-z [!abc…] matches any character not enclosed
Command substitution The shell also supports substituting the output of a command $ ls –l `cat filenames` The command should be enclosed in backquotes (`) ~]$ cat filenames temp temp2 ~]$ ls -l `cat filenames` -rw-r--r-- 1 zlizmj Domain U 6 Mar 21 03:00 temp -rw-r--r-- 1 zlizmj Domain U 567 Mar 30 11:14 temp2 ~]$ ls -l temp temp2 -rw-r--r-- 1 zlizmj Domain U 6 Mar 21 03:00 temp -rw-r--r-- 1 zlizmj Domain U 567 Mar 30 11:14 temp2 ~]$
Avoiding Shell Expansion What happens if we actually want to pass a metacharacter to the command? (i.e. we don’t want the shell to interpret it as a metacharacter) For example, me may have a file named temp* The character needs to be quoted or escaped –We can quote an argument with single quotes (’) or with double quotes (”) –We escape characters with the backslash character (\)
Single or Double Quotes? ″ –everything between ″ and ″ is taken literally, except for: $ - variable substitution will occur ` - command substitution will occur ″ - marks the end of the double quote ’ – doesn’t have special meaning ′ –everything between ′ and ′ is taken literally except for another ′. –You cannot embed another ′ within such a quoted string (unless you escape it)
Escaping a Character The character following a backslash \ is taken literally. $ echo I\’m Mauro I’m Mauro $ Use \ within ″ ″ or ’ ’ to escape ″, $, and ′ when necessary. How to escape \?
Regular Expressions
Also called regex For describing a set of strings using a pattern –Follows a set of rules –Used for finding occurrences of strings in files Contain normal characters mixed with special characters (called metacharacters) These metacharacters are NOT the same as shell metacharacters which are used for filename expansion!
Regular Expressions Regular Expressions must be put inside quotes otherwise the shell will interpret metacharacters for filename expansion E.g.: –grep ‘[Ff]red’ myfile.txt –Searches the file myfile.txt for lines containing either Fred or fred
Fixed Patterns vs. Regular Expressions To search a file for the word computer: –grep computer myfile.txt –Will only match the word computer –A fixed pattern not a regular expression Supposing we want to find occurrences (including potential misspellings) of: –computer, computor, Computer, Computor, Computers, and so on… –grep ‘[cC]omput[eo]rs*’ myfile.txt –Uses a regular expression
Three versions of grep grep: supports for the most common metacharacters. egrep: (extended grep) supports extended set of metacharacters. It’s more expressive but may be slower. fgrep: (fast grep) doesn’t support metacharacters. It’s less expressive but faster.
Regex Metacharacters.Matches any single character except newlinec.t matches cat, cbt, cct … [ ]Matches one character between [ and ][abc] matches a, b or c -Indicates a rangea-z matches all characters from a to z *Matches zero or more occurrences of the preceding character 12* matches 1, 12, 122, 1222 … +Matches one or more occurrences of the preceding character. NOTE: for use with egrep 12+ matches 12, 122, 1222 … ?Matches zero or one occurrence of the preceding character. NOTE: for use with egrep 12? matches 1 and 12 \Treats the next character literally\* will match the character * and NOT the metacharacter * ^Matches the start of the line^Fred will match only lines that have the word Fred at the start of the line $Matches the end of the lineFred$ will match only lines that have the word Fred at the end of the line
grep Revisited Used to search a file for a pattern (remember STDIN, STDOUT, etc. are also treated as files in UNIX) cat myfile.txt | grep “chocolate” who | grep zlizmj grep ‘pingu’ penguinNames.txt grep ‘[Ww]ib*le’ wobble.txt
egrep Extended grep. Slower but greater functionality Includes additional metacharacters, e.g.: –+ matches one of more of it’s preceding character. E.g. abc+ means abc, abcc, abccc, … –? matches zero or one of it’s preceding character. E.g. abc? means ab or abc –| an alternative. E.g. A | B means A or B
egrep Example egrep ‘(bio|geo)logy’ subjects.txt –will search the file subjects.txt for all lines that contain the words biology or geology
fgrep Fast grep Does not use regular expressions –Used for matching an exact string, not a pattern –$, *, [, ^, |, (, ), and \ are interpreted literally –(but still have special meaning to the shell) –Enclose entire string in quotes
Summary The shell performs filename expansion and command substitution. Shell metacharacters are not the same as regular expressions! Regular expressions allow us to search for a pattern in a file Commands used for searching: –grep –egrep –fgrep (does not use regular expressions)