Presentation is loading. Please wait.

Presentation is loading. Please wait.

grep & regular expression

Similar presentations


Presentation on theme: "grep & regular expression"— Presentation transcript:

1 grep & regular expression
CSRU3130, Spring 2011 Xiaolan Zhang 1

2 Outline grep regular expression egrep, fgrep cut, paste, comp, uniq
sort 2

3 Filter programs Filter: program that takes input, transforms the input, produces output. default: input=stdin, output=stdout Examples: grep, sed, awk Typical use: $ program pattern_action filenames The program scans the files, looking for lines matching the pattern, performing action on matching lines, printing each transformed line.

4 grep family for pattern matching
grep comes from ed (Unix text editor) search command “global regular expression print” or g/re/p so useful that it was written as a standalone utility two other variants, egrep and fgrep that comprise grep family grep - uses regular expressions for pattern matching fgrep – file (fast, fixed-string) grep, does not use regular expressions, only matches fixed strings but can get search strings from a file egrep - extended grep, uses a more powerful set of regular expressions but does not support backreferencing, generally fastest member of grep family

5 grep Syntax Syntax Options
grep [-hilnv] [-e expression] [filename], or grep [-hilnv] expression [filename] Options -h do not display filenames -i Ignore case -l List only filenames containing matching lines -n Precede each matching line with its line number -v Negate matches -x Match whole line only (fgrep only) -e expression Specify expression as option -f filename Take the regular expression (egrep) or a list of strings (fgrep) from filename

6 A quick exercise How many users in storm has same first name or last name as you ? We have used exact string as pattern. We can specify pattern in regular expression How many users have no password ? How many users are student accounts ?

7 What Is a Regular Expression?
A regular expression (regex) describes a set of possible input strings, i.e., a pattern Languages accepted by Finite State Machines are regular languages—that is, a language is regular if there is some FSM that accepts it. Regular expression describe regular language, E.g., ab*: all words that starts with a, followed by zero or more b Regular expressions are endemic to Unix vi, ed, sed, emacs, awk, tcl, perl, Python grep, egrep, fgrep, compilers

8 Deterministic Finite-State Automata
FSA is quintuple S, Q, qinit, F, dM S is input alphabet Q is finite, nonempty set of states qinit Î Q initial state or start state F  Q is a (possibly empty) set of accepting or terminal states dM: Q    Q transition function Useful for : describe how protocols work … define a language

9 Accepting words Deterministic finite-state automaton M accepts word w Î S* if unique path starting at qinit and labeled by w leads to some member of F, i.e., to some accepting state of M. Does this FSA accept the following words? ba abbb abbab

10 FSM and Regular Expression
a language is regular if there is some FSM that accepts it. Some languages are not regular Example:

11 grep regex operators(special chars)
^ (Caret) = match expression at the start of a line, as in ^A. $ (Question) = match expression at the end of a line, as in A$. \ (Back Slash) = turn off the special meaning of the next character, as in \^. [ ] (Brackets) = match any one of the enclosed characters, as in [aeiou]. Use Hyphen "-" for a range, as in [0-9]. [^ ] = match any one character except those enclosed in [ ], as in [^0-9]. . (Period) = match a single character of any value, except end of line. *(Asterisk) = match zero or more of the preceding character or expression. \{x,y\} = match x to y occurrences of the preceding. \{x\} = match exactly x occurrences of the preceding. \{x,\} = match x or more occurrences of the preceding.

12 c k s UNIX Tools rocks. UNIX Tools sucks. UNIX Tools is okay.
regular expression a string of literal characters to match UNIX Tools rocks. match UNIX Tools sucks. match UNIX Tools is okay. no match

13 Protecting Regex Metacharacters
Some regex operators (special chars) also have special meaning for bash: globbing (filename expansion) and variable reference Eg. grep e.* .bash_profile ## suppose there are files .txt, e_trace.txt # under current dir Actual command executed is: grep .txt e_trace.txt .bash_profile grep $PATH file ## $PATH will be replaced by value of PATH… Solution: always single quote regexs so shell won’t interpret special characters grep ′e.*′ .bash_profile double quotes differs from single quotes: allows for variable substitution whereas single quotes do not.

14 Escaping Special Characters
To match with special character literally, we escape it with \ (backslash) E.g., to search for character sequence 'a*b*' grep 'a*b*' : ## match zero or more ‘a’s followed by zero or more ## ‘b’s, not what we want grep 'a\*\*' ## the asterisks are treated as regular characters Hyphen when used as first char in pattern needs to be escaped ls –l | grep ′\-rwxrwxrwx' ## list all regular files that are readable, writable and executable to all To look for reference to shell variable PATH in a file grep ‘\$SHELL' file.txt

15 Regex special char: Period (.)
Period . in regex matches any character. grep ′o. ′ file.txt How to list files with filename of 5 characters ? ls | grep ′….. ′ ## actually list files with filename 5 or more chars long? Why? How to list normal files that are executable by owners? ls –l | grep ′\-..x ′ o . regular expression For me to poop on. match 1 match 2

16 Character Classes Ranges can be specified in character classes
Character classes [] can be used to match any char from the specific set of characters. [aeiou] will match any of the characters a, e, i, o, or u [kK]orn will match korn or Korn Ranges can be specified in character classes [1-9] is the same as [ ] [abcde] is equivalent to [a-e] You can also combine multiple ranges [abcde ] is equivalent to [a-e1-9] Note - has a special meaning in a character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3

17 Character Classes (cont’d)
Character classes can be negated with the [^ ] syntax [^1-9] ##match any non-digits char [^aeiou] ## match with letters other than a,e,i,o,u Commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, cntrl) Syntax [:name:] [a-zA-Z] [[:alpha:]] [a-zA-Z0-9] [[:alnum:]] [45a-z] [45[:lower:]]

18 Anchors Anchors are used to match at the beginning or end of a line (or both). ^ means beginning of the line $ means end of the line To display all directories only ls –ld | grep ^d ## list all lines start with letter d To display all lines end with period grep ′\.$′ .bash_profile ## lines end with .

19 Exercise To display all empty lines
grep ′^$′ .bash_profile ## empty lines How to list files with filename of 5 characters ? ls | grep ′^…..$ ′ ## Now it’s right Find all executable files under current directory ?

20 Repetition asterisk* match zero or more occurrences of the character or character class preceding it. x* ## match with zero or more x grep ′x*′ .bash_profile ## display all lines, as all lines have zero or more x [a-zA-Z]* abc* ## match with ab, abc, abccc, … .*x ## matches anything up to and include the last x on the line

21 egrep regex grep comes from ed (Unix text editor) search command “global regular expression print” or g/re/p so useful that it was written as a standalone utility two other variants, egrep and fgrep that comprise grep family grep - uses regular expressions for pattern matching fgrep – file (fast, fixed-string) grep, does not use regular expressions, only matches fixed strings but can get search strings from a file egrep - extended grep, uses a more powerful set of regular expressions but does not support backreferencing, generally fastest member of grep family

22 egrep:Repetition Ranges
In grep regex, asterisk (*) matches 0 or more occurrence of characters preceding it Egrep regex: can specified range for the # of occurrence { } notation can specify a range of repetitions for the immediately preceding regex {n} means exactly n occurrences {n,} means at least n occurrences {n,m} means at least n occurrences but no more than m occurrences Example: .{0,} same as .* a{2,} same as aaa* .{6} same as ……

23 egrep: Subexpressions
( ) group part of an expression to a sub-expression Sub-expresssions are treated like a single character * or { } can be applied to them Example: a* matches 0 or more occurrences of a abc* matches ab, abc, abcc, abccc, … (abc)* matches abc, abcabc, abcabcabc, … (abc){2,3} matches abcabc or abcabcabc

24 egrep: Alternation Alternation character | for matching one or another sub-expression (T|Fl)an will match ‘Tan’ or ‘Flan’ ^(From|Subject): will match lines starting with From or Subject, followed by a : Sub-expressions are used to limit scope of alternation At(ten|nine)tion then matches “Attention” or “Atninetion” not “Atten” or “ninetion” as would happen without the parenthesis - Atten|ninetion

25 Egrep: Repetition Shorthands
Asterisk * zero or more occurrences of the immediately preceding character + (plus) means “one or more” abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but will not match ‘abd’ Equivalent to {1,} ‘?’ (question mark) specifies an optional character, the single character that immediately precedes it July? will match ‘Jul’ or ‘July’ Equivalent to {0,1}

26 Egrep: Repetition Shorthands
The *, ?, and + are known as quantifiers because they specify the quantity of a match Quantifiers can also be used with subexpressions (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line

27 egrep Examples Find all lines with signed numbers
$ egrep ’[-+][0-9]+\.?[0-9]*’ *.c bsearch. c: return -1; compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst, convert. c: Print integers in a given base 2-16 (default 10) convert. c: sscanf( argv[ i+1], "% d", &base); strcmp. c: return -1; strcmp. c: return +1;

28 grep: Backreferences Sometimes it is handy to be able to refer to a match that was made earlier in a regex E.g., to find lines starting and ending with same words This is done using backreferences Use \( and \) to mark a sub-expression that we want to back reference Use \n (backreference specifier) to refer to the n-th marked subexpression Ex: to search for lines that start with two same characters grep ′^\(.\)\1′ file.txt

29 Back-references Recall /etc/passwd stores info. about user account
~]$ head /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin To find accounts whose uid is same as groupid grep '^[^:]*:[^:]*:\([0-9]*\):\1' /etc/passwd

30 Exercise with back-references
Note: one regex can have multiple backreference Find five-letter long palindrome in wordlist grep ′\(.\)\(.\).$2$1′ wordlist One can download a list of English words online For spell checking purpose egrep as a simple spelling checker: specify plausible alternatives you know egrep "n(ie|ei)ther" wordlist Neither

31 A good help with Crossword
How many words have 3 a’s one letter apart? egrep a.a.a wordlist| wc –l 54 egrep u.u.u wordlist Cumulus Words of 7 letters that start with g, 4th letter is a, and 7th letter is h egrep ′g..a..h$′ wordlist

32 Practical Regex Examples
Variable names in C [a-zA-Z_][a-zA-Z_0-9]* Dollar amount with optional cents \$[0-9]+(\.[0-9][0-9])? Time of day (1[012]|[1-9]):[0-5][0-9] (am|pm) HTML headers <h1> <H1> <h2> … <[hH][1-4]>

33 Quick Reference This is one line of text input line o.*o
regular expression fgrep, grep, egrep grep, egrep grep Quick Reference egrep

34 Examples Interesting examples of grep commands
To search lines that have no digit character: grep -v '^[0-9]*$' filename Look for users with uid=0 (root permission) grep '^[^:]*:[^:]*:0:' /etc/passwd To search users without passwords: grep ‘^[^:]*::’ /etc/passwd 34

35 Specify pattern in files
-f option: useful for complicated patterns, also don't need to worry about shell interpretation. Example $ cat alphvowels ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$ $ egrep -f alphvowels /usr/share/dict/words abstemious ... tragedious

36 Command cut cut displays selected columns or fields from each line of a file. Delimit-based cut The most typical usage of cut command is cutting one of several columns from a file (often a log file) to create a new file.  For example: cut -d ' ' -f 2-7 retrieves the second to seventh field assuming that each field is separated by a single (note: single) blank.  fields are counted starting from one. Character column cut cut -c 4,5,20 foo # cuts foo at columns 4, 5, and 20.

37 Command paste Useful for merging two files together
Suppose f1 stores world population info, f2 stores world economy (GDP) info, To generate a file that merge the two files’ corresponding lines: paste f1 f2 > pop_GDP To make sure info for same country are merged: sort the files first (if same set of countries are listed in both files, this solves the problem) Otherwise…

38 Command tr tr - Translate, squeeze, and/or delete characters from standard input, writing to standard output. cat file| tr [a-z] [A-Z] ## translate all capital letter to lower case cat file | tr -sc A-Za-z '\n‘ ## replace all non-letter characters with newline ## -c: complement ## -s: squeeze

39 Command tr and uniq uniq: report or omit repeated lines
-c: precede each unique line with the number of occurences Useful for word counts

40 wf (word frequency) Ex: Get a letter frequency count on a set of files given on command line. (No file names means that std input is used.) #!/bin/bash cat $* | tr -sc A-Za-z '\012' | tr A-Z a-z| sort | uniq -c | sort -nr -k 1 Uncomment the last two lines to get letters (and counts) from most frequent to last frequent, rather than alphabetical. What is being generated at second command ? * Command tee can be inserted into pipeline, to save the streams of input/ output into a file.

41 Command tee tee - read from standard input and write them to standard output and files tee [OPTION]... [FILE]... Copy standard input to FILE, and also to standard output. Option: -a, --append append to the given FILEs, do not overwrite Useful for insert into pipes for testing, and for storing intermediate results

42 Capture intermediate result in file
#!/bin/bash cat $* | tr -sc A-Za-z '\012' | tr A-Z a-z| sort | tee aftersort | uniq -c | sort -nr -k 1 For example: add the parts in red to store output of sort command to aftersort, and feed them to next command in the pipeline (uniq)…

43 Usage of tee In shell script, sometimes you might need to process standard input for multiple times: count number of lines, search for some pattern: #!/bin/bash # usage: tee_ex pattern echo Number of lines `wc –l` echo Searching for $1 grep $1 Problems: standard input to the script (might be redirected from file/pipe) will be processed by wc (the first command in scripts that reads standard input). Subsequence command (grep here) does not get it 

44 tee to rescue #!/bin/bash # Usage: tee_ex pattern echo Number of lines `tee tmp | wc –l` echo Searching for $1 grep $1 tmp rm tmp Use tee to save a copy of standard input to file tmp, while at the same time copy standard input to standard output, i.e., fed into pipe to wc

45 Another solution #!/bin/bash # Usage: tee_ex pattern # save standard input to file for later processing cat > tmpfile echo Number of lines `wc –l tmpfile` echo Searching for $1 grep $1 tmpfile rm tmpfile ## always clean up temporary file created …

46 Summary Regular expression and Finite state automata
Single quote search patterns so that shell do not interpret characters that have special meaning to him: *, ., $, ?, … Be sure to distinguish regex and shell globbing We look at grep regex, egrep regex egrep regex is generally a superset of grep regex, except back reference Some other useful filter commands


Download ppt "grep & regular expression"

Similar presentations


Ads by Google