Filters using regular expressions

Filters using regular expressions
Course code: 10CS44 UNIT-5 Filters using regular expressions Engineered for Tomorrow Prepared by : Thanu Kurian Department :Computer Science Date :

Filters using Regular Expressions –grep and sed
Engineered for Tomorrow Filters using Regular Expressions –grep and sed The filters are the commands which accepts data from standard input, manipulate it and write the results to standard output. i.e. filters filter or manipulate data in different ways. grep and sed are two important filters.

grep: Searching for a Pattern
Engineered for Tomorrow grep: Searching for a Pattern grep searches a file for a pattern and displays both matching and nonmatching lines. grep scans its input for a pattern and displays lines containing the pattern, the line numbers or filenames where the pattern occurs. The command uses the following syntax: grep options pattern filename(s) grep searches for pattern in one or more filenames(s), or the standard input if no filename is specified. Example: to display lines containing the string sales from the file emp.lst ; $ grep “sales” emp.lst 2233 | a.k shukla | g.m |sales |12/12/52|6000 1006 | pqr agarwal | director | sales |09/08/38|6500 3456 | s.n sharma | manager | sales |03/06/70|5000

grep can also save the standard output in a file as shown below:
Engineered for Tomorrow grep can also save the standard output in a file as shown below: who | grep kumar > foo grep returns the prompt in case the pattern cannot be located: $ grep president emp.lst // no quoting necessary here $ _ // no president found

grep options: i) Ignoring case (-i):
Engineered for Tomorrow grep options: i) Ignoring case (-i): when we are looking for a name and not sure of the case , use the - i (ignore) option. $ grep -i ‘agarwal’ emp.lst 3564 abc Agarwal | executive|personnel | 06/07/47 |7500 This locates the name Agarwal using the name agarwal

ii) Deleting lines (-v):
Engineered for Tomorrow ii) Deleting lines (-v): The inverse (-v) option selects all lines except those containing the pattern. So, we can create a file otherlist containing all but directors: $ grep –v ‘director’ emp.lst > otherlist $ wc –l otherlist 11 otherlist iii) Displaying Line Numbers (-n): The –n (number) option displays the line numbers containing the pattern , along with the lines: $ grep –n ‘marketing’ emp.lst 3 : 5678 |sumit chowdury | d.g.m. |marketing | 19/05/04 |6000 15 : 3490| kumar sharma | d.g.m |marketing |12/06/11|3490

iv) Counting Lines containing Pattern (-c):
Engineered for Tomorrow iv) Counting Lines containing Pattern (-c): the count (-c ) option displays the number of lines containing the pattern . $ grep –c ‘director’ emp.lst 3 when this command is used with multiple files, the filename is prefixed to the line count: $ grep -c director emp*.lst emp1.lst :4 emp2.lst : 2 empold.lst : 2

v) Displaying Filenames (-l):
Engineered for Tomorrow v) Displaying Filenames (-l): the list (-l ) option displays only the names of files containing the pattern $ grep –l ‘manager’ *.lst desig.lst emp.lst emp1.lst emp2.lst

vii) Taking Patterns from a File (-f):
Engineered for Tomorrow vii) Taking Patterns from a File (-f): we can place all the three patterns in a separate file, one pattern per line. $ grep –f pattern.lst emp.lst Basic Regular Expression (BRE):-An introduction The basic regular expression character subset:

Engineered for Tomorrow

there are two categories of regular expressions: basic and extended
Engineered for Tomorrow Regular Expression: An ambiguous expression containing some special characters which is expanded by a command to match more than one string. there are two categories of regular expressions: basic and extended grep supports basic regular expressions by default and extended regular expressions (ERE) with the – E option. The Character Class: The character class [aA] matches the letter a in both uppercase and lowercase. The model [ar][ar] matches any of the 4 patterns: aa ar ra rr $ grep “ [aA] g [ar] [ar] wal” emp.lst 3546 | xyz Agarwal | executive |Personel | 07/06/46 | 7500 0110 | v.k agrawal | g.m |Marketing | 09/03/56 |9000 The pattern [a-zA-Z0-9] matches a single alphanumeric character. The pattern [^a-zA-Z] matches a single nonalphabetic character string.

The * match zero or more occurrences of the previous character.
Engineered for Tomorrow The * match zero or more occurrences of the previous character. The pattern g* matches a null character or the following strings: g gg ggg gggg ……… $ grep “[aA]gg*[ar][ar] wal” emp.lst 2470 |anil aggarwal |manager | sales |05/07/79|5000 3564 | xyz Agarwal |executive|pesonnel |07/06/47|7500 0110 | v.k. agrawal |g.m. |marketing| 12/35/60|9000

The Dot matches a single character.
Engineered for Tomorrow The Dot matches a single character. The Regular Expression .* : signifies any number of characters, or none. $ grep “j.* saxena” emp.lst 2345 | j.b. saxena |g.m. |marketing |03/12/45 |6700

Specifying Pattern locations(^ and $)
Engineered for Tomorrow Specifying Pattern locations(^ and $) Two characters are used to match a pattern at the beginning or end of line ^ for matching at beginning of line $ for matching at end of line Example: to extract those lines where the emp-id begins with a 2. $ grep “^2” emp.lst 2233| a.k. Shukla |g.m. | sales| 12/12/56 | 6000 2365| anil agarwal | manager | sales | 01/05/90|5000 2345 | j.b. Saxena | g.m.|marketing |03/12/45|8000 Example: to select the lines where the salary lies between 7000 and 7999, we have to use $ at the end of the pattern: $ grep “7…$” emp.lst 9876 |jai sharma |director | production | 03/12/50 | 7000 3564 | xyz agarwal | executive |personnel |07/06/27|7500

Extended Regular Expressions (ERE) and egrep:
Engineered for Tomorrow Extended Regular Expressions (ERE) and egrep: ERE makes it possible to match dissimilar patterns with a single expression. The + and ?: the ERE set includes two special characters: + and ?. +: matches one or more occurrences of the previous character. ?: matches zero or more occurrences of the previous character. i.e. b+ matches b, bb, bbb , etc. b? matches either single instance of b or nothing. $ grep -E “[aA]gg? arwal” emp.lst 2476 |anil aggarwal |manager |sales | 01/05/59 |5000 3564 | v.k. Agarwal |executive |personnel |06/07/37 |4300

sed: Stream Editor is a multipurpose tool which combines the work of several filters. sed performs noninteractive operations on a data stream. the syntax is: sed options ‘address action’ file(s) addressing is done in 2 ways: By one or two line numbers (like 3,7) By specifying a /- enclosed pattern which occurs in a line (like /From:/ In the first form, address specifies either one line number to select a single line or a set of two (3,7) to select a group of contiguous lines. the second form uses one or two patterns. the action component can be a simple display (print) or an editing function like insertion, deletion or substitution of text. sed processes several instructions in a sequential manner. each instruction operates on the output of previous instruction

lines can be addressed using line numbers and context.
Engineered for Tomorrow Line Addressing: lines can be addressed using line numbers and context. in line addressing, sed can be used to display the first 3 lines of a file as follows: $ sed ‘3q’ emp.lst 2233 | a.k shukla |g.m. |sales |12/12/52|6000 9876 |jai sharma |director |production |03/12/50 |7000 5678 |sumit gupta |d.g.m |marketing |04/19/45|6700

We can display the selected lines using p (print) command.
Engineered for Tomorrow We can display the selected lines using p (print) command. $ sed –n ‘1,2p’ emp.lst 2233 | a.k shukla |g.m. |sales |12/12/52|6000 9876 |jai sharma |director |production |03/12/50 |7000 Prints the first 2 lines. The –n option makes sure that lines are not printed twice when using the p (print) command. To print the last line of the file, use the $: $ sed –n ‘$p’ emp.lst 01110 |v.k agarwal |g.m. |marketing |31/12/40|9000

Selecting Lines from anywhere: to select lines 9 through 11, use:
Engineered for Tomorrow Selecting Lines from anywhere: to select lines 9 through 11, use: $ sed –n ‘9,11p’ emp.lst Selecting multiple groups of lines: sed -n ‘1,2p 7,9p $p’ emp.lst Negating the Action (!): sed has a negation operator (!), which can be used with any action. selecting the first 2 lines is the same as not selecting lines 3 through the end . This is written as: sed -n ‘3,$!p’ emp.lst // don’t print lines 3 to end.

Using Multiple Instructions ( -e and –f):
Engineered for Tomorrow Using Multiple Instructions ( -e and –f): the -e option allows us to enter as many instructions as we wish , each preceded by the option. sed -n -e ‘1,2p’ -e ‘7,9p’ -e ‘$p’ emp.lst the above 3 instructions can be stored in a file , with each instruction on a separate line as shown below: $ cat instr.fil 1,2p 7,9p $p We can now use the -f option to direct sed to take its instructions from the file using the command : sed -n –f instr.fil emp.lst

Writing Selected lines to a File (w):
Engineered for Tomorrow Writing Selected lines to a File (w): we can use the w (write) command to write the selected lines to a separate file. save the lines of directors in dlist this way: sed -n ‘/director/w dlist’ emp.lst //no display when using -n option. we can store lines pertaining to the directors, managers and executives in 3 separate files as shown : sed -n ‘/director/w dlist /manager/w mlist /executive/w elist’ emp.lst

Commands are : i (insert), a (append), c (change) and d (delete).
Engineered for Tomorrow Text Editing: Commands are : i (insert), a (append), c (change) and d (delete). Inserting and Changing Lines ( i, a, c): the i command inserts text. $ sed ‘1i\ > # include <stdio.h> \ > # include <unistd.h> > ‘ foo.c >> $$

First enter the instruction 1i, which inserts text at line number 1.
Engineered for Tomorrow First enter the instruction 1i, which inserts text at line number 1. Then enter a \ before pressing [Enter]. Now we can input as many lines as we wish. Each line except the last has to be terminated by the \ before hitting the [Enter]. Here , the concatenated output of two lines of inserted text and existing lines to standard output , which is redirected to a temporary file ,$$. Now this file can be moved to another file , foo.c as shown: $ mv $$ foo.c ; head -n 2 foo.c # include <stdio.h> # include <unistd.h>

sed -n ‘/director/!p’ emp.lst > olist
Engineered for Tomorrow Deleting Lines (d): sed uses the d (delete) command to select lines not containing the pattern sed ‘/director/d’ emp.lst > olist // -n option not to be used with d command. sed -n ‘/director/!p’ emp.lst > olist selects all lines except those containing director , and saves them in olist.

the g flag at the end makes the substitution global.
Engineered for Tomorrow the g flag at the end makes the substitution global. use g (global) flag to replace all the pipes: $ sed ‘s/ |/ :/g’ emp.lst | head –n 2 2233 : a.k. Shukla :g.m :sales :12/12/52:6000 9876 : sharma :director :production :12/03/50:7000 we can replace the word director with member of a file using: $ sed ‘1,5s/director/member /’ emp.lst to add the 2 prefix to all emp-ids: $ sed ‘s/^/2/’ emp.lst | head -n 1 | a.k shukla |g.m. |sales | 12/12/52|6000 we can add the suffix .00 to the salary : $ sed ‘s/$/0.00/’ emp.lst |head -n 1 | a.k shukla |g.m. |sales | 12/12/52|

The Remembered Pattern (//):
Engineered for Tomorrow The Remembered Pattern (//): the two slashes (//) represents the expression used for scanning a pattern (the remembered pattern). sed ‘/director/s//member/’ emp.lst Here, sed “remembers” the scanned pattern, and stores it in // (2 front slashes). When we use // in the target string, it means that source pattern is to be removed. For example, sed ‘s/|//g’ emp.lst //removes every | from file.

Basic Regular Expressions (BRE) – Revisited
Engineered for Tomorrow Basic Regular Expressions (BRE) – Revisited Three more additional types of expressions are: The repeated patterns - & The interval regular expression (IRE) – { } The tagged regular expression (TRE) – ( )

The repeated patterns - &
Engineered for Tomorrow The repeated patterns - & To make the entire source pattern appear at the destination also sed ‘s/director/executive director/’ emp.lst sed ‘s/director/executive &/’ emp.lst sed ‘/director/s//executive &/’ emp.lst Replaces director with executive director where & is a repeated pattern

The interval RE - { } sed and grep uses IRE that uses an integer to specify the number of characters preceding a pattern. The IRE uses an escaped pair of curly braces and takes three forms: ch\{m\} – the ch can occur m times ch\{m,n\} – ch can occur between m and n times ch\{m,\} – ch can occur at least m times The value of m and n can't exceed 255. Let teledir.txt maintains landline and mobile phone numbers. To select only mobile numbers, use IRE to indicate that a numerical can occur 10 times. grep ‘[0-9]\{10\}’ teledir.txt Line length between 101 and 150 grep ‘^.\{101,150\}$’ foo Line length at least 101 sed –n ‘/.{101,\}/p’ foo

The Tagged Regular Expression (TRE)
Engineered for Tomorrow The Tagged Regular Expression (TRE) You have to identify the segments of a line that you wish to extract and enclose each segment with a matched pair of escaped parenthesis. If we need to extract a number, $[0-9]*$. If we need to extract non alphabetic characters, $[^a-zA-Z]*$ Every grouped pattern automatically acquires the numeric label n, where n signifies the nth group from the left. sed ‘s/ \ (a-z]*\) *\ ([a-z]*\) / \2, \1/’ teledir.txt To get surname first followed by a , and then the name and rest of the line. sed does not use compulsorily a / to delimit patterns for substitution. We can use only any character provided it doesn’t occur in the entire command line. Choosing a different delimiter has allowed us to get away without escaping the / which actually occurs in the pattern.

Filters using regular expressions

Similar presentations

Presentation on theme: "Filters using regular expressions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Filters using regular expressions

Similar presentations

Presentation on theme: "Filters using regular expressions"— Presentation transcript:

Similar presentations

About project

Feedback