The awk command
Introduction Awk is a programming language used for manipulating data and generating reports. The data may come from standard input, one or more files, or as output from a process. Awk can be used at the command line for simple operations, or it can be written into programs for larger applications. Awk scans a file ( or input) line by line, from the first to the last line, searching for lines that match a specified pattern and performing selected actions ( enclosed in curly braces ) on those lines.
Awk stands for the first initials in the last names of each of the authors of the language, Alfred Aho, Brian Kernighan, and peter Weinberger. There are a number of versions of awk : old awk, new awk, gnu awk, POSIX awk, and so on. Awk combines features of several filters, but it has two unique features. 1. it can identify and manipulate individual fields in a line. 2. awk is the only UNIX filter that can perform computation. Further, awk also accepts extended regular expressions (EREs) for pattern matching, has C-type programming constructs and several built-in variables and functions.
awk Preliminaries The awk command follows the general syntax: Awk ‘selection_criteria { action }’ Note the use of single quotes and curly braces. The selection_criteria ( a form of addressing) filters input and selects lines for the action component to act on. This component is enclosed within curly braces. The selection_criteria and action constitute an awk program that is surrounded by a set of single quotes. These programs are often one-liners though they can span several lines as well. Ex: to select the directors from the file, the awk command is: $ awk '/dir./ {print}' emp.lst 7898 | akash |dir. |mark. | 11/06/70 |9000
Unlike other filters, awk uses a contiguous sequence of spaces and tabs as the default delimiter. This default has been changed in the example by “|” using the –F option. A,(comma) has been used to delimit the field specification. $ awk -F"|" '/dir./ {print $2,$3,$4,$6}' emp.lst akash dir. mark Fields in awk are numbered $1,$2,etc. Awk also addresses the entire line as $0. Ex: to display the number of records in the file e.lst: $ awk '{print $0}' e.lst |wc -l 6
The action section is represented by the statement { print }, which has the effect of printing all the selected lines. If the selection_criteria is missing, then the action will apply to all lines of the file. If the action is missing, then the entire line will be printed. Either the address or the action is optional, but both must be enclosed within a pair of single quotes. All context patterns have to be enclosed within a pair of /’s. The print statement if used without any field specifiers prints the entire line, though you can also use the variable $0 to indicate that explicitly. Since print is the default action of awk, there is no need to specify it if you want to print the entire line. All the three forms are equivalent: $ awk ‘/dir/ ’ emp.lst $ awk ‘/dir/ {print} ‘ emp.lst $ awk ‘/dir/ {print $0} ‘ emp.lst
For pattern matching, awk uses regular expressions of the egrep variety, with the same requirement that all these expressions be bounded on either side by a /. This lets you locate both ‘sharma’ and ‘sarma’ : $ awk -F"|" '/[Ss]h*arma/ ' e.lst 9876 | sharma | mgr |product| 12/03/60 | | Sarma | dir.| sales | 05/09/60 |25000 Awk also accepts a line address (single or double) to select lines. Ex: to select lines 3 to 6 from a file, use the built-in variable NR to specify line numbers : $ awk -F"|" 'NR==3,NR==6 {print NR, $2, $3,$6}' e.lst 3 akash dir tiwary g.m kumar mgr Sarma dir
Formatting output with printf Awk uses the print and printf statements to write to standard output. Print produces unformatted output. Ex: to print all fields except the 4 th, we can assign the one we don’t want to an empty string : $ awk -F"|" '{ $4=""; print}' e.lst |head shukla g.m 12/12/ sharma mgr 12/03/ When placing multiple statements in a single line, use the ; as their delimiter. Print here is the same as print $0. With the C-like printf statement, you can use awk as a stream formatter. Printf uses a quoted format specificier and a field list. %s – String %d – Integer %f – Floating point number
To produce formatted o/p from unformatted i/p, using a regular expression, $ awk -F"|" '/[sS]h*arma/{ printf("%-20s %-12s %6d\n",$2,$3,$6) }' e.lst sharma mgr Sarma dir
The Logical And Relational Operators To print the 3 fields for the directors and the manager, you can write each awk program in a separate line: $ awk –F”|” ‘/director/ { printf “%-20s %-12s %d\n”, $2,$3,$6} >/manager/ {printf “%-20s %-12s %d\n”, $2,$3,$6}’ emp.lst But this method of repeating the print action on each line can be tedious. Awk also uses the || and && logical operators. $ awk -F"|" '$3==" mgr " || $3=="dir. "{ printf("%-20s %-12s %6d\n",$2,$3,$6) }' e1.lst akash dir kumar mgr 15000
If you want to print only those lines for persons who are neither director nor manager, you should use the != and && operators: $ awk -F"|" '$3!=" dir." && $3!=" mgr" { printf "%-20s %-12s %d\n", $2,$3,$6}' e1.lst While using the operators == and != for string matching, you must remember that they can handle only fixed strings, and not regular expressions. How to match regular expressions: Awk offers the ~ and !~ operators to match and negate a match, respectively. $ awk -F"|" '$3 ~/g.m/ {print}' e1.lst 2233 | shukla | g.m | sales | 12/12/52 | | sharma |d.g.m|product| 12/03 60 | | tiwary |g.m |product| 05/02/89 |23000
The previous example prints the d.g.m’s as well as the g.m’s, since the pattern g.m. is embedded in the larger string. Therefore use the characters ^ and $ used by the regular expressions, which indicate the beginning and the end of a field, respectively. $ awk -F"|" '$3 ~/^g.m/ {print}' e1.lst 3456 | tiwary |g.m |product| 05/02/89 |23000
The relational and regular expression matching operators used by awk OperatorSignificance <Less than <=less than or equal to ==equal to !=not equal to >=greater than or equal to >greater than ~match a regular expression !~doesn’t match a regular expression
Number Processing Awk uses the arithmetic operators +,-,*,/, and %(modulus). It also overcomes the most major limitations of the shell ; the inability to handle decimal numbers. You can use awk to print a pay-slip for the directors: $ awk -F"|" '$3~/^dir./ { >printf "%-20s %-12s,%d %d %d\n", $2,$3,$6,$6*0.4,$6*0.15}' e1.lst akash dir., While awk has certain built-in variables, like NR and $0, it also permits the user to use variables of his choice. A user-defined variable used by awk has a special feature ; no type declaration is needed, and it is initialized to zero or a null string, by default, depending on its type. Awk has a mechanism of identifying the type of variable used from its context.
$ awk -F"|" '$6>=15000 { > cnt = cnt+1 > print cnt,$2,$3,$6}' e1.lst 1 shukla g.m sharma d.g.m tiwary g.m kumar mgr 15000
THE –f OPTION Awk offers the –f option to take the program from the file that follows this option. $ cat q1.awk $6>=15000 { print ++count,$2,$3,$6} $ awk -F"|" -f q1.awk e1.lst 1 shukla g.m sharma d.g.m tiwary g.m kumar mgr 15000
THE BEGIN AND END SECTIONS If you are to print something before processing the first line, for example, a heading, then the BEGIN section can be used quite gainfully. Similarly, if you want to print some totals after the processing is over, then you should do it in the END section. The BEGIN and END are optional, and take the form: BEGIN {action} END {action} These two sections, when present, are delimited by the body of the awk program. They also use a pair of curly braces to enclose the program. You can use these two sections to print a suitable heading at the beginning, and the average salary at the end.
$ cat q2.awk BEGIN { printf "\n\t\t EMPLOYEE ABSTRACT \n\n" } $6>15000 { # used for comments count++; tot+=$6 printf "%3d%-20s%-12s%d\n", count,$2,$3,$6 } END{ printf "\n\t The average basic pay is %6d\n", tot/count }
$ awk -F"|" -f q2.awk e1.lst EMPLOYEE ABSTRACT 1 shukla g.m tiwary g.m The average basic pay is 21500
Positional Parameters The program q1.awk could take a more generalized form if the number is replaced with a variable. To do that, the entire awk command (not just the program) should be stored in a shell script, and the parameter supplied as an argument to the script. This parameter is then compared with the variable. These variables are known as positional parameters, and identified by the shell as $1,$2,$3, etc. in the order they are presented in the command line. The positional parameters used by awk should be enclosed within single quotes, so as to distinguish between a positional parameter and a field identifier.
Cat q1.awk awk -F"|" '$6>='$1' { print $2,$3,$6}' e1.lst $ q1.awk 15000
BUILT–IN VARIABLES VARIABLEFUNCTION NRCumulative number of records read FSThe input field separator OFSThe output field separator NFNumber of fields in current record FILENAME The current input file ARGCNumber of arguments in the command line ARGVThe list of arguments
NR stores the record number of the current line. FS defines the input field separator. This is an alternative to the –F option of the command. When used at all it must occur in the BEGIN section so that the body of the program knows its value before it starts processing : The default output field separator, can be reassigned using the variable OFS in the BEGIN section Ex: $ awk 'BEGIN {FS="|";OFS="~"} $6>15000 {print $1,$2,$3,$6}' e1.lst 2233 ~ shukla ~ g.m ~ ~ tiwary ~g.m ~23000
NF is used in cleaning up a database from records which don’t contain the right number of fields. Ex: to locate those records not having 6 fields, and which have crept in due to faulty data entry: $ awk 'BEGIN {FS="|"} > NF!=6 > print "record no ",NR," has ",NF, " fields"}' emp.lst FILENAME stores the name of the current file being processed. By default, awk doesn’t print the filename, but you can instruct it to do so: $ awk -F "|" '$6<15000 {print FILENAME,$0}' e1.lst e1.lst 7898 | akash |dir. |mark. | 11/06/70 |9000
While using awk program within shell scripts, you can arrange to pass parameters to the script. ARGV[ ], stores the entire list of arguments in the array. And the number of such arguments is stores in the variable ARGC $ emp.awk director Then ARGC takes the value 4, while the array ARGV[ ] is filled up with the words in the command line: ARGV[0] = empfind.awk ARGV[1] = 3500 ARGV[2] = 7000 ARGV[3] = director
FUNCTIONS Awk has several built-in functions, performing both arithmetic and string operations. The parameters are passed to a function in C-style, delimited by commas, and enclosed by a matched pair of parentheses.
Built – in functions in awk FunctionDescription int(x)Returns the integer value of x sqrt(x)Returns the square root of x index(s1,s2) Returns the position of the string s2 in the string s1 length( )Returns the length of the argument (the complete record in case of none) substr(s1,s2,s3) Returns portion of the string of length s3, starting from the position s2 in the strting s1 split(s,a)Split string s into the array a; optionally returns number of fields
Control flow – THE if statement the control command itself must be enclosed in parentheses. $ awk -F"|” '{ if ($6 >15000) print($2,$6)}' e1.lst shukla tiwary $ awk -F"|" '{ if ($6 >15000) commission = 0.15*$6 else commission = 0.10 *$6 } {print ($2,$6,commission)}' e1.lst shukla sharma akash tiwary kumar