Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning.

awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning and processing language. It searches one or more files to see if they contain lines that match specified patterns and then performs actions, such as writing the line to the standard output or incrementing a counter, each time it finds a match. You can use awk to generate reports or filter text. It works equally well with numbers and text; when you mix the two, awk will almost always come up with the right answer. The authors of awk (Alfred V. Aho, Peter J. Weinberger, and Brian W.Kernighan) designed it to be easy to use and, to this end, they sacrificed execution speed.

The awk utility takes its input from files you specify on the command line or fron1 its standard input. flexible format conditional execution looping statements numeric variables string variables regular expressions C’s printf The awk utility takes many of its constructs from the C programming language. It includes the following features: The first format uses a program-file, which is the pathname of a fie containing an awk program. See “Description, ” on the next page. The second format uses a program, which is an awk program included on the command line. This format allows you to write simple, short awk programs without having to create a separate program-file. To prevent the shell from interpreting the awk commands as shell commands, it is a good idea to enclose the program in single quotation marks. The file-list contains pathnames of the ordinary files that awk processes. These are the input files. Arguments

Options If you do not use the -f option, awk uses the first command line argument as its program. -fprogram-file file This option causes awk to read its program from the program file given as the first command line argument. -Fc field This option specifies an input field separator c, to be used in place of the default separators ([space] and [TAB]). The field separator can be any singlecharacter. Description An awk program consists of one or more program lines containing a pattern and/or action in the hllowing format: panern { action } The pattern selects lines from the input file. The awk utility performs the action on all lines that the pattern selects. You must enclose the action within braces so that awk can differentiate it from the pattern. If a program line does not contain a pattern, awk selects all lines in the input file. If a program line does not contain an action, awk copies the selected lines to its standard output.

To start, awk compares the first line in the input file (from the file--list) with each pattern in the program-file or program. If a pattern selects the line (if there is a match), awk takes the action associated with the pattern. If the line is not selected, awk takes no action. When awk has completed its comparisons for the first line of the input file, it repeats the process for the next line of input. It continues this process, comparing subsequent lines in the input file, until it has read the entire file-list. If several patterns select the same line, awk takes the actions associated with each of the patterns in the order in which they appear. It is therefore possible for awk to send a single line from the input file to its standard output more than once.

Patterns You can use a regular expression (refer to Appendix A), enclosed within slashes, as a pattern. The ~ operator tests to see if a field or variable matches a regular expression-The !~operator tests for no match. You can process arithmetic and character relational expressions with the following relational operators. You can combine any of the patterns described above using the Boolean operators | | (OR) or && (AND). Operator Meaning < <= == != >= > less than less than or equal to equal to not equal to greater than or equal to greater than

The comma is the range operator. If you separate two patterns with a comma on a single awk program line, awk selects a range of lines beginning with the first line that contains the first pattern. The last line awk selects is the next subsequent line that contains the second pattern. After awk finds the second pattern, it Starts the process over by looking for the first pattern again. Two unique patterns, BEGIN and END, allow you to execute commands before awk starts its processing and after it finishes. The awk utility executes the actions associated with the BEGIN pattern before, and with the END pattern after, it processes all the files in the file-list. Actions The action portion of an awk command causes awk to take action when it matches a pattern. If you do not specify an action, awk performs the default action, which is the Print command (explicitly represented as {print}). This action copies the record (normally a line-see “Variables” on the next page) from the input file to awk’s standard output. You can follow a Print command with arguments, causing awk to print just the arguments you specify. The arguments can be variables or string constants. Using awk, you can send the output from a Print command to a file(>), append it to a file (>>), or pipe it to the input of another program( | ). Unless you separate items in a Print command with commas, awk catenates them. Commas cause awk to separate the items with the output field separator (normally a [space]-see “Variables” on the next page). You can include several actions on one line within a set of braces by separating them with semicolons.

Comments The awk utility disregards anything on a program line following a pound sign (#). You can document an awk program by preceding comments with this symbol. Variables You declare and initialize user variables when you use them (that is, you do not have to declare them before you use them). In addition, awk maintains program variables for your use. You can use both user and program variables in the pattern and in the action portion of an awk program. Following is a list of program variables. VariableRepresents NR $0 NF $1-$N FS OFS RS ORS FILENAME record number of current record the current record(as a single variable) number of fields in the current record fields in the current record input field separator (default:[SPACE]or[TAB]) output field separator (default:[SPACE]) input record separator (default:[NEWLINE]) output record separator (default:[NEWLINE]) name of the current input file

The input and output record separators are, by default, [NEWLINE] characters. Thus, awk takes each line in the input file to be a separate record and appends a [NEWLINE] to the end of each record that it sends to its standard output. The input field separators are, by default, [SPACE] and [TAB]s. The output field separator is a [SPACE]. You can change the value of any of the separators at any time by assigning a new value to its associated variable. Also, the input held separator can be set on the command line using the -F option. Functions The functions that awk provides for manipulating numbers and strings follow. NameFunction length(str) returns the number of characters in str; if you do not supply an argument, it returns the number of characters in th current input record int(num)returns the integer portion of num index(str1, str2)returns the index of str2 in str1 or 0 if str2 is not present split(str, arr, del) places elements of str, delimited by del, in the array arr[1]…arr[n]; returns the number of elements in the array sprintf(fmt, args) formats args according to fmt and returns the formatted string; mimics the C programming language function of the same name substr(str,pos,len)returns a substring of str that begins at pos and is len characters long

Operators The following awk arithmetic operators are from the C programming language. OperatorFunction *multiplies the expression preceding the operator by the expression following it. /divides the expression preceding the operator by the expression following it. % takes the remainder after dividing the expression preceding the operator by the expression following it +adds the expression preceding the operator and the expression following it. -subtracts the expression following the operator from the expression preceding it = assigns the value of the expression following the operator to the variable preceding it. ++increments the variable preceding the operator --decrements the variable preceding the operator += adds the expression following the operator to the variable preceding it and assigns the result to the variable preceding the operator -= subtracts the expression following the operator from the variable preceding it and assigns the result to the variable preceding the operator

OperatorFunction *= multiplies the variable preceding the operator by the expression following it and assigns the result to the variable preceding the operator /= divides the variable preceding the operator by the expression following it and assigns the result to the variable preceding the operator %= takes the remainder, after dividing the variable preceding the operator by the expression following it, and assigns the result to the variable preceding the operator Associative Arrays An associative array is one of awk’s most powerful features. An associative array uses strings as its indexes. Using an associative array, you can mimic a traditional array by using numeric. strings as indexes. You assign a value to an element of an associative array just as you would assign a value to any other awk variable. The format is shown below. array[string] = value The array is the name of the array, string is the index of the element of the array you are assigning a value to, and value is the value you are assigning to the element of the array

There is a special For structure you can use with an awk array. The formatat is: for (elem in array) action The elem is a variable that takes on the values of each of the elements in the array as the For structure loops through them, array is the name of the array, and action is the action that awk takes for each element in the array. You can use the elem variable in this action. The “Examples” section contains programs that use associative arrays. Printf You can use the Printf command in place of Print to control the format of the output that awk generates. The awk version of Printf is similar to that of the C language. A Printf command takes the following format: printf “control-string” arg1, arg2,..., argn The control-string determines how Printf will format arg1-n. The arg1-n can be variables or other expressions. Within the control-string, you can use \n to indicate a [NEWLINE] and \t to indicate a [TAB]. The control-string contains conversion specifications, one for each argument (arg1-n). A conversion specification has the following format:

%[-][x[.y]]conv The - causes Printf to Left justify the argument. The x is the minimum field width, and the.y is the number of places to the right of a decimal point in a number. The conv is a letter from the following list. Refer to the following “Examples” section for examples of how to use printf. conv Cenversion ddecimal eexponential notation ffloating-point number guse f or e, whichever is shorter ounsigned octal sstring of characters xunsigned hexadecimal

Examples A simple awk program is shown on the following page. { print } This program consists of one program line that is an action. It uses no pattern. Because the pattern is missing, awk selects all lines in the input file. Without any arguments, the Print command prints each selected line in its entirety. This program copies the input file to its standard output. The following program has a pattern pan without an explicit action. /jenny/ In this case, awk selects all lines from the input file that contain the string jenny. When you do not specify an action, awk assumes the action to be Print. This program Copies all the lines in the input file that contain jenny to its standard output. The following examples work with the car data file. From left to right, the columns in the file contain each cark make, model, year of manufacture, mileage, and price. All white space in this file is composed of single [TAB]s (there are no [SPACE]s in the file).

$cat cars The first example below selects all lines that contain the string chevy. The slashes indicate that chevy is a regular expression. This example has no action part. Although neither awk nor shell syntax requires single quotation marks on the command line, it is a good idea to use then1, because they prevent many problems. If the awk program you create on the command line includes [SPACE]s or any special characters that the shell will interpret, you must quote them. Always enclosing the program in single quotation marks is the easiest way of making sure you have quoted any characters that need to be quoted. $ awk ’/chevy/’ cars chevy nova 79603000 chevy nova 80503500 chevy impa1a 65851550 The next example selects all lines from the file (it has no pattern part). The braces enclose the action part-you must always use braces to delimit the action part, so that awk can distinguish the pattern part from the action part. This example prints the third field ($3), a [SPACE] (indicated by the comma), and the first field ($1) of each selected line.

$ awk ’{print $3, $1}’ cars 77 p1ym 79 chevy 65 ford 78 vo1vo 83 ford 88 chevy 65 fiat 8l honda 84 ford 82 toyota 65 chevy 83 ford The next example includes both a pattern and an action part. It selects all lines that contain the string chevy and prints the third and first fields from the lines it selects. $ awk ’/chevy/ {print $3, $l}’ cars 79 chevy 88 chevy 65 chevy

The next example selects lines that contain a match for the regular expression h. Because there is no explicit action, it prints all the lines it selects. $ awk ’/h/’ cars chevy nova 79 68 3000 chevy nova 8050 3500 honda accord 8l 30 6000 ford thundbd 84 l0 17000 chevy impa1a 65 85 l550 The next pattern uses the matches operator (~) to select all lines that contain the letter h in the first field. $ awk ’$1 ~ /h/’ cars chevy nova 79 60 3000 chevy nova 80 50 3500 honda accord 8l 30 6000 chevy impa1a 65 85 l550 The caret (^) in a regular expression forces a match at the beginning of the line or, in this case, the beginning of the first field. $ awk ’ $l ~ /^h/’ cars honda accord 81 30 6000 A pair of brackets SUI-rounds a character class definition (refer to Appendix A, “Regular Expressions”). Below, awk selects all lines that have a second field that begins with t or m. Then it prints the third and second fields, a dollar sign, and the fifth field.

$ awk ’$2 ~ /^[tm]/ {print $3, $2, “$”, $5}’ cars 65 mustang $l0000 84 thundbd $17000 82 tercel $750 The next example shows three roles that a dollar sign can play in an awk program. A dollarsign followed by a number forms the name of a field. Within a regular expression, a dollar sign forces a match at the end of a line or held (5$). Within a string, you can use a dollar sign as itself. $ awk ’$3 ~ /5$/ {print $3, $l, “$” $5}’ cars 65 ford $l0000 65 fiat $450 65 chevy $l550 Below, the equals relational operator (==) causes awk to perform a numeric comparison between the third field in each line and the number 65. The awk commands takes the default action, Print, on each line that matches. $ awk ’$3 == 65’ cars ford mustang 654510000 fiat 600 65115450 chevy impa1a 65851550

The next example finds all cars priced at or under $3000. $ awk ‘$5 〈 = 300’ cars plym fury 77732500 chevy nova 79603000 fiat 60065115450 toyota terce1 82180750 chevy impa1a 65851550 When you use double quotation marks, awk performs textual comparisons, using the ASCII collating sequence as the basis of the comparison. Below, awk shows that the strings 450 and 750 fall in the range that lies between the strings 2000 and 9000. $ awk ’$5 >= “2000” && $5 < “9000”’ cars p1ym fury 77 73 2500 chevy nova 79 60 3000 chevy nova 80 50 3500 fiat 600 65 ll5 450 honda accord 8l 30 6000 toyota terce1 82 l80750 When you need a numeric comparison, do not use quotation marks.The next example gives the correct results. It is the same as the previous ex. ample but omits the double quotation marks.

$ awk ‘$5 >= 2000 && $5 < 9000’ cars plymfury77732500 chevynova79603000 chevynova80503500 Hondaaccord81306000 Next, the range operator (,) selects a group of lines. The first line it selects is the one specified by the pattern before the comma. The last line is the one selected by the pattern after the comma. If there is not line that matches the pattern after the comma, awk selects every line up to the end of the file. The example selects all lines starting with the line that contains Volvo and concluding with the line that contains fiat. $ awk ‘/volvo/, /fiat/’ cars volvogl781029850 fordltd831510500 chevynova80503500 fiat60065115450 After the range operator finds its first group of lines, it starts the process over, looking for a line that matches the pattern before the comma. In the following example, awk finds three groups of lines that fall between chevy and ford. Although the fifth line in the file contains ford, awk does not select it because, at the time it is processing the fifth line, it is searching for chevy.

$ awk ‘/chevy/, /ford/’ cars chevynova79603000 fordmustang654510000 chevynova80503500 fiat60065115450 hondaaccord81306000 fordthundbd841017000 chevyimpala65851550 fordbronco83259500 When you are writing a longer awk program, it is convenient to put the program in a file and reference the file on the command line. Use the –f option, followed by the name of the file containing the awk program. Following is an awk program that has two actions and uses the BEGIN pattern. The awk utility performs the action associated with BEGIN before it processes any of the lines of the data file. The pr_header awk program uses BEGIN to print a header. The second action, {print}, has no pattern part and prints all the lines in the file. $ cat pr_header BEGIN{print“Make ModelYearMilesPrice”} {print}

$ awk –f pr_header cars MakeModelYearMilesPrice Plymfury77732500 Chevynova79603000 Fordmustang654510000 Volvogl781029850 Fordltd831510500 Chevynova80503500 Fiat60065115450 Hondaaccord81306000 Fordthundbd841017000 Toyotatercel82180750 Chevyimpala65851550 Fordbronco83259500 In the previous and following examples, the white space in the headers is composed of single [TAB]s, so that the titles line up with the columns of data. $ cat pr_header2 BEGIN { print“MakeModelYearMilesPrice” print“-----------------------------” } {print}

$ awk –f pr_header2 cars MakeModelYearMilesPrice ------------------------------------------------------------ Plymfury77732500 Chevynova79603000 Fordmustang654510000 Volvogl781029850 Fordltd831510500 Chevynova80503500 Fiat60065115450 Hondaaccord81306000 Fordthundbd841017000 Toyotatercel82180750 Chevyimpala65851550 Fordbronco83259500 When you call the length function without an argument, it returns the number of characters in the current line, including field separators. The $0 variable always contains the value of the current line. In the next example, awk prepends the length to each line, and then a pipe sends the output from awk to sort, so that the lines of the cars file appear in order of length. Because the formatting of the report depends on [TAB]s, including three extra characters at the beginning of each line throws off the format of the last line. A remedy for this situation will be covered shortly.

$ awk ‘{print length, $0}’ cars | sort 19fiat60065115450 20fordltd831510500 20plymfury77732500 20volvogl781029850 21chevynova79603000 21chevynova80503500 22fordbronco83259500 23chevyimpala65851550 23hondaaccord81306000 24fordmustang654510000 24fordthundbd841017000 24toyotatercel82180750 The NR variable contains the record (line) number of the current line. The following pattern selects all lines that contain more than 23 characters. The action prints the line number of all the selected lines. $ awk ‘length > 23 {print NR}’ cars 3 9 10

You can combine the range operator (,) and the NR variable to display a group of lines of a file based on their line numbers. The next example displays lines 2 through 4. $ awk ‘NR == 2, NR == 4’ cars chevynova79603000 fordmustang654510000 volvogl781029850 The END pattern works in a manner similar to the BEGIN pattern, except awk takes the actions associated with it after it has processed the last of its input lines. The following report displays information only after it has processed the entire data file. The NR variable retains its value after awk has finished processing the data file, so that an action associated with an END pattern can use it. $ awk ‘END {print NR, “cars for sale.” }’ cars 12 cars for sale. The next example uses If commands to change the values of some of the first fields. As long as awk does not make any changes to a record, it leaves the entire record, including separators, intact. Once it makes a change to a record, it changes all separators in that record to the default. The default output field separator is a [SPACE].

$ cat separ_demo { if ($1 ~ /ply/)$1 = “plymouth” if ($1 ~ /chev/) $1 = “chevrolet” print } $ awk –f separ_demo cars plymouth fury 77 73 2500 chevrolet nova 79 60 3000 fordmustang654510000 volvogl781029850 ford1td831510500 chevrolet nova 80 50 3500 fiat60065115450 hondaaccord81306000 fordthundba841017000 Toyotatercel82180750 Chevroletimpala65851550 Fordbronco83259500

You can change the default value of the output field separator by assigning a value to the OFS variable. There is one [TAB] character between the quotation marks in the following example. This fix improves the appearance of the report but does not properly line up the columns. $ cat ofs_demo BEGIN{OFS = “[TAB]”} { if ($1 ~ /ply/) $1 = “plymouth” if ($1 ~ /chev/)$1 = “chevrolet” print } $ awk -f ofs_demo cars plymouthfury77732500 chevroletnova79603000 ford mustang654510000 volvo gl781029850 ford 1td831510500 chevroletnova80503500 fiat 60065115450 honda accord81306000 ford thundba841017000 Toyota tercel82180750 Chevroletimpala65851550 Ford bronco83259500

You can use Printf to refine the output format (refer to page 535). The following example uses a backslash at the end of a program line to mask the following [NEWLINE] from awk. You can use this technique to continue a long line over one or more lines without affecting the outcome of the program. $ cat printf_demo BEGIN { print “ Miles” print “MakeMode1Year(000)Price” print \ “-----------------------------------------------------------------------” } if ($l ~ /p1y/ 〕 $l = ”p1ymouth” if ($l ~ /chev/) $l = ”chevro1et” printf ”%-l0s %-8s l9%2d %5d$ %8.2f\n”,\ ‘$1, $2, $3, $4, $5 }

$ awk -f printf_demo cars Miles MakeModelYear(0000)Price ------------------------------------------------------------------------------------------ plymouthfury1977 73$ 2500.00 chevroletnova1079 60$ 3000.00 fordmustang1965 45$ 10000.00 volvogl1978 102$ 9850.00 ford1td1983 15$ 10500.00 chevroletnova1980 50$ 3500.00 fiat6001965115$ 450.00 hondaaccord1981 30$ 6000.00 fordthundba1984 10$ 17000.00 Toyotatercel1982180$ 750.00 Chevroletimpala1965 85$ 1550.00 Fordbronco1983 25$ 9500.00

The next example creates two new files, one with all the lines that contain chevy and the other with lines containing ford. $ cat redi rect-out /chevy/ {print 〉 ”chevfi1e”} /ford/ {print 〉 ”fordfi1e”} END {print ”done.”} $ awk -f red1rect-out cars done. $ cat chevfi1e chevy nova79603000 chevy nova80503500 chevy nova65851550 The summary program produces a summary report on all cars and newer cars. The first two lines of declarations are not required; awk automatically declares and initializes variables as you use them. After awk reads all the input data, it computes and displays averages.

$ cat summary BEGIN{ yearsum = 0 ; costsum = 0 newcostsum = 0 ; newcount = 0 } { yearsum += $3 costsum += $5 } $3 〉 80 {newcostsum += $5 ; newcount ++} END { Printf ”Average age of cars is %3.lf years ＼ n”, \ ＼ 90 - (yearsum/NR) printf ”Average cost of cars is $%7.2f ＼ n”, ＼ costum/NR printf ”Average cost of newer cars is %$7.2f ＼ n”,\ newcostsum/newcount } $ awk -f summary cars Ave rage age of cars is l3.2 years Average cost of cars is $62l6.67 Average cost of newer cars is $8750.00 Following, grep shows the format of a line from the passwd file that the next example uses.

$ grep ’mark ’ /etc/passwd mark:4zvDGYGEbYHJg:107:ext 112:/home/mark:/bin/csh The next example demonstrates a technique for finding the largest number in a field. Because it works with the passwd file, which delimits fnelds with colons (:), it changes the input filed separator (FS) before reading any data. (Alternatively, the -F option could be used on the command line to change the input held separator.) This example reads the passwd file and determines the next available user ID number (field 3). The numbers do not have to be in order in the passwd file for this program to work.. The pattern causes awk to select records that contain a user ID number greater than any previous user ID number that it has processed. Each time it selects a record, it assigns the value of the new user ID number to the saveit variable. Then awk uses the new value of saveit to test the user ID of all subsequent records. Finally awk adds 1 to the value of saveit and displays the result. $ cat find-uid. BEGIN {F5 = ’ : ’ saveit = 0} $3 〉 Saveit {saveit = $3} END {print ”Next avai1able UID i s ” saveit + 1} $awk –f find_uid /etc/passwd Next available UID is 192

The next example shows another report based on the cars file. This report uses nested If Else statements to substitute values based on the contents of the price field. The program has no pattern part--it processes every record. $ cat price_range { if ($5 <= 5000) $5 = ”inexpensive” e1se if ($5 > 5000 && $5 〈 1000) $5 = “please ask” e1se if ($5 >= l0000) $5 = ”expensive” printf ”%-10s %-8s19%2d%5d%-12s\n”,\ $l, $2, $3, $4, $5 } $ awk -f price -range cars p1ym fury1977 73inexpensive chevy nova1979 60inexpensive ford mustang1965 45expensive volvo g11978102please ask ford 1td1983 15expensive chevy nova1980 50inexpensive fiat 6001965115inexpensive honda accord1981 30please ask ford thundbd1984 10expensive toyota tercel1982180inexpensive chevy impa1a1965 85inexpensive ford bronco1983 25please ask

Problem 1) Find the number of annotated gene in each strand of ecoli genome sequences. Problem 2) Find the number of putatively identified, hypothetical, unknown genes from ecoli genome seqeunces.

Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning.

Similar presentations

Presentation on theme: "Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning.

Similar presentations

Presentation on theme: "Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning."— Presentation transcript:

Similar presentations

About project

Feedback