Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular.

Slides:



Advertisements
Similar presentations
CIS 118 – Intro to UNIX Shells 1. 2 What is a shell? Bourne shell – Developed by Steve Bourne at AT&T Korn shell – Developed by David Korn at AT&T C-shell.
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
A Guide to Unix Using Linux Fourth Edition
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
Scripting Languages Chapter 8 More About Regular Expressions.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2014) Dan Jurafsky (From Chris Manning’s modification of Ken Church’s presentation)
Shell Script Examples.
Chapter 4: UNIX File Processing Input and Output.
Advanced File Processing
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
System Programming Regular Expressions Regular Expressions
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
Basic Text Processing, Redirection and Pipes. 222 Lecture Overview  Basic text processing commands head, tail, wc  Redirection and pipes  Getting to.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Regular expressions Used by several different UNIX commands, including ed, sed, awk, grep A period ‘.’ matches any single characters.X. matches any X.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
BIF713 Additional Utilities. Linux Utilities  You have learned many Linux commands. Here are some more that you can use:  Data Manipulation (Reg Exps)
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Appendix A: Regular Expressions It’s All Greek to Me.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University.
UNIX Commands RTFM: grep(1), egrep(1) & fgrep(1) Gilbert Detillieux April 13, 2010 MUUG Meeting.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
Lesson 6-Using Utilities to Accomplish Complex Tasks.
In the last class, Filters and delimiters The sample database pr command head and tail commands cut and paste commands.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
SIMPLE FILTERS. CONTENTS Filters – definition To format text – pr Pick lines from the beginning – head Pick lines from the end – tail Extract characters.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Regular Expressions Copyright Doug Maxwell (
Lesson 5-Exploring Utilities
Looking for Patterns - Finding them with Regular Expressions
The UNIX Shell Learning Objectives:
Regular Expression - Intro
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
Lecture 9 Shell Programming – Command substitution
INTRODUCTION TO UNIX: The Shell Command Interface
Folks Carelli, Instructor Kutztown University
Guide To UNIX Using Linux Third Edition
Unix Talk #2 (sed).
CSE 303 Concepts and Tools for Software Development
CSCI The UNIX System Regular Expressions
1.5 Regular Expressions (REs)
Software I: Utilities and Internals
Presentation transcript:

Advanced Text Processing

222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

333 Cutting Lines – cut  The cut command extracts sections from each line of the input file  Command line options for cut : -c – output only these characters -f – output only these fields -d – use this character as the field delimiter cut options [files]

444 Cutting Lines – cut  With cut, at least one of the selection options ( -c or -f ) must be specified  The value given with -c or -f can be: A number – specifies a single character position A range – specifies a sequence of positions A comma separated list – specifies multiple positions or ranges

555 cut – Examples  Given a file called ' my_phones.txt ': ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 BECK, Bill 6346 BENNETT, Peter 7456 GRAHAM, Linda 6141 HARMER, Peter 7484 MAKORTOFF, Peter 7328 MEASDAY, David 6494 NAKAMURA, Satoshi 6453 REEVE, Shirley 7391 ROSNER, David 6830

666 cut – Examples head -3 my_phones.txt | cut -c3-16 AMS, Andrew 75 RRETT, Bruce 6 YES, Ryan 6585 head -3 my_phones.txt | cut -d" " -f2 Andrew Bruce Ryan head -3 my_phones.txt | cut -c1-3,10,12,15-18 ADAde7583 BARBu 646 BAYa 85

777 Merging Files – paste  The paste command merges multiple files by concatenating corresponding lines  Command line options for paste : -d – provide a list of separator characters -s – paste one file at a time instead of in parallel (each file becomes a single line) paste [options] [files]

888 paste – Examples  Assume that we are given 3 input files: Andrew Bruce Ryan Bill Peter Linda Peter David Satoshi first.txt ADAMS BARRETT BAYES BECK BENNETT GRAHAM HARMER MAKORTOFF MEASDAY NAKAMURA last.txt num.txt

999 paste – Examples paste first.txt last.txt num.txt | head -3 Andrew ADAMS 7583 Bruce BARRETT 6466 Ryan BAYES 6585 paste -d" :" first.txt last.txt num.txt | head -3 Andrew ADAMS:7583 Bruce BARRETT:6466 Ryan BAYES:6585 paste -s last.txt first.txt num.txt | cut -f1-5,10 ADAMS BARRETT BAYES BECK BENNETT NAKAMURA Andrew Bruce Ryan Bill Peter Satoshi

10 Translating Characters – tr  The tr command is used to translate between one character set and another  Input is read from standard input and written to standard output (no files)  With no options, tr accepts two character sets with equal lengths, and replaces each character with the corresponding one tr [options] set1 [set2]

11 Deleting or Squeezing Characters – tr  Sets contain literal characters, or character ranges, such as: ' a-z ' or ' DEFa-z '  With command line options, tr can also be used to delete or squeeze characters  Command line options for tr : -d – delete characters in set1 -s – replace sequence of characters with one

12 Defining Sets for tr  tr has some interpreted sequences to simplify the definition of sets: [:alpha:] – all letters [:digit:] – all digits [:alnum:] – all letters and digits [:space:] – all whitespace [:punct:] – all punctuation characters [CHAR*REPEAT] – REPEAT copies of CHAR [CHAR*] – copies of CHAR until set1 length

13 tr – Examples  Change lower case to capital, and replace the digits 6, 7, 8 with the letters x, y, z head -3 padded_phones.txt ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 head -3 padded_phones.txt | tr 'a-z678' 'A-Zxyz' ADAMS ANDREW y5z3 BARRETT BRUCE x4xx BAYES RYAN x5z5

14 tr – Examples  Squeeze sequences of spaces into one:  Delete spaces, and digits 7 and 8: head -3 padded_phones.txt | tr -d " 78" ADAMSAndrew53 BARRETTBruce6466 BAYESRyan655 head -3 padded_phones.txt | tr -s " " ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585

15 Reading from Standard Input  Many UNIX commands accept one or more input files listed in the command line ( tr is one of the few that don't)  If no input file is given, these commands will read from the standard input  Alternately, if the file list contains a ' - ', the standard input will be inserted in its place

16 Standard Input – Example cat last.txt | tr "A-Z" "a-z" | \ paste –d"_" first.txt - number.txt | head -10 Andrew_adams_7583 Imelda_aguilar_6518 Daniel_albers_7540 Pierre_amaudruz_7567 Friedhelm_ames_7581 Willy_andersson_6238 Andrei_andreyev_6491 Jonathan_aoki_6820 Donald_arseneau_6295 Danny_ashery_6188

17 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

18 Sorting Files – sort  The sort command reorders the lines in a file (or files), and sends the result to the standard output  Command line options for sort : -f – ignore case (fold lowercase to uppercase) -r – sort in reverse order -n – sort in numeric order sort [options] [files]

19 Sorting Files – sort  With no options given, the input is sorted based on the ASCII code order  The sort command has many more options for selecting which fields to sort by, and for changing the way input is treated  As always, you should read the man pages for the full details

20 sort – Example: Using Ignore-Case Andrew bill Bruce peter Ryan Andrew Bruce Ryan bill peter Bruce Ryan peter Andrew bill sort -f sort

21 sort – Example: Sorting Numbers sort -n sort

22 Removing Duplicate Lines – uniq  The uniq command removes adjacent duplicate lines from its input file If input is sorted, removes all duplicate lines  Command line options for uniq : -i – ignore case -c – prefix lines by the number of occurrences -d – only print duplicate lines -u – only print unique lines

23 uniq – Example 1 Andrew 1 Bill 2 David 3 Peter 1 Ryan Andrew Bill David Peter Ryan Andrew Bill David Peter Ryan uniq -c uniq

24 uniq – Example Andrew Bill Ryan David Peter Andrew Bill David Peter Ryan uniq -u uniq -d

25 Example – File Processing Using Pipes  Task – go over the book "War and Peace" and count the appearances of each word Step 1: remove all punctuation marks Step 2: put each word in a separate line Step 3: sort words cat war_and_peace.txt | tr -d '[:punct:]' cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort

26 Example – File Processing Using Pipes Step 4: count appearances of each word Step 5: sort result by number of appearances Step 6: write output to file cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr > words.txt

27 Comparing Text Files – diff  The diff command takes two input files, and compares them  The output contains only the different lines, with their line numbers  Command line options for diff : -i – ignore case -b – ignore changes in amount of white space -B – ignore insertion or deletion of blank lines

28 diff – Examples 2,3c2,3 < BARRETT Bruce 6466 < BAYES Ryan > BARRETT Bruce 3333 > BAYES Ryan c5 < BENNETT Peter > Bennett peter 7456 diff ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456 ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter 7456

29 diff – Examples 2c2 < BARRETT Bruce > BARRETT Bruce c5 < BENNETT Peter > Bennett peter 7456 diff -b ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456 ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter c2 < BARRETT Bruce > BARRETT Bruce 3333 diff -bi

30 Maintaining Output Consistency  During program development, assume that we have reached the correct output  We want to verify that it does not change Create reference output file: After changing the program, compare output: prog > prog.out prog | diff – prog.out

31 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

32 Searching For Matching Patterns – grep  The grep command searches files for patterns, and prints matching lines  The mandatory regexp argument defines a regular expression  A regular expression is a formula for matching strings that follow some pattern grep [options] regexp [files]

33 Searching For Matching Patterns – grep  The simplest regular expression is just a sequence of characters  This regular expression matches only a single string – itself  The following command prints all lines from any of files that contain word : grep word files

34 Searching For Matching Patterns – grep  The power of grep lies in using more sophisticated regular expressions  Command line options for grep : -v – print all lines that don't match -c – print only a count of matched lines -n – print line numbers -h – don't print file names (for multiple files) -l – print file name but not matching line

35 Regular Expressions  Regular expressions are a powerful tool for searching and selecting text  Their origin is in the UNIX grep command (and further back in automata theory)  They have since been copied into many other tools and languages such as awk, sed, perl and Java

36 Regular Expressions vs. Filename Expansion  Note that regular expressions are different from filename expansion  Filename expansion uses some regular expression concepts and symbols, but: Filename expansion is done by the shell Regular expressions are passed as arguments to specific commands or utilities

37 Matching a Single Character  A period (. ) matches any single character  For example: Regular ExpressionMatchesDoesn't Match b.gbag debug bigger brag bg bad U..XUNIXunix.a, b, cAn empty line

38 Matching a Character Class  Square brackets ( [] ) match any single character within the brackets  If the first character following the left bracket is a ' ^ ', the expression matches any character not in the brackets  A ' - ' can be used to indicate a range, such as: [a-z]

39 Matching a Character Class Regular ExpressionMatchesDoesn't Match [Bb]illBill bill got billed Dill ill kill t[aeiou].ktalk stack stink track take number [^0-5]number xxx number 8: number 59

40 Matching a Character Class  The same predefined character classes used for tr can also be used here  For portability reasons, [:alpha:] is always preferable to [A-Za-z]  Note: the brackets are part of the symbolic names, and must be included in addition to the enclosing brackets, i. e. [[:alpha:]]

41 Matching Repetitions  An asterisk ( * ) represents zero or more matches of the regular expression it follows Regular ExpressionMatchesDoesn't Match ab*cac abc aaabbbc abac acb t.*ingthing string thinking king

42 Matching Special Characters  Sometimes we want to literally match a character that has a special meaning, such as ' * ' or ' [ '  There are two ways to do that: Precede the character with a ' \ ' Use square brackets – any character inside is taken literally

43 Matching Special Characters Regular ExpressionMatchesDoesn't Match a\.ca.cabc \.\.\.*the end... more..... abc stop. [*.]* start * Sys.print Hello world abc C:\\binC:\binC:\\bin

44 Matching the Beginning or the End of a Line  A regular expression that begins with a caret ( ^ ) can match a string only at the beginning of a line  Similarly, a regular expression that ends with a dollar sign ( $ ) can match a string only at the end of a line

45 Matching the Beginning or the End of a Line Regular ExpressionMatchesDoesn't Match ^TThis line That bug START My Tag ^num.*[0-9]$num5 num99 number 1 my num1 the number 6 num 6a ^t.*k$talk track tk stack take

46 Using Regular Expressions with grep – Examples cat bugs.txt big boy bad bug bag bigger bag better boogie nights grep 'b.g' bugs.txt big boy bad bug bag bigger bag grep 'b.g.' bugs.txt big boy bigger bag grep 'b.*g.' bugs.txt big boy bigger bag boogie nights

47 Using Regular Expressions with grep – Examples cat f.txt ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 grep '[[:alpha:]],' f.txt grep '^[C-Z][[:lower:]]*$' f.txt Ryan ADAMS, BARRETT, BAYES, grep '^[^[:alpha:]0-3]*$' f.txt

48 Pipes and Regular Expressions – Example  Task: create a file containing the names of all source files in the current directory, sorted by the number of lines in each file Step 1: count lines in each file Step 2: leave only '.c ' and '.h ' files Step 3: sort in reverse order (largest first) wc -l * wc -l * | grep '\.[ch]$' wc -l * | grep '\.[ch]$' | sort -nr

49 Pipes and Regular Expressions – Example Step 4: squeeze leading spaces (into one) Step 5: remove number field Step 6: write output to file wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 > sorted_source_files.txt wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3

50 Which grep to Use?  In addition to grep itself, there are two more variants of it: egrep and fgrep Use grep for most standard text finding tasks Use egrep for complex tasks, where basic regular expressions are just not enough, and you need to use extended regular expressions Use fgrep when only fixed strings are searched, and speed is of the essence

51 Extended Regular Expressions – egrep  Extended regular expressions support all basic regular expression syntax, plus some additional special characters: + – similar to ' * ', but at least one appearance ? – similar to ' * ', but zero or one appearances () – grouping a|b – the OR operator – matches either regular expression a or regular expression b

52 Extended Regular Expressions – egrep Regular ExpressionMatchesDoesn't Match num6+num666 num654 num566 number num6?5num65 num555 num6 num665 Barret|BennetBarret Bennet B(arr|enn)etBarret Bennet

53 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

54 Stream Editor – sed  sed is a script editor for text streams, which supports basic regular expressions  It performs transformations on an input stream, based on simple instructions  sed has many commands, but the most commonly used is the substitute command: sed 's/pattern/replacement/[g]' [file]

55 Stream Editor – sed  pattern is any basic regular expression  replacement is a string that will replace one or more matches of pattern  The optional g flag defines whether the operation is global – without it only the first match in every line is replaced  The special character ' & ' can be used inside replacement to refer to the matched text

56 Using Regular Expressions with grep – Examples cat bugs.txt big boy bad bug bag bigger bag better sed 's/b.g/XXX/' bugs.txt XXX boy bad XXX XXX XXXger bag better sed 's/b.g/XXX/g' bugs.txt XXX boy bad XXX XXX XXXger XXX better

57 sed – Examples head -2 my_phones.txt head -2 my_phones.txt | sed 's/ [[:upper:]]/ /g' ADAMS, ndrew 7583 BARRETT, ruce 6466 ADAMS, Andrew 7583 BARRETT, Bruce 6466 ADAMS, Andrew ### BARRETT, Bruce ### head -2 my_phones.txt | sed 's/[[:digit:]]*$/###/g'

58 Matching and Reusing Portions of a Pattern in sed  It is also possible to use portions of the matching pattern  Within the pattern, portions should be enclosed between ' \( ' and ' \) '  In replacement, the special sequences: ' \1 ', ' \2 ', etc. can be used to refer to the matched portions

59 Matching and Reusing Portions of a Pattern in sed – Examples  Remove the first name from each line:  Replace first name with initial: head -2 my_phones.txt | sed 's/ \([[:upper:]]\)[[:lower:]]* / \1. /' ADAMS, A BARRETT, B ADAMS, 7583 BARRETT, 6466 head -2 my_phones.txt | sed 's/ [[:upper:]][[:lower:]]* / /'

60 Matching and Reusing Portions of a Pattern in sed – Examples  Switch between first and last names:  Switch names and parenthesize number: head -2 my_phones.txt | sed 's/\(.*\), \(.*\) \(.*\)/\2 \1: (03-555\3)/' Andrew ADAMS: ( ) Bruce BARRETT: ( ) Andrew ADAMS 7583 Bruce BARRETT 6466 head -2 my_phones.txt | sed 's/\(.*\), \(.*\) /\2 \1 /'