2.1.2.4 – Command-Line Data Analysis and Reporting 9/4/2015 2.1.2.4.2 - Command-Line Data Analysis and Reporting - Rediscovering the Prompt 1 2.1.2.4.2.

Slides:



Advertisements
Similar presentations
UNIX Chapter 12 Redirection and Piping Mr. Mohammad Smirat.
Advertisements

การใช้ระบบปฏิบัติการ UNIX พื้นฐาน บทที่ 4 File Manipulation วิบูลย์ วราสิทธิชัย นักวิชาการคอมพิวเตอร์ ศูนย์คอมพิวเตอร์ ม. สงขลานครินทร์ เวอร์ชั่น 1 วันที่
CIS 118 – Intro to UNIX Shells 1. 2 What is a shell? Bourne shell – Developed by Steve Bourne at AT&T Korn shell – Developed by David Korn at AT&T C-shell.
CS Lecture 03 Outline Sed and awk from previous lecture Writing simple bash script Assignment 1 discussion 1CS 311 Operating SystemsLecture 03.
1 CSE 390a Lecture 2 Exploring Shell Commands, Streams, and Redirection slides created by Marty Stepp, modified by Josh Goodwin
Linux+ Guide to Linux Certification, Second Edition
Guide To UNIX Using Linux Third Edition
Introduction to Unix – CS 21 Lecture 5. Lecture Overview Lab Review Useful commands that will illustrate today’s lecture Streams of input and output File.
Unix Files, IO Plumbing and Filters The file system and pathnames Files with more than one link Shell wildcards Characters special to the shell Pipes and.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2014) Dan Jurafsky (From Chris Manning’s modification of Ken Church’s presentation)
Chapter 4: UNIX File Processing Input and Output.
1 Day 16 Sed and Awk. 2 Looking through output We already know what “grep” does. –It looks for something in a file. –Returns any line from the file that.
Advanced File Processing
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
The UNIX Shell. The Shell Program that constantly runs at terminal after a user has logged in. Prompts the user and waits for user input. Interprets command.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Linux+ Guide to Linux Certification, Third Edition
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Introduction to Unix – CS 21
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Chapter Four I/O Redirection1 System Programming Shell Operators.
40 Years and Still Rocking the Terminal!
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular.
– Command-Line Data Analysis and Reporting 12/7/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University.
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
CS 124/LINGUIST 180 From Languages to Information
Sed. Class Issues vSphere Issues – root only until lab 3.
Linux+ Guide to Linux Certification, Second Edition
Lecture 1: Introduction, Basic UNIX Advanced Programming Techniques.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep.
Lesson 6-Using Utilities to Accomplish Complex Tasks.
In the last class, Filters and delimiters The sample database pr command head and tail commands cut and paste commands.
1 UNIX Operating Systems II Part 2: Shell Scripting Instructor: Stan Isaacs.
Linux Administration Working with the BASH Shell.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
CIRC Summer School 2016 Baowei Liu
Lesson 5-Exploring Utilities
CIRC Summer School 2017 Baowei Liu
CIRC Winter Boot Camp 2017 Baowei Liu
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
Unix Scripting Session 4 March 27, 2008.
CS 403: Programming Languages
CSE 390a Lecture 2 Exploring Shell Commands, Streams, and Redirection
Guide To UNIX Using Linux Third Edition
Unix Talk #2 (sed).
Chapter Four UNIX File Processing.
Exploring Shell Commands, Streams, and Redirection
More advanced BASH usage
CSE 390a Lecture 2 Exploring Shell Commands, Streams, and Redirection
Software I: Utilities and Internals
Exploring Shell Commands, Streams, and Redirection
Presentation transcript:

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt · redirection · more on sort · join · process substitution Command-Line Data Analysis and Reporting – Session ii

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 2 Command Line Glue · the pipe “|” sends the output of one process to another · STDOUT of a process becomes STDIN of another process · a composition operator · apply function f then function g · f(g(x)) or f·g(x) · the pipe allows complex text processing from building blocks like sort, cut, uniq, etc. · each element in a pipe is simple and tractable and has a limited mandate · selecting/permuting elements and using command-line parameters at each step offers both flexibility and power · the redirect “ ” sends stdin/stdout/stderr to/from a file

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 3 Redirection and Pipe Syntax sourcetargetcommand stdoutfileprog > file stderrfileprog 2> file stdout and stderrfile prog &> file prog > file 2>&1 stdoutend of fileprog >> file stderrend of fileprog 2>> file stdout and stderrend of file prog &>> file prog >> file 2>&1 filestdinprog < file stdoutprocessprog | prog2 stdout and stderrprocess prog 2>&1 | prog2 file prog file2 UPT 43.1

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 4 Pipe vs Redirect · don’t confuse the pipe “|” with a redirect “>”, “<“, etc · pipe sends output of one process to another · redirects uses standard I/O facility to send data to/from a file · don’t use cat with a single argument – use a redirect #this is worse cat file.txt | prog # this is better prog < file.txt # dude, where’s my script? # this is what you meant prog1 > prog2 prog1 | prog2

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 5 file descriptors · any process is given three places to/from which information can be sent · these places are called open files and the kernel gives a file descriptor to each · fd 0 = standard input · fd 1 = standard output · fd 2 = standard error · prog 2> file redirects standard error · [n]> redirects to file descriptor [n] · 1> is just the same as > (n=1 by default), and redirects standard output · prog > file 2>&1 redirects both standard output and error · [n]>&[m] makes descriptor n point to the same place as descriptor m · standard error is pointed to standard output

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 6 file descriptors (cont’d) · BASH supports additional file descriptors (3,4,… up to ulimit -n) · swapping standard output with standard error · how do you swap the standard output and error of a process? · prog 2>&1 1>&2 · nope, this doesn’t work because by the time bash gets to 1>&2, stderr already points to stdout · analogous to swapping variable values – you need a temporary variable to hold a value · prog 3>&2 2>&1 1>&3 · this works – see table · more complicated · send stdout to file and stderr to process · prog 3>&1 > file 2>&3 | prog 2 stdinstdoutstderr fd0fd1fd2 3>&2 fd0fd1 fd2 fd3 2>&1 fd0 fd1 fd2 fd3 1>&3 fd0fd2 fd1 fd3 UPT

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 7 Idioms From Last Time head FILE first 10 lines in a file tail FILE last 10 lines in a file head –NUM FILE first NUM lines in a file tail –NUM FILE last NUM lines in a file head –NUM FILE | tail -1 NUMth line wc –l FILE number of lines in a file sort FILE sort lines asciibetically by first column sort +COL FILE sort lines asciibetically by COL column sort –n FILE sort lines numerically in ascending order sort –nr FILE sort lines numerically in descending order sort +NUM1 +NUM2 sort lines in a file first by field COL1 then COL2 grep ^CHR FILE report lines that start with character CHR (^ is the start-of-line anchor) grep –v ^CHR FILE lines that don’t start with CHR sed ‘s/REGEX/STRING/’ replace first match of REGEX with STRING sed ‘s/^ *//’ remove leading spaces uniq –c FILE report number of adjacent duplicate lines cat –n FILE prefix lines with their number tr CHR1 CHR2 FILE replace all instances of CHR1 with CHR2 tr ABCD 1234 FILE replace A->1, B->2, C->3, D- >4 tr –d CHR1 delete instances of CHR1 fold –w NUM split a line into multiple lines every NUM characters expand –t NUM FILE replace each tab with NUM spaces idioms

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 8 More on Sort · sort orders lines in a file based on values in a column or columns · forward or reverse (-r) · asciibetic or numerical (-n) · return all lines or only those with unique field values (-u) · sort –u returns all unique values of a field, without counting the number of time each field appears sort FILE sort lines asciibetically by first column sort +COL FILE sort lines asciibetically by COL column sort –n FILE sort lines numerically in ascending order sort –nr FILE sort lines numerically in descending order sort +NUM1 +NUM2 sort lines in a file first by field COL1 then COL2 sort –u sort, but return only first line of a run with the same field value #animals.txt #sheep #pig #sheep #horse #pig > sort –u animals.txt horse pig sheep > sort animals.txt | uniq –c 1 horse 2 pig 3 sheep idioms UPT

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 9 sort’s flags · to tell sort which fields to sort by specify the field start (m) and end (m) positions using +n +m · sort · start sorting on field 0, stop sorting on field 1 · i.e. sort by field 0 only · sort · start sorting on field 0, stop sorting on field 2 · i.e. sort by field 0, and 1 · sort · sort by field 0 and 2 · to mix sorting schemes, add “n” to the field number · sort n -2 · sort field 0 by ASCIIbetic, but field 1 by numerical · to ask for reverse sort, add “r” to the field number · sort nr -2 · sort field 0 by ASCIIbetic, but field 1 by reverse numerical

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 10 sort (cont’d) · each letter appears about 300 times # 10,000 lines with a letter and a number b 741 c 53 s 511 a 238 i 9

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 11 sort (cont’d) · the –u flag in sort is handy in identifying min/max lines associated with the same key · each letter appears about 300 times · what are the minimum and maximum values for each letter? · sort by character (asciibetic), then number (numerical) # 10,000 lines with a letter and a number b 741 c 53 s 511 a 238 i 9 # minimum values for each letter sort n -2 nums.txt | sort -u -k 1,1 # maximum values for each letter sort rn -2 nums.txt | sort -u -k 1,1

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 12 » sort -u -k 1,1 nums.txt a 238 b 741 c 53 d 168 e 903 f 424 g 736 h 720 i 9 j 99 k 124 l 305 m 484 n 837 o 78 p 329 q 63 r 910 s 511 t 431 u 229 v 976 w 705 x 671 y 81 z 913 sort n -2 | sort -u -k 1,1 a 985 b 993 c 995 d 996 e 995 f 999 g 995 h 999 i 999 j 991 k 998 l 983 m 999 n 997 o 999 p 999 q 999 r 987 s 995 t 998 u 999 v 995 w 999 x 998 y 999 z 999 sort nr -2 | sort –u –k 1,1 a b 2 c 3 d 5 e 1 f 0 g 2 h 0 i 4 j 0 k 0 l 1 m 2 n 0 o 8 p 2 q 3 r 1 s 3 t 3 u 0 v 0 w 0 x 6 y 4 z 3 num of first appearance of a letter max num of a letter min num of a letter

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 13 What’s the Deal with Zero Padding · by default, sort acts asciibetically (alphanumeric) · 0 comes before 1 – great · 1 comes before 11 – great · 11 comes before 2, oops · problem caused by strings of different lengths · sort permits sorting asciibetically on one field and numerical on another · sort n -2 · field 1 ASCII, field 2 numerical · sort · fields 1,2 ASCII · by padding numerical fields with leading zeroes, asciibetic sorting becomes equivalent to numerical · 1, 2, 10, 11, 22 · 01, 02, 10, 11, 22 · if you combine character and numerical fields in a report, consider zero-padding the numbers · leading zeroes are easily removed with sed ‘s/\([^0-9]\)0+/\1/g’

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 14 More on grep · there are a number of variants of grep · egrep (grep –E) is extended grep, supporting extended regular expression patterns · fgrep (grep –F) interprets regular expression as a list of fixed strings, each of which can be matched · grep –P supports Perl-type regular expressions · agrep supports approximate matching · feature set of regular expressions is different for the greps, sed and perl · different RE engines (DFA, NFA), different functionality, different performance · perl has non-POSIX extensions to its RE engine UPT 32.20

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 15 agrep – Approximate grep · text matching, with support for approximate matching · a match error is one of: deletion, insertion, or substitution · weight of each can be set by –D –I and -S · how many non-overlapping 7-mers from the first 1 Mb of chr7 match GATTACA · with no errors · with N errors (agrep supports N=1..8) cat chr7.fa | grep -v ">" | tr -d "\n" | fold -w 1000 | head | tr –d “\n” | fold –w 7 | grep -v N | tr atgc ATGC > 7mers.txt wc –l 7mers.txt agrep GATTACA 7mers.txt | wc 28 agrep –c -1 GATTACA 7mers.txt | wc 318 agrep -c -2 GATTACA 7mers.txt 5464 agrep –c -3 GATTACA 7mers.txt | wc 39442

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 16 agrep (cont’d) · what are the most frequent/infrequent 7-mers matching GATTACA with one error? agrep -1 GATTACA 7mers.txt | grep –v GATTACA | sort | uniq –c | sort –nr | head ATTACAG 19 GGATTAC 13 GATCACA agrep -1 GATTACA 7mers.txt | grep –v GATTACA | sort | uniq –c | sort –nr | grep –w 1 1 GATTAAT 1 GATTAAG 1 GATTAAC 1 GATACAG 1 GATACAC 1 CGTTACA 1 CGATTAC 1 CGATACA

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 17 agrep (cont’d) · agrep supports discovery of supersequences – strings that contain your query but not necessarily in a contiguous stretch · 7-mers with 5 Gs · GGGGGTA, GGTGGGA, TGGAGGG, etc · 7-mers with 3 Gs followed by a C then a T · GGAGCAT, AGGGCGT, GGGGCCT agrep –c -p GGGGG 7mers.txt 4026 agrep –c –p GGGCT 7mers.txt 2341

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 18 join · joins two files on lines with a common field · join will not sort · lines must be either sorted or already in the corresponding order awk '{printf("%s %04d\n",$1,$2)}' max.txt awk '{printf("%s %04d\n",$1,$2)}' min.txt join min.txt max.txt a b c d e f g

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 19 join (cont’d) · let’s start with two files with some animal data · unmatched lines are not reported #colors sheep white pig pink dog brown cat black parrot green canary yellow hippo grey zebra black_white #sounds sheep meeh pig oink dog woof cat meow parrot i_love_you canary chirp man hello chicken pakawk join sounds.txt colors.txt sheep meeh white pig oink pink dog woof brown cat meow black parrot i_love_you green canary chirp yellow

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 20 join (cont’d) · you can get a list of lines that didn’t make it into the join · join –v 1|2 · you can select to join on different fields by · join -1 NUM1 -2 NUM2 · will join based on field NUM1 in file 1 and NUM2 in file2 join –v 1 sounds.txt colors.txt man hello chicken pakawk join –v 2 sounds.txt colors.txt hippo grey zebra black_white

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 21 Process Substitution · sometimes (often) the files are not sorted and you need to sort them first · that’s a lot of temporary files · use process substitution · <(process) will run process, send its output to a file and provide the name of that file · let’s sample some random lines (25%) and count the number of lines in the output · sample is a perl prompt tool (covered next time) sort sounds.txt > tmp.1 sort colors.txt > tmp.2 join tmp.1 tmp.2 join <(sort sounds.txt) <(sort colors.txt) > wc <(sample -r 0.25 colors.txt) /dev/fd/63

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 22 Process Substitution · the >( ) substitution is a little more arcane ls <(true) lr-x martink users :54 /dev/fd/63 -> pipe:[ ] ls <(true) lr-x martink users :55 /dev/fd/63 -> pipe:[ ] ls <(true) lr-x martink users :55 /dev/fd/63 -> pipe:[ ] tar cvf >(gzip –c > archive.tgz) *txt

– Command-Line Data Analysis and Reporting 9/4/ Command-Line Data Analysis and Reporting - Rediscovering the Prompt 23 · Perl prompt tools next time Command-Line Data Analysis and Reporting – Session 1