18.12.2001Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications.

Slides:



Advertisements
Similar presentations
EMT 2390L Lecture 4 Dr. Reyes Reference: The Linux Command Line, W.E. Shotts.
Advertisements

A Guide to Unix Using Linux Fourth Edition
 *, ? And [ …] . Any single character  ^ beginning of a line  $ end of the line.
CS 497C – Introduction to UNIX Lecture 25: - Simple Filters Chin-Chih Chang
Guide To UNIX Using Linux Third Edition
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Lecture 02CS311 – Operating Systems 1 1 CS311 – Lecture 02 Outline UNIX/Linux features – Redirection – pipes – Terminating a command – Running program.
Grep, comm, and uniq. The grep Command The grep command allows a user to search for specific text inside a file. The grep command will find all occurrences.
CSCI 330 T HE UNIX S YSTEM File operations. OPERATIONS ON REGULAR FILES 2 CSCI The UNIX System Create Edit Display Contents Display Contents Print.
Unix Files, IO Plumbing and Filters The file system and pathnames Files with more than one link Shell wildcards Characters special to the shell Pipes and.
CSC 4630 Meeting 2 January 22, Filters Definition: A filter is a program that takes a text file as an input and produces a text file as an output.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2014) Dan Jurafsky (From Chris Manning’s modification of Ken Church’s presentation)
Advanced File Processing
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
CS 403: Programming Languages Lecture 21 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Regular expressions Used by several different UNIX commands, including ed, sed, awk, grep A period ‘.’ matches any single characters.X. matches any X.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Session 2 Wharton Summer Tech Camp Basic Unix. Agenda Cover basic UNIX commands and useful functions.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
I/O and Redirection. Standard I/O u Standard Output (stdout) –default place to which programs write u Standard Input (stdin) –default place from which.
Sed Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Introduction to Unix (CA263) File Processing (continued) By Tariq Ibn Aziz.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
LIN Unix Lecture 7 Hana Filip. LIN Text Processing Command Line Utility Programs (cont.) sed LAST WEEK wc sort tr uniq awk TODAY join paste.
40 Years and Still Rocking the Terminal!
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
CS 124/LINGUIST 180 From Languages to Information
1 Lecture 10 Introduction to AWK COP 3344 Introduction to UNIX.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep.
Lesson 6-Using Utilities to Accomplish Complex Tasks.
In the last class, Filters and delimiters The sample database pr command head and tail commands cut and paste commands.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
6/13/2016Course material created by D. Woit 1 CPS 393 Introduction to Unix and C START OF WEEK 3 (UNIX) 6/13/2016Course material created by D. Woit 1.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
SIMPLE FILTERS. CONTENTS Filters – definition To format text – pr Pick lines from the beginning – head Pick lines from the end – tail Extract characters.
Linux 201 Training Module Linux Adv File Mgmt.
Tutorial of Unix Command & shell scriptS 5027
Lesson 5-Exploring Utilities
Advanced File Processing
CS 124/LINGUIST 180 From Languages to Information
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
CS 403: Programming Languages
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
CS 124/LINGUIST 180 From Languages to Information
The Linux Command Line Chapter 6
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Chapter Four UNIX File Processing.
MeasureCamp VI *NIX for ETL
CS 124/LINGUIST 180 From Languages to Information
Software I: Utilities and Internals
Presentation transcript:

Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications

Unix Trix for Emprirical CL2 Acknowledgements Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks

Unix Trix for Emprirical CL3 Unix Tools grep : search for pattern sort : sort a file uniq : eliminate duplicates tr : translate characters wc : count words sed : edit string awk : pattern based programming language cut : cut out selected fields of each line of a file paste : merge corresponding or subsequent lines of files comm : select or reject lines common to two files join : relational database operator man command for further details of these

Unix Trix for Emprirical CL4 Text l I was intrigued by the article "Cloning a human being a long way off`" (December 3). I attended the well-presented lecture by Dr Bruce Campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained. This involves the transfer of the nucleus from an adult human mature cell, such as skin, hair or mucosa, into the denucleated human ovum of a female of the species, which is then allowed to start developing for a few days to the stage where the placental precursor cells separate from the cells destined to become the foetus.

Unix Trix for Emprirical CL5 Punctuation 1 sed –f markpunct.sed file contents s/"/ xzzdoublequotezzx /g s/'/ xzzquotezzx /g s/`/ xzzquotezzx /g s/(/ xzzleftparenzzx /g I was intrigued by the article xzzdoublequotezzx Cloning a human being xzzquotezzx

Unix Trix for Emprirical CL6 Punctuation 2 sed –f angle.sed file contents s/xzz/</g s/zzx/>/g I was intrigued by the article Cloning a human being

Unix Trix for Emprirical CL7 Case tr 'A-Z' 'a-z' i was intrigued by the article "cloning a human being `a long way off`" (december 3). i attended the well-presented lecture by dr bruce campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained.

Unix Trix for Emprirical CL8 Tokenisation tr –sc 'a-zA-Z' '\012' I was intrigued by the article Cloning a

Unix Trix for Emprirical CL9 Sorting tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012' | sort a a a a adult allowed an article as attended become being bruce by by

Unix Trix for Emprirical CL10 Making a Wordlist tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012' | sort | uniq a adult allowed an article as attended become being bruce by campbell cell cells cloning

Unix Trix for Emprirical CL11 Counting tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq -c 4 a 1 adult 1 allowed 1 an 1 article 1 as 1 attended 1 become 1 being 1 bruce 2 by 1 campbell 1 cell 3 cells 2 cloning

Unix Trix for Emprirical CL12 Sorted Frequency List tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r 13 the 5 of 4 human 4 a 3 to 3 cells 2 was 2 i 2 from 2 for 2 cloning 2 by 1 which 1 wherein

Unix Trix for Emprirical CL13 Sorted Frequency List tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r | cat -n 1 13 the 2 5 of 3 4 human 4 4 a 5 3 to 6 3 cells 7 2 was 8 2 i 9 2 from 10 2 for 11 2 cloning 12 2 by

Unix Trix for Emprirical CL14 Zipf Principle of least effort: people act so as to minimise their probable average rate of work. Speaker’s effort is conserved by having a small no of very frequent words, whilst hearer’s effort demands large number of rare words. Consequence (according to Zipf): relationship between word frequency and rank. Frequency x Rank = constant

Unix Trix for Emprirical CL15 Zipf Curve Rank  Frequency 

Unix Trix for Emprirical CL16 paste and tail paste: The default operation of paste will concatenate the corresponding lines of the input files. The NEWLINE character of every line except the line from the last input file will be replaced with a TAB character. tail: The tail utility copies the named file to the standard output beginning at a designated place. These two utilities can be used to work with n- grams

Unix Trix for Emprirical CL17 Bigrams tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012‘> foo tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'| tail +2 > foo1 paste foo foo1 | sort : human being human mature human ovum human stem : the article the cells the cutting the denucleated the foetus

Unix Trix for Emprirical CL18 grep grep '[A-Z] 'Lines w. uppercase char grep ‘^[A-Z] 'Lines starting w. uppercase char grep '[A-Z]$ 'Lines ending w. uppercase char grep '[^aeiou] 'Lines containing non-vowerl grep '[.]'Lines w. any character grep '[A-Z]* 'Lines w. 0 or more vowels