CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University.

Slides:



Advertisements
Similar presentations
Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications.
Advertisements

A Guide to Unix Using Linux Fourth Edition
Unix Scripting A Tutorial for Computational Linguistics (CSE 506/606) Kristy Hollingshead Fall
 *, ? And [ …] . Any single character  ^ beginning of a line  $ end of the line.
CS 497C – Introduction to UNIX Lecture 25: - Simple Filters Chin-Chih Chang
Guide To UNIX Using Linux Third Edition
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Grep, comm, and uniq. The grep Command The grep command allows a user to search for specific text inside a file. The grep command will find all occurrences.
Using Unix Shell Scripts to Manage Large Data
Unix Files, IO Plumbing and Filters The file system and pathnames Files with more than one link Shell wildcards Characters special to the shell Pipes and.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
Filters using Regular Expressions grep: Searching a Pattern.
CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2014) Dan Jurafsky (From Chris Manning’s modification of Ken Church’s presentation)
Chapter 4: UNIX File Processing Input and Output.
1 Day 16 Sed and Awk. 2 Looking through output We already know what “grep” does. –It looks for something in a file. –Returns any line from the file that.
Advanced File Processing
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
CS 403: Programming Languages Lecture 21 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Regular expressions Used by several different UNIX commands, including ed, sed, awk, grep A period ‘.’ matches any single characters.X. matches any X.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Session 2 Wharton Summer Tech Camp Basic Unix. Agenda Cover basic UNIX commands and useful functions.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Lecture 24CS311 – Operating Systems 1 1 CS311 – Lecture 24 Outline Final Exam Study Guide Note: These lecture notes are not intended replace your notes.
Introduction to Unix (CA263) File Processing (continued) By Tariq Ibn Aziz.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Searching and Sorting. Why Use Data Files? There are many cases where the input to the program may come from a data file.Using data files in your programs.
Getting the most out of the workshop Ask questions!!! Don’t sit next to someone you already know Work with someone with a different skillset and different.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular.
16-Dec-15Advanced Programming Spring 2002 sed and awk Henning Schulzrinne Dept. of Computer Science Columbia University.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
CS 124/LINGUIST 180 From Languages to Information
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep.
In the last class, Filters and delimiters The sample database pr command head and tail commands cut and paste commands.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Plan This week Monday: Operating systems, Unix command line (“terminal”) Wednesday: Data science: Dataframes, pandas Next week Monday: Python as files,
Tutorial of Unix Command & shell scriptS 5027
Lesson 5-Exploring Utilities
CSE 374 Programming Concepts & Tools
CS 124/LINGUIST 180 From Languages to Information
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
CS 403: Programming Languages
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
CS 124/LINGUIST 180 From Languages to Information
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Unix Talk #2 (sed).
Chapter Four UNIX File Processing.
CS 124/LINGUIST 180 From Languages to Information
Lab 8: Regular Expressions
Software I: Utilities and Internals
Presentation transcript:

CS 124/LINGUIST 180 From Languages to Information Unix for Poets (in 2013) Christopher Manning Stanford University

Christopher Manning Unix for Poets (based on Ken Church’s presentation) Text is available like never before The Web Dictionaries, corpora, , etc. Billions and billions of words What can we do with it all? It is better to do something simple, than nothing at all. You can do simple things from a Unix command-line DIY is more satisfying than begging for ‘‘help’’ 2

Christopher Manning Exercises to be addressed 1.Count words in a text 2.Sort a list of words in various ways 1.ascii order 2.‘‘rhyming’’ order 3.Extract useful info from a dictionary 4.Compute ngram statistics 5.Work with parts of speech in tagged text 3

Christopher Manning Tools grep: search for a pattern (regular expression) sort uniq –c (count duplicates) tr (translate characters) wc (word – or line – count) sed (edit string -- replacement) cat (send file(s) in stream) echo (send text in stream) cut (columns in tab-separated files) paste (paste columns) head tail rev (reverse lines) comm join shuf (shuffle lines of text) 4

Christopher Manning Prerequisites ssh into a corn cp /afs/ir/class/cs124/nyt_ txt. man, e.g., man tr (shows command options; not friendly) Input/output redirection: > < | CTRL-C 5

Christopher Manning Exercise 1: Count words in a text Input: text file (nyt_ txt) Output: list of words in the file with freq counts Algorithm 1. Tokenize(tr) 2. Sort(sort) 3. Count duplicates (uniq –c) 6

Christopher Manning Solution to Exercise 1 tr -sc ’A-Za-z’ ’\n’ < nyt_ txt | sort | uniq -c a 1271 A 3 AA 3 AAA 1 Aalborg 1 Aaliyah 1 Aalto 2 aardvark 7

Christopher Manning Some of the output tr -sc ’A-Za-z’ ’\n’ < nyt_ txt | sort | uniq -c | head –n a 1271 A 3 AA 3 AAA 1 Aalborg tr -sc ’A-Za-z’ ’\n’ < nyt_ txt | sort | uniq -c | head Gives you the first 10 lines tail does the same with the end of the input (You can omit the “-n” but it’s discouraged.) 8

Christopher Manning Extended Counting Exercises 1.Merge upper and lower case by downcasing everything Hint: Put in a second tr command 2.How common are different sequences of vowels (e.g., ieu) Hint: Put in a second tr command 9

Christopher Manning Sorting and reversing lines of text sort sort –f Ignore case sort –nNumeric order sort –rReverse sort sort –nrReverse numeric sort echo “Hello” | rev 10

Christopher Manning Counting and sorting exercises Find the 50 most common words in the NYT Hint: Use sort a second time, then head Find the words in the NYT that end in “zz” Hint: Look at the end of a list of reversed words 11

Christopher Manning Lesson Piping commands together can be simple yet powerful in Unix It gives flexibility. Traditional Unix philosophy: small tools that can be composed 12

Christopher Manning Bigrams = word pairs counts Algorithm 1. tokenize by word 2. print word i and word i +1 on the same line 3. count 13

Christopher Manning Bigrams tr -sc ’A-Za-z’ ’\n’ nyt.words tail –n +2 nyt.words > nyt.nextwords paste nyt.words nyt.nextwords > nyt.bigrams head –n 5 nyt.bigrams KBR said said Friday Friday the the global global economic 14

Christopher Manning Exercises Find the 10 most common bigrams (For you to look at:) What part-of-speech pattern are most of them? Find the 10 most common trigrams 15

Christopher Manning grep Grep finds patterns specified as regular expressions grep rebuilt nyt_ txt Conn and Johnson, has been rebuilt, among the first of the 222 move into their rebuilt home, sleeping under the same roof for the the part of town that was wiped away and is being rebuilt. That is to laser trace what was there and rebuilt it with accuracy," she home - is expected to be rebuilt by spring. Braasch promises that a the anonymous places where the country will have to be rebuilt, "The party will not be rebuilt without moderates being a part of 16

Christopher Manning grep Grep finds patterns specified as regular expressions globally search for regular expression and print Finding words ending in –ing: grep ’ing$’ nyt.words| sort | uniq -c 17

Christopher Manning grep grep is a filter – you keep only some lines of the input grep gh keep lines containing ‘‘gh’’ grep ’ˆcon’ keep lines beginning with ‘‘con’’ grep ’ing$’ keep lines ending with ‘‘ing’’ grep –v gh keep lines NOT containing “gh” grep -P Perl regular expressions (extended syntax) grep -P '^[A-Z]+$' nyt.words | sort | uniq –cALL UPPERCASE 18

Christopher Manning Counting lines, words, characters wc nyt_ txt nyt_ txt wc -l nyt.words nyt.words 19

Christopher Manning grep & wc exercises How many all uppercase words are there in this NYT file? How many 4-letter words? How many different words are there with no vowels What subtypes do they belong to? How many “1 syllable” words are there That is, ones with exactly one vowel Type/token distinction: different words (types) vs. instances (tokens) 20

Christopher Manning sed sed is a simple string (i.e., lines of a file) editor You can match lines of a file by regex or line numbers and make changes Not much used in 2013, but The general regex replace function still comes in handy sed 's/George Bush/Dubya/' nyt_ txt | less 21

Christopher Manning sed exercises Count frequency of word initial consonant sequences Take tokenized words Delete the first vowel through the end of the word Sort and count Count word final consonant sequences 22

Christopher Manning awk Ken Church’s slides then describe awk, a simple programming language for short programs on data usually in fields I honestly don’t think it’s worth learning awk in 2013 Better to write little programs in your favorite scripting language, be that Python, or Perl, or groovy, or … 23

Christopher Manning shuf Randomly permutes (shuffles) the lines of a file Exercises Print 10 random word tokens from the NYT excerpt 10 instances of words that appear, each word instance equally likely Print 10 random word types from the NYT excerpt 10 different words that appear, each different word equally likely 24

Christopher Manning cut – tab separated files cp /afs/ir/class/cs124/parses.conll. head –n 5 parses.conll 1 Influential _ JJ JJ _ 2 amod _ _ 2 members _ NNS NNS _ 10 nsubj _ _ 3 of _ IN IN _ 2 prep _ _ 4 the _ DT DT _ 6 det _ _ 5 House _ NNP NNP _ 6 nn _ _ 25

Christopher Manning cut – tab separated files Frequency of different parts of speech: cut -f 4 parses.conll | sort | uniq -c | sort –nr Get just words and their parts of speech: cut -f 2,4 parses.conll You can deal with comma separate files with: cut –d, 26

Christopher Manning cut exercises How often is ‘that’ used as a determiner (DT) “that man” versus a complementizer (IN) “I know that he is rich” versus a relative (WDT) “The class that I love” Hint: With grep –P, you can use ‘\t’ for a tab character What determiners occur in the data? What are the 5 most common? 27