LIN 6932 Unix Lecture 6 Hana Filip
LIN 6932 HW6 - Part II solutions posted on my website see syllabus
LIN 6932 Text Processing Command Line Utility Programs sed wc awk comm cut ex iconv join paste sort tr uniq xargs
LIN 6932 TextPro Lexicon File Lexicon file “core.text” Background: TextPro An information extraction system used as SRI International, Menlo Park, CA Developed by Doug Appelt
LIN 6932 copy “machen.txt” into your account > cd.. > cd c6932aab > ls … machen.txt … > cp machen.txt ~ c6932aad > cd > ls … machen.txt …
LIN 6932 Text Processing Command Line Utility Programs tr translate or delete characters Example 1: delete (-d) all the new line characters from “machen.txt”, and redirect the output to a file named “machen-cont.txt”. % cat machen.txt | tr -d "\n" > machen-cont.txt Example 2: delete (-d) all characters from “machen.txt” except for alphabetical characters, new lines, and spaces, and redirect the output to a file named “machen-alpha.txt”. % cat machen.txt | tr -c -d "[:alpha:]\n " > machen-alpha.txt Try also: % cat machen.txt | tr -c -d "[:alpha:]\n" > machen-alpha.txt
LIN 6932 Text Processing Command Line Utility Programs tr can be used to make a wordlist from a text. This can be done by replacing all spaces with a newline: % cat machen.txt | tr " " "\n" | less % cat machen.txt | tr " " "\012" | less We can combine the command above with the delete functionality of tr to make a wordlist without unwanted characters: % cat machen.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" > lex
LIN 6932 Text Processing Command Line Utility Programs sort prints the lines of its input or concatenation of all files listed in its argument list in sorted order. (The -r flag will reverse the sort order.) % sort -r movie_characters
LIN 6932 Text Processing Command Line Utility Programs uniq takes a text file and outputs the file with adjacent identical lines collapsed to one it is a kind of filter program typically it is used after sort % cat machen.txt | tr " " "\n" | tr -c -d "[:alpha:]\n” | sort | uniq > lex
LIN 6932 Text Processing Command Line Utility Programs sed = stream editor a special editor for automatically modifying files a find and replace program, it reads text from standard input and writes the result to standard outout (normally the screen) The search pattern is a regular expression (see references). sed search pattern is a regular expression, essentially the same as a grep regular expression often used in a program to make changes in a file
LIN 6932 Text Processing Command Line Utility Programs sed: simple example 1 % sed 's/United States/USA/' new-usa-gaz.text s Substitute command /../../ Delimiter United States Regular Expression Pattern String USA Replacement string new_file
LIN 6932 Text Processing Command Line Utility Programs sed: simple example 2 % sed 's/\(United\)\(States\)/\2\1/' usa-switch-gaz.text switch two words around \( word onset \) word end /../../delimiter \1 register 1 \2 register 2
LIN 6932 Text Processing Command Line Utility Programs multiple sed commands may also be stored in a script file. The "-f" option is used on the command line to access the commands in the script: % sed -f sedscript.sed [file]
LIN 6932 Text Processing Command Line Utility Programs % sed 's/^/LexEntry: /g;s/$/ ;./' lex > newlex ^ match the beginning of the line $ match the end of the line
LIN 6932 Text Processing Command Line Utility Programs & shell script #! /usr/local/bin/tcsh #usage: make_lex filename1; make_lex filename1 filename2, … # first, make sure the user typed in at least one argument if ( $# < 1 ) then echo "This script needs at least 1 argument." echo "Exiting...(annoyed)" exit 666 endif foreach name ($*) cat $name | tr " " "\n" | tr -c -d "[:alpha:]\n" | sort | uniq > mylex sed 's/^/LexEntry: /g;s/$/ ;./' mylex > newlex echo "Your new lexical file is called 'newlex'." end