LING 408/508: Computational Techniques for Linguists

Slides:



Advertisements
Similar presentations
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 Introduction to Ruby.
Advertisements

Regular expressions Day 2
UNIX Chapter 10 Advanced File Processing Mr. Mohammad Smirat.
1 Unix Talk #2 AWK overview Patterns and actions Records and fields Print vs. printf.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
A Guide to Unix Using Linux Fourth Edition
AWK: The Duct Tape of Computer Science Research Tim Sherwood UC Santa Barbara.
AWK: The Duct Tape of Computer Science Research Tim Sherwood UC San Diego.
Modern Information Retrieval Chapter 4 Query Languages.
UNIX Filters.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Read Lecturer.
Advanced File Processing
7 - 1 Lecture 5: Data Analysis for Modeling
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Introduction to Awk Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data manipulation tasks.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Exam Revision Ruibin Bai (Room AB326) Division of Computer Science The University of Nottingham.
Awk Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Chapter 12: gawk Yes it sounds funny. In this chapter … Intro Patterns Actions Control Structures Putting it all together.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
Time to talk about your class projects!. Shell Scripting Awk (lecture 2)
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
1 LAB 4 Working with Trace Files using AWK. 2 Structure of Trace File.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming (3) Ruibin Bai (Room AB326) Division of Computer Science The University.
Alon Efrat Computer Science Department University of Arizona Unix Tools.
Review of Awk Principles
Command Prompt Chapter 9 Pipes, Filters, and Redirection ©Richard Goldman 11/30/2000 Revised 10/16/2001.
1 Lecture 10 Introduction to AWK COP 3344 Introduction to UNIX.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Lecture 14 Programming with awk II
Regular Expressions Upsorn Praphamontripong CS 1110
Regular Expressions 'RegEx'.
CprE 185: Intro to Problem Solving (using C)
Lecturer : Dr. Pavle Mogin
CS 403: Programming Languages
13 Text Processing Hongfei Yan June 1, 2016.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
John Carelli, Instructor Kutztown University
CS 124/LINGUIST 180 From Languages to Information
Guide To UNIX Using Linux Third Edition
LING/C SC/PSYC 438/538 Lecture 6 Sandiway Fong.
LING 408/508: Computational Techniques for Linguists
LING 408/508: Computational Techniques for Linguists
Topics discussed in this section:
LING 388: Computers and Language
10.1 Character Testing.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
LING 408/508: Computational Techniques for Linguists
CEV208 Computer Programming
CS 1111 Introduction to Programming Fall 2018
Text Analyzer BIS1523 – Lecture 14.
LING 408/508: Computational Techniques for Linguists
LING 408/508: Computational Techniques for Linguists
Programming Languages
CS 124/LINGUIST 180 From Languages to Information
LING 408/508: Computational Techniques for Linguists
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
String Processing 1 MIS 3406 Department of MIS Fox School of Business
1.5 Regular Expressions (REs)
Awk.
Strings #include <stdio.h>
awk- An Advanced Filter
Files Chapter 8.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Presentation transcript:

LING 408/508: Computational Techniques for Linguists Lecture 11

Today's Topics Make sure everyone is on board with associative arrays in awk A fundamental data structure the same as: dict in Python hash in Perl Reading assignment: https://www.gnu.org/software/gawk/manual/html_node/Regexp.html

awk: regex awk doesn't support \d (digit) etc. from Perl, use [[:digit:]] instead

awk: regex Build a table of all the words used: not case insensitive      Arizona 3          and 3           of 3   Management 2       Safety 2       Office 2 Institutional 1 800-362-0101 1 800-222-1222 1 520-626-6850 1 … 520-621-1790 1  emergencies 1       during 1       campus 1       Police 1       Poison 1       Health 1       Center 1       Campus 1 NR = number of lines NF = number of fields in a line variable[key] associative array variable[key] = value not case insensitive Build a table of all the words used: awk 'NR!=1 {for (i=1; i<=NF; i++) {word[$i]+=1}} END {for (x in word) { printf "%12s %d\n", x, word[x]}}' uanumbers.txt | sort –k2 -n NR!=1 (pattern) skip 1st line (!= means not equal to) NF = number of fields on a line word = associative array of frequencies | = pipe (output of awk into sort) sort –k2 –n = command to sort on field 2 numerically (-n)

awk: regex Build a table of all the words used (case-insensitive):      arizona 3          and 3           of 3   management 2       safety 2       office 2       campus 2 institutional 1 800-362-0101 1 … 520-621-1790 1  information 1  emergencies 1   university 1    biosafety 1     students 1     recorded 1     chemical 1     (tucson) 1 Build a table of all the words used (case-insensitive): awk 'NR!=1 {for (i=1; i<=NF; i++) {word[tolower($i)]+=1}} END {for (x in word) { printf "%12s %d\n", x, word[x]}}' uanumbers.txt | sort –k2 -nr tolower(string) Return a copy of string, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123". https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

          arizona 3          and 3           of 3   management 2       safety 2       office 2       campus 2 institutional 1  information 1  emergencies 1   university 1   facilities 1   department 1   biological 1    radiation 1    committee 1    biosafety 1     students 1     recorded 1     chemical 1      updates 1      service 1       tucson 1       police 1 awk: regex Build a table of all the words used (no numbers, no punctuation): gawk 'NR!=1 {for (i=1; i<=NF; i++) {gsub(/[^A-Za-z]/, "", $i); word[tolower($i)]+=1}} END {for (x in word) { printf "%12s %d\n", x, word[x]}}' uanumbers.txt | sort -k2 -nr gsub(regexp, replacement [, target]) Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere. https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html