LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong. Administrivia We'll postpone Homework 4 review until next week …
LING 438/538 Computational Linguistics Sandiway Fong Lecture 8: 9/29.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
CS5371 Theory of Computation
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 12: 10/4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.
LING 388 Language and Computers Lecture 4 9/11/03 Sandiway FONG.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/3.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 12: 10/5.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/2.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong. Administrivia Reading Homework – Chapter 3 of JM: Words and Transducers.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 5 graded.
Finite Automata Chapter 1. Automatic Door Example Top View.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Course Overview Why this course “formal languages and automata theory?” What do computers really do? What are the practical benefits/application of formal.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Akram Salah ISSR Basic Concepts Languages Grammar Automata (Automaton)
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Regular Languages REGULAR LANGUAGES (AKA TYPE 3) REGULAR GRAMMARS FINITE STATE MACHINES.
Theory of Computation Automata Theory Dr. Ayman Srour.
Department of Software & Media Technology
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture Three: Finite Automata Finite Automata, Lecture 3, slide 1 Amjad Ali.
CS510 Compiler Lecture 2.
Finite State Machines Dr K R Bond 2009
Chapter 3 Lexical Analysis.
CS 404 Introduction to Compiler Design
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.
CSE 105 theory of computation
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Intro to Data Structures
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
Regular Expressions
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong.
4b Lexical analysis Finite Automata
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.
Discrete Maths 13. Grammars Objectives
Lecture 5 Scanning.
LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong

Today's Topics Erratum Homework 5 review Regular languages: FSA

Prime Number Testing using Perl Regular Expressions Results are still right so far though: wrt. prime vs. non-prime But we predict it will report an incorrect result for 1,073,938,441 It should claim (incorrectly) that this is prime since 1073938441 = 327712 (32766 is the limit for the number of bundles) https://primes.utm.edu/lists/small/10000.txt

Homework 5 Review File: hw5.pos Source: Penn Treebank POS (Part-of-Speech) tagged sentences

Homework 5 Review One of you noticed some inconsistencies in the Penn Treebank WSJ tagging data: [ Valley/NNP Federal/NNP Savings/NNPS ] [ your/PRP$ Foster/NNP Savings/NNP Institution/NNP ] [ First/NNP Securities/NNPS Group/NNP ] Neither/DT First/NNP [ Securities/NNP ] [ Biscayne/NNP Securities/NNP Corp./NNP ]

Homework 5 Review There are more: [ the/DT Netherlands/NNP ] [ the/DT Netherlands/NNPS ] [ the/DT new/JJ Colorado/NNP Springs/NNPS ] [ Colorado/NNP Springs/NNP ] [ International/NNP Business/NNP Machines/NNPS Corp./NNP ] [ International/NNP Business/NNP Machines/NNP Corp./NNP ] [ Northeast/NNP Utilities/NNP ] [ Eastern/NNP Utilities/NNPS ] [ Duchossois/NNP Industries/NNPS Inc./NNP ] [ Free/NNP State/NNP Glass/NNP Industries/NNP ] perl -ne 'while (m!(\S+?s)/NNPS!g) {$s{$1}+=1}; while (m!(\S+?s)/NNP\b!g) {if ($s{$1}) {$t{$1}+=1}}; END {foreach $k (keys %t) {print "$k\n"}}' hw5.pos

Homework 5 Review Question 1: How many proper noun (tokens) are there? matching for NNP includes NNPS (both are proper nouns) perl -ne 'while (m!/NNP!g) {$c+=1} END {print "$c\n"}' hw5.pos 5199 python3 -c 'import sys, re; f = open(sys.argv[1]); print(len([m for g in (re.findall(r"\b\S+?/NNP",l) for l in f) if g for m in g]))' hw5.pos re. findall(regex,string) The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. Example: … ['Las/NNP', 'Vegas/NNP'], ['Robert/NNP', 'Gerhard/NNP', 'Smith/NNP'], ['Carson/NNP', 'City/NNP'], ['Nev./NNP'], ['Mr./NNP', 'Cutrer/NNP'], ['U.S./NNP'], ['London/NNP'], ['Tokyo/NNP'], ['Japan/NNP'], ['U.S./NNP'], …

Homework 5 Review Question 2: How many different proper nouns (types)? perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END {$c = keys %w; print "$c\n"}' hw5.pos 1689 1689[^/\s]+? vs. 1690 \S+ Examples: [ Macmillan\/McGraw-Hill/NNP School/NNP Publishing/NNP Co./NNP 's/POS Scoring/NNP High/NNP ] [ the/DT Iran\/Contra/NNP affair/NN ] [ Bridgestone\/Firestone/NNP Inc/NNP ] python3 -c 'import sys, re; f = open(sys.argv[1]); print(len(set([m for g in (re.findall(r"\b(\S+?)/NNP",l) for l in f) if g for m in g])))' hw5.pos 1690

Homework 5 Review Question 3: How many different plural proper nouns (types)? perl -ne 'while (m!([^/\s]+?)/NNPS!g) {$w{$1}+=1} END {$c = keys %w; print "$c\n"}' hw5.pos 47 (94 total) python3 -c 'import sys, re; f = open(sys.argv[1]); print(len(set([m for g in (re.findall(r"\b(\S+?)/NNPS",l) for l in f) if g for m in g])))' hw5.pos 47

Homework 5 Review Question 4: What is the most frequent proper noun and its frequency? perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END {@a = sort {$w{$b} <=> $w{$a}} keys %w; print "$a[0]:$w{$a[0]}\n"}' hw5.pos Mr.:223 Question 5: What is the most frequent plural proper noun and its frequency? perl -ne 'while (m!([^/\s]+?)/NNPS!g) {$w{$1}+=1} END {@a = sort {$w{$b} <=> $w{$a}} keys %w; print "$a[0]:$w{$a[0]}\n"}' hw5.pos Containers:16

Homework 5 Review Question 6: Print the top 5 proper nouns and frequencies perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END {@a = sort {$w{$b} <=> $w{$a}} keys %w; foreach $k (@a[0..4]) { printf "%-8s %d\n", $k, $w{$k}}}' hw5.pos Mr.      223 U.S.     148 New      93 Japan    67 Corp.    59

Homework 5 Review Determiners (DT): perl -ne 'while (m!([^/\s]+?)/DT!g) {$w{lc($1)}+=1} END {@a = sort {$w{$b} <=> $w{$a}} keys %w; foreach $k (@a) { printf "%-8s %d\n", $k, $w{$k}}}' hw5.pos the      2277 a        994 an       164 this     104 some     67 that     55 these    50 any      49 all      47 no       46 those 32 another 31 both 28 each 21 every 11 neither 7 half 3 la 1 le 1 either 1 del 1

Homework 5 Review

Regular Languages Three formalisms All formally equivalent (no difference in expressive power) i.e. if you can encode it using a RE, you can do it using a FSA or regular grammar, and so on … Regular Languages Note: Perl regexs are more powerful than the math characterization, e.g. backreferences, recursive regexs Regular Grammars FSA Expressions talk about formal equivalence next time

Regular Languages A regular language is the set of strings (including possibly the empty string) (set itself could also be empty) (set can be infinite) generated by a RE/FSA/Regular Grammar Note: in formal language theory: a language = set of strings

Regular Languages Example: L is a regular language Note: Notation: Language: L = { a+b+ } “one or more a’s followed by one or more b’s” L is a regular language described by a regular expression (we’ll define it formally next time) Note: infinite set of strings belonging to language L e.g. abbb, aaaab, aabb, *abab, * Notation: is the empty string (or string with zero length), sometimes ε is used instead * means string is not in the language

Finite State Automata (FSA) L = { a+b+ } can be also be generated by the following FSA s x y a b > Conventions (used here): > Indicates start state Red circle indicates end (accepting) state we accept a input string only when we’re in an end state and we’re at the end of the string

Finite State Automata (FSA) L = { a+b+ } can be also be generated by the following FSA s x y a b > There is a natural correspondence between components of the FSA and the regex defining L Note: L = {a+b+} L = {aa*bb*}

Finite State Automata (FSA) L = { a+b+ } can be also be generated by the following FSA s x y a b > deterministic FSA (DFSA) no ambiguity about where to go at any given state i.e. for each input symbol in the alphabet at any given state, there is a unique “action” to take non-deterministic FSA (NDFSA) no restriction on ambiguity (surprisingly, no increase in power)

Finite State Automata (FSA) more formally (Q,s,f,Σ,) set of states (Q): {s,x,y} must be a finite set start state (s): s end state(s) (f): y alphabet (Σ): {a, b} transition function : signature: character × state → state (a,s)=x (a,x)=x (b,x)=y (b,y)=y s x y a b >

Finite State Automata (FSA) In Perl transition function : (a,s)=x (a,x)=x (b,x)=y (b,y)=y We can simulate our 2D transition table using a hash whose elements are themselves (anonymized) hashes %transitiontable = ( s => { a => "x" }, x => { a => "x", b => "y" y => { } ); Example: print "$transitiontable{s}{a}\n"; s x y a b > Syntactic sugar for %transitiontable = ( "s", { "a", "x", }, "x", { "a", "x" , "b", "y" }, "y", { "b", "y" }, );

Finite State Automata (FSA) Given transition table encoded as a hash How to build a decider (Accept/Reject) in Perl? Complications: How about ε-transitions? Multiple end states? Multiple start states? Non-deterministic FSA?

Finite State Automata (FSA) %transitiontable = ( s => { a => "x" }, x => { a => "x", b => "y" y => { } ); @input = @ARGV; $state = "s"; foreach $c (@input) { $state = $transitiontable{$state}{$c}; if ($state eq "y") { print "Accept\n"; } else { print "Reject\n" } Example runs: perl fsm.prl a b a b Reject perl fsm.prl a a a b b Accept

Finite State Automata (FSA) this is pseudo-code not any real programming language but can be easily translated

In Python Python dictionary = Perl hash key:value sys.argv = @ARGV but numbered from 1 tab indentation used for grouping statements

In Python Python has a data structure called a tuple: (e1,..,en) Note: Python lists use [..] In Python, tuples (but not lists) can also be dictionary keys Note: Many other ways of encoding FSA in Python, e.g. using object-oriented programming (classes) https://wiki.python.org/moin/FiniteStateMachine#FSA_-_Finite_State_Automation_in_Python

Finite State Automata (FSA) practical applications can be encoded and run efficiently on a computer widely used encode regular expressions (e.g. Perl regex) morphological analyzers Different word forms, e.g. want, wanted, unwanted (suffixation/prefixation) see chapter 3 of textbook speech recognizers Markov models = FSA + probabilities and much more …