Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong

2 Today's Topics Erratum Homework 5 review Regular languages: FSA

3 Prime Number Testing using Perl Regular Expressions
Results are still right so far though: wrt. prime vs. non-prime But we predict it will report an incorrect result for 1,073,938,441 It should claim (incorrectly) that this is prime since = (32766 is the limit for the number of bundles)

4 Homework 5 Review File: hw5.pos
Source: Penn Treebank POS (Part-of-Speech) tagged sentences

5 Homework 5 Review One of you noticed some inconsistencies in the Penn Treebank WSJ tagging data: [ Valley/NNP Federal/NNP Savings/NNPS ] [ your/PRP$ Foster/NNP Savings/NNP Institution/NNP ] [ First/NNP Securities/NNPS Group/NNP ] Neither/DT First/NNP [ Securities/NNP ] [ Biscayne/NNP Securities/NNP Corp./NNP ]

6 Homework 5 Review There are more: [ the/DT Netherlands/NNP ]
[ the/DT Netherlands/NNPS ] [ the/DT new/JJ Colorado/NNP Springs/NNPS ] [ Colorado/NNP Springs/NNP ] [ International/NNP Business/NNP Machines/NNPS Corp./NNP ] [ International/NNP Business/NNP Machines/NNP Corp./NNP ] [ Northeast/NNP Utilities/NNP ] [ Eastern/NNP Utilities/NNPS ] [ Duchossois/NNP Industries/NNPS Inc./NNP ] [ Free/NNP State/NNP Glass/NNP Industries/NNP ] perl -ne 'while (m!(\S+?s)/NNPS!g) {$s{$1}+=1}; while (m!(\S+?s)/NNP\b!g) {if ($s{$1}) {$t{$1}+=1}}; END {foreach $k (keys %t) {print "$k\n"}}' hw5.pos

7 Homework 5 Review Question 1: How many proper noun (tokens) are there?
matching for NNP includes NNPS (both are proper nouns) perl -ne 'while (m!/NNP!g) {$c+=1} END {print "$c\n"}' hw5.pos 5199 python3 -c 'import sys, re; f = open(sys.argv[1]); print(len([m for g in (re.findall(r"\b\S+?/NNP",l) for l in f) if g for m in g]))' hw5.pos re. findall(regex,string) The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. Example: … ['Las/NNP', 'Vegas/NNP'], ['Robert/NNP', 'Gerhard/NNP', 'Smith/NNP'], ['Carson/NNP', 'City/NNP'], ['Nev./NNP'], ['Mr./NNP', 'Cutrer/NNP'], ['U.S./NNP'], ['London/NNP'], ['Tokyo/NNP'], ['Japan/NNP'], ['U.S./NNP'], …

8 Homework 5 Review Question 2: How many different proper nouns (types)?
perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END {$c = keys %w; print "$c\n"}' hw5.pos 1689 1689[^/\s]+? vs \S+ Examples: [ Macmillan\/McGraw-Hill/NNP School/NNP Publishing/NNP Co./NNP 's/POS Scoring/NNP High/NNP ] [ the/DT Iran\/Contra/NNP affair/NN ] [ Bridgestone\/Firestone/NNP Inc/NNP ] python3 -c 'import sys, re; f = open(sys.argv[1]); print(len(set([m for g in (re.findall(r"\b(\S+?)/NNP",l) for l in f) if g for m in g])))' hw5.pos 1690

9 Homework 5 Review Question 3: How many different plural proper nouns (types)? perl -ne 'while (m!([^/\s]+?)/NNPS!g) {$w{$1}+=1} END {$c = keys %w; print "$c\n"}' hw5.pos 47 (94 total) python3 -c 'import sys, re; f = open(sys.argv[1]); print(len(set([m for g in (re.findall(r"\b(\S+?)/NNPS",l) for l in f) if g for m in g])))' hw5.pos 47

10 Homework 5 Review Question 4: What is the most frequent proper noun and its frequency? perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END = sort {$w{$b} <=> $w{$a}} keys %w; print "$a[0]:$w{$a[0]}\n"}' hw5.pos Mr.:223 Question 5: What is the most frequent plural proper noun and its frequency? perl -ne 'while (m!([^/\s]+?)/NNPS!g) {$w{$1}+=1} END = sort {$w{$b} <=> $w{$a}} keys %w; print "$a[0]:$w{$a[0]}\n"}' hw5.pos Containers:16

11 Homework 5 Review Question 6: Print the top 5 proper nouns and frequencies perl -ne 'while (m!([^/\s]+?)/NNP!g) {$w{$1}+=1} END = sort {$w{$b} <=> $w{$a}} keys %w; foreach $k { printf "%-8s %d\n", $k, $w{$k}}}' hw5.pos Mr.      223 U.S.     148 New      93 Japan    67 Corp.    59

12 Homework 5 Review Determiners (DT):
perl -ne 'while (m!([^/\s]+?)/DT!g) {$w{lc($1)}+=1} END = sort {$w{$b} <=> $w{$a}} keys %w; foreach $k { printf "%-8s %d\n", $k, $w{$k}}}' hw5.pos the      2277 a        994 an       164 this     104 some     67 that     55 these    50 any      49 all      47 no       46 those 32 another 31 both 28 each 21 every 11 neither 7 half 3 la 1 le 1 either 1 del 1

13 Homework 5 Review

14 Regular Languages Three formalisms
All formally equivalent (no difference in expressive power) i.e. if you can encode it using a RE, you can do it using a FSA or regular grammar, and so on … Regular Languages Note: Perl regexs are more powerful than the math characterization, e.g. backreferences, recursive regexs Regular Grammars FSA Expressions talk about formal equivalence next time

15 Regular Languages A regular language is the set of strings
(including possibly the empty string) (set itself could also be empty) (set can be infinite) generated by a RE/FSA/Regular Grammar Note: in formal language theory: a language = set of strings

16 Regular Languages Example: L is a regular language Note: Notation:
Language: L = { a+b+ } “one or more a’s followed by one or more b’s” L is a regular language described by a regular expression (we’ll define it formally next time) Note: infinite set of strings belonging to language L e.g. abbb, aaaab, aabb, *abab, * Notation: is the empty string (or string with zero length), sometimes ε is used instead * means string is not in the language

17 Finite State Automata (FSA)
L = { a+b+ } can be also be generated by the following FSA s x y a b > Conventions (used here): > Indicates start state Red circle indicates end (accepting) state we accept a input string only when we’re in an end state and we’re at the end of the string

18 Finite State Automata (FSA)
L = { a+b+ } can be also be generated by the following FSA s x y a b > There is a natural correspondence between components of the FSA and the regex defining L Note: L = {a+b+} L = {aa*bb*}

19 Finite State Automata (FSA)
L = { a+b+ } can be also be generated by the following FSA s x y a b > deterministic FSA (DFSA) no ambiguity about where to go at any given state i.e. for each input symbol in the alphabet at any given state, there is a unique “action” to take non-deterministic FSA (NDFSA) no restriction on ambiguity (surprisingly, no increase in power)

20 Finite State Automata (FSA)
more formally (Q,s,f,Σ,) set of states (Q): {s,x,y} must be a finite set start state (s): s end state(s) (f): y alphabet (Σ): {a, b} transition function : signature: character × state → state (a,s)=x (a,x)=x (b,x)=y (b,y)=y s x y a b >

21 Finite State Automata (FSA)
In Perl transition function : (a,s)=x (a,x)=x (b,x)=y (b,y)=y We can simulate our 2D transition table using a hash whose elements are themselves (anonymized) hashes %transitiontable = ( s => { a => "x" }, x => { a => "x", b => "y" y => { } ); Example: print "$transitiontable{s}{a}\n"; s x y a b > Syntactic sugar for %transitiontable = ( "s", { "a", "x", }, "x", { "a", "x" , "b", "y" }, "y", { "b", "y" }, );

22 Finite State Automata (FSA)
Given transition table encoded as a hash How to build a decider (Accept/Reject) in Perl? Complications: How about ε-transitions? Multiple end states? Multiple start states? Non-deterministic FSA?

23 Finite State Automata (FSA)
%transitiontable = ( s => { a => "x" }, x => { a => "x", b => "y" y => { } $state = "s"; foreach $c { $state = $transitiontable{$state}{$c}; if ($state eq "y") { print "Accept\n"; } else { print "Reject\n" } Example runs: perl fsm.prl a b a b Reject perl fsm.prl a a a b b Accept

24 Finite State Automata (FSA)
this is pseudo-code not any real programming language but can be easily translated

25 In Python Python dictionary = Perl hash
key:value sys.argv = @ARGV but numbered from 1 tab indentation used for grouping statements

26 In Python Python has a data structure called a tuple: (e1,..,en)
Note: Python lists use [..] In Python, tuples (but not lists) can also be dictionary keys Note: Many other ways of encoding FSA in Python, e.g. using object-oriented programming (classes)

27 Finite State Automata (FSA)
practical applications can be encoded and run efficiently on a computer widely used encode regular expressions (e.g. Perl regex) morphological analyzers Different word forms, e.g. want, wanted, unwanted (suffixation/prefixation) see chapter 3 of textbook speech recognizers Markov models = FSA + probabilities and much more …


Download ppt "LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong."

Similar presentations


Ads by Google