Download presentation
Presentation is loading. Please wait.
1
LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26
2
2 Administrivia reminder –no class this Thursday
3
3 + left & right recursive rules Last Time introduced Finite State Automata (FSA) and regular expressions (RE) formally equivalent – in terms of generative capacity or power Regular Grammars FSA Regular Expressions Regular Languages regular grammar s --> [a],b. b --> [a],b. b --> [b],c. b --> [b]. c --> [b],c. c --> [b]. regular expression a + b + sx y a a b b
4
4 Last Time FSA –gave a formal definition (Q,s,f,Σ, ) many practical applications –can be encoded and run efficiently on a computer –implement regular expressions –compress large dictionaries –build morphological analyzers (suffixation) see chapter 3 of textbook –speech recognizers (Hidden) Markov models = FSA + probabilities
5
5 Today’s Lecture from the textbook –Chapter 2: Regular Expressions and Finite State Automata how to implement FSA –in Prolog –from first principles –two ways
6
6 Regular Expressions pattern-matching using regular expressions –important tool in automated searching popular implementations –Unix grep returns lines matching a regular expression standard part of all Unix-based systems –including MacOS X (command-line interface in Terminal) many shareware/freeware implementations available for Windows XP –just Google and see... –wildcard search in Microsoft Word limited version with differences in notation
7
7 Regular Expressions One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP –name comes from Unix ed command g/re/p –“search globally for lines matching the regular expression, and print them” –[Source: http://en.wikipedia.org/wiki/Grep]
8
8 Regular Expressions: GNU grep terminology: –metacharacter character with special meaning, not interpreted literally, e.g. ^ vs. a must be quoted or escaped using the backslash \ to get literal meaning, e.g. \^ excerpts from the manpage –A list of characters enclosed by [ and ] matches any single character in that list; –if the first character of the list is the caret ^ then it matches any character not in the list. Examples –the regular expression [0123456789] matches any single digit. –A range of characters may be specified by giving the first and last characters, separated by a hyphen. –[0-9] –[a-z] –[A-Za-z]
9
9 Regular Expressions: grep excerpts from the manpage –The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. –The symbol \b matches the empty string at the edge of a word –The symbols \ respectively match the empty string at the beginning and end of a word. –The period. matches any single character. –Finally, certain named classes of characters are predefined. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] (alphanumeric) means [0-9A- Za-z] –The symbol \w is a synonym for [[:alnum:]] and –\W is a synonym for [^[:alnum]]. terminology –word unbroken sequence of digits, underscores and letters
10
10 Regular Expressions: grep Excerpts from the manpage –A regular expression may be followed by one of several repetition operators: ? The preceding item is optional and matched at most once. * The preceding item will be matched zero or more times. + The preceding item will be matched one or more times. {n} The preceding item is matched exactly n times {n,} The preceding item is matched n or more times. {n,m} The preceding item is matched at least n times, but not more than m times.
11
11 Regular Expressions: GNU grep concatenation –Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction – Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. Excerpts from the manpage
12
12 Regular Expressions: Examples Regular Expression –\b99 matches 99 in “there are 99 bottles …” but not in “there are 299 bottles …” –Note: $99 contains two words so \b99 will match 99 here Regular Expression –beds? examples –bed –beds
13
13 Regular Expressions: Examples example –guppy –guppies Regular Expression –gupp(y|ies) –| = disjunction –( ) = parentheses indicate scope example –the (whole word, case insensitive) –the25 Regular Expression (pg. 29) (^|[^a-zA-Z])[tT]he[^a- zA-Z] –^ = beginning of line –[^ ] = negation
14
14 Regular Expressions: Microsoft Word terminology: –wildcard search
15
15 Regular Expressions: Microsoft Word Note: zero or more times is missing in Microsoft Word
16
16 Finite State Automata (FSA) more formally –(Q,s,f,Σ, ) 1.set of states (Q): {s,x,y}must be a finite set 2.start state (s): s 3.end state(s) (f): y 4.alphabet ( Σ ): {a, b} 5.transition function : signature: character × state → state (a,s)=x (a,x)=x (b,x)=y (b,y)=y sx y a a b b
17
17 Finite State Automata (FSA) directly implement the formal definition –define a predicate fsa/2 –takes two arguments –S = a start state –L = string (as a list) we’re interested in testing Prolog code (for any FSA) –fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). –fsa(E,[]):- end_state(E).
18
18 Finite State Automata (FSA) Prolog code (for any FSA) –fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). –fsa(E,[]):- end_state(E). Facts (FSA-particular) –end_state(y). –transition(s,a,x). –transition(x,a,x). –transition(x,b,y). –transition(y,b,y). sx y a a b b transition function : (a,s)=x (a,x)=x (b,x)=y (b,y)=y
19
19 Finite State Automata (FSA) computation tree ?- fsa(s,[a,a,b]). ?- transition(s,a,T). T=x ?- fsa(x,[a,b]). ?- transition(x,a,T’). T’=x ?- fsa(x,[b]). ?- transition(x,b,T”). T”=y ?- fsa(y,[]). ?- end_state(y). Yes fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E).
20
20 Finite State Automata (FSA) deterministic FSA (DFSA) –no ambiguity about where to go at any given state non-deterministic FSA (NDFSA) –no restriction on ambiguity (surprisingly, no increase in formal power) textbook –D-RECOGNIZE (FIGURE 2.13) –ND-RECOGNIZE (FIGURE 2.21) fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E).
21
21 Finite State Automata (FSA) Prolog –no change in code –Prolog computation rule takes care of choice point management example –one change –“a” instead of “b” from x to y –non-deterministic –what regular language does this machine accept? fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E). sx y a a b a
22
22 Finite State Automata (FSA) another possible Prolog encoding strategy –define one predicate for each state taking one argument (the input string) consume input character call next state with remaining input string –query ?- s(L). call start state s
23
23 Finite State Automata (FSA) –state s: (start state) s([a|L]) :- x(L). match input string beginning with a and call state x with remainder of input –state x: x([a|L]) :- x(L). x([b|L]) :- y(L). –state y: (end state) y([]). y([b|L]) :- y(L). sx y a a b b
24
24 Finite State Automata (FSA) example: 1.?- s([a,a,b]). 2.?- x([a,b]). 3.?- x([b]). 4.?- y([]). Yes s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L).
25
25 Finite State Automata (FSA) Note: –non-deterministic properties of Prolog’s computation rule still applies here
26
26 Finite State Automata (FSA) example 1.?- s([a,b,a]). 2.?- x([b,a]). 3.?- y([a]). No s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L).
27
27 Next Time bit more on FSA...read (if you haven’t yet) –Chapter 3: Morphology and Finite State Transducers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.