LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

Slides:



Advertisements
Similar presentations
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Advertisements

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 13: 10/9.
LING 388: Language and Computers Sandiway Fong Lecture 9: 9/27.
LING 388: Language and Computers Sandiway Fong 9/29 Lecture 11.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 8: 9/29.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 12: 10/4.
LING 388: Language and Computers Sandiway Fong Lecture 12: 10/5.
LING 388 Language and Computers Lecture 14 10/16/03 Sandiway FONG.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
LING 388 Language and Computers Lecture 4 9/11/03 Sandiway FONG.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/3.
Computational Language Finite State Machines and Regular Expressions.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/13.
LING 388: Language and Computers Sandiway Fong Lecture 11: 10/3.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 12: 10/5.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
LING 388: Language and Computers Sandiway Fong Lecture 10: 9/26.
LING 388 Language and Computers Lecture 7 9/23/03 Sandiway FONG.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/2.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
LING 388 Language and Computers Lecture 6 9/18/03 Sandiway FONG.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
LING 388: Language and Computers Sandiway Fong 9/20 Lecture 8.
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
Introduction to Unix – CS 21 Lecture 6. Lecture Overview Homework questions More on wildcards Regular expressions Using grep Quiz #1.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
LING 388: Language and Computers Sandiway Fong 9/27 Lecture 10.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong. Did you install SWI Prolog?
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
CS 154 Formal Languages and Computability February 11 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Gollis University Faculty of Computer Engineering Chapter Five: Retrieval, Functions Instructor: Mukhtar M Ali “Hakaale” BCS.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Akram Salah ISSR Basic Concepts Languages Grammar Automata (Automaton)
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Department of Software & Media Technology
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Looking for Patterns - Finding them with Regular Expressions
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong.
Presentation transcript:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26

2 Administrivia reminder –no class this Thursday

3 + left & right recursive rules Last Time introduced Finite State Automata (FSA) and regular expressions (RE) formally equivalent – in terms of generative capacity or power Regular Grammars FSA Regular Expressions Regular Languages regular grammar s --> [a],b. b --> [a],b. b --> [b],c. b --> [b]. c --> [b],c. c --> [b]. regular expression a + b + sx y a a b b

4 Last Time FSA –gave a formal definition (Q,s,f,Σ,  ) many practical applications –can be encoded and run efficiently on a computer –implement regular expressions –compress large dictionaries –build morphological analyzers (suffixation) see chapter 3 of textbook –speech recognizers (Hidden) Markov models = FSA + probabilities

5 Today’s Lecture from the textbook –Chapter 2: Regular Expressions and Finite State Automata how to implement FSA –in Prolog –from first principles –two ways

6 Regular Expressions pattern-matching using regular expressions –important tool in automated searching popular implementations –Unix grep returns lines matching a regular expression standard part of all Unix-based systems –including MacOS X (command-line interface in Terminal) many shareware/freeware implementations available for Windows XP –just Google and see... –wildcard search in Microsoft Word limited version with differences in notation

7 Regular Expressions One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP –name comes from Unix ed command g/re/p –“search globally for lines matching the regular expression, and print them” –[Source:

8 Regular Expressions: GNU grep terminology: –metacharacter character with special meaning, not interpreted literally, e.g. ^ vs. a must be quoted or escaped using the backslash \ to get literal meaning, e.g. \^ excerpts from the manpage –A list of characters enclosed by [ and ] matches any single character in that list; –if the first character of the list is the caret ^ then it matches any character not in the list. Examples –the regular expression [ ] matches any single digit. –A range of characters may be specified by giving the first and last characters, separated by a hyphen. –[0-9] –[a-z] –[A-Za-z]

9 Regular Expressions: grep excerpts from the manpage –The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. –The symbol \b matches the empty string at the edge of a word –The symbols \ respectively match the empty string at the beginning and end of a word. –The period. matches any single character. –Finally, certain named classes of characters are predefined. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] (alphanumeric) means [0-9A- Za-z] –The symbol \w is a synonym for [[:alnum:]] and –\W is a synonym for [^[:alnum]]. terminology –word unbroken sequence of digits, underscores and letters

10 Regular Expressions: grep Excerpts from the manpage –A regular expression may be followed by one of several repetition operators: ? The preceding item is optional and matched at most once. * The preceding item will be matched zero or more times. + The preceding item will be matched one or more times. {n} The preceding item is matched exactly n times {n,} The preceding item is matched n or more times. {n,m} The preceding item is matched at least n times, but not more than m times.

11 Regular Expressions: GNU grep concatenation –Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction – Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. Excerpts from the manpage

12 Regular Expressions: Examples Regular Expression –\b99 matches 99 in “there are 99 bottles …” but not in “there are 299 bottles …” –Note: $99 contains two words so \b99 will match 99 here Regular Expression –beds? examples –bed –beds

13 Regular Expressions: Examples example –guppy –guppies Regular Expression –gupp(y|ies) –| = disjunction –( ) = parentheses indicate scope example –the (whole word, case insensitive) –the25 Regular Expression (pg. 29) (^|[^a-zA-Z])[tT]he[^a- zA-Z] –^ = beginning of line –[^ ] = negation

14 Regular Expressions: Microsoft Word terminology: –wildcard search

15 Regular Expressions: Microsoft Word Note: zero or more times is missing in Microsoft Word

16 Finite State Automata (FSA) more formally –(Q,s,f,Σ,  ) 1.set of states (Q): {s,x,y}must be a finite set 2.start state (s): s 3.end state(s) (f): y 4.alphabet ( Σ ): {a, b} 5.transition function  : signature: character × state → state  (a,s)=x  (a,x)=x  (b,x)=y  (b,y)=y sx y a a b b

17 Finite State Automata (FSA) directly implement the formal definition –define a predicate fsa/2 –takes two arguments –S = a start state –L = string (as a list) we’re interested in testing Prolog code (for any FSA) –fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). –fsa(E,[]):- end_state(E).

18 Finite State Automata (FSA) Prolog code (for any FSA) –fsa(S,L) :- L = [C|M], transition(S,C,T), fsa(T,M). –fsa(E,[]):- end_state(E). Facts (FSA-particular) –end_state(y). –transition(s,a,x). –transition(x,a,x). –transition(x,b,y). –transition(y,b,y). sx y a a b b transition function  :  (a,s)=x  (a,x)=x  (b,x)=y  (b,y)=y

19 Finite State Automata (FSA) computation tree ?- fsa(s,[a,a,b]). ?- transition(s,a,T). T=x ?- fsa(x,[a,b]). ?- transition(x,a,T’). T’=x ?- fsa(x,[b]). ?- transition(x,b,T”). T”=y ?- fsa(y,[]). ?- end_state(y). Yes fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E).

20 Finite State Automata (FSA) deterministic FSA (DFSA) –no ambiguity about where to go at any given state non-deterministic FSA (NDFSA) –no restriction on ambiguity (surprisingly, no increase in formal power) textbook –D-RECOGNIZE (FIGURE 2.13) –ND-RECOGNIZE (FIGURE 2.21) fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E).

21 Finite State Automata (FSA) Prolog –no change in code –Prolog computation rule takes care of choice point management example –one change –“a” instead of “b” from x to y –non-deterministic –what regular language does this machine accept? fsa(S,L) :- L = [C|M], L = [C|M],transition(S,C,T),fsa(T,M). fsa(E,[]) :-. fsa(E,[]) :- end_state(E). sx y a a b a

22 Finite State Automata (FSA) another possible Prolog encoding strategy –define one predicate for each state taking one argument (the input string) consume input character call next state with remaining input string –query ?- s(L). call start state s

23 Finite State Automata (FSA) –state s: (start state) s([a|L]) :- x(L). match input string beginning with a and call state x with remainder of input –state x: x([a|L]) :- x(L). x([b|L]) :- y(L). –state y: (end state) y([]). y([b|L]) :- y(L). sx y a a b b

24 Finite State Automata (FSA) example: 1.?- s([a,a,b]). 2.?- x([a,b]). 3.?- x([b]). 4.?- y([]). Yes s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L).

25 Finite State Automata (FSA) Note: –non-deterministic properties of Prolog’s computation rule still applies here

26 Finite State Automata (FSA) example 1.?- s([a,b,a]). 2.?- x([b,a]). 3.?- y([a]). No s([a|L]) :- x(L). x([a|L]) :- x(L). x([b|L]) :- y(L). y([]). y([b|L]) :- y(L).

27 Next Time bit more on FSA...read (if you haven’t yet) –Chapter 3: Morphology and Finite State Transducers