October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery
October 2007Natural Language Processing2 Acknowledgement Material derived from/copied from –Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 –Richard Sproat, Lecture notes
October 2007Natural Language Processing3 Outline Words Regular Languages Regular Expressions Finite State Automata
October 2007Natural Language Processing4 What is a Word? A series of speech sounds that symbolizes meaning without being divisible into smaller units Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements The smallest meaningful element of language. When written it stands alone with a space on either side of it.
October 2007Natural Language Processing5 Information Associated with Words Spelling –orthographic –phonological Syntax –POS –Valency Semantics –Meaning –Relationship to other words
October 2007Natural Language Processing6 Properties of Words Sequence –characters pollution –phonemes Delimitation –whitespace –other? Structure –simple ("atomic") words –complex ("molecular") words
October 2007Natural Language Processing7 Complex Words Complex words have subparts: e.g. "enlargement" en + large + ment Some subparts are valid words large Others are prefixes and suffixes en, ment N.B. The complex word can be built in different ways: (en + large) + ment en + (large + ment)
October 2007Natural Language Processing8 Morphological Processes affixation –prefix –suffix –circumfix: għandi - mgħandix –infix: phenidine phenetidine other morphological processes –redoubling (mexa; mexxa) –vowel change (swim; swam)
October 2007Natural Language Processing9 Complex Words Formed by Concatenation dis re un en large charge infect code decide ed ing ee er ly ++ prefixesrootssuffixes
October 2007Natural Language Processing10 The Language of Words What kind of formal language is the language of words? One which can be constructed out of –A characteristic set of basic symbols (alphabet) –A characteristic set of combining operations Union (disjunction) Concatenation Iteration Regular Language; Regular Sets
October 2007Natural Language Processing11 Outline Words Regular Languages Regular Expressions Finite State Automota
October 2007Natural Language Processing12 Regular Languages A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: –Set union –Concatenation –Transitive closure (Kleene star)
October 2007Natural Language Processing13 Some things that are regular languages Zero or more a’s followed by zero or more b’s The set of words in an English dictionary Dates URLs English?
October 2007Natural Language Processing14 Some things that are not regular languages Zero or more a’s followed by exactly the same number of b’s The set of all English palindromes (e.g. Madam I'm Adam) The set that includes all noun phrases of the form –the cat slept –the cat the dog bit slept –the cat the dog the man fed bit slept
October 2007Natural Language Processing15 Some special regular languages The universal language (Σ*) The empty language (Ø) Note: the empty language is not the same as the empty string
October 2007Natural Language Processing16 Some closure properties of regular languages Intersection Complementation Difference Reversal Power
October 2007Natural Language Processing17 Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION MACHINE
October 2007Natural Language Processing18 Outline Words Regular Languages Regular Expressions Finite Automota
October 2007Natural Language Processing19 Regular Expressions Notation for describing regular sets Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) Xerox Finite State tools use a somewhat different notation, but similar function.
October 2007Natural Language Processing20 Regular Expressions aa simple symbol A Bconcatenation A | Balternation operator A & Bintersection operator A*Kleene star
October 2007Natural Language Processing21 Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION MACHINE
October 2007Natural Language Processing22 Outline Words Regular Languages Regular Expressions Finite Automata
October 2007Natural Language Processing23 Finite Automaton A finite automaton is a quintuple (Q, I, q0,F, δ ) where: Q is a finite set of states Σ is alphabet of symbols q0 Q is a start state F Q are final states δ is a transition relation δ(q,i,q ' ) between a state q Q, a symbol σ Σ and q' Q
October 2007Natural Language Processing24 Representation of FSA’s: State Diagram
October 2007Natural Language Processing25 State Table
October 2007Natural Language Processing26 Mr. Kleene
October 2007Natural Language Processing27 Kleene’s theorem Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions. Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.
October 2007Natural Language Processing28 Converting a Regular Expression to an NFA The NFA representing the empty string is: The NFA representing a single character is: 1 2 ε 1 2 a
October 2007Natural Language Processing29 Regular Expression to NFA Diagram from Leonidas Fegaras, Univ. Texas
October 2007Natural Language Processing30 Deterministic Finite Automata In deterministic finite automata (DFA), every state/symbol pair maps to a unique state In other words, δ is a function Why do we care about DFAs?
October 2007Natural Language Processing31 Deterministic Finite Automata In deterministic finite automata (DFA), every state/symbol pair maps to a unique state In other words, δ is a function Why do we care about DFAs? EFFICIENCY!!
October 2007Natural Language Processing32 Equivalence of NFA’s and DFA’s
October 2007Natural Language Processing33 Subset Construction for Determinisation States which are connected by an ε transition will be represented by the same states in the DFA. If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). Thus these states will be combined into a single DFA state. more details ml
October 2007Natural Language Processing34 Subset construction for determinization
October 2007Natural Language Processing35 Subset construction for determinization
October 2007Natural Language Processing36 Subset construction for determinization
October 2007Natural Language Processing37 Subset construction for determinization
October 2007Natural Language Processing38 Subset construction for determinization
October 2007Natural Language Processing39 Subset construction for determinization