October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota
October 2004CSA3050 NL Algorithms2 This lecture Outline –Words –The language of words –FSAs in Prolog Acknowledgement –Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 –Blackburn and Steignitz: NLP Techiques in Prolog:
October 2004CSA3050 NL Algorithms3 What is a Word? A series of speech sounds that symbolizes meaning without being divisible into smaller units Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements A number of bytes processed as a unit.
October 2004CSA3050 NL Algorithms4 Information Associated with Words Spelling –orthographic –phonological Syntax –POS –Valency Semantics –Meaning –Relationship to other words
October 2004CSA3050 NL Algorithms5 Properties of Words Sequence –characters pollution –phonemes Delimitation –whitespace –other? Structure –simple ("atomic“) words –complex ("molecular") words
October 2004CSA3050 NL Algorithms6 Complex Words enlargement en + large + ment (en + large) + ment en + (large + ment) affixation –prefix –suffix –infix
October 2004CSA3050 NL Algorithms7 Sets Underly the Formation of Complex Words dis re un en large charge infect code decide ed ing ee er ly ++ prefixesrootssuffixes
October 2004CSA3050 NL Algorithms8 Structure of Complex Words Complex words are made by concatenating elements chosen from –a set of prefixes –a set of roots –a set of suffixes The set of valid words for a given human language (e.g. English, Maltese) can be regarded as a formal language.
October 2004CSA3050 NL Algorithms9 The Language of Words What kind of formal language is the language of words? One which can be constructed out of –A characteristic set of basic symbols (alphabet) –A characteristic set of combining operations Union (disjunction) Concatenation Closure (iteration) Regular Language; Regular Sets
October 2004CSA3050 NL Algorithms10 Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION MACHINE
October 2004CSA3050 NL Algorithms11 Regular Expressions Notation for describing regular sets Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) Xerox Finite State tools use a somewhat different notation, but similar function.
October 2004CSA3050 NL Algorithms12 Regular Expressions aa simple symbol A Bconcatenation A | Balternation operator A & Bintersection operator A*Kleene star
October 2004CSA3050 NL Algorithms13 Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION MACHINE
October 2004CSA3050 NL Algorithms14 Finite Automaton A finite automaton comprises A finite set of states Q An alphabet of symbols I A start state q0 Q A set of final states F Q A transition function δ(q,i) which maps a state q Q and a symbol i I to a new state q' Q
October 2004CSA3050 NL Algorithms15 Encoding FSAs in Prolog Three predicates –initial/1 initial(s) – s is an initial state –final/1 final(f) – f is a final state –arc/3 arc(s,t,c) there is an arc from s to t labelled c
October 2004CSA3050 NL Algorithms16 Example 1: FSA initial(1). final(4). arc(1,2,h). arc(2,3,a). arc(3,4,!). arc(3,2,h) = h ha !
October 2004CSA3050 NL Algorithms17 Example 2: FSA with jump arc initial(1). final(4). arc(1,2,h). arc(2,3,a). arc(3,4,!). arc(3,1,#) = h #a !
October 2004CSA3050 NL Algorithms18 Example 3: NDA initial(1). final(4). arc(1,2,h). arc(2,3,a). arc(3,4,!). arc(2,1,a) = h a a !
October 2004CSA3050 NL Algorithms19 A Recogniser recognize1(Node,[ ]) :- final(Node). recognize1(Node1,String) :- arc(Node1,Node2,Label), traverse1(Label,String,NewString), recognize1(Node2,NewString). traverse1(Label,[Label|Symbols],Symbols).
October 2004CSA3050 NL Algorithms20 Trace Call: (7) test1([h, a, !]) Call: (8) initial(_L181) Exit: (8) initial(1) Call: (8) recognize1(1, [h, a, !]) Call: (9) arc(1, _L199, _L200) Exit: (9) arc(1, 2, h) Call: (9) traverse1(h, [h, a, !], _L201) Exit: (9) traverse1(h, [h, a, !], [a, !]) Call: (9) recognize1(2, [a, !]) Call: (10) recognize1(3, [!]) Call: (11) recognize1(4, []) Call: (12) final(4) Exit: (12) final(4) Exit: (11) recognize1(4, []) Exit: (10) recognize1(3, [!]) Exit: (9) recognize1(2, [a, !]) Exit: (8) recognize1(1, [h, a, !]) Exit: (7) test1([h, a, !])
October 2004CSA3050 NL Algorithms21 Generation test1(X) X = [h, a, !] ; X = [h, a, h, a, !] ; X = [h, a, h, a, h, a, !] ; X = [h, a, h, a, h, a, h, a, !] ; etc.
October 2004CSA3050 NL Algorithms22 3 Related Frameworks REGULAR LANGS/SETS REGULAR EXPRESSIONS FINITE STATE NETWORKS describe recognise
October 2004CSA3050 NL Algorithms23 Regular Operations Operations –Concatenation –Union –Closure Over What –Language –Expressions –FS Automota
October 2004CSA3050 NL Algorithms24 Concatenation over Reg. Expression and Language Regular Expression E1: =[a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"}
October 2004CSA3050 NL Algorithms25 Concatenation over FS Automata a b c d a b c d ⌣
October 2004CSA3050 NL Algorithms26 Issues Handling jump arcs. Handling non-determinism Computing operations over networks. Maintaining multiple states in DB Representation.