Download presentation
Presentation is loading. Please wait.
Published byClara Stephens Modified over 9 years ago
1
6/10/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini
2
NLP research at UBC TOPICS Generation and Summarization of Evaluative Text (e.g., customer reviews) Summarization of conversations (emails, blogs, meetings) Subjectivity Detection, Domain Adaptation, Rhetorical Parsing PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch 6/10/2015CPSC503 Winter 20092 http://people.cs.ubc.ca/~rjoty/Webpage/
3
6/10/2015CPSC503 Winter 20093 State Machines (no prob.) Finite State Automata (and Regular Expressions) Finite State Transducers (English) Morphology Logical formalisms (First-Order Logics) Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Syntax Pragmatics Discourse and Dialogue Semantics AI planners Linguistic Knowledge Formalisms and associated Algorithms
4
6/10/2015CPSC503 Winter 20094 Computational tasks in Morphology Recognition: recognize whether a string is an English/… word (FSA) Parsing/Generation: word stem, class, lexical features …. bought buy +V +PAST-PART buy +V +PAST Stemming: word stem …. e.g.,
5
6/10/2015CPSC503 Winter 20095 Today Sept 16 Finite State Transducers (FSTs) and Morphological Parsing Stemming (Porter Stemmer)
6
6/10/2015CPSC503 Winter 20096 FST definition (Recap.) Q: a finite set of states I,O: input and an output alphabets (which may include ε) Σ: a finite alphabet of complex symbols i:o, i I and o O Q 0: the start state F: a set of accept/final states (F Q) A transition relation δ that maps QxΣ to 2 Q E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ?
7
6/10/2015CPSC503 Winter 20097 FST can be used as… Translators: input one string from I, output another from O (or vice versa) Recognizers: input a string from IxO Generator: output a string from IxO Terminology warning! E.g., if I={a,b} ; O={a,b,ε}; ……
8
6/10/2015CPSC503 Winter 20098 FST: inflectional morphology of plural Some regular-nouns Some irregular-nounso:i X -> X:X lexical:surface Notes:
9
6/10/2015CPSC503 Winter 20099 Examples mic +N+PLcat lexical surface e
10
6/10/2015CPSC503 Winter 200910 Computational Morphology: Problems/Challenges 1.Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2.Spelling changes: may occur when two morphemes are combined e.g. butterfly + -s -> butterflies
11
6/10/2015CPSC503 Winter 200911 Ambiguity: more complex example What’s the right parse for Unionizable? –Union-ize-able –Un-ion-ize-able Each would represent a valid path through an FST for derivational morphology. Both Adj……
12
6/10/2015CPSC503 Winter 200912 Deal with Morphological Ambiguity Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words
13
6/10/2015CPSC503 Winter 200913 (2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., butterfly, try)
14
6/10/2015CPSC503 Winter 200914 Solution: Multi-Tape Machines Add intermediate tape Use the output of one tape machine as the input to the next Add intermediate symbols –^ morpheme boundary –# word boundary
15
6/10/2015CPSC503 Winter 200915 Multi-Level Tape Machines FST-1 translates between the lexical and the intermediate level FTS-2 handles the spelling changes (due to one rule) to the surface tape FST-1 FST-2
16
6/10/2015CPSC503 Winter 200916 FST-1 for inflectional morphology of plural (Lexical Intermediate ) Some regular-nouns Some irregular- nouns o:i +PL:^s# # # # +PL:^ ε :s ε :#
17
6/10/2015CPSC503 Winter 200917 Example fox intemediate lexical semou intemediate lexical +PL+N +PL
18
6/10/2015CPSC503 Winter 200918 FST-2 for E-insertion (Intermediate Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# foxes #: ε
19
6/10/2015CPSC503 Winter 200919 Examples ^sfox intermediate surface # ^ibox intermediate surface ng#
20
6/10/2015CPSC503 Winter 200920 Where are we? #
21
6/10/2015CPSC503 Winter 200921 Final Scheme: Part 1
22
6/10/2015CPSC503 Winter 200922 Final Scheme: Part 2
23
6/10/2015CPSC503 Winter 200923 Intersection (FST1, FST2) = FST3 For all i,j,n,m,a,b δ 3 ((q 1i,q 2j ), a:b) = (q 1n,q 2m ) iff –δ 1 (q 1i, a:b) = q 1n AND –δ 2 (q 2j, a:b) = q 2m States of FST1 and FST2 : Q 1 and Q 2 States of intersection: (Q 1 x Q 2 ) Transitions of FST1 and FST2 : δ 1, δ 2 Transitions of intersection : δ 3 a:b (q 1i,q 2j )(q 1n,q 2m ) a:b q 1i q 1n a:b q 2j q 2m a:b
24
6/10/2015CPSC503 Winter 200924 Composition(FST1, FST2) = FST3 States of FST1 and FST2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST1 and FST2 : δ 1, δ 2 Transitions of composition : δ 3 For all i,j,n,m,a,b δ 3 ((q 1i,q 2j ), a:b) = (q 1n,q 2m ) iff –There exists c such that –δ 1 (q 1i, a:c) = q 1n AND –δ 2 (q 2j, c:b) = q 2m a:b (q 1i,q 2j )(q 1n,q 2m ) a:b a:c q 1i q 1n c:b q 2j q 2m
25
6/10/2015CPSC503 Winter 200925 FSTs in Practice Install an FST package…… (pointers) Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp-like notation (pointer) Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) Complexity/Coverage: FSTs for the morphology of a natural language may have 10 5 – 10 7 states and arcs Spanish (1996) 46x10 3 stems; 3.4 x 10 6 word forms Arabic (2002?) 131x10 3 stems; 7.7 x 10 6 word forms
26
6/10/2015CPSC503 Winter 200926 Other important applications of FST in NLP From segmenting words into morphemes to… Tokenization: –finding word boundaries in text (?!) …maxmatch –Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3.22 Shallow syntactic parsing: e.g., find only noun phrases Phonological Rules…… (Chpt. 11)
27
6/10/2015CPSC503 Winter 200927 Computational tasks in Morphology Recognition: recognize whether a string is an English word (FSA) Parsing/Generation: word stem, class, lexical features …. bought buy +V +PAST-PART buy +V +PAST Stemming: word stem …. e.g.,
28
6/10/2015CPSC503 Winter 200928 Stemmer E.g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: (condition) S1->S2 –ATIONAL ATE (relational relate) –(*v*) ING if stem contains vowel (motoring motor) Cascade of rules applied to: computerization –ization -> -ize computerize –ize -> ε computer Errors occur: –organization organ, university universe Code freely available in most languages: Python, Java,…
29
6/10/2015CPSC503 Winter 200929 Stemming mainly used in Information Retrieval 1.Run a stemmer on the documents to be indexed 2.Run a stemmer on users queries 3.Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents
30
6/10/2015CPSC503 Winter 200930 Porter as an FST The original exposition of the Porter stemmer did not describe it as a transducer but… –Each stage is a separate transducer –The stages can be composed to get one big transducer
31
6/10/2015CPSC503 Winter 200931 Next Time Read handout –Probability –Stats –Information theory Next Lecture: –finish Chpt 3, 3.10-11 –Start Probabilistic Models for NLP (Chpt. 4, 4.1 – 4.2 and 5.9!)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.