Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. This way a FSA is created that can be used for morphological recognition.

Two-level Morphology Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e.g. cats -> cat + N + PL Two-level morphology represents a word as the correspondence between the lexical and the surface level.

Finite State Transducer (FST) A FST is an automaton that we use for performing the mapping between the two-levels. A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings. Another view of a FST is as a machine that reads one string and generates another string.

Formal FST definition Extention to FSA definition –Q: a finite set of states. (q0, q1, q2, …) –Σ: a finite alphabet of complex symbols i:o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets) –q0: the start state (first state) –F: the states with of final states (subset of Q) –δ(q,i:o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’. e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}

Useful FST Properties Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator. Composition: Given two FSTs T1 that maps from I to C and T2 that maps from C to O, their composition is a new transducer T1 o T2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.

Finite State Transducers It is convenient to view a FST as having two tapes. –The upper or lexical tape –The lower of surface tape Each symbol a:b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape. Symbols such as a:a are called default pairs and are represented simply as a.

FST Morphotactics FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.

FST Lexicon

Combining FST Lexicon and Morphtactics The two FST for lexicon and morphotactics can be cascaded, i.e. the input is run through the lexicon FST and then the output is run through the morphotactics FST. Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).

Orthographic Rules The previous FST will accept the word foxs and reject the word foxes. We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E.g. for English –e is inserted after -s, -z, -x, -ch, -sh before -s. –-y becomes -ie before -s. Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d. –ε ->e/{x,s,z}^__s#.

Orthographic Rules and FST The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.

Orthographic Rules and FST The previous orthographic rule can be represented as a FST.

Orthographic Rules and FST Transition table for the previous FST. State/ Input s:sx:xz:z^:εε:e#other q0:1110-00 q1:1112-00 q2:5110300 q34------ q4-----0- q51112--0

Combining FST Lexicon and Rules First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes. Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level. The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottom- up (parsing).

FST Parsing Parsing is more complicated than generation because of ambiguity. E.g. foxes may be parsed as both fox+V+3SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST. Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.

Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Similar presentations

Presentation on theme: "Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Similar presentations

Presentation on theme: "Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up."— Presentation transcript:

Similar presentations

About project

Feedback