Advanced Filtering and Flag Diacritics

Advanced Filtering and Flag Diacritics
Thursday PM Kenneth R. Beesley Xerox Research Centre Europe

Advanced Filtering & Flag Diacritics
When specifying morphotactics in lexc or regular expressions, it is often convenient and attractive to start with a grammar that overgenerates and overrecognizes. Linguists at Xerox typically use lexc to define the morphotactics (word structure) of a language. Lexc source files consist of LEXICONs which in turn consist of morpheme entries. The linguist specifies which LEXICONs contain morphemes that can start a word. In addition, each morpheme entry includes a “continuation class” that defines the LEXICONs (morpheme classes) that can come next in a valid word. The lexc continuation-class approach is OK for defining dependencies between morphemes, where the morphemes are contiguous. But where a language has discontiguous dependencies or “separated dependencies” between morphemes, lexc and pure regular expressions in general are a poor device for morphotactic description. When using lexc or regular expressions to define natural-language morphotactics, it is often convenient and appropriate to begin with a description that overgenerates and overrecognizes. Then, subsequenty, one must filter out the bad wordsusing one of several methods:

Advanced Filtering & Flag Diacritics
An initial overgenerating network must subsequently be constrained by Composition of filters at compile time Known as “composing in” the restrictions This can easily result in an explosion in the size of the network or Simulation of composition at runtime Slow or Flag Diacritics Recognized and applied at runtime This method can avoid size explosions and still run very fast Linguists at Xerox typically use lexc to define the morphotactics (word structure) of a language. Lexc source files consist of LEXICONs which in turn consist of morpheme entries. The linguist specifies which LEXICONs contain morphemes that can start a word. In addition, each morpheme entry includes a “continuation class” that defines the LEXICONs (morpheme classes) that can come next in a valid word. The lexc continuation-class approach is OK for defining dependencies between morphemes, where the morphemes are contiguous. But where a language has discontiguous dependencies or “separated dependencies” between morphemes, lexc and pure regular expressions in general are a poor device for morphotactic description. When using lexc or regular expressions to define natural-language morphotactics, it is often convenient and appropriate to begin with a description that overgenerates and overrecognizes. Then, subsequenty, one must filter out the bad wordsusing one of several methods:

Goal of this Presentation
Quick review of the linguistic problem of separated dependencies Illustrate the traditional solutions and their problems Acquaint you with Flag Diacritics Linguists doing practical development of morphological analyzers and similar networks should be aware of Flag Diacritics and use them where appropriate. Flag Diacritics are a relatively recent innovation and are slowly being retrofitted into existing morphological analyzers. Any new work should consider using Flag Diacritics from the very start.

Continuation Classes and Concatenation
The “continuation classes” of lexc translate into concatenation: LEXICON Foo root Suff ; root Suffx ; LEXICON Suff ard # ; LEXICON Suffx xarc # ; So the co-occurrence restrictions between one morpheme and the very next morpheme(s) are usually easy to handle.

prefix1+prefix2+prefix3+stem+suffix1+suffix2+suffix3
Constraining “separated dependencies” can be awkward in lexc or regular expressions: prefix1+prefix2+prefix3+stem+suffix1+suffix2+suffix3 prefix1 may be incompatible with suffix 3 prefix1 may be optional, but if it is present, it might require the presence of suffix3 suffix2 may be optional, but if it is present, it might require a previous prefix2 etc., etc., etc.

Quick Review of Composition
Transducers have an upper-side language and a lower-side language By Xerox convention, the upper-side language contains analysis strings, usually consisting of a root and tags, e.g. Root[Tag1][Tag2][Tag3] The lower-side language usually consists of orthographical strings If you compose a filter or rules on the top of the network, it must match the upper-side language If you compose a filter or rules on the bottom of the network, it must match the lower-side language

Constraining Separated Dependencies within Words via Composition of Filters
E.g. assume that the presence of one morpheme, “foo”, precludes the co- occurrence of another morpheme, “fum”, anywhere later in the word. Let “foo” be spelled “foo^X” on the upper side, and let “fum” be spelled “fum^Y” on the upper side, where ^X and ^Y are declared multicharacter symbols that we will use as features. An overgenerating lexicon mylex.fst contains ungrammatical strings like …. f o o^X … f u m^Y … One solution: eliminate such ungrammatical strings by the compile-time composition of a suitable filter, then map the features to epsilon. 0 <- [ %^X | %^Y ] .o. ~$[ %^X ?* %^Y ] mylex.fst The original method of removing overgeneration and overrecognition from networks is to “compose in” filters.

A Second, Equivalent Solution
Again, assume that the presence of one morpheme, “foo”, precludes the co- occurrence of another morpheme, “fum”, anywhere later in the word. Let “foo” be spelled “foo^X” on the upper side, and let “fum” be spelled “fum^Y” on the upper side, where ^X and ^Y are declared multicharacter symbols that we will use as features. We can state the restriction as “^Y occurs only in words where it is not preceded by ^X”. 0 <- [ %^X | %^Y ] .o. %^Y => .#. ~$[%^X] _ mylex.fst The original method of removing overgeneration and overrecognition from networks is to “compose in” filters.

More Separated Dependencies within Words
Assume that the presence of one morpheme, e.g. “fie”, requires the co-occurrence of another morpheme, e.g. “fee”, somewhere later in the same word. Let “fie” be spelled “fie^X” on the upper side, and let “fee” be spelled “fee^Y” on the upper side. The overgenerating lexicon mylex.fst contains ungrammatical strings like …. f i e ^X … where fee^Y does not occur after fie^X One solution: Eliminate such ungrammatical strings at compile time by the composition of a suitable filter, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ?* %^X ~$[ %^Y] ] mylex.fst

Another equivalent solution:
Again, assume that the presence of one morpheme, e.g. “fie”, requires the co-occurrence of another morpheme, e.g. “fee”, somewhere later in the same word. Let “fie” be spelled “fie^X” on the upper side, and let “fee” be spelled “fee^Y” on the upper side. If ^X appears, it must be followed by a ^Y 0 <- [ %^X | %^Y ] .o. %^X => _ $[%^Y] mylex.fst

More Separated Dependencies within Words
Assume that the presence of one morpheme, e.g. “fee”, requires the co-occurrence of another morpheme, e.g. “fie”, somewhere earlier in the same word. So fie is usually optional, but it is required with a following fee. Let “fee” be spelled “fee^Y” on the upper side, and let “fie” be spelled “fie^X” on the upper side. The overgenerating lexicon mylex.fst contains ungrammatical strings like …. f ee^Y … where fie^X does not occur before fee^Y One solution: Eliminate such ungrammatical strings at compile time by the composition of a suitable filter, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ~$[%^X] %^Y ?* ] mylex.fst

Another equivalent solution
Again, assume that the presence of one morpheme, e.g. “fee”, requires the co- occurrence of another morpheme, e.g. “fie”, somewhere earlier in the same word. Let “fee” be spelled “fee^Y” on the upper side, and let “fie” be spelled “fie^X” on the upper side. ^Y, if it appears, must be preceded by ^X 0 <- [ %^X | %^Y ] .o. %^Y => $[%^X] _ mylex.fst

Problems with the Traditional “composing in” of Constraints
When you “compose in” such restrictions, for separated dependencies, the overgeneration and overrecognition are eliminated, but the resulting transducer tends to get bigger, sometimes very big. The general problem is that all the states and arcs between the two co-restricted morphemes need to be copied.

Arabic Articles and Case Endings
A bare Arabic stem can generally take any one of six case endings: +Nom u +Def +Gen ε +Acc i k a a t i b a +Indef +Nom ε uN +Gen iN +Acc aN Assume that the subnetwork represented here by “kaatib” contains all the noun stems and is very large.

But an Arabic noun can also, optionally, take the al- prefix, which is an overt definite article, e.g. kaatibu or alkaatibu. Using lexc or xfst, we could easily make it an optional prefix thus +Nom u ε +Def +Gen Art+ l ε +Acc a i k a a t i b a +Indef +Nom ε ε uN +Gen iN +Acc aN Unfortunately, this straightforward solution overgenerates. The overt al- prefix can in fact co-occur only with +Def case endings. This is a classic “separated dependency”.

You can filter out the bad strings using compile-time composition: ~$[ Art%+ ?* %+Indef ] .o. +Nom u ε +Def +Gen Art+ l +Acc a ε i k a a t i b a +Nom +Indef ε ε uN +Gen iN +Acc aN

But this almost doubles the size of the network!
To impose this constraint in pure finite-state terms, the whole noun-stem structure is duplicated in the course of composition. ε Art+ k a a t i b a l ε +Def +Nom u +Def ε +Gen ε +Acc i k a a t i b a +Nom +Indef ε uN +Gen iN +Acc aN

An Alternative Solution: Simulation of Composition at Runtime
Using Xerox utilities like ‘lookup’, you can specify “lookup strategies” that involve compositions that are simulated at runtime, e.g. 0 <- [ %^X | %^Y ] .o. ~[ ?* %^X ~$[ %^Y] ] mylex.fst This keeps the transducers small and produces the same results as compile-time composition, But it usually runs more slowly Instead of composing in the filters at compile-time, one can use the ‘lookup’ facility to keep the networks separate and SIMULATE the composition at runtime. The ‘lookup’ facility is maintained by Tamas Gaal. It requires the definition of a file containing “lookup strategies”. In general, the simulation of composition at runtime is less efficient that “compiling in” the same restrictions at compile time.

Flag Diacritics: A Practical Alternative
What are flag diacritics? Simple feature-like symbols for imposing constraints Especially useful for enforcing “separated dependencies” between morphemes Motivations Keep networks smaller (prevent “blow-ups”) Keep networks maximally efficient at runtime How to use them Syntax Semantics When to use them

What Are Flag Diacritics?
As far as regular expressions, lexc and networks are concerned, Flag Diacritics are just multicharacter symbols, defined or declared like any other multicharacter symbols. The linguist can add Flag Diacritic symbols to any strings in the network. They usually become part of the spelling of a morpheme. Flag Diacritics have a distinctive spelling, delimited and with 2 or 3 fields delimited by periods (full stops), e.g. Flag Diacritics allow simple, efficient and highly valuable feature constraints at runtime.

The Semantics of Flag Diacritics
During the application of a network, Flag Diacritic symbols are not matched against the input strings or included in the output strings. In this sense, Flag Diacritics are treated like epsilons. But unlike epsilons, a network path labeled with a Flag Diacritic is successfully traversed at runtime only if the operation indicated by the Flag Diacritic is successful. The operations involve feature-setting and feature-unification. The Flag Diacritics are “noticed”, and the feature-like operations are performed, by “flag-sensitive” runtime code, e.g. ‘apply up’ and ‘apply down’.

Arabic Articles and Case Endings with Flag Diacritics: Start with an overgenerating network
ε +Nom l u +Def Art+ +Gen a +Acc ε i k a a t i b a +Indef +Nom ε ε uN +Gen iN +Acc aN Contains illegal paths like: Art+ k a a t i b +Indef +Nom a l k a a t i b uN This diagram is the same as the old overgenerating FST that recognizes and generates illegal words that contain both an overt definite-article prefix and an indefiinite case suffix, but now Flag Diacritics have been introduced by the linguist. The definite-article prefix now contains the symbol intended to indicate that the definite-article prefix is present. The indefinite case suffixes now incorporate the symbol intended to indicate that the definite-article prefix must be absent. If the Flag Diacritics were simply treated as epsilons, the network would still accept and generate illegal words such as alkaatibuu.

The network still contains illegal paths like: Art+ k a a t i b +Indef +Nom a l k a a t i b uN But while exploring this path, looking up the bad word *alkaatibuN, the flag-sensitive ‘apply up’ routine will Find on the lower-side Treat it as an epsilon (it consumes no input) But will remember the feature setting ART = YES Eventually find on the lower side, treat it as an epsilon, but Will try to “unify” ART = NO with the stored valued ART = YES and will FAIL The illegal path is therefore blocked at runtime Lookup proceeds in the usual way, matching symbols from the input against the lower side of the network. When looking up alkaatibuu, the ‘a’ and ‘l’ are matched and consumed in the usual way. is then found by the llookup routine and is treated like an epsilon, not being matched against any input. However, the lookup routine interprets the Flag Diacritic as an instruction to unify the feature-attribute setting ART=YES with the current set of features. As ART has not yet been defined, it simply stores ART=YES in its memory. The llookup continues to match ‘k’, ‘a’, ‘a’, ‘t’, ‘I’’, ‘b’ in the usual way before finding which indicates a unification with the value ART=NO. This in fact fails to unify with the stored value ART=YES, and the path is blocked. Thus the illegal work alkaatibuu is not analyzed. Lookup continues by backtracking to look for any other possible solutions.

Flag Diacritics Each flag diacritic signals the runtime apply routine to perform a little feature-based operation. The arc labeled with the flag diacritic is traversed only if the feature-based operation is successful; otherwise the algorithm abandons the path and backtracks for other solutions. The application routines contain a very small amount of memory for storing feature values. If used correctly, Flag Diacritics allow your network to contain illegal paths that are noticed and rejected at runtime. The result is to get the restrictions you need, without the network blowing up in size, and with minimal loss of speed.

The basic @U.feature.value@ flags
All features start out with neutral/unset values. The spelling is where the feature and value strings are chosen by the linguist. They have no inherent meaning to the system. If the application routine finds and there is no stored value for X, then it simply sets feature X = Y in its little memory. If the application routine finds and there is a previously set value for X, then the routine will attempt to unify the new value with the old one. If successful, the arc is traversed; otherwise fail. All you need in many practical applications are flags.

You do not need to declare flag diacritics
As far as networks and sigmas are concerned, a Flag Diacritics is just a normal multicharacter symbol. Flag Diacritics do have a distinctive spelling, surrounded Flag Diacritics are “noticed” or “obeyed” only by application routines like ‘apply up’ and ‘apply down’ that have been rewritten to be “sensitive to flag diacritics.”

Other Feature-Diacritic Types
Positive Reset (re)sets feature = value; always succeeds Negative Reset (re)sets feature # value There are other kinds of Flag Diacritics besides the ‘U’ type. More examples are available in “The Book”.

Other Feature-Diacritic Types
Require: succeeds iff currently feature = value Require: succeeds iff currently feature is set to a non-neutral value Disallow: succeeds iff currently feature is set to something other than value Disallow: succeeds iff feature is neutral/unset Clear: (re)set back to neutral/unset value; always succeeds There are other kinds of Flag Diacritics besides the ‘U’ type. More examples are available in “The Book”.

A Trap for the Unwary ‘apply up’ matches the input against the lower-side of a transducer and notices Flag Diacritics only on the lower-side ‘apply down’ matches the input against the upper-side of a transducer and notices Flag Diacritics only on the upper-side So typically you want to define your networks so that Flag Diacritics are visible on both sides of a transducer. But experts may want to build systems with different restrictions on analysis and generation.

Some Success Stories Hungarian Morphological Analyzer Was 35 Megabytes
Now 5 Megabytes, after adding 5 Flag Diacritic attributes French Morphological Analyzer Was 11 Megabytes Now 5 Megabytes, after adding 37 Flag Diacritic attributes New German Morphological Analyzer Now just ,934 arcs Explodes to 2,247,984 arcs after eliminating just two Flag Diacritics (among many)

When to Use Flag Diacritics
Use them when you need to keep networks smaller, especially when there are separated dependencies. I also find them very useful to keep lexc descriptions simpler. Avoid the proliferation of continuation classes. You can always remove Flag Diacritics from a network using ‘eliminate flag’: xfst[]: eliminate flag attrname The effect of ‘eliminate flag’ is the same as composing in the restrictions, usually resulting in an increase in size. While Flag Diacritics may initially appear to be a complication, which linguists may be tempted to leave for later, they can considerably simplify a lexc grammar and should be used, where appropriate, from the beginning. Retrofitting Flag Diacritics into an existing system is a nuisance that should be avoided where possible.

More Information on Flag Diacritics
“The Book” (Beesley & Karttunen, 2003) contains a whole chapter on Flag Diacritics. Sonja Bosch and Laurette Pretorius have used them successfully to constrain the combination of class prefixes with Zulu roots. Contact me if you need help. Flag Diacritics have been a difficult subject for a number of our linguist students. Be sure to read the available documention: there is now a whole chapter in The Book. You may also want to consult with XRCE developers who have already used Flag Diacritics successfully.

Advanced Filtering and Flag Diacritics

Similar presentations

Presentation on theme: "Advanced Filtering and Flag Diacritics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Filtering and Flag Diacritics

Similar presentations

Presentation on theme: "Advanced Filtering and Flag Diacritics"— Presentation transcript:

Similar presentations

About project

Feedback