Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven
Maximizing Lexical Coverage Target: Reduction of the number of OOV-words Means: –accurate content and organization of the recognizer lexicon –taking care of a number of productive word formation processes Evaluation: –implementation of test tool –test results Conclusions
Lexicon: Content & Organization Starting point: CGN-lexicon ( entries) Reduction to one entry per wordform per POS ( entries) Removal of compounds ( entries) Selection of most frequent entries (40.000) => Basic Word List (BWL) Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL
Lexicon Accuracy Careful selection of the words in BWL: –no compounds –frequent words Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL
Word Formation Processes Input: number of word parts that can or cannot be compounded Hybrid approach: Rule-based + Statistical Filters Output: –compound + morfo-syntactic info + confidence measure –no compounding possible with given word parts
Word Formation Processes: Input From BWL: full words, that can be part of a compound or can be words by themselves From QWL: ‘words’ that can only be part of a compound 2 up to 5 word parts
Word Formation Processes: Rules Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) Input from QWL: word part is N and can only be modifier Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules Rules use 2 word parts When input > 2 word parts: recursivity in rules
Word Formation Processes: Statistics Relative Frequency Threshold Parameter Confidence Measure of the Compound Probability
Relative Frequency Threshold Makes use of relative frequency of POS for a word form Makes use of a threshold value (0.05%) If RF > Threshold: POS is used for this wordform If RF < Threshold: POS is rejected for this wordform Example: RF(bij (PREP) ) = > T, RF(bij (N) ) = <T, only bij (PREP) is used
Confidence Measure of Compounding Probability estimation of: P(comp (w1=mod, w2=head) ) / P(comp (w1=*, w2=head) ) where: –P(comp (w1=mod, w2=head) ) is the probability that two consecutive word parts form a compound rather than being 2 separate words –P(comp (w1=*, w2=head) is the probability of w 2 being a head, with any modifier
Confidence Measure of Compound Probability (2) If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp (w1=mod, w2=head) )/Fr(comp (w1=*,w2=head) )] x (1-D head ) where: –Fr(comp (w1=mod, w2=head) ) is the frequency of the compound that consists of w 1 + w 2 –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier –D head is the discount parameter: amount of probability reserved for words not in frequency list
Confidence Measure of Compounding Probability (3) Discount parameter is estimated: D head = #diff(mod | head) / Fr(comp (w1=*, w2=head) ) where: –#diff(mod | head) is the number of different modifiers occuring with the given head –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier (1-D head ) is the amount of probability reserved for words that can be found in the frequency list
Confidence Measure of Compounding Probability (4) If the compound is not found in the frequency list, the ratio is estimated like this: D head x [Fr(comp (w1=mod, w2=*) ) / Fr (*) ] where: –Fr(comp (w1=mod, w2=*) ) is the frequency of the 1 st word part as a modifier of any head –Fr (*) is the total frequency of all words in the frequency list (= )
Confidence Measures: Examples binnen+kijken –binnenkijken occurs in the frequency list –Fr (w1=binnen, w2=kijken) = 10 –Fr (w1=*, w2=kijken) = 2188 –#diff( mod | head=kijken) = 21 –(10 / 2188) x (1 - 21/2188) = frequentie + tabel –frequentietabel does not occur in frequency list –Fr (w1=*, w2=tabel) = 141 –#diff( mod | head=tabel) = 17 –Fr (w1=frequentie,w2=*) = 15 –(17 / 141) x (15 / ) = 2.26 e -8
Evaluation Test System Test Results
The Test System Takes a regular text as input Converts punctuation marks into # For the test system, a BWL of entries was used Every word is checked in BWL: –if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) –no compounding rules are used for split up procedure –if no possible split up is found, split up in 3 parts is tried If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word
The Test System (2) For every 2 consecutive word parts, it was tested whether they can be compounded or not Results are compared with original text False compounding and false identification of noncompounds can be counted this way Same was done for every 3 consecutive word parts A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected
Test Results 3 test texts were used: –Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds –Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds –Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds Most of the OOV’s are proper nouns or non- standard Dutch
Test Results (2) Correct identification of noncompounds and compounds: –dependent on test text –dependent on parameter thresholds There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text
Test Results (3)
Conclusions Identifying compoundability can be done with an accuracy of % Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of entries (BWL+QWL)
Conclusions (2) Capturing already existing compounds by automated compounding proves to be successful Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower Automated compounding proves to be a useful means for maximizing lexical coverage