Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

Similar presentations


Presentation on theme: "Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven."— Presentation transcript:

1 Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

2 Maximizing Lexical Coverage Target: Reduction of the number of OOV-words Means: –accurate content and organization of the recognizer lexicon –taking care of a number of productive word formation processes Evaluation: –implementation of test tool –test results Conclusions

3 Lexicon: Content & Organization Starting point: CGN-lexicon (570.000 entries) Reduction to one entry per wordform per POS (300.000 entries) Removal of compounds (160.000 entries) Selection of most frequent entries (40.000) => Basic Word List (BWL) Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL

4 Lexicon Accuracy Careful selection of the words in BWL: –no compounds –frequent words Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL

5 Word Formation Processes Input: number of word parts that can or cannot be compounded Hybrid approach: Rule-based + Statistical Filters Output: –compound + morfo-syntactic info + confidence measure –no compounding possible with given word parts

6 Word Formation Processes: Input From BWL: full words, that can be part of a compound or can be words by themselves From QWL: ‘words’ that can only be part of a compound 2 up to 5 word parts

7 Word Formation Processes: Rules Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) Input from QWL: word part is N and can only be modifier Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules Rules use 2 word parts When input > 2 word parts: recursivity in rules

8 Word Formation Processes: Statistics Relative Frequency Threshold Parameter Confidence Measure of the Compound Probability

9 Relative Frequency Threshold Makes use of relative frequency of POS for a word form Makes use of a threshold value (0.05%) If RF > Threshold: POS is used for this wordform If RF < Threshold: POS is rejected for this wordform Example: RF(bij (PREP) ) = 0.999 > T, RF(bij (N) ) = 0.0004<T, only bij (PREP) is used

10 Confidence Measure of Compounding Probability estimation of: P(comp (w1=mod, w2=head) ) / P(comp (w1=*, w2=head) ) where: –P(comp (w1=mod, w2=head) ) is the probability that two consecutive word parts form a compound rather than being 2 separate words –P(comp (w1=*, w2=head) is the probability of w 2 being a head, with any modifier

11 Confidence Measure of Compound Probability (2) If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp (w1=mod, w2=head) )/Fr(comp (w1=*,w2=head) )] x (1-D head ) where: –Fr(comp (w1=mod, w2=head) ) is the frequency of the compound that consists of w 1 + w 2 –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier –D head is the discount parameter: amount of probability reserved for words not in frequency list

12 Confidence Measure of Compounding Probability (3) Discount parameter is estimated: D head = #diff(mod | head) / Fr(comp (w1=*, w2=head) ) where: –#diff(mod | head) is the number of different modifiers occuring with the given head –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier (1-D head ) is the amount of probability reserved for words that can be found in the frequency list

13 Confidence Measure of Compounding Probability (4) If the compound is not found in the frequency list, the ratio is estimated like this: D head x [Fr(comp (w1=mod, w2=*) ) / Fr (*) ] where: –Fr(comp (w1=mod, w2=*) ) is the frequency of the 1 st word part as a modifier of any head –Fr (*) is the total frequency of all words in the frequency list (= 79.862.581)

14 Confidence Measures: Examples binnen+kijken –binnenkijken occurs in the frequency list –Fr (w1=binnen, w2=kijken) = 10 –Fr (w1=*, w2=kijken) = 2188 –#diff( mod | head=kijken) = 21 –(10 / 2188) x (1 - 21/2188) = 0.0045 frequentie + tabel –frequentietabel does not occur in frequency list –Fr (w1=*, w2=tabel) = 141 –#diff( mod | head=tabel) = 17 –Fr (w1=frequentie,w2=*) = 15 –(17 / 141) x (15 / 79 862 581) = 2.26 e -8

15 Evaluation Test System Test Results

16 The Test System Takes a regular text as input Converts punctuation marks into # For the test system, a BWL of 35.000 entries was used Every word is checked in BWL: –if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) –no compounding rules are used for split up procedure –if no possible split up is found, split up in 3 parts is tried If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

17 The Test System (2) For every 2 consecutive word parts, it was tested whether they can be compounded or not Results are compared with original text False compounding and false identification of noncompounds can be counted this way Same was done for every 3 consecutive word parts A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected

18 Test Results 3 test texts were used: –Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds –Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds –Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds Most of the OOV’s are proper nouns or non- standard Dutch

19 Test Results (2) Correct identification of noncompounds and compounds: –dependent on test text –dependent on parameter thresholds There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

20 Test Results (3)

21 Conclusions Identifying compoundability can be done with an accuracy of 94.5 - 98.5 % Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

22 Conclusions (2) Capturing already existing compounds by automated compounding proves to be successful Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower Automated compounding proves to be a useful means for maximizing lexical coverage


Download ppt "Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven."

Similar presentations


Ads by Google