Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Maximizing Lexical Coverage Target: Reduction of the number of OOV-words Means: –accurate content and organization of the recognizer lexicon –taking care of a number of productive word formation processes Evaluation: –implementation of test tool –test results Conclusions

Lexicon: Content & Organization Starting point: CGN-lexicon (570.000 entries) Reduction to one entry per wordform per POS (300.000 entries) Removal of compounds (160.000 entries) Selection of most frequent entries (40.000) => Basic Word List (BWL) Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL

Lexicon Accuracy Careful selection of the words in BWL: –no compounds –frequent words Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL

Word Formation Processes Input: number of word parts that can or cannot be compounded Hybrid approach: Rule-based + Statistical Filters Output: –compound + morfo-syntactic info + confidence measure –no compounding possible with given word parts

Word Formation Processes: Input From BWL: full words, that can be part of a compound or can be words by themselves From QWL: ‘words’ that can only be part of a compound 2 up to 5 word parts

Word Formation Processes: Rules Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) Input from QWL: word part is N and can only be modifier Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules Rules use 2 word parts When input > 2 word parts: recursivity in rules

Word Formation Processes: Statistics Relative Frequency Threshold Parameter Confidence Measure of the Compound Probability

Relative Frequency Threshold Makes use of relative frequency of POS for a word form Makes use of a threshold value (0.05%) If RF > Threshold: POS is used for this wordform If RF < Threshold: POS is rejected for this wordform Example: RF(bij (PREP) ) = 0.999 > T, RF(bij (N) ) = 0.0004<T, only bij (PREP) is used

Confidence Measure of Compounding Probability estimation of: P(comp (w1=mod, w2=head) ) / P(comp (w1=*, w2=head) ) where: –P(comp (w1=mod, w2=head) ) is the probability that two consecutive word parts form a compound rather than being 2 separate words –P(comp (w1=*, w2=head) is the probability of w 2 being a head, with any modifier

Confidence Measure of Compound Probability (2) If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp (w1=mod, w2=head) )/Fr(comp (w1=*,w2=head) )] x (1-D head ) where: –Fr(comp (w1=mod, w2=head) ) is the frequency of the compound that consists of w 1 + w 2 –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier –D head is the discount parameter: amount of probability reserved for words not in frequency list

Confidence Measure of Compounding Probability (3) Discount parameter is estimated: D head = #diff(mod | head) / Fr(comp (w1=*, w2=head) ) where: –#diff(mod | head) is the number of different modifiers occuring with the given head –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier (1-D head ) is the amount of probability reserved for words that can be found in the frequency list

Confidence Measure of Compounding Probability (4) If the compound is not found in the frequency list, the ratio is estimated like this: D head x [Fr(comp (w1=mod, w2=*) ) / Fr (*) ] where: –Fr(comp (w1=mod, w2=*) ) is the frequency of the 1 st word part as a modifier of any head –Fr (*) is the total frequency of all words in the frequency list (= 79.862.581)

Confidence Measures: Examples binnen+kijken –binnenkijken occurs in the frequency list –Fr (w1=binnen, w2=kijken) = 10 –Fr (w1=*, w2=kijken) = 2188 –#diff( mod | head=kijken) = 21 –(10 / 2188) x (1 - 21/2188) = 0.0045 frequentie + tabel –frequentietabel does not occur in frequency list –Fr (w1=*, w2=tabel) = 141 –#diff( mod | head=tabel) = 17 –Fr (w1=frequentie,w2=*) = 15 –(17 / 141) x (15 / 79 862 581) = 2.26 e -8

Evaluation Test System Test Results

The Test System Takes a regular text as input Converts punctuation marks into # For the test system, a BWL of 35.000 entries was used Every word is checked in BWL: –if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) –no compounding rules are used for split up procedure –if no possible split up is found, split up in 3 parts is tried If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

The Test System (2) For every 2 consecutive word parts, it was tested whether they can be compounded or not Results are compared with original text False compounding and false identification of noncompounds can be counted this way Same was done for every 3 consecutive word parts A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected

Test Results 3 test texts were used: –Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds –Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds –Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds Most of the OOV’s are proper nouns or non- standard Dutch

Test Results (2) Correct identification of noncompounds and compounds: –dependent on test text –dependent on parameter thresholds There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Test Results (3)

Conclusions Identifying compoundability can be done with an accuracy of 94.5 - 98.5 % Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

Conclusions (2) Capturing already existing compounds by automated compounding proves to be successful Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower Automated compounding proves to be a useful means for maximizing lexical coverage

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

Similar presentations

Presentation on theme: "Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

Similar presentations

Presentation on theme: "Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven."— Presentation transcript:

Similar presentations

About project

Feedback