Download presentation
Presentation is loading. Please wait.
1
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007
2
Outline A quantitative view on the corpora The keyword extractor Evaluation of the KWE
3
Creation of a learning objects archive Collection of the learning material IST domains for the LOs: 1. Use of computers in education, with sub- domains: 2. Calimera documents (parallel corpus developed in the Calimera FP5 project, http://www.calimera.org/ ) http://www.calimera.org/ Result: a multilingual, partially parallel, partially comparable, domain specific corpus
4
Corpus statistics – full corpus Measuring lengths of corpora (# of documents, tokens) Measuring token / tpye ratio Measuring type / lemma ratio
5
# of documents# of tokens Bulgarian55218900 Czech1343962103 Dutch77505779 English1251449658 German36265837 Polish35299071 Portuguese29244702 Romanian69484689
6
Token / typeTypes / Lemma Bulgarian9.652.78 Czech18.371.86 Dutch14.181.15 English34.932.8 (tbc) German8.761.38 Polish7.461.78 Portuguese12.271.42 Romanian12.431.54
7
Corpus statistics – full corpus Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness) English has by far the highest ratio Czech, Dutch, Portuguese and Romanian are in between type / lemma ration reflects richness of inflectional paradigms
8
Reflection The corpora are heterogeneous wrt to the type / token ratio Does the data sparseness of some corpora, compared to others, influence the information extraction process? If yes, how can we counter this effect? How does the quality of the linguistic annotation influence the extraction task?
9
Corpus statistics – annotated subcorpus Measuring lenghts of annotated documents Measuring distribution of manually marked keywords over documents Measuring the share of keyphrases
10
# of annotated documents Average length (# of tokens) Bulgarian553980 Czech465672 Dutch726912 English369707 German348201 Polish254432 Portuguese298438 Romanian413375
11
# of keywordsAverage # of keywords per doc. Bulgarian323677 Czech16403.5 Dutch170624 English117426 German134439.5 Polish103341 Portuguese99734 Romanian255562
12
Keyphrases Bulgarian43 % Czech27 % Dutch25 % English62 % German10 % Polish67 % Portuguese14 % Romanian30 %
13
Reflection Did the human annotators annotate keywords of domain terms? Was the task adequately contextualised? What do the varying shares of keyphrases tell us?
14
Keyword extraction Good keywords have a typical, non random distribution in and across documents Keywords tend to appear more often at certain places in texts (headings etc.) Keywords are often highlighted / emphasised by authors Keywords express / represent the topic(s) of a text
15
Modelling Keywordiness Linguistic filtering of KW candidates, based on part of speech and morphology Distributional measures are used to identify unevenly distributed words –TFIDF –(Adjusted) RIDF Knowledge of text structure used to identify salient regions (e.g., headings) Layout features of texts used to identify emphasised words and weight them higher Finding chains of semantically related words
16
Challenges Treating multi word keywords (= keyphrases) Assigning a combined weight which takes into account all the aforementioned factors Multilinguality: finding good settings for all languages, balancing language dependent and language independent features
17
Treatment of keyphrases Keyphrases have to be restricted wrt to length (max 3 words) and frequency (min 2 occurrences) Keyphrase patterns must be restricted wrt to linguistic categories (style of learning is acceptable; of learning styles is not)
18
KWE Evaluation 1 Human annotators marked n keywords in document d First n choices of KWE for document d extracted Measure overlap between both sets measure also partial matches
19
KWE Evaluation – Overlap Settings All three statistics have been tested Maximal keyphrase length set to 3
20
Best methodF-Measure BulgarianTFIDF/ADRIDF0.25 CzechTFIDF/ADRIDF0.18 DutchTFIDF0.29 EnglishADRIDF0.33 GermanTFIDF0.16 PolishADRIDF0.26 PortugueseTFIDF0.22 RomanianTFIDF/ADRIDF0.15
21
Reflection Is it correct to use the human annotation as „gold standard“ Is it correct to give a weight to partial matches?
22
KWE Evaluation - IAA Participants read text (Calimera „Multimedia“) Participants assign keywords to that text (ideally not more than 15) KWE produces keywords for text IAA is measured over human annotators IAA is measured for KWE / human ann.
23
IAA human annoators IAA of KWE with best settings Bulgarian0.100.37 Czech0.230.39 Dutch0.160.28 English0.090.43 German0.250.23 Polish0.280.20 Portuguese0.180.19 Romanian0.200.26
24
KWE Evaluation – Judging adequacy Participants read text (Calimera „Multimedia“) Participants see 20 KW generated by the KWE and rate them Scale 1 – 4 (excellent – not acceptable) 5 = not sure
25
Average<= 2,0<=2,5 Bulgarian2,21915 Czech2,22813 Dutch1,931114 English2,151014 German2,061215 Polish1,951317 Portuguese2,34611 Romanian2,141516
26
20 kwFirst 5 kwFirst 10 kw Bulgarian2,212,542,12 Czech2,221,96 Dutch1,931,681,64 English2,152,522,22 German2,061,96 Polish1,952,062,1 Portuguese2,342,081,94 Romanian2,141,82,06
27
New keywords suggested Average per participant Bulgarian215.25 CzechNone Dutch122.4 English224.4 GermanNone Polish457.5 Portuguese71.4 RomanianNone
28
Reflection How should we treat the „not sure“ decisions (quite substantial for a few judges) What do the added keywords tell us? Where are they in the ordered list of recommendations?
29
Conclusions Evaluation of a KWE in a multilingual environment and with diverse corpora is more difficult than expected beforehand Now we have the facilities for a controlled development / improvement of KWE Quantitative evaluation has to be accompanied by validation of the tool
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.