Download presentation
Presentation is loading. Please wait.
Published byBeverly Darcy Owens Modified over 9 years ago
1
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008
2
Outline Demonstration of the functionalities Where we stand Evaluation of tools Consequences for the development of the tools in the final phase
3
Demo We simulate a tutor who adds a learning objects and generates and edits additional data
4
Where we stand (1) Achievements reached in the first year of the project: Annotated corpora of learning objects Stand-alone prototype of keyword extractor (KWE) Stand-alone prototype of glossary candidate detector (GCD)
5
Where we stand (2) Achievements reached in the second year of the project: Quantitative evaluation of the corpora and tools Validation of the tools in user-centered usage scenarios for all languages Further development of tools in response to the results of the evaluation
6
Evaluation - rationale Quantitative evaluation is needed to Inform the further development of the tools (formative) Find the optimal setting / parameters for each language (summative)
7
Evaluation (1) Evaluation is applied to: the corpora of learning objects the keyword extractor the glossary candidate detector In the following, I will focus on the tool evaluation
8
Evaluation (2) Evaluation of the tools comprises of 1.measuring recall and precision compared to the manual annotation 2.measuring agreement on each task between different annotators 3.measuring acceptance of keywords / definition (rated on a scale)
9
KWE Evaluation step 1 On human annotator marked n keywords in document d First n choices of KWE for document d extracted Measure overlap between both sets measure also partial matches
10
Best methodF-Measure BulgarianTFIDF/ADRIDF0.25 CzechTFIDF/ADRIDF0.18 DutchTFIDF0.29 EnglishADRIDF0.33 GermanTFIDF0.16 PolishADRIDF0.26 PortugueseTFIDF0.22 RomanianTFIDF/ADRIDF0.15
11
KWE Evaluation – step 2 Measure Inter-Annotator Agreement (IAA) Participants read text (Calimera „Multimedia“) Participants assign keywords to that text (ideally not more than 15) KWE produces keywords for text
12
KWE Evaluation – step 2 1.Agreement is measured between human annotators 2.Agreement is measured between KWE and human annotators We have tested two measures / approaches –kappa according to Bruce / Wiebe –AC1, an alternative agreement weighting suggested by Debra Haley at OU, based on Gwet
13
IAA human annotators IAA of KWE with best settings Bulgarian0.630.99 Czech0.710.78 Dutch0.670.72 English0.620.82 German0.640.63 Polish0.630.67 Portuguese0.580.67 Romanian0.590.61
14
KWE Evaluation – step 3 Humans judge the adequacy of keywords Participants read text (Calimera „Multimedia“) Participants see 20 KW generated by the KWE and rate them Scale 1 – 4 (excellent – not acceptable) 5 = not sure
15
20 kwFirst 5 kwFirst 10 kw Bulgarian2.212.542.12 Czech2.221.96 Dutch1.931.681.64 English2.152.522.22 German2.061.96 Polish1.952.062.1 Portuguese2.342.081.94 Romanian2.141.82.06
16
GCD Evaluation - step 1 A human annotator marked definitions in document d GCD extracts defining contexts from same document d Measure overlap between both sets Overlap is measured on the sentence level, partial overlap counts
17
Is-definitionsRecallPrecision Bulgarian0.640.18 Czech0.480.29 Dutch0.920.21 English0.580.17 German0.550.37 Polish0.740.22 Portuguese0.690.30 Romanian1.00.53
18
GCD Evaluation – step 2 Measure Inter-Annotator Agreement Experiments run for Polish and Dutch Prevalence-adjusted version of kappa used as a measure Polish: 0.42; Dutch: 0.44 IAA rather low for this task
19
GCD Evaluation – step 3 Judging quality of extracted definitions Participants read text Participants get definitions extracted by GCD for that text and rate quality Scale 1 – 4 (excellent – not acceptable) 5 = not sure
20
# defin.# testersAv. value Bulgarian2572.7 Czech2463.1 Dutch1462.8 English1043.3 German552.1 Polish1152.7 Portuguese3662.2 Romanian973.0
21
GCD Evaluation – step 3 Further findings relatively high variance (many ‚1‘ and ‚4‘) Disagreement between users about the quality of individual definitions
22
Individual user feedback - KWE The quality of the generated keywords remains an issue Variance in the responses from different language groups We suspect a correlation between language of the users and their satisfaction Performance of KWE relies on language settings, we have to investigate them further
23
Individual user feedback – GCD Not all the suggested definitions are real definitions. Terms are ok, but definitions cited are often not what would be expected. Some terms proposed in the glossary did not make any sense. The ability to see the context where a definition has been found is useful.
24
Consequences - KWE Use non-distributional information to rank keywords (layout, chains) Present first 10 keywords to user, more keywords on demand For keyphrases, present most frequent attested form Users can add their own keywords
25
Consequences - GCD Split definitions into types and tackle the most important types Use machine learning alongside local grammars Look into the part of the grammars which extract the defined term Users can add their own definitions
26
Plans for final phase KWE, work with lexical chains GCD, extend ML experiments Finalize documentation of the tools
27
Validation User scenarios with NLP tools embedded: 1.Content provider adds keywords and a glossary for a new learning object 2.Student uses keywords and definitions extracted from a learning object to prepare a presentation of the content of that learning object
28
Validation 3.Students use keywords and definitions extracted from a learning objects to prepare a quiz / exam about the content of that learning object
29
Validation We want to get feedback about The users‘ general attitude towards the tools The users‘ satisfaction with the results obtained by the tools in the particular situation of use (scenario)
30
User feedback Participants appreciate the option to add their own data Participants found it easy to use the functions
31
Plans for the next phase Improve precision of extraction results: KWE – implement lexical chainer GCD – use machine learning in combination with local grammars or substituting these grammars Finalize documentation of the tools
32
Corpus statistics – full corpus Measuring lengths of corpora (# of documents, tokens) Measuring token / tpye ratio Measuring type / lemma ratio
33
# of documents# of tokens Bulgarian55218900 Czech1343962103 Dutch77505779 English1251449658 German36265837 Polish35299071 Portuguese29244702 Romanian69484689
34
Token / typeTypes / Lemma Bulgarian9.652.78 Czech18.371.86 Dutch14.181.15 English34.932.8 (tbc) German8.761.38 Polish7.461.78 Portuguese12.271.42 Romanian12.431.54
35
Corpus statistics – full corpus Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness) English has by far the highest ratio Czech, Dutch, Portuguese and Romanian are in between type / lemma ration reflects richness of inflectional paradigms
36
To do Please check / verify this numbers Report, for the M24 deliverable, about improvements / recanalysis of the corpora (I am aware of such activities for Bulgarian, German, and English)
37
Corpus statistics – annotated subcorpus Measuring lenghts of annotated documents Measuring distribution of manually marked keywords over documents Measuring the share of keyphrases
38
# of annotated documents Average length (# of tokens) Bulgarian553980 Czech465672 Dutch726912 English369707 German348201 Polish254432 Portuguese298438 Romanian413375
39
# of keywordsAverage # of keywords per doc. Bulgarian323677 Czech16403.5 Dutch170624 English117426 German134439.5 Polish103341 Portuguese99734 Romanian255562
40
Keyphrases Bulgarian43 % Czech27 % Dutch25 % English62 % German10 % Polish67 % Portuguese14 % Romanian30 %
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.