Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007
Presentation Plan ● LT4eL project ● ILIAS ● Corpus ● Tool ● Grammars ● Copula ● Other Verbs ● Punctuation ● Results ● Conclusion
LT4eL ● Improve retrieval and accessibility of LO in learning management systems ● Employ language technology resources and tools for the semi-automatic generation of descriptive metadata. ● Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).
ILIAS
Objective ● Build a Glossary in an automatic way to support e- learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages) ● Better access to information for student ● Accelerate the work of the tutor
ILIAS: Glossary Candidate Detector
The Corpus tokens Tutorials PhD Thesis Scientific papers 3 Domains evenly represented e-learning Technology for non experts Calimera
XML format Intranet é uma rede desenvolvida para processamento de informações em uma empresa ou organização.
LxTransduce Input: simple text or xml Regular expressions Substitution and markup Output the same file with changes Match tree using elements Quick Unicode friendly freeware Easy to integrate in other tools (java)
Rules in lxtransduce
First development phase ● Less than 50% of the corpus ● Focus on the verb ● Precision: manually marked/all automatic ● Recall: correct automatic/manually marked ● F2 :3*(precision*recall)/2*precision+recall Gr Gr 00 F2RecallPrecision
Second developing phase 75% of the corpus for developing 25% of the corpus for testing Specific grammar/rules for each type
Copula baseline grammar Verb “to be” third person singular or plural present indicative
Copula base result Sentence level results Problem with precision
Copula Grammar
Rules for is_type <query = ’V’ and )] 3’ )])]
Confronting Results Include that patterns that were excluded Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.
Other_Verbs grammar Collect verbs in a lexicon Three different category: reflexive, active, passive. 22 different verbs ref pas
Results for verb_type Analyze each verbs separately as with is_type Richer syntactic patterns
Punctuation Grammar ● Preliminary work ● Definition introduced by colon mark (most frequent)
All-in-one Combination of the previous grammars The type is not take into account to calculate precision and recall
Conclusions and Future Work Overall results: Recall 86%, Precision 14% Difference among domains: the style of a document influence the result. Improve the rules for verb_type and punc_type Combining with other techniques such as ML