Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.

Slides:



Advertisements
Similar presentations
LT4EL - Integrating Language Technology and Semantic Web techniques in eLearning Lothar Lemnitzer GLDV AK eLearning, 11. September 2007.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
WP 2: Semi-automatic metadata generation driven by Language Technology Resources Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
LTeL - Language Technology for eLearning -
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Pedagogic uses of a corpus of student writing and their implications for sampling and annotation Alois Heuboeck University of Reading, UK.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
LTeL - Language Technology for eLearning - Paola Monachesi, Lothar Lemnitzer, Kiril Simov, Alex Killing, Diane Evans, Cristina Vertan.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Semi-automatic glossary creation from learning objects Eline Westerhout & Paola Monachesi.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
February 2007MCST - FP7 Launch1 Michael Rosner Department of Computer Science and Artificial Intelligence University of Malta.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Text Classification, Active/Interactive learning.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
TESTING.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
National Taiwan University, Taiwan
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Language Identification and Part-of-Speech Tagging
Measuring Monolinguality
Using lexical chains for keyword extraction
Computational and Statistical Methods for Corpus Analysis: Overview
Multimedia Information Retrieval
Presentation transcript:

Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007

Outline A quantitative view on the corpora The keyword extractor Evaluation of the KWE

Creation of a learning objects archive Collection of the learning material IST domains for the LOs: 1. Use of computers in education, with sub- domains: 2. Calimera documents (parallel corpus developed in the Calimera FP5 project, ) Result: a multilingual, partially parallel, partially comparable, domain specific corpus

Corpus statistics – full corpus Measuring lengths of corpora (# of documents, tokens) Measuring token / tpye ratio Measuring type / lemma ratio

# of documents# of tokens Bulgarian Czech Dutch English German Polish Portuguese Romanian

Token / typeTypes / Lemma Bulgarian Czech Dutch English (tbc) German Polish Portuguese Romanian

Corpus statistics – full corpus Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness) English has by far the highest ratio Czech, Dutch, Portuguese and Romanian are in between type / lemma ration reflects richness of inflectional paradigms

Reflection The corpora are heterogeneous wrt to the type / token ratio Does the data sparseness of some corpora, compared to others, influence the information extraction process? If yes, how can we counter this effect? How does the quality of the linguistic annotation influence the extraction task?

Corpus statistics – annotated subcorpus Measuring lenghts of annotated documents Measuring distribution of manually marked keywords over documents Measuring the share of keyphrases

# of annotated documents Average length (# of tokens) Bulgarian Czech Dutch English German Polish Portuguese Romanian413375

# of keywordsAverage # of keywords per doc. Bulgarian Czech Dutch English German Polish Portuguese99734 Romanian255562

Keyphrases Bulgarian43 % Czech27 % Dutch25 % English62 % German10 % Polish67 % Portuguese14 % Romanian30 %

Reflection Did the human annotators annotate keywords of domain terms? Was the task adequately contextualised? What do the varying shares of keyphrases tell us?

Keyword extraction Good keywords have a typical, non random distribution in and across documents Keywords tend to appear more often at certain places in texts (headings etc.) Keywords are often highlighted / emphasised by authors Keywords express / represent the topic(s) of a text

Modelling Keywordiness Linguistic filtering of KW candidates, based on part of speech and morphology Distributional measures are used to identify unevenly distributed words –TFIDF –(Adjusted) RIDF Knowledge of text structure used to identify salient regions (e.g., headings) Layout features of texts used to identify emphasised words and weight them higher Finding chains of semantically related words

Challenges Treating multi word keywords (= keyphrases) Assigning a combined weight which takes into account all the aforementioned factors Multilinguality: finding good settings for all languages, balancing language dependent and language independent features

Treatment of keyphrases Keyphrases have to be restricted wrt to length (max 3 words) and frequency (min 2 occurrences) Keyphrase patterns must be restricted wrt to linguistic categories (style of learning is acceptable; of learning styles is not)

KWE Evaluation 1 Human annotators marked n keywords in document d First n choices of KWE for document d extracted Measure overlap between both sets measure also partial matches

KWE Evaluation – Overlap Settings All three statistics have been tested Maximal keyphrase length set to 3

Best methodF-Measure BulgarianTFIDF/ADRIDF0.25 CzechTFIDF/ADRIDF0.18 DutchTFIDF0.29 EnglishADRIDF0.33 GermanTFIDF0.16 PolishADRIDF0.26 PortugueseTFIDF0.22 RomanianTFIDF/ADRIDF0.15

Reflection Is it correct to use the human annotation as „gold standard“ Is it correct to give a weight to partial matches?

KWE Evaluation - IAA Participants read text (Calimera „Multimedia“) Participants assign keywords to that text (ideally not more than 15) KWE produces keywords for text IAA is measured over human annotators IAA is measured for KWE / human ann.

IAA human annoators IAA of KWE with best settings Bulgarian Czech Dutch English German Polish Portuguese Romanian

KWE Evaluation – Judging adequacy Participants read text (Calimera „Multimedia“) Participants see 20 KW generated by the KWE and rate them Scale 1 – 4 (excellent – not acceptable) 5 = not sure

Average<= 2,0<=2,5 Bulgarian2,21915 Czech2,22813 Dutch1, English2, German2, Polish1, Portuguese2,34611 Romanian2,141516

20 kwFirst 5 kwFirst 10 kw Bulgarian2,212,542,12 Czech2,221,96 Dutch1,931,681,64 English2,152,522,22 German2,061,96 Polish1,952,062,1 Portuguese2,342,081,94 Romanian2,141,82,06

New keywords suggested Average per participant Bulgarian CzechNone Dutch122.4 English224.4 GermanNone Polish457.5 Portuguese71.4 RomanianNone

Reflection How should we treat the „not sure“ decisions (quite substantial for a few judges) What do the added keywords tell us? Where are they in the ordered list of recommendations?

Conclusions Evaluation of a KWE in a multilingual environment and with diverse corpora is more difficult than expected beforehand Now we have the facilities for a controlled development / improvement of KWE Quantitative evaluation has to be accompanied by validation of the tool