WP 2: Semi-automatic metadata generation driven by Language Technology Resources Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
LT4EL - Integrating Language Technology and Semantic Web techniques in eLearning Lothar Lemnitzer GLDV AK eLearning, 11. September 2007.
Using a domain-ontology and semantic search in an eLearning environment Lothar Lemnitzer, Kiril Simov, Petya Osenova, Eelco Mossel and Paola Monachesi.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment Eelco Mossel LSP 2007, Hamburg.
WP 4: Integration of Language Technology Tools into ILIAS Learning Management System Alexander Killing Project review, Utrecht, 1 Feb 2007.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Applying Ontology-Based Lexicons to the Semantic Annotation of Learning Objects Kiril Simov and Petya Osenova BulTreeBank Project
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Setting Up Information Portal Irwan Sampurna C-CONTENT 23 May 2006.
Chapter 5: Introduction to Information Retrieval
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment RANLP, Borovets, 2007 Eelco Mossel University of Hamburg.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
Crosslingual Retrieval in an eLearning Environment Cristina Vertan, Kiril Simov, Petya Osenova, Lothar Lemnitzer, Alex Killing, Diane Evans, Paola Monachesi.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
LTeL - Language Technology for eLearning -
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
LT4eL - WP1: Setting the scene WP leader: UAIC Univ. AI. I. Cuza of Iasi Faculty of Computer Science Dan Cristea, Corina Forăscu, Dan Tufiş, Ionuţ Pistol,
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
LTeL - Language Technology for eLearning - Paola Monachesi, Lothar Lemnitzer, Kiril Simov, Alex Killing, Diane Evans, Cristina Vertan.
Semi-automatic glossary creation from learning objects Eline Westerhout & Paola Monachesi.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Overview of Search Engines
Cis-Regulatory/ Text Mining Interface Discussion.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
CLARIN web services and workflow Marc Kemps-Snijders.
Requirements Analysis
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
February 2007MCST - FP7 Launch1 Michael Rosner Department of Computer Science and Artificial Intelligence University of Malta.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
WP5: Validation Anne De Roeck Diane Evans The Open University, UK.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Chapter 23: Probabilistic Language Models April 13, 2004.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
September 25, 2006 NASA Feasibility Study Status Update.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
General Architecture of Retrieval Systems 1Adrienn Skrop.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Language Identification and Part-of-Speech Tagging
CRF &SVM in Medication Extraction
Session III Chapter 6 – Creating DTDs
XML Data DTDs, IDs & IDREFs.
CS246: Information Retrieval
Session II Chapter 6 – Creating DTDs
Presentation transcript:

WP 2: Semi-automatic metadata generation driven by Language Technology Resources Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007

Our Background Experience in corpus annotation and information extraction from texts Experience in grammar development Experience in statistical modelling Experience in eLearning

WP2 Dependencies WP 1: collection and preparation of LOs WP 3: WP 2 results are input to this WP WP 4: Integration of tools WP 5: Evaluation and validation

M12 Deliverables Language Resources: > running words per language With structural and linguistic annotation > 1000 manually selected keywords > 300 manually selected definitions Local grammars for definitory contexts

M12 Deliverables Documentation: Guidelines linguistic annotation Guidelines keyword annotation Guidelines annotation of definitions (Guidelines evaluation)

M12 Deliverables Tools: Prototype Keyword Extractor Prototype Glossary Candidate Detector

Lexikon CZ EN CONVERTOR 1 Documents SCORM Pseudo-Struct. Basic XML LING. PROCESSOR Lemmatizer, POS, Partial Parser CROSSLINGUAL RETRIEVAL LMS User Profile Documents SCORM Pseudo-Struct Metadata (Keywords) Ling. Annot XML Ontology CONVERTOR 2 Documents HTML Lexikon PT Lexikon RO Lexikon PL Lexicon GE Lexikon MT Lexikon BG Lexikon DT Lexicon EN PLGE BG PTMTDTRO EN Documents User (PDF, DOC, HTML, SCORM,XML) REPOSITORY Glossary

Linguistically annotated learning objects Structural annotation: par, s, chunk, tok Linguistic annotation: base, ctag, msd attributes  Example 1 Specific annotation: marked term, defining text

Part of the DTD <!ATTLIST markedTerm %a.ana; kw (y|n) "n" dt (y|n) "n" status CDATA #IMPLIED comment CDATA #IMPLIED > <!ATTLIST definingText id ID #IMPLIED xml:lang CDATA #IMPLIED lang CDATA #IMPLIED rend CDATA #IMPLIED type CDATA #IMPLIED wsd CDATA #IMPLIED def IDREF #IMPLIED continue CDATA#IMPLIED part CDATA #IMPLIED status CDATA #IMPLIED comment CDATA #IMPLIED >

Linguistically annotated learning objects Use: Linguistically annotated texts are input to the extraction tools Marked terms and defining texts are used as training material and / or as gold standard for the evaluation

Characteristics of keywords 1.Good keywords have a typical, non random distribution in and across LOs 2.Keywords tend to appear more often at certain places in texts (headings etc.) 3.Keywords are often highlighted / emphasised by authors

Distributional features of keywords We are using the following metrics to measure keywordiness by distribution Term frequency / inverse document frequency (tf*idf), Residual Inverse document frequency (RIDF) An adjusted version of RIDF (adjustment by term frequency) to model inter text distribution of KW Term burstiness to model intra text distribution of KW

Structural and layout features of keywords We will use: Knowledge of text structure used to identify salient regions (e.g., headings) Layout features of texts used to identify emphasised words (  Example 2) We will weigh words with such features higher

Complex keywords Complex, multi-word keywords are relevant, differences between languages The keyword extractor allows the extraction of n-grams of any length Evaluation showed that the including bi- or even trigrams word increases the results, with larger n-grams the performance begins to drop Maximum keyword length can be specified as a parameter

Complex keywords LanguageSingle-word keywords Multi-word keywords German91 %9 % Polish35 %65 %

Language settings for the keyword extractor Selection of single keywords is restricted to a few ctag categories and / or msd values (nouns, proper nouns, unknown words and some verbs for most languages) Multiword patterns are restricted wrt to position of function words (style of learning is acceptable; of learning behaviours is not)

Output of Keyword Extractor List, ordered by “keywordiness” value, with the elements Normalized form of keyword (Statistical figures) List of attested forms of the keyword  Example 3

Evaluation strategy We will proceed in three steps: 1.Manually assigned keywords will be used to measure precision and recall of key word extractor 2.Human annotators will judge results from extractor and rate them 3.The same document(s) will be annotated by many test persons in order to estimate inter-annotator agreement on this task

Summary With the keyword extractor, 1.We are using several known statistical metrics in combination with qualitative and linguistic features 2.We give special emphasis on multiword keywords 3.We evaluate the impact of these features on the performance of these tools for eight languages 4.We integrate this tool into an eLearning application 5.We have a prototype user interface to this tool

Identification of definitory contexts Empirical approach based on linguistic annotation of LO Workflow –Definitory contexts are searched and marked in LOs –Recurrent patterns are characterized quantitatively and qualitatively (  Example 4) –Local grammars are drafted on the basis of these recurrent patterns –Extraction of definitory context performed by lxtransduce (University of Edinburgh - LTG)

Characteristics of local grammars Grammar rules match and wrap subtrees of the XML document tree One grammar rule refers to subrules which match substructures Rules can refer to lexical list to constrain categories further The defined term should be identified and marked

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters. Example

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Output of Glossary candidate detector Ordered List of words Defined Term marked (Larger context – one preceding, one following sentence) (  Example 10)

Evaluation Strategy We will proceed in two steps: 1.Manually marked definitory contexts will be used to measure precision and recall of the glossary candidate detector 2.Human annotator to judge results from the glossary candidate detector and rate their quality / completeness

Results PrecisionRecall Own LOs 21.5 %34.9 % Verbs only34.1 %29.0 %

Evaluation Questions to be answered by a user-centered evaluation: Is there a preference for higher recall or for higher precision? Do user profit from seeing a larger context?

Integration of functionalities ILIAS Server Java Webserver (Tomcat) Application Logic User Interface KW/DC/Onto Java Classes / Data Webservices Axis nuSoap Servlets/JSP Development Server (CVS) KW/DC Code Code/Data Ontology Code ILIAS Content Portal LOs Access functionalities directly Evaluate functionalities in ILIAS Nightly Updates Use functionalities through SOAP Migration Tool Third Party Tools

User Interface Prototypes