Corpus-based Terminology Extraction applied to Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas.

Slides:



Advertisements
Similar presentations
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Distinción semántica de compuestos léxicos en Recuperación de Información Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos,
Evaluating Hierarchical Clustering of Search Results Departamento de Lenguajes y Sistemas Informáticos UNED, Spain Juan Cigarrán Anselmo Peñas Julio Gonzalo.
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Peñas, Julio Gonzalo and Felisa Verdejo NLP Group, Dpto.
La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.
Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Research methods in corpus linguistics Xiaofei Lu.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Query Operations Relevance Feedback & Query Expansion.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Web- and Multimedia-based Information Systems Lecture 2.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Information Retrieval
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Measuring Monolinguality
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Terminology problems in literature mining and NLP
Inf 722 Information Organisation
CS246: Information Retrieval
A Suite to Compile and Analyze an LSP Corpus
Cross Language Information Retrieval (CLIR)
Presentation transcript:

Corpus-based Terminology Extraction applied to Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas Informáticos, UNED, Spain Corpus Linguistics 2001, Lancaster, UK

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Introduction: Framework The European Treasury Browser (ETB) project Web site of Educational Resources (primary and secondary school) Context of New Technologies Objective: to build the structures to organise and retrieve educational resources Similar systems The Educational Resources Information Centre The British Education Index

Introduction: use of Thesauri Thesauri Definition: controlled vocabulary, structured in relations Structure: descriptors and relations (NT, BT, RT) Existing educational thesauri Don’t cover primary and secondary school vocabulary within the new technologies context Construction of a multilingual thesaurus is needed for the ETB project purposes Terminology Lists

Objectives of the work To build the Spanish list of candidate terms for the ETB multilingual thesaurus. To develop a general procedure to obtain terminology lists In an automatic way Independently of the application domain To explore effective ways of Information Retrieval using the terminology lists instead of thesaurus to bridge the gap between users’ and collection languages

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Resources and Tools Resources Semantic network: EuroWordNet Monolingual dictionary (VOX) Bilingual dictionary (VOX) Tools Tokeniser Morphological analyser POS tagger Shallow parser (based on syntactic patterns)

Corpora Corpus of educational resources 1,075 documents (670,646 words) from –Programa de Nuevas Tecnologías ( –Aldea Global ( Corpus of international news 7,364 documents (2.9 million words) –( Pre-processing (html tags treatment, language detection, detection of repeated pages and chunks, etc.)

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Terminology Extraction (TE) Terminology List: List of mono-lexical and poly-lexical terms which are usual in a specific domain Steps of Terminology Extraction 1. Term detection 2. Term weighting 3. Term selection

1. Term Detection (mono-lexical) (Over both corpora, Educational Resources and International News) Processing Tokenising Lemmatising,Tagging Removal of erroneous strings, abbreviations and words from other languages Extraction of nouns, verbs and adjectives Result List of candidate lemmas with its: Term frequency (any form) in both collections Document frequency in both collections

1. Term Detection (poly-lexical) (Over Educational Resources corpus) Processing Tokenising, Lemmatising,Tagging Shallow parsing (Syntactic pattern recognition) Result List of candidate terminological phrases: Term frequency in the collection Document frequency in the collection... como/CS en/Prep la/Art educación/N a/Prep distancia/N,/Punc el/Art ministerio/N... Pattern: N Prep N Detected term: educación a distancia Syntactic Patterns for Spanish terminological phrases N NN A N [A] Prep N [A] N [A] Prep Art N [A] N [A] Prep V N [A] Prep V N [A]

2. Term weighting Empirical measure Proportional to –term frequency –document frequency Inversely proportional to –term frequency in other domain Normalisation in the domain corpus

3. Term Selection Removal of unfrequent terms in the study domain Removal of very frequent terms in other domains Ranking of terms according to their weight Selection of top terms in the terminology list (thresholds to obtain 2,000 / 3,000 terms from the  75,000 detected terms) Addition of phrases with relevant components

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Evaluation: Visual exploration Automatic generation of result pages in HTML Purpose To help in the decisions of the prototype development To evaluate the measures and techniques and to suggest improvements or modifications To give further information to documentalists in order to assist final decisions in thesaurus construction

Evaluation: Visual exploration

Evaluation: Precision Manual classification of the 2,856 selected terms Proyecto curricularCiencias socialesSistema operativo Proyectos curriculares (Proyecto curricular) Profesorado materiales ¿?Alumnos inglesesBiblioteca nacional With a low effort, a large number of accurate terms is proposed to documentalists

Evaluation: Precision precision number of selected candidates Precision, % of selected terms which are appropriate terms Higher precision on the top of the ranking With a lower number of candidates, the precision increases

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Terminology-based Information Access Terminology Extraction in Information Retrieval provides: At Indexing: to add poly-lexical terms to the indexes without the explosion of n-grams Term browsing: to navigate through the terminology and access the documents from the terms (without the use of thesauri)

Terminology-based Information Access A difference with TE: terminology list truncation (as query gives the relevant terms, now the task is concerned with recall rather than precision of terms) A new task: to retrieve terminology Poly-lexical terms are retrieved from mono-lexical ones

Terminology-based Information Access Terminology retrieval To bridge the gap between Collection terminology Query terms Requires Query expansion Query translation But produces noise in the retrieval However phrases provides an excellent way for ambiguity reduction (Ballesteros & Croft, 1998)

Terminology-based Information Access Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try de Nucleares nuclear de Nuclear test ban treaty? Nuclear fitting interdiction manage? Nuclear taste proscription process? Expansion Translation

Content Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Conclusions Extraction of relevant terms in Spanish for the ETB project domain (primary and secondary school / new technologies) –Automatic process from free resources as web pages –Exploring contexts and statistical data via Internet Development of a search engine based on terminology extraction –Using terminology lists in an intermediate way between free-searching and thesaurus-guided searching –Without needing of thesaurus construction –Bridging the distance between the terms used in the query and the terminology used in the collection (even in different languages)

Thanks for your attention