Irion Technologies (c)

Slides:



Advertisements
Similar presentations
Communicative evolution: from strings to words to expressions to concepts to intentions Piek Vossen ©Irion Technologies ICT Kenniscongres, April 11 th,
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Complex queries in the PATENTSCOPE search system Cyberspace September 2013 Sandrine Ammann Marketing & Communications Officer.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Experiments on Using Semantic Distances Between Words in Image Caption Retrieval Presenter: Cosmin Adrian Bejan Alan F. Smeaton and Ian Quigley School.
Advance Information Retrieval Topics Hassan Bashiri.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Chapter 5: Information Retrieval and Web Search
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 The BT Digital Library A case study in intelligent content management Paul Warren
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.
Digital libraries and web- based information systems Mohsen Kamyar.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Text Based Information Retrieval
Semantic Parsing for Question Answering
Cross-language Information Retrieval
Search Techniques and Advanced tools for Researchers
Thanks to Bill Arms, Marti Hearst
Extracting Semantic Concept Relations
WordNet WordNet, WSD.
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
CSE 635 Multimedia Information Retrieval
Wordnets for information retrieval: a hole in one!
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Irion Technologies (c)
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
Topic: Semantic Text Mining
Introduction to Search Engines
Active AI Projects at WIPO
Presentation transcript:

Irion Technologies (c) MEANING WP8 Validation ©Irion Technologies 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Validation in MEANING Industrial partners: (Reuters), EFE, Irion Integration of MEANING results in end-user application: Cross-lingual retrieval Text classification system Application to real user cases: Reuters news collection EFE Fototeca database: news pictures with Spanish & English captions Evaluations: Text classification benchmark (Reuters) Information retrieval benchmark (Reuters & EFE) Task-based evaluation by end-users (EFE) 20-9-2018 Irion Technologies (c)

Validation in MEANING: Baseline Corporate Semantic Network + Wordnet Domains Text-classification on Reuters news: Without Wordnets: R 67.8%, P 70.4%, C 83.2% Wordnets: R 75.6%, P 65.9%, C 99.5% Wordnets + WSD: R 79.2%, P 71.5%, C 100% Information retrieval with paraphrased English queries on Reuters news: Without Wordnets: R 29% Wordnets: R 25% Wordnets + WSD: R 32% Details in MEANING Deliverable 8.1 20-9-2018 Irion Technologies (c)

Validation in MEANING: Fase-3 Integration of MEANING (MCR) EFE Fototeca database Evaluation: Information retrieval benchmark (EFE) Task-based evaluation by end-users (EFE) MEANING Deliverables 8.2, 8.3, 8.4 20-9-2018 Irion Technologies (c)

MEANING-full effects in Information retrieval WP8 Validation ©Irion Technologies 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Overview TwentyOne search system The EFE data and indexes built with MEANING Evaluation Conclusions 20-9-2018 Irion Technologies (c)

TwentyOne search system Conceptual phrasal search 20-9-2018 Irion Technologies (c)

Value linguistic phrases Traditional string-based page retrieval system cannot differentiate linguistic contexts: “animal party” & “party animal” “Java Internet servers” & “Internet servers on Java” “Good service but bad volley” & “Attend a service in the cathedral” User queries and linguistic phrases express complex concepts that should be matched as a whole 20-9-2018 Irion Technologies (c)

TwentyOne: two-stage retrieval system Vector space model is used to retrieve all relevant pages from a large collection; Within the relevant pages we compare the concepts expressed in the query with the concepts expressed in the linguistic phrases; We list the pages with the best matching phrases; We use the vector space score when the phrase scores are equal; 20-9-2018 Irion Technologies (c)

Conceptual phrase matching Document Phrase Word form1 Word form2 Word formN Domain = economy Query Word form1 Concept1..N Word form2 Word formN ConceptN ConceptM Domain = politics human right activist-leader mensenrechtenactivistenleider (human rights activist leader) Domain = politics Concept1..N ConceptN ConceptM all concepts, same wording -> 100% 1 out of 3 concepts, same wording: -> 33% Phrase-score: number matching concepts party animal; animal party matching conceptual relation matching domains: potatos, potatoes, Afganistan & afghanistan fuzzy word match: café, cafe, Café, CaFé, CAFÉ, café-noir depart, departure, departures, departing, departings flexion and derivation: mensenrechtenactivistenleider, human rights multiwords and compounds: original word, synonym or translation: café, pub, bar, coffee shop, tea room United States of America, US, USA, VS, Amerika, Pays-Bas, Holland, the Netherlands 20-9-2018 Irion Technologies (c)

Cross-lingual retrieval NLP Query Syn Tokenization Tagging Parsing Nam Named Entity Recognition Con Concept Recognition Multilingual Semantic Network ES EN CA BA INDEX IT Expansion Lid XML NLP pages phrases 20-9-2018 Irion Technologies (c)

Domain-based WSD (IRST-Trento, Magnini 2002) TwentyOne Classify Text Classifier Text grouped by Domains Un-seen Document Phrase: financial scandal Juventus Phrase: Players boycott the match More Contexts + Domain Train IST-project MEANING Set of concepts Domain Synsets Glosses Examples WordNet/Semnet Concept Selection Sport - words Export Microworld: Sport - Nanoworld: Finance Nanoworld: Sport 20-9-2018 Irion Technologies (c)

Effectivity of Domain disambiguation 2nd Level domains(163 -> 57); NPs classified in a window of 10 NPs; Threshold was set to 60; Nanoworlds Microworlds Spanish English disambiguated words 238,671   26,279 44,652 3,097 total concepts 1,691,079 314,394 220,574 18,541 excluded 879,317 52% 205,221 65% 105,620 48% 10,603 57% selected 811,762 109,173 35% 114,954 7,938 43% polysemy 7,1 12,0 4,9 6,0 20-9-2018 Irion Technologies (c)

Fototeca database for finding news pictures from captions EFE data and indexes Fototeca database for finding news pictures from captions 20-9-2018 Irion Technologies (c)

Irion Technologies (c) EFE DATA 29,511 XML files (26,546 Spanish, 2,965 English), 29,943 images; Content: caption and descriptions (mostly capitalized!); Meta information, other fields; 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Indexes NO: no usage of wordnet FULL: wordnets used for full expansion MEANING: wordnets used for expansion adter disambiguation with MEANING data 20-9-2018 Irion Technologies (c)

Irion Technologies (c) NO Index Spanish source string -> Spanish normalisation ->Spanish index -> no translation -> English, Basque, Catalan and Italian index, English source string -> English normalisation ->English index no translation -> Spanish, Basque, Catalan and Italian index, 20-9-2018 Irion Technologies (c)

Irion Technologies (c) FULL Index Spanish source string Spanish-WN -> all meanings -> synonym expansion -> normalisation -> Spanish index -> translation -> normalisation -> English, Basque, Catalan and Italian index English source string English-WN -> all meanings -> synonym expansion -> normalisation -> English index -> translations -> normalisation -> Spanish, Basque, Catalan and Italian index 20-9-2018 Irion Technologies (c)

Irion Technologies (c) MEANING Index Spanish source string Spanish-WN -> WSD -> selection of meanings -> synonym expansion -> normalisation -> Spanish index -> translation -> normalisation -> English, Basque, Catalan and Italian index English source string English-WN -> WSD -> selection of meanings -> synonym expansion -> normalisation -> English index -> translations -> normalisation -> Spanish, Basque, Catalan and Italian index 20-9-2018 Irion Technologies (c)

MEANING-full effects in information retrieval Evaluation MEANING-full effects in information retrieval 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Evaluation set up Sets of paraphrased queries with translations to all languages; Automatic measurement of recall, where we accept a top-10 result; Number of results is limited to a maximim of 25, searched with Boolean AND; Applied to all 3 indexes: no wordnets wordnets & no disambiguation wordnets & disambiguation 20-9-2018 Irion Technologies (c)

Disambiguating effect of phrase matching Context highly determining; Wordnet expansions: Maximize recall -> generate all possible synonyms for all possible meanings e.g. police cell -> jail; Maximize noise -> generate all possible synonyms for unintended/irrelevant meanings, e.g. cell -> neuron, phone, battery; Chances are low that user queries contain phrases where unintended meanings are combined with similar context words: police cell division police phone; police neuron; police battery 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Queries <TESTIN> <DBS_ID>EFE_1</DBS_ID> <DOC_ID>11</DOC_ID> <PAG_TITLE></PAG_TITLE> <PAG_ID>231</PAG_ID> <NPS> <NP ID="16">Un grupo de cargueros transporta una imagen adornada con flores violetas</NP> </NPS> <SOURCE_LNG>es</SOURCE_LNG> <BOOLEAN>AND</BOOLEAN> <QUERY_ES>flores violetas</QUERY_ES> <QUERY_EN>violet flowers</QUERY_EN> <QUERY_CA>flor violeta</QUERY_CA> <QUERY_BA>lore bioleta</QUERY_BA> <QUERY_IT>fiori di viola</QUERY_IT> <QUERY_SY>flores moradas</QUERY_SY> </TESTIN> 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Queries Multi word queries Single word queries Unique strings Postings Spanish original 58 105 77 92 Spanish paraphrase 94 69 96 English 57 74 Catalan 60 Basque 104 65 Italian 56 Words with ambiguity or synonyms Other meanings and synonyms also occur in the documents Words relate to the pictures There can be multiple correct results for each query 20-9-2018 Irion Technologies (c)

Results for multi word queries Spanish original 105 Q paraphrase 94 Q English Catalan Basque 104 Q Italian NO 99 0.94 14 0.15 2 0.02 31 0.3 1 0.01 3 0.03 p1 60 0.57 9 0.1 21 0.2 FULL 96 0.91 71 0.76 39 0.37 70 0.67 50 0.48 55 0.52 38 0.4 16 44 0.42 27 0.26 19 0.18 MEANING 97 0.92 61 0.65 68 46 0.44 32 0.41 48 0.46 20 0.19 FULL has better overall scores, MEANING has better scores for the 1st position Wordnet indexes (FULL & MEANING) outperform NO for paraphrased & cross-lingual queries High recall for original queries, due to conceptual phrase search FULL and MEANING are very close to NO: no negative effects from expansion MEANING removed correct cases but has better precision -> FULL introduces more noise in the ranking 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Conclusions Integrated MEANING in TwentyOne Search and for a real end-user application: EFE picture database Showed that wordnets are useful for mono- and cross-lingual retrieval No significant improvement for WSD (top-10 scores are less, top-1 scores are better) WSD can be improved in many ways Query-phrase matching is very effective so that we can afford to maximize recall with wordnets The type of queries (short & ambiguous) and type of retrieval (small captions & page-phrase matching) are important for experimental results 20-9-2018 Irion Technologies (c)

Irion Technologies (c) Thank you for your attention! 20-9-2018 Irion Technologies (c)