TRUST & QRISTAL (TRUST = Text Retrieval Using Semantic Technologies) (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues)

Slides:



Advertisements
Similar presentations
1. 2 We feel part of a multilingual Europe and we would like to give our contribution: for this reason we decided to apply for the European Language label.
Advertisements

M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide-Matrix S.A. (PL)
COUNTER Update Peter Shepherd Project Director COUNTER STM Innovations Seminar, 2 December 2005.
ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.
© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
OAF - Workshop, Lisbon, Dec Open Access to Libraries MALVINE and LEAF. Perspectives of the Open Archives Initiative Protocol for Metadata Harvesting.
EU Bookshop A single access to the official general publications of the European Union Dr. Silke Stapel - Publications Office Luxembourg/ Brussels, December.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
10 years of the Aspergillus Website Evolution and revolution Original summary of purpose This site is designed to provide information on pathogenic.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
We have developed CV easy management (CVem) a fast and effective fully automated software solution for effective and rapid management of all personnel.
1 Contract Inactivation & Replacement Fly-in Action ( Continue to Page Down/Click on each page…) Electronic Document Access (EDA)
Implementation of a QA system in a real context Carlos Amaral (Priberam, Portugal) Dominique Laurent (Synapse Développement, France) Workshop TellMeMore,
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
LeadManager™- Internet Marketing Lead Management Solution May, 2009.
CLEF QA, September 21, 2006, Synapse Développement, D. LAURENT Why not 100% ?
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
Natural Language Processing WEB SEARCH ENGINES August, 2002.
M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
1212 Management and Communication of Distributed Conceptual Design Knowledge in the Building and Construction Industry Dr.ir. Jos van Leeuwen Eindhoven.
Information Retrieval in Practice
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Search Engines and Information Retrieval Chapter 1.
FLAVIUS Presentation of Softissimo WP1 Project Management.
QRISTAL (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process)
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
SEO ENRICH YOUR MARKET BY SMART SEARCH SOLUTIONS1.
Training Guide for Inzalo SOP Users. This guide has been prepared to demonstrate the use of the Inzalo Intranet based SOP applications. The scope of this.
A Language Independent Method for Question Classification COLING 2004.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
D L T Cross-Language French-English Question Answering using the DLT System at CLEF 2003 Aoife O’Gorman Igal Gabbay Richard F.E. Sutcliffe Documents and.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
1 INTEGRATION OF THE TEXTUAL DATA FOR INFORMATION RETRIEVAL : RE-USE THE LINGUISTIC INFORMATION OF VICINITY Omar LAROUK ELICO -ENS SIB University of Lyon-France.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
2004/051 >> Supply Chain Solutions That Deliver Users.
The Online World ONLINE DOCUMENTS. Online documents Online documents (such as text documents, spreadsheets, presentations, graphics and forms) are any.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
An example of polish SME engaged in Craft Project TRUST– Multilingual Semantic and Cognitive Search Engine for Text Retrieval using Semantic Technologies.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Introduction to the New SSA OnePoint Online Website
Information Retrieval
What is the Entrance Exams Task
A framework for ontology Learning FROM Big Data
Presentation transcript:

TRUST & QRISTAL (TRUST = Text Retrieval Using Semantic Technologies) (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process) Automatically the Languages Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT

1. TRUST Presentation 2. QRISTAL Presentation 3. QRISTAL Evaluation

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT 1. TRUST Presentation TRUST is an R&D project co-financed by the EU Commission, under Synapse technological leadership,,and addressing a multilingual QA system.It was submitted by a consortium of 6 Smes Synapse Développement, Toulouse, France Expert System Solutions, Modène, Italie Priberam, Lisbonne, Portugal TiP, Katowice, Pologne Convis, Berlin, Allemagne & Paris, France Sémiosphère, Toulouse, France (coordination)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT TRUST started in November 2001 and completed in October It was designed to be an industrial project with aim to commercialise in B2B and B2C, a QA software allowing to any user to retrieve one or several answers to a general purpose or factual question. It was bound to answer to questions from a finite corpus (hard disk, set of documents…), or questions addressed to Internet, via a meta-engine, using the most popular engine (Google, MSN, Altavista, AOL, etc.)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The targeted languages were French, Italian, Polish, Portuguese. English was not part of Trust but was developed in parallel. The pivot language, allowing to ask a question in one language and get the reply in another is English. All partners owned a syntactic analyser and important linguistic resources. Synapse, as technology transferor, had at disposal a previously commercialised engine (called Chercheur) to index and retrieve.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Ontologie générale Documents Dico des formes dérivées Ontologie des types questions Indexation Découpage des blocs Correction orthographe Analyse syntaxique Analyse conceptuelle Index mots-clés blocs Index entités nommées Index têtes dérivation Index des concepts Index des domaines Résolution anaphores Index des types de questions-réponses Question Traitement Question Correction orthographe Analyse syntaxique Analyse conceptuelle Extraction mots-clés Type de la question Traduction si multilingue Recherche dans Index Synonymes + converses Sélection des blocs Ordonnancement blocs Extraction des blocs Extraction réponse Réponse(s) Correction orthographe Analyse syntaxique Analyse conceptuelle Type de la réponse Mots-clés du bloc Résolution des anaphores Détection des métaphores Sélection phrase(s) Tri des phrases Cohérence, justification Extraction réponse(s)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Trust Engine description At completion, the Trust engine has very original features : the indexation is carried-out on words, expressions, named entities but also on concepts, domains and the types of QA The excerpt search, and the answer extraction are using a very deep and sharp syntactic, conceptual, and semantic analysis..

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT A Modular Architecture Module Linguistique français Module Linguistique italien Module Linguistique portugais Module Linguistique polonais Module Linguistique anglais Moteur dindexation Moteur dextraction de blocs de texte Index Documents Visualisation Des résultats Visualisation Des résultats

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Document Indexation TRUST indexes numerous document formats (.html,.doc,.pdf,.ps,.sgml,.xml,.hlp,.dbx, etc.) as well as archived/compressed (.zip) and ascii texts. An automated spelling checking may be carried out prior to it. Beyond the usual indexation of the terms, a semantic and syntactic analysis performs the indexation of the concepts and the typology of answers (ex. : a date of birth,a title or an occupation for a person, etc.)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The simple words are indexed by « head of derivation » i.e. words such as « symétrie », « symétriques », « asymétrie », « dissymétrique », « symétriseraient » ou « symétrisable » will be indexed under the same heading « symétrie ». This technique allows to reduce the size of the indexes and facilitates the grouping of neighbouring notions, thus avoiding the classical « term expansion » process during the request.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Technical characteristics Currently indexation is performed in 1Ko blocks, ie the texts are sliced in 1Ko blocks and any head of derivation will be indexed and allocated an occurrence number (ex: found 3 times in the blocks, occurrence is 3) The indexation speed is very different according to the languages.It is about 300 Mo/hour in French and Polish, about 240 Mo/hour in Portuguese, about 100 Mo/hour for English and about 10 Mo/hour for italian.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Conceptual Indexation and Ontology TRUST shares a common ontology with all linguistic modules of the various languages attached to it.This ontology, developed by Synapse, includes 5 hierarchical levels corresponding to : 28 categories at the main superior level 94 categories at the second level 256 categories at the third level 3387 categories at the fourth level over terms (including meanings for words) & over « syntagmes » at the basic level.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Indexation & Types of questions TRUST indexes the types of questions. It means that each linguistic module,when analysing each block of text, attempts to detect/profile the possible answer for each type of question (person, date, event, cause, aim, etc.) The present taxonomy of the type of questions comprises 86 different categories.It goes beyond the « factual » because including notions such as « usefulness » « comparison » « judgment » but also categories like « yes/no » or a classification.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Analysis of the question When the question is keyed in by the user, automatically the language of the question is detected, and its matching linguistic module performs the semantic and syntactic analysis of the question. When some words of the question have several meanings, the most probable meaning is choosen, but the user may force the meaning of each word. The same linguistic modules determines the domain, the concepts and above all the type of the question.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The text search From the data obtained via the analysis of the question,(heads of derivation,named entities, domains, concepts, the question profile/type), the search engine extracts from the index, the blocks of texts best suiting the set of data. A balance of the different available data is carried out in order to avoid that a disambiguation error relating to the meaning or the type of the question prevents obtention of the blocks of texts that may contain an answer.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Extraction of Answers For a given question, after possible spell-check,syntactic, semantic, conceptual analysis, then detection of the question, heads of derivations,named entities, concepts, domains, the types of QA are compared to the indexes for these different types. The best ranked blocks are analysed and answers extracted. The extraction of the answer is performed by the search of the named entities or syntactic groups in « position of use for the answering ».

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Response time A keyed-in question on a closed corpus (hard disk, corpus, Intranet) the answer is provided in French in less than 3 seconds. With other languages it can be up to 10 seconds. A keyed-in question on Internet,the response time may be anything between 2 to 14 seconds, depending on the language used, the number of pages analysed (user- definable) and the type of the question (a few answers are retrieved very quickly just on the available resumé or short description)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT 2. QRISTAL Presentation QRISTAL (is acronym de Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) is the B2C version of TRUST. It is priced at 99 and commercialised in retail computer outlets and in large consummer market distributors such as Virgin Stores or FNAC. Fruit of a 6 year development, QRISTAL performs beyond the TRUST set limits, but is undoubtedly arising from this project.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT QRISTAL may be used in 2 major functions : Provide exact answers to questions on « closed corpora » (hard disk, s, Intranet, etc.), these being previously indexed so as extract the answers from the blocks of text corresponding to the analysis of the question. Provide the exact answers to questions addressed to Internet (web). In this case, Qristal converts the questions in « understandable requests » for the standard engines, extracts the returned pages and their short description, analyses them and computes the answers.

In Qristal, a special attention has been given to the « user- self definability ». In design, Qristal is targeting those unfamiliar with SQL or web requests, and wishing to obtain directly an answer while formulating their questions in common natural language. Therefore the interface must be very user-friendly and as simple as possible,in order for them to profile Qristal usage to suit their habits and wishes. For more experimented users, files of questions as well as work on several indexes permit a more advanced usability, Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT

Commercialisation QRISTAL has been commercialised since December 2004 and has had a registered base of more than 700 users in that single month. Users are satisfied of the results obtained in French, while their judgment on the other language results is ( a bit unfairly) critical. Qristal appears to be very « reliable and stable », user- friendly as very few calls to the support/customer service may justify this appreciation. Users expectations are very large and their satisfaction will mean for us to produce a lot of efforts.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Press Article in « La Dépêche du Midi » du 4 January 2005

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Perspectives QRISTAL will have updates in the coming years, with the following improvements : improve the rate of exact answers, eliminate noise use the notoriety of the pages to order them carry out more precise inferences to extract the answers allow « user profiles » include other languages (German, Spanish ) better differentiate the answer mode (alone, all ) better situate the answers in their context

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT 3. QRISTAL Evaluation QRISTAL was evaluated within a contest called EQUER, as a campaign of evaluation of QA systems of the the EVALDA project ( EVALDA and Technolangue projects, have been initiated by the French Ministry for Industry, Research and Culture. The EQUER campaign was organised by ELDA (Evaluations and Language resources Distribution Agency, and was deployed between January 2003 and December

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The EQUER campaign, very similar in its principles to TREC-QA (USA) or NTCIR (Japan), included 2 different tests : 500 all domain/ all purpose questions, mainly factual, on a journalistic and administrative corpus of 1,5 Go. 200 questions, very often non-factual, on a medical corpus made of scientific articles and web pages of about 50 Mo.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The 500 all purpose questions were sectioned in : 407 factual simple questions (ex.: Comment s'appelle le fils de Juliette Binoche ?) 31 questions having a list as answer (ex.: Quels sont les trois pays qui bordent la Bosnie-Herzégovine ?) 32 questions having a definition as answer (ex.: Quest- ce que la NSA ?) 30 binary questions binaires, having Yes/No as answer (ex.: La carte didentité existe-t-elle au Royaume-Uni ?)

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The EQUER contestants were : 4 commercial companies :(the first 2 are very large firms) Commissariat à lÉnergie Atomique, Saclay, France France Telecom, Lannion, France Sinequa, Paris, France Synapse Développement, Toulouse, France 3 University laboratories : LIA & SMART, Avignon, France LIMSI, Orsay, France Université de Neuchâtel, Suisse

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Procedure and Metrics to allocate marks The used metrics to allocate marks to the results was MRR (Mean Reciprocal Rank), i.e. 1 for an exact answer in a first position, ½ for an exact answer in a second position, 1/3 for an exact answer in a third position, etc. Only 5 answers were accounted for, except for binary question with one exact justified answer was to be accepted. For the questions having a list as answer, the used metrics was NIAP (Non Interpolated Average Precision).

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT The Synapse QA system evaluated during the EQUER campaign was a « pre-version » of QRISTAL, not having all the functionalities to extract the exact answer. With EQUER, Synapse participated to its first ever campaign to evaluate QA systems, while many other contestants had experience in participating to TREC-QA or CLEF-QA, for the English language QA or French Language QA.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Technical performance The full set of the 500 questions of the general corpus was processed in 23 minutes and 17 seconds, hence less than 3 seconds per question. The speed of the linguistic analysis of the blocks was about 400 Mo/hour for the indexation, i.e words/second. The speed of analysis and extraction of the answer was about 230 Mo/h, i.e words/second. On 500 questions, the type « correct » has been determined in 98% of the cases. These speed tests were carried out on a Pentium 3 GHz with 1 Go Ram memory.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT

As shown in the previous graph, the best EQUER QA system (i.e Synapse), performs as well as the best one in TREC or NTCIR (MRR of 0,58 versus 0,68 & 0,61) for exact answers. This level is, in all cases, superior to the second best in TREC or NTCIR. These results consolidate the theoritical options and the quality of the resources developed within TRUST and implemented in QRISTAL.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Other Evaluations During the evaluation, a set of 100 references of texts, originating from a standard engine, was provided for each question. With this data, Synapseengine performance was 0,64 (versus 0,70) for the « passages » and 0,48 (versus 0,58) for the exact answers. An in-house test has later shown that in « inhibiting » the function « the type of the question » the MRR fell from 0,70 to 0,46 for the « passages », hence epitomising the importance of this functionality.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT After extraction of the texts enclosing the answers of the general corpus of1,5 Go, we achieved to reduce it to 180 Go. It is noticeable that the results of the 500 questions are very near on each of the 2 corpora. This leads us to think that the size of the corpus could be considered as negligeable for the Quality of the results, contrary to an usually admitted idea in information retrieval. The said corpus of questions included « reformulations ». A benchmark comparing the answers of the questions at the « start position » versus the position after « reformulations » has shown that the results are very near to each other (93% of answers in first position are identical).

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT Future evaluations Synapse has the intention to participate in CLEF-QA both in monolingual and multilingual options in 2005 Currently, no other evaluation campaign is planned in France to follow-up EQUER, but an evaluation of a transcript from an oral corpus should take place in the coming month.

Présentation M-CAST, 10 janvier 2005, Synapse Développement, D. LAURENT FIN End Merci ! Thank you