Download presentation
Presentation is loading. Please wait.
1
Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo Arroyo María Felisa Verdejo Maíllo Departamento de Lenguajes y Sistemas Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA TESIS DOCTORAL
2
2Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III. Website Term Browser IV. Evaluation framework
3
3 Classic Information Retrieval Retrieve documents relevant to user’s information need Pre-supposes: Static information needs Value is found in the retrieved set of documents (not in searching process) Ignores Task (purpose) that origins the information need Changes in the information needs Interactivity Imprecise information needs Users develop strategies without system aid Help users to express and precise their information needs I. Problem definition and goals Informatio n need Search engine Docs. Document ranking Refinement Query Formulation
4
4 Language barriers Problems in query formulation Users don’t know the appropriate domain terminology Users can’t express their information need in a foreign language Translinguality Natural Language characteristics Lexical ambiguity Terminology variation Help users to overcome language barriers I. Problem definition and goals
5
5 General approaches I. Problem definition and goals Natural Language Processing Disambiguation Conceptual indexing Terminology Controlled vocabularies indexing & browsing String Processing Free text indexing Information Retrieval Phrase indexing & browsing (Phind) Keyphrase navigation (Phrasier)
6
6 Natural Language Processing Help users to express and precise their information needs? Open field in IR Help users to overcome language barriers? –Phrase extraction and normalization –Explicit disambiguation (POS, WSD) Bad strategies or too much error in automatic processing? –Conceptual indexing I. Problem definition and goals
7
7Goals Study the role of automatic linguistic techniques within classic IR model Phrase indexing, POS tagging, WSD Semantic distinction of phrases Viability of conceptual indexing Section II: Experiments in Lexical Ambiguity and Indexing I. Problem definition and goals
8
8Goals Develop a model –to help users to express and precise their information needs –to help users to overcome language barriers Bringing to users the collection terminology Morpho-syntactic, semantic & translingual variations Without needs of thesauri construction Establish an appropriate evaluation framework Sections III & IV: Website Term Browser I. Problem definition and goals
9
9 Proposed approach Natural Language Processing Disambiguation Conceptual indexing Terminology Controlled vocabularies indexing & browsing String Processing Free text indexing Information Retrieval Phrase indexing & browsing (Phind) Keyphrase navigation (Phrasier) Automatic Terminology Extraction Terminology Retrieval & Term browsing (WTB)
10
10Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III. Website Term Browser IV. Evaluation framework
11
11Contents Morpho-syntactic ambiguity in IR Phrase indexing Semantic distinction of lexical compounds in IR Conceptual indexing ITEM Search Engine Conclusions II. Experiments in Lexical Ambiguity and Indexing IR-SEMCOR, hand annotated test collection
12
12 Morpho-syntactic ambiguity in IR II. Experiments in Lexical Ambiguity and Indexing Texts...particle crosses the wall......canadian red cross......boat to cross mississippi river... Query cross_N...particl_N cross_V the_D wall_N......canadian_ADJ red_ADJ cross_N......boat_N to_TO cross_V mississippi_N river_N... POS Tagged Query cross...particl cross the wall......canadian red cross......boat to cross mississippi river... Plain matches
13
13
14
14 Phrase indexing II. Experiments in Lexical Ambiguity and Indexing Texts...a guide for the fisher who......information on cat care......arboreal carnivorous called fisher cat... Query fisher...a guide for the fisher who......arboreal carnivorous called fisher cat......information on cat care... Plain Query fisher Phrase indexing...a guide for the fisher who......arboreal carnivorous called fisher_cat......information on cat care... matches
15
15
16
16 Semantic distinction of compounds II. Experiments in Lexical Ambiguity and Indexing Automatic classification through WordNet Endocentric: one component is hyperonym Appositional: all components are hyperonyms Exocentric: no components are hyperonyms purchasing department is_a Endocentric aspirin powder powderaspirin is_a Appositional fisher cat Exocentric Types of lexical compounds
17
17
18
18 Conceptual Indexing II. Experiments in Lexical Ambiguity and Indexing This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) Depending on WSD error rate Query spring Texts...spring......muelle......spring......fountain......fuente......spring......springtime......primavera... Conceptual Index n03114639 n05727069 n09151839 WSD
19
19 Synset indexing with no errors in WSD
20
20 Conceptual Indexing II. Experiments in Lexical Ambiguity and Indexing Although explicit disambiguation strategies applied to Indexing POS tagging Phrase indexing Word Sense Disambiguation don’t produce a significative improvement in IR Conceptual indexing based on synsets Needs automatic WSD accuracy near to state-of-the-art (60%) Permit Cross-Language Information Retrieval Qualitative evaluation justifies a prototype development
21
21 Textual representation : query is translated into the target language Conceptual representation : query and documents are compared at a conceptual level Selection of query language Selection of WSD strategy Selection of newspaper determines the target language Retrieved documents
22
22 ITEM Search Engine II. Experiments in Lexical Ambiguity and Indexing Conceptual indexing seems atractive but there are some unsolved challenges: Low accuracy in Word Sense Disambiguation due to –Unrestricted domains in EWN –Fine grain distinction of senses Indexing units translation units –Loss of information in word by word disambiguation High cost, low benefit –Users perceive a slower and less transparent system
23
23 Conclusions Don’t subordinate NLP to classic IR model Even an improvement of 10% wouldn’t change users perception Think of users Find new paradigms in Information Access In a higher level, closer to users Consider users tasks Consider users interaction New places for NLP techniques in IR Interaction over partial NLP processing A proposal: Terminology Retrieval & Term Browsing II. Experiments in Lexical Ambiguity and Indexing
24
24Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III. Website Term Browser IV. Evaluation framework
25
25Contents Terminology Retrieval Term extraction Indexing Retrieval model Query expansion and translation Website Term Browser interface III. Website Term Browser
26
26 Terminology Retrieval Term Browsing Navigate through relevant terminology Access information from retrieved terms Terminology Retrieval Retrieve relevant terms related to the query Phrase extraction Phrase indexing Phrase retrieval III. Transition to an interactive model Recall is more important than precision in term extraction Relaxing linguistic processing is possible Premise: don’t lose phrases LemmaDocument Phrase
27
27 Term extraction Syntactic pattern (Spanish, English, French, Italian, Catalan) [ phr_content ] [ phr_closed | phr_content ]* [ phr_content ] phr_content: noun, adjective, number, infinitive, participle phr_closed: article, preposition, conjunction Needs POS tagging High computational cost Tagging oriented to phrase detection III. Transition to an interactive model
28
28Indexing Steps 1. Text pre-processing and listing of words 2. Word tagging (oriented to phrase detection) 3. Phrase detection & lemmatization of components 4. Document indexing & statistics (document frequency) III. Transition to an interactive model 5. Phrase selection (Subsumption & Lexicalization degree) 6. Phrase indexing LemmaDocument Phrase LemmaDocument Phrase
29
29 Retrieval model III. Transition to an interactive model query Tokenising Expansion / Translation lem 11 lem 21 lem 31 lem 12 lem 22 lem 32 ··· ··· ··· EWN & Dic. Lemmatising tok 1 tok 2 tok 3 Lexicon Phrase retrieval exp 31 exp 32... tran 31 tran 32... exp 21 exp 22... tran 21 tran 22... exp 11 exp 12... tran 11 tran 12... Phrase index Document retrieval Document index Term ranking lem 11 lem 12... lem 31 lem 32... terms documents Document ranking
30
30 Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Query expansion and translation Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try de Nucleares nuclear de Expansion Translation III. Transition to an interactive model Nuclear taste proscription process? Nuclear test ban treaty? Ambiguity Reduction
31
31 Query in Spanish Hierarchy of terms Catalan English Spanish Ranking of documents
32
32
33
33Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III. Website Term Browser IV. Evaluation framework
34
34 Evaluation of Terminology Retrieval Compare Terminology Retrieval Hand-crafted Multilingual Thesaurus V. Evaluation framework
35
35
36
36 Evaluation of Terminology Retrieval Recall of mono-lexical terms (lemmas) Monolingual: 85% - 95% Translingual: 55% - 65% Recall of poly-lexical terms (phrases) Monolingual: 40% - 65% Translingual: 10% - 45% Loss of recall due to Phrase extraction (mainly POS tagging): 3% - 17% Phrase indexing (mainly lemmatization): 2% - 34% Phrase selection: 12% - 37% Lack of connections between different languages in EWN Lack in EWN adjective hierarchies V. Evaluation framework
37
37 Usefulness of Term Browsing Previous experiences in interactivity evaluation (TREC) need: –Precise queries –Laboratory conditions –Controlled users There aren’t differences between systems Identify better approaches is not possible A new framework is here proposed Real work environment Register users interaction Compare the use of Term area provided by WTB Document ranking provided by Google V. Evaluation framework
38
38 QUERY RECONSULT WITH TERM EXPLORE TERM EXPLORE DOCUMENT
39
39 Usefulness of Term Browsing LOG FILE 539 2001/03/14 12:10:33 QUERY UNED 193.146.241.164 ozone hole 2001/03/14 12:11:20 EXPLORE_TERM 539684: degradación de la capa de ozono 2001/03/14 12:11:29 EXPLORE_DOC http://www.uned.es/doctorado/0108.htm... EXPLORE_TERM RECONSULT EXPLORE_DOC... V. Evaluation framework 2318 sessions with interaction An average of 5.16 actions per session EXPLORE_TERM is used in 65%
40
40 Usefulness of Term Browsing All queries 1 word queries >1 word queries First actionEXPLORE_DOC 42% 47% 39% after QUERYEXPLORE_TERM 51% 45% 55% RECONSULT 7% 8% 6% Last action before finishing QUERY 50% 57% 46% the session with EXPLORE_TERM 44% 38% 47% explore DOC RECONSULT 6% 5% 7% V. Evaluation framework
41
41Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III. Website Term Browser IV. Evaluation framework
42
42Conclusions Lexical Ambiguity has been studied using IR-Semcor Evaluation free of automatic processing errors Explicit disambiguation at indexing doesn’t seem to improve retrieval (POS, WSD, Semantic distinction of lexical compounds) Conceptual indexing based on EuroWordNet synsets needs to solve some challenges Think of users to find new places for NLP
43
43Conclusions A search model based on extraction, retrieval and browsing of terminology has been developed User oriented Interaction over terminological information –Intermediate way between free-searching and thesaurus- guided searching –Without needs of thesaurus construction Bringing to users the collection terminology –Morpho-syntactic & semantic variations –Translinguality
44
44Conclusions An evaluation framework for Terminology Retrieval and Term Browsing has been established Points the way to improve Terminology Retrieval Users appreciate Term Browsing WTB phrasal information can substantially complement the document ranking provided by the search engines
45
Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo Arroyo María Felisa Verdejo Maíllo Departamento de Lenguajes y Sistemas Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA TESIS DOCTORAL
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.