Presentation is loading. Please wait.

Presentation is loading. Please wait.

Irion Technologies (c)

Similar presentations


Presentation on theme: "Irion Technologies (c)"— Presentation transcript:

1 Irion Technologies (c)
Crossing the chasm between statistic processing and language understanding Piek Vossen ©Irion Technologies Gastcollege, Internet Informatie Universiteit van Amsterdam Irion Technologies (c)

2 Irion Technologies (c)
Overview Irion technologies Why use language technology? TwentyOne Search The MEANING project Irion Technologies (c)

3 Irion Technologies (c)
Irion in a Nutshell Irion Technologies (c)

4 Company, people and mission
The company Irion was founded by in 2000 as a spin-off of TNO Multimedia Technology 5 investors: Parcom Ventures, FLV, Twinning, TNO, Van Dale The people The Irion team comes from a variety of cultures and languages. The team consists of language technologists, software engineers, IT specialists, and sales people. The mission To enable equal access to the web for everybody regardless of the language they speak To fulfill the promises of information and knowledge management by making systems that can (really) understand language To become the worlds best and most versatile language technology provider Irion Technologies (c)

5 Irion Technologies (c)
Business Model Prototypes, lingware, algorithms CMU software tools end user applications pull strategy TNO Irion VAR End User Van Dale Lease model No projects No customisation DFKI Xerox Irion Technologies (c)

6 Irion Technologies (c)
Main products TwentyOne Search een crosslinguale zoekmachine, die conceptueel kan zoeken TwentyOne Dialogue een dialoogsysteem waarmee een gebruiker kan communiceren met een computer als met een mens TwentyOne Classify Een automatisch classificatiesysteem voor classificatie, filtering, routing en kennismanagement TwentyOne Concepts een semantisch netwerkapplicatie, die het mogelijk maakt om iedere zoekmachine te verbeteren met taaltechnologie TwentyOne Capture een “Cockpit” waarmee op eenvoudige wijze data in verschillende formaten kunnen worden opgeslagen in XML TwentyOne Language Tools een verzameling van taaltechnologische tools, zoals: automatische samenvattingengeneratie, eigennaamherkenning en taalherkenning Irion Technologies (c)

7 General TwentyOne Architecture
4. Indexing & Retrieval Query Interactive Natural Language Dialogue Paper documents Web documents Word Processor Documents XML Publishing Platform 3. Classification Categorization Classification Filtering Routing 1. Capture 2. NLP Natural Language Processing Summarization Translation Linguistic analysis Etc. Databases AV Documents Irion Technologies (c)

8 Why use language technology?
Irion Technologies (c)

9 Why Internet Search Engines do not?
Is it worth the effort? Traditionel Search Engines search for literal strings; Low recall is not conceived as a problem: There is too much information on the Internet to handle anyway; There is redundancy of information, i.e. it is expressed in any conceivable way and any conceivable language; Whatever you type in, you allways get many results; Google approach: All content words should occur (boolean AND); Pidgeon ranking: pages to which many people link are on top, show what others know; Irion Technologies (c)

10 Why should language technology be used (1)?
There is still not enough recall: Internet search only fulfills one type of information need; Quality-based and non-redundant data collections need more recall; Nobody knows what we miss on the Internet; There is not enough precision: Information continues to grow, so superficial techniques will no longer work; Information pollution; There is more need for qualitative information search because people realize it represents capital/power; Irion Technologies (c)

11 Why should language technology be used (2)?
Cross-lingual retrieval is not possible unless you map words across languages; Very specific questions still give no results if the query is formulated differently from the answer, e.g. Google: huis (5,280,000) + rood dak (37,100) + garage (3,830) + vrijstaand (85) + tuin (63) + zwembad (32) +kelder (13) + zolder (3)/ vliering (0) Small-scale specialized indexes (Intranet, archives) have no redundancy, there will be no results for queries formulated differently; What is the use of 100,000 results if the quality of the first results cannot be differentiated? Irion Technologies (c)

12 Irion Technologies (c)
Where are we heading at? There is a growing need for more precision and more complex applications to find more fine-grained facts regardless of ‘form’ Information retrieval (IR): documents Classification: topics Informatie extractie (IE): facts Multimodal human machine interfaces (speech, mobile, chat); Question-answering systems (QA): simple human-machine interface Dialogue systems: iterative human-machine Intelligent machines (reason, decisions): intelligent human-machine interface Summarization -> Multidoc summaries ->Language generation ->Machine translation Irion Technologies (c)

13 How to use language technology?
Irion Technologies (c)

14 How to get the correct recall?
Morpho- logy Wordnet Full Expansion police cells jail prison Index & query Disambiguation: - Biology cell neuron cell growth No NLP cell [prison] cell [phone] cell [tissue] mobile phone Index Disambiguation: Communication Legal Biology cell-division neuron cellular Irion Technologies (c)

15 How to get more precision?
Irion Technologies (c)

16 Why should language technology be used (3)? Words out of context:
Traditional search paradigm focuses on document/page retrieval and not on phrase retrieval: Dominant meanings will overrule other meanings: “Internet services on Java” gives no results for the island Java only for the software. Compositional differences are neglected: “toxic medication” versus “medication against toxication”, “animal party” versus “party animal” Irion Technologies (c)

17 Irion Technologies (c)
Word in contexts: Word combinations should be taken into account: Representing concepts in documents expressed by phrases; Representing concepts in queries expressed by phrases; Robust matching of queries and document phrases based on concept combination Irion Technologies (c)

18 TwentyOne search system
Conceptual phrasal search Irion Technologies (c)

19 Irion Technologies (c)
Approach Multilingual wordnet database and morphy-syntatic processing are used to decompose text (para)phrases to ranges of concept elements: -> maximum recall; Word-sense-disambiguation at index and query side: -> reduce noise; Synonym selection: -> reduce more noise; Match query phrases with document phrases: -> match concept combinations in context Intelligent dialogues to create context at the user side: -> match intended meanings Irion Technologies (c)

20 Conceptual phrase matching
Document Phrase Word form1 Word form2 Word formN ?Context Domain = economy Query Word form1 Concept1..N Word form2 Word formN ConceptN ConceptM Domain = politics human right activist-leader mensenrechtenactivistenleider (human rights activist leader) Domain = politics Concept1..N ConceptN ConceptM all concepts, same wording -> 100% 1 out of 3 concepts, same wording: -> 33% Phrase-score: number matching concepts party animal; animal party matching conceptual relation matching domains: potatos, potatoes, Afganistan & afghanistan fuzzy word match: café, cafe, Café, CaFé, CAFÉ, café-noir depart, departure, departures, departing, departings flexion and derivation: mensenrechtenactivistenleider, human rights multiwords and compounds: original word, synonym or translation: café, pub, bar, coffee shop, tea room United States of America, US, USA, VS, Amerika, Pays-Bas, Holland, the Netherlands Irion Technologies (c)

21 Irion Technologies (c)
Meaning Developing Multilingual Web-scale Language Technologies Irion Technologies (c)

22 Irion Technologies (c)
Meaning Objectives Funded by the European Union as project IST 3 -year project: April April 2005 Large-scale (Lexical) Knowledge Bases Automatic enrichment of EWN Mixed approach (KB + ML) Applied to Q/A, CLIR Problem structural and lexical ambiguity Irion Technologies (c)

23 Irion Technologies (c)
Meaning Approach automatic collection of sense examples (Leacock et al. 98, Mihalcea y Moldovan 99) Large-scale WSD (Boosting, SVM, transductives) Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02) Irion Technologies (c)

24 Irion Technologies (c)
Meaning Architecture English Web Corpus Italian Web Corpus WSD WSD English EWN Italian EWN ACQ UPLOAD UPLOAD ACQ Multilingual Central Repository PORT PORT PORT PORT Spanish EWN Basque EWN ACQ ACQ UPLOAD UPLOAD Spanish Web Corpus Catalan EWN Basque Web Corpus WSD Catalan Web Corpus WSD Irion Technologies (c)

25 Validation in MEANING: Fase-3
Integration of MEANING (MCR) EFE Fototeca database with pictures and captions in Spanish and English Evaluation: Information retrieval benchmark (EFE) Task-based evaluation by end-users (EFE) MEANING Deliverables 8.2, 8.3, 8.4 Irion Technologies (c)

26 MEANING-full effects in Information retrieval
WP8 Validation ©Irion Technologies Irion Technologies (c)

27 Cross-lingual retrieval
NLP Query Syn Tokenization Tagging Parsing Nam Named Entity Recognition Con Concept Recognition Multilingual Semantic Network ES EN CA BA INDEX IT Expansion Lid XML NLP pages phrases Irion Technologies (c)

28 Multilingual Wordnet database (SemNet linked to Wordnet)
Vocabularies of languages Concepts Domains Music Culture Finance Clothing Sport Ball sports Winter Relations 1 rec: 12345 financial institute 2 rec: 54321 - river side bank 1 rec: 9876 - string instrument violin type-of 2 rec: 65438 - musician playing a violin violist rec:42654 - musician type-of 1 rec:35576 - string of an instrument part-of string 2 rec:29551 - underwear rec:25876 - string instrument Irion Technologies (c)

29 Domain based concept selection
IST-project MEANING Set of concepts Domain Synsets Glosses Examples WordNet/Semnet More Contexts + Domain Train Sport - words Text Classifier Text grouped by Domains TwentyOne Classify Train Export Un-seen Document Phrase: financial scandal Juventus Phrase: Players boycott the match Classify Concept Selection Microworld: Sport - Nanoworld: Finance Nanoworld: Sport Irion Technologies (c)

30 Irion Technologies (c)
10\cache_en\151.naw(13): <MICROWORLD>art;architecture;</MICROWORLD> <NP ID="6"> <WRD POS="0"><WF>epa </WF></WRD> <NAME> <WRD POS="99"><WF>Pete</WF></WRD><WRD POS="99"><WF>Townsend</WF></WRD> </NAME> <WRD POS="0"><WF>performs</WF></WRD><WRD POS="99"><WF>on</WF></WRD> <WRD POS="71"><WF>the</WF></WRD><WRD POS="0"><WF>stage</WF></WRD> <WRD POS="6"><WF>during</WF></WRD> <WRD POS="71"><WF>the</WF></WRD><WRD POS="99"><WF>Ronnie</WF></WRD> <WRD POS="99"><WF>Lane</WF></WRD><WRD POS="99"><WF>Tribute</WF></WRD> <WRD POS="0"><WF>concert</WF></WRD><WRDPOS="6"><WF>at</WF></WRD> <WRD POS="71"><WF>The</WF></WRD><WRD POS="99"><WF>Royal</WF></WRD> <WRD POS="99"><WF>Albert</WF></WRD><WRD POS="99"><WF>Hall</WF></WRD> <WRD POS="6"><WF>in</WF></WRD><WRD POS="0"><WF>central</WF></WRD> <WRD POS="99"><WF>London</WF></WRD></NAME> <PHR>epa Pete Townsend performs on the stage during the Ronnie Lane Tribute concert at The Royal Albert Hall in central London</PHR> <NW>art;architecture;</NW></NP> Irion Technologies (c)

31 Effectivity of Domain disambiguation
Spanish English total concepts 2,769,753 403,124 disambiguated in microworlds 220,574 7,96% 18,541 4,60% disambiguated in nanoworlds 1,691,079 61,06% 31,4394 77,99% unaffected concepts 858,100 30,98% 70,189 17,41% 2nd Level domains(163 -> 57); NPs classified in a window of 10 NPs; Threshold was set to 60; Irion Technologies (c)

32 Effectivity of Domain disambiguation
2nd Level domains(163 -> 57); NPs classified in a window of 10 NPs; Threshold was set to 60; Nanoworlds Microworlds Spanish English disambiguated words 238,671 26,279 44,652 3,097 total concepts 1,691,079 314,394 220,574 18,541 excluded 879,317 52% 205,221 65% 105,620 48% 10,603 57% selected 811,762 109,173 35% 114,954 7,938 43% polysemy 7,1 12,0 4,9 6,0 Irion Technologies (c)

33 Fototeca database for finding news pictures from captions
EFE data and indexes Fototeca database for finding news pictures from captions Irion Technologies (c)

34 Irion Technologies (c)
EFE DATA 29,511 XML files (26,546 Spanish, 2,965 English), 29,943 images; Content: caption and descriptions (mostly capitalized!); Meta information, other fields; Irion Technologies (c)

35 Irion Technologies (c)
Indexes NO: no usage of wordnet ( FULL: wordnets used for full expansion ( MEANING: wordnets used for expansion adter disambiguation with MEANING data ( Irion Technologies (c)

36 MEANING-full effects in information retrieval
Evaluation MEANING-full effects in information retrieval Irion Technologies (c)

37 Irion Technologies (c)
Evaluation set up Sets of paraphrased queries with translations to all languages; Automatic measurement of recall, where we accept a top-10 result; Number of results is limited to a maximim of 25, searched with Boolean AND; Applied to all 3 indexes: no wordnets wordnets & no disambiguation wordnets & disambiguation Irion Technologies (c)

38 Irion Technologies (c)
Queries <TESTIN> <DBS_ID>EFE_1</DBS_ID> <DOC_ID>11</DOC_ID> <PAG_TITLE></PAG_TITLE> <PAG_ID>231</PAG_ID> <NPS> <NP ID="16">Un grupo de cargueros transporta una imagen adornada con flores violetas</NP> </NPS> <SOURCE_LNG>es</SOURCE_LNG> <BOOLEAN>AND</BOOLEAN> <QUERY_ES>flores violetas</QUERY_ES> <QUERY_EN>violet flowers</QUERY_EN> <QUERY_CA>flor violeta</QUERY_CA> <QUERY_BA>lore bioleta</QUERY_BA> <QUERY_IT>fiori di viola</QUERY_IT> <QUERY_SY>flores moradas</QUERY_SY> </TESTIN> Irion Technologies (c)

39 Irion Technologies (c)
Queries Multi word queries Single word queries Unique strings Postings Spanish original 58 105 77 92 Spanish paraphrase 94 69 96 English 57 74 Catalan 60 Basque 104 65 Italian 56 Words with ambiguity or synonyms Other meanings and synonyms also occur in the documents Words relate to the pictures There can be multiple correct results for each query Irion Technologies (c)

40 Results for multi word queries
Spanish original 105 Q paraphrase 94 Q English Catalan Basque 104 Q Italian NO 99 0.94 14 0.15 2 0.02 31 0.3 1 0.01 3 0.03 p1 60 0.57 9 0.1 21 0.2 FULL 96 0.91 71 0.76 39 0.37 70 0.67 50 0.48 55 0.52 38 0.4 16 44 0.42 27 0.26 19 0.18 MEANING 97 0.92 61 0.65 68 46 0.44 32 0.41 48 0.46 20 0.19 FULL has better overall scores, MEANING has better scores for the 1st position Wordnet indexes (FULL & MEANING) outperform NO for paraphrased & cross-lingual queries High recall for original queries, due to conceptual phrase search FULL and MEANING are very close to NO: no negative effects from expansion MEANING removed correct cases but has better precision -> FULL introduces more noise in the ranking Irion Technologies (c)

41 Irion Technologies (c)
Conclusions Integrated MEANING in TwentyOne Search and for a real end-user application: EFE picture database Showed that wordnets are useful for mono- and cross-lingual retrieval No significant improvement for WSD (top-10 scores are less, top-1 scores are better) WSD can be improved in many ways Query-phrase matching is very effective so that we can afford to maximize recall with wordnets The type of queries (short & ambiguous) and type of retrieval (small captions & page-phrase matching) are important for experimental results Irion Technologies (c)

42 Irion Technologies (c)
Thank you for your attention! Irion Technologies (c)

43 Irion Technologies (c)
Irion Technologies (c)

44 Irion Technologies (c)
Irion Technologies (c)

45 Irion Technologies (c)
Irion Technologies (c)

46 Irion Technologies (c)
Irion Technologies (c)

47 Irion Technologies (c)
Irion Technologies (c)

48 Irion Technologies (c)
Irion Technologies (c)


Download ppt "Irion Technologies (c)"

Similar presentations


Ads by Google