JRC-Ispra, , Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September
JRC-Ispra, , Slide 2 Applications mentioned so far Thesaurus indexing (summarise main concepts of document) –Fully automatic –Interactive –Monolingual and cross-lingual Document retrieval –Monolingual and cross-lingual Eurovoc indexing can be used for MUCH MORE …
JRC-Ispra, , Slide 3 Main goals of JRCs Language Technology (LT) activity Gather potentially user-relevant documents Analyse texts in various languages –extract information from texts (Eurovoc) –identify similarity between documents (Eurovoc) –Classify documents (Eurovoc) Visualise contents –of individual documents (Eurovoc) –of whole document collections (Eurovoc)
JRC-Ispra, , Slide 4 Eurovoc indexing as part of a tool set
JRC-Ispra, , Slide 5 (Cross-lingual) document similarity calculation English Text English Text Resolution on radio- active waste Spanish Text Resolución sobre los residuos radioactivos monolingual
JRC-Ispra, , Slide 6 (Multilingual) text classification Most current approaches to text classification are monolingual Category 1Category 2 Category 3 Es Fr Es Text classification, via Eurovoc, is multilingual
JRC-Ispra, , Slide 7 (Multilingual) document map © Cartias ThemeScape
JRC-Ispra, , Slide 8 Translation Spotting Why? To test document similarity calculation To compile a collection of parallel texts (for the training and testing of other multilingual text analysis applications) To detect cross-lingual document plagiarism
JRC-Ispra, , Slide 9 Translation Spotting - Results Task: find Spanish translations of English source document in a parallel text collection DS considering the length of documents DS correcting the monolingual bias (83%) Simple document similarity (DS)
JRC-Ispra, , Slide 10 To organise unknown document collections Algorithm: –Find pairs of texts that are most similar –Group them in one cluster, repeat the operation until only one cluster remains (Multilingual) clustering of documents 90% 80% 75% 40% 10%
JRC-Ispra, , Slide 11 Building a (multilingual) cluster tree
JRC-Ispra, , Slide 12 Application to (multilingual) news analysis EMM system in JRCs Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) ( Cluster related news stories and identify duplicates ( news topic identification ) Identify keywords, peoples names, place names, main sentences ( information extraction ) Find related news stories over time ( news topic tracking ) Find related news stories in other languages ( cross-lingual topic tracking mainly via Eurovoc and place names )
JRC-Ispra, , Slide 13 Detection of the major news of the day (EMM)
JRC-Ispra, , Slide 14 Establish Links to Related News over time
JRC-Ispra, , Slide 15 Establish links to related news in other languages
JRC-Ispra, , Slide 16 Subject-specific summarisation (1) Title: "Resolution on the 10th anniversary of the Chernobyl accident" Eurovoc descriptors:
JRC-Ispra, , Slide 17 Subject-specific summarisation (2) Eurovoc descriptors:
JRC-Ispra, , Slide 18 Further JRC LT applications Recognition and translation of: –Place names; + visualisation –Peoples names; + retrieval of images and further information –Dates –Products Recognition of text language
JRC-Ispra, , Slide 19 Place name recognition / Cross-lingual display
JRC-Ispra, , Slide 20 Place name recognition / Visualisation 18 references (Boston, American, America, New York) 11 references (Vietnam) 5 references (Iraq) + 1 reference to Sweden (Andre Heinz(…) Swedish based environmental consultant)
JRC-Ispra, , Slide 21 Place name recognition / Disambiguation Requires disambiguation 14 Paris, 7 Birminghams cities called And, Annan name variants (exonyms) Zoom on Europe
JRC-Ispra, , Slide 22 Recognising names, places, … - News navigation Top-mentioned personalities En/Fr news 26 July 2004
JRC-Ispra, , Slide 23 Automatic recognition of name variants
JRC-Ispra, , Slide 24 Automatic link to online encyclopaedia
JRC-Ispra, , Slide 25 News clusters mentioning a person
JRC-Ispra, , Slide 26 Persons talked about in same news clusters
JRC-Ispra, , Slide 27 Countries talked about in same news clusters
JRC-Ispra, , Slide 28 Frequent keywords for these news clusters
JRC-Ispra, , Slide 29 Recognising products and product groups Sample text
JRC-Ispra, , Slide 30 Recognising products and product groups Identified products
JRC-Ispra, , Slide 31 Recognising products and product groups Cross-lingual display of products found
JRC-Ispra, , Slide 32
Multilingual Information Extraction – Language recognition (demo)demo – Keywords (monolingual; cross-lingual)monolingualcross-lingual – Geographical place names (intro; new EU languages; demo)intronew EU languagesdemo – Products and product groups (slides; demo JRC, demo CIS)slidesdemo JRCdemo CIS – Names of people (demo news names, demo recognition, related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)demo news namesdemo recognitionrelated namesCyrillic/Greek fuzzy name matchingdemo fuzzy matching – Dates (demo recognition)demo recognition – Terminology extraction – Summarisation (standard sentence extraction; subject-specific summarisation)subject-specific summarisation Cross-lingual navigation and classification – Document similarity (monolingual; cross-lingual; translation spotting)translation spotting –Bottom-up document clustering ; topic detection (demo news analysis)demo news analysis – Classification (multi-monolingual and cross-lingual; pre-classification clustering)cross-lingualpre-classification clustering – Relevance-ranking of documents (slides)slides –News topic tracking (monolingual historical; cross-lingual; demo news analysis)cross-lingualdemo news analysis –Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).slidesdemo news names Visualisation of textual contents – Individual documents (document profile)document profile – Whole document collections (document map)document map – Geographical information (maps; animated maps, demo)animated mapsdemo – Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …asciistartreeKWIC Further tools – Document Gathering (Lang-Tech crawler; WTs EMM system) WTs EMM system – Document format conversion (PDF, MS-Word, PS, HTML, XML) – Character set conversion (UTF-8, ISO-Latin, HTML, …) Projects IDoRA for OLAF (slides)slides Cross-lingual Indexing (EUROVOC) Breaking News – Detection and Visualisation (BNDV / State-of-the-World)BNDV SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH, AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)REACH AMINFSO project proposals JRC IntroductionIntroduction Multilingual and crosslingual text analysis