Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

ICT PSP Infoday Luxembourg Call 2011 – 2.4 eLearning ICT-PSP Call Objective eLearning Marc Röder Infso E6/eContent and Safer Internet Luxembourg,
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Distributed search for complex heterogeneous media Werner Bailer, José-Manuel López-Cobo, Guillermo Álvaro, Georg Thallinger Search Computing Workshop.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Digital Collections: Use, Value and Impact Lorna Hughes University of Wales Chair in Digital Collections, National Library of Wales Aberystwth University.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
ENeL: European Network of e-Lexicography COST Action IS1305.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
New organisational perspectives in 'library business' in the future – case study Finland Kristiina Hormia-Poutanen National Library of Finland.
THE UNIVERSITY OF HONG KONG WEB BY DANIEL CHURCHILL 2.0.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved Digital Enterprise Research Institute Ontologies & Natural Language.
An innovative platform to allow translation and indexing of internet sites Localization World
IASA-AMIA 2010 ANNUAL CONFERENCE PHILADELPHIA EUROPEANAEUROPEANA Benefits and progress.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
The Future and Accessibility OZeWAI Conference 2011 Jacqui van Teulingen Director, Web Policy 1.
Networking Session: Global Information Structures for Science & Cultural Heritage - The Interoperability Challenge «INTEROPERABILITY FROM THE CULTURAL.
Claudia Marzi Institute for Computational Linguistics (ILC) National Research Council (CNR) - Italy.
Advances in Technology and CRIS Nikos Houssos National Documentation Centre / National Hellenic Research Foundation, Greece euroCRIS Task Group Leader.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Challenges & opportunities in the preservation of (digital) information: the case of European research libraries Museo de las Ciencias Teatro de UNIVERSUM.
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
Local services and Community Memory: knowledge sharing and policy support Let’s map the future 28 February 2003 Rob Davies MDR Partners
15/11/2011EVA Minerva Jerusalem1 Linked Heritage : Coordination of standards and technologies for the enrichment of Europeana Marie-Véronique Leroi Ministry.
Europeana - next steps Policy and practice Yvo Volman European Commission DG Information Society and Media Conference on the integration of Bulgarian cultural.
Value to organisations: the research library view point Susan APA, Frascati, Nov 6, 2012.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
BARCELONA January 2011 European Commission Information Society and Media GaLA Game and Learning Alliance The European Network of Excellence on Serious.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Cross-domain access to Europe’s heritage Jon Purday Senior Communications Advisor, Europeana Doom or Bloom: reinventing the library in the digital age.
Exploring Europe's Television Heritage in Changing Contexts Connected to: Funded by the European Commission within the eContentplus programme
EUscreen: Examining An Aggregator ’ s Role in Digital Preservation Samantha Losben Digital Preservation - Final Project December 15, 2010.
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
By: Colleen Shannon, August Mendes. Literacy technology is the ability to responsibly, creatively, and effectively use appropriate technology. Uses: Communication.
Future Learning Landscapes Yvan Peter – Université Lille 1 Serge Garlatti – Telecom Bretagne.
IST Programme - Key Action III Semantic Web Technologies in IST Key Action III (Multimedia Content and Tools) Hans-Georg Stork CEC DG INFSO/D5
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
ON-line SERVICES based on DIGITAL DOCUMENTS Prof. Doina Banciu ROCS Bucharest, 2008.
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
ENeL WG3 meeting: Automatic Knowledge Acquisition for Lexicography Herstmonceux, August 2015 STARTS AT 2:30 PM.
ICT TOOLS AND SOCIETY INVOLVEMENT AMONG THE EUPAN NETWORK HIGHLIGHTS FROM THE SURVEY RESULTS TANYA CHETCUTI AND MARCO FICHERA - WORKSHOP EUROPEAN COMMISSION.
Local content in a Europeana cloud Kate Fernie, 2Culture Associates, Project Manager LoCloud is funded by the European Commission's ICT Policy Support.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
ICT in Classroom Prepared by: Ymer LEKSI Kukes
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Digital University of Pisa Alessandro Lenci CoLing Lab – Laboratorio di Linguistica Computazionale Università di Pisa Aix-Marseille Université.
Emerging Technologies & Language FET-Open The European Future and Emerging Technologies Open Scheme FIL2010 Louvain-La-Neuve, March 17 th 2010 Paul Hearn.
DELOS Network of Excellence on Digital Libraries Yannis Ioannidis University of Athens, Hellas Digital Libraries: Future Research Directions for a European.
INNOVATIVE USE OF ICTS: TOWARDS A CITIZEN- DRIVEN PUBLIC SECTOR Barbara-Chiara Ubaldi Project Manager, E-government Public Sector Reform Directorate for.
INTRODUCTION TO APPLIED LINGUISTICS
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
MICHAEL Culture Association WP4 Integration of existing data structure into Europeana ATHENA, WP4 Working group technical meeting Konstanz, 7th of May.
WG4 report: Lexicography and Lexicology from a Pan-European Perspective Eveline Wandl-Vogt, Krzysztof Nowak.
European Network of e-Lexicography
Presentation transcript:

Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia

COST Action

Working Groups / Objectives WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

Innovative e-dictionaries The third working group will focus on the development of digitally born dictionaries, focusing on the latest developments in e- lexicography and the interface between lexicography and computational linguistics. Work will be carried out on: the analysis of the possible impact of automatic acquisition of lexical data the analysis of the interface between dictionary and computational lexica (cf. wordnets) and syntactically and semantically annotated corpora (cf. FrameNet, SemCor, Senseval) the investigation of the possible use of dictionary content for computational linguistic applications

Electronic lexicography in the 21st century The first eLex conference: New challenges, new applications, Louvain-la-Neuve (Belgium), 22 to 24 October 2009 The second eLex conference: New applications for new users, Bled (Slovenia), 10 to 12 November 2011 The third eLex conference: Thinking outside the paper, Tallinn (Estonia), 17 to 19 October 2013 The fourth eLex conference: Linking Lexical Data in the digital age, Herstmonceux Castle (UK), 11 to 13 August 2015

eLex 2011 Language data for digital natives: old wine in a new bottle or...? Text mining is a challenge Content is a problem Presentation is a bigger problem

What is in the middle? (Web, Mobile) Design Lexicography Natural Language Processing ? Text mining is a challenge Content is a problem Presentation is a bigger problem

Sinclair: Floating dictionary (2001) »A few years ago I felt that the time was ripe to plan a new kind of dictionary, one that would never exist on paper, but would be automatic or almost automatic in its selfupdating. It would, so to speak, float on top of a corpus, rather like a jellyfish, its tendrils constantly sensing the state of the language. As well as reporting on the settled usage and meanings of the words and phrases of a language, like a normal dictionary does, the floating dictionary, when interrogated, dips into the corpus and checks this information, offering instances that match its criteria for the senses; also it explores further to see if there are any instances that conflict with the criteria, and may signify a development of a sense or the emergence of a new usage altogether. Within the limits of its powers, it organises this evidence as a comment on the existing dictionary entry.«

Does dictionary content know itself? LT community now has a basic idea how to store various types of information also SW community: RDF, RDFa, RDFS, OWL, SKOS, and more standardization in human-oriented dictionary encoding was never really successful (XML, TEI?) the question is: if different types of lexicographic information intended for human users will have to know each other – will the format be dictated by LT standards? (Probably yes.)

Similar domain, different task EU projects: The goal of the XLike project is to develop technology to monitor and aggregate knowledge that is currently spread across mainstream and social media, and to enable cross-lingual services for publishers, media monitoring and business intelligence. xLiMe proposes to extract knowledge from different media channels and languages and relate it to cross-lingual, cross-media knowledge bases. By doing this in near real-time we will provide a continuously updated and comprehensive view on knowledge diffusion across media.

Sevices Newsfeed a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world EventRegistry a system that can analyze news articles and identify world events can identify groups of articles in different languages that describe the same event

EventRegistry system architecture

ENeL perspective Complex story about events = complex story about words/languages Slovene Estonian English German French Hungarian Croatian Basque Swedish … Cross-lingual horizontal axis Diachronic vertical axis …

Cross-lingual synchronic horizontal axis "Never without data" Existing lexical resources (dictionaries, BableNet, AnyNet, Linked Data, etc.) Corpora, the Web and NLP Definition extraction (and generation) RANLP 2009, International workshop on definition extraction Language Technology for eLearning ( Extraction of grammatical or lexical information Kookkurrenzdatenbank ( Sketch Engine ( Extraction of good (dictionary) examples ENeL Vienna workshop Extraction of translation equivalents Linguee etc. Extraction of Multi-word Expressions (Parseme)

Automatically Constructed Dictionary Content Complex multimodal information extraction

Explain, combine, exemplify Definitions Found Generated Combinations Collocations as subject as object Multi-word expressions Knowledge- Rich Contexts

Real-time data Streaming Twitter News Feeds

Sounds, graphics and visuals Sounds Speech Synthesis Recorded / Speech Recognition Graphics Images Videos

Multi-lingual, cross-lingual (Hidden) parallel corpora hub language

ENeL WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

ENeL WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

Retro-digitization Digital Agenda for Europe (Europe 2020 Strategy – one of the pillars) Commission’s Recommendation on the digitization and online accessibility of cultural material and digital preservation Put in place solid plans for their investments in digitization and foster public-private partnerships to share the gigantic cost of digitization (recently estimated at € 100 billion). Make 30 million objects available through Europeana by 2015, including all Europe's masterpieces which are no longer protected by copyright, and all material digitized with public funding.

Retro-digitized dictionaries encode and enrich dictionary data (standards and tools) (the question is: if different types of lexicographic information intended for human users will have to know each other – will the format be dictated by LT standards?) definitions examples etymology other types of information linking dictionary data with historical corpora

Lexical Cloud

Integrated interface to European (dictionary / lexical) content Any dictionary AnypediaAnyNet Any corpus Any base

Conclusion any word/concept in any language on any device offers a story about its current life and its history what is a "concept" (in the sense of "event")? X-Nets? Wikipedia? what is the central format? what is the appropriate context? EU projects? ICT? Cultural Heritage? Infrastructure (e.g. Clarin)?