CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group

Slides:



Advertisements
Similar presentations
Chapter 7 System Models.
Advertisements

ENHANCING ATTRACTIVENESS OF ENVIRONMENTAL ASSESSMENT AND MANAGEMENT HIGHER EDUCATION Seminar on Experiences in China and the EU Nankai University, Tianjin,
OMV Ontology Metadata Vocabulary April 10, 2008 Peter Haase.
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
From CESSDA to European Research Infrastructure Developments in cross-European data sharing.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
The European Activities of BR Communication e-CODEX e-Justice Communication via Online Data Exchange Bucharest, June 14 th 2013.
02-Oct-2008 European Forum for GeoStatistics 2008 in Bled Concept for an Integrated Web Solution / an Infrastructure for Geostatistics (Subproject 3)
PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
CLARIN licensing schemes Anje Müller Gjesdal & Gunn Inger Lyse, University of Bergen.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CLARIN for Linguists Introduction Jan Odijk LOT Summerschool Nijmegen,
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
CLARIN web services and workflow Marc Kemps-Snijders.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
European Life Sciences Infrastructure for Biological Information ELIXIR
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
Linguistics with CLARIN Introduction Jan Odijk LOT Winterschool Amsterdam,
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* A 2 RT, Dept.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
The Brain Project – Building Research Background Part of JISC Virtual Research Environments (Phase 3) Programme Based at Coventry University with Leeds.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
CLARIN work packages. Conference Place yyyy-mm-dd
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
MedKAT Medical Knowledge Analysis Tool December 2009.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
Technology-enhanced Learning: EU research and its role in current and future ICT based learning environments Pat Manson Head of Unit Technology Enhanced.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
CLARIN and CLARINO resources Knut Hofland Uni Research Computing Bergen, Norway Workshop ICAME 37, Hong Kong,
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Business System Development
Natural Language Processing (NLP)
Social Knowledge Mining
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group

Basic Notions  Language Technology (LT)  language resources and tools  robust in terms of quality and coverage  multipurpose  component based  Language Technology Infrastructure  a software framework (architecture or platform)  for combining language tools with language resources into processing chains (or pipelines)  the defined processing chains are next applied to language data sources  interoperability, also with the external systems Humanistyka Cyfrowa Warszawa CLARIN-PL

LT in Humanities and Social Sciences: Barriers  Physical – language tools and resources are not accessible in Internet  Informational – descriptions are not available or there is no means for searching  Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers  Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering  Legal – licences for language resources and tools (LRTs) limit their applications Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN Support for Humanities & Social Sciences  CLARIN is ERIC type consortium of  11 countries (Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, Lithuania, The Netherlands, Poland, Portugal, Sweden) and The Dutch Language Union  1 observer: Norway  Focus area:  Supporting research in Humanities and Social Sciences  Users: researchers, PhD students, students and scientific institutions  CLARIN Mission  To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS)  To facilitate or enable research methods based on automated analysis of text and speech resources Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN Offer  Integration of different LT components into one interoperable system  Common, flexible meta-data standard (CMDI)  Central searching for resources (Virtual Language Observatory)  One sign on and one login into the distributed infrastructure  Decreased Physical and Informational Barriers  Common standards: promoting, co-ordinating, harmonising  Web Services for Language Tools and Resources  Decreased Technological Barrier  Installation-free, access via Web Applications  Decreased Knowledge Barrier  Common licences and promotion of the open access  Decreased Legal Barrier Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN: Portal Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN: Virtual Language Observatory Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN: Federated Content Search – Searching Corpora Humanistyka Cyfrowa Warszawa CLARIN-PL

LTI Development Paradigms  Bottom-up  a collected offer approach  based on linking together the already existing Language Resources and Tools  focused on accessibility, technical interoperability and processing chains  Top-down  following on user-centred design paradigm  research applications for H&SS are a starting point  Bi-directional  linking of Language Resources and Tools  combined with the development of research applications Humanistyka Cyfrowa Warszawa CLARIN-PL

Bi-directional LTI Development  Idea  development of the necessary elements  a distributed network infrastructure  basic LT processing chain  combined with user-centred approach to the development of research applications  Top-down part  close co-operation with key users from the H&SS domain  a metaphor of the Agile-like light weight software designing method with emphasis to prototyping  amendments to the shape of the technical basis: LRTs, standards,  inspirations, identification of the further user needs, next iterations Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: the Consortium  Polish scientific consortium  Wrocław University of Technology, G4.19 Research Group  Institute of Computer Science, Polish Academy of Science  Polish-Japanese Institute of Information Technology, Chair of Multimedia  University of Łódź, PELCRA group at Chair of English Language and Applied Linguistics  Institute of Slavic Studies, Polish Academy of Science  Wrocław University  Goal: implementation of the Polish part of the CLARIN ERIC LTI  Follows the bi-directional approach to LTI development Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Mission  Starting point  Several publicly available language resources and tools for Polish,  But still many were lacking  Deeper technological barrier: restricted applications  CLARIN-PL Pillars:  CLARIN-PL Language Technology Centre  the Polish node of the CLARIN distributed infrastructure  Complete set of the basic Language Resources & Tools for Polish  Research applications for H&SS  first set for key users and selected H&SS sub-domains. Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL Language Technology Centre  Location in Wrocław University of Technology  based on modified D-Space system from Lindat (Czech CLARIN)  One sign-on, one login (a member of the Pioneer.id Federation)  Advanced repository system for language resources  Persistent Identifiers for resources and tools  Rich CMDI meta-data – CLARIN wide visibility in the central search  Interface for Federated Content Search  depositing service for researchers from H&SS  application for the Data Seal of Approval  Adherence to all CLARIN specifications about standards and protocols  Web Services for LRTs:  the basic processing chain of Polish  Prototype system for flexible composition of the natural language processing chains  support for developers SOAP & REST interfaces  Web Applications for LRTs  Knowledge Sharing: expertise and support for the users Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Resources 1.Polish Morphological Dictionary 2.Polish Speech Corpora 3.Annotated Polish Corpora 4.Bilingual Corpora 5.Polish Historical Corpus 6.Semantic lexicon  Wordnet for Polish  formal description of lexical meanings 7.Dictionary of Multiword Expressions 8.Bilingual semantic lexicon 9.Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary 11.Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Resources 1.Polish Morphological Dictionary 2.Polish Speech Corpora 3.Annotated Polish Corpora 4.Bilingual Corpora 5.Polish Historical Corpus 6.Semantic lexicon  plWordNet 3.0  formal description of lexical meanings 7.Dictionary of Multiword Expressions 8.Bilingual semantic lexicon 9.Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary: 11.Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Resources  Starting point – a set of large resources  a huge National Corpus of Polish (1 billion tokens)  plWordNet 2.1 – a very large wordnet for Polish  Korpus Politechniki Wrocławskiej – an open Polish corpus with rich annotation  Expanded resources  plWordNet 3.0 – a huge semantic lexicon of Polish  a comprehensive description of the Polish lexico-semantic system (~ lemmas, ~ senses)  fully mapped to English Princeton WordNet  described formally by mapping to an ontology  Dictionary of multiword expressions described syntactically  NELexicon 2.0 – a huge lexicon of Polish Proper Names (2.5 mln) Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Resources for Polish  Expanded resources  Conversational corpus (following PELCRA and NKJP)  A large semantic valency lexicon for Polish predicative lexical units  Newly built resources  Transcribed training-testing Polish speech corpus  Bi-lingual corpora:  Polish-English, Polish-Bulgarian-Russian, Polish-Lithuanian  Polish historical corpus (for the years )  Corpora annotated for: meta-data, anaphora, time expressions, spatial expressions, semantic relations and situations Humanistyka Cyfrowa Warszawa CLARIN-PL

plWordNet 2.2 in CLARIN-PL Humanistyka Cyfrowa Warszawa CLARIN-PL

plWordNet 2.2 in CLARIN-PL Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Tools for Polish  Systems for searching corpora, especially Polish corpora  Spokes for conversational and bilingual corpora  Poliqarp 2.0 for richly annotated  Historical corpora [New]  Text mining (information extraction)  Recognition and classification of Proper Names  Recognition of anaphoric links  Recognition and classification of time expressions and spatial expressions [New]  Situation recognition [New]  Extraction of multiword expressions (collocations)  A generic set of morpho-syntactic tools for Polish that can be adapted to a domain specified by the user [New] Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Language Tools for Polish  Word Sense Disambiguation based on plWordNet  Shallow semantic parser [New]  Deep syntactic-semantic parser [New]  Tools for the extraction of the semantic-pragmatic information from documents and collections of documents, e.g.  keywords [New],  semantic relations between text fragments  and text summaries Humanistyka Cyfrowa Warszawa CLARIN-PL

Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Processing Chain for Polish Humanistyka Cyfrowa Warszawa CLARIN-PL

CLARIN-PL: Recognition and classification of Proper Names Humanistyka Cyfrowa Warszawa CLARIN-PL

Bi-directional - Top-down Part: First Applications  Approaching users  already active, interested, working on large textual and speech resources, …  covering a maximal variety of research areas, e.g. linguistics, literary studies, psychology, political studies and sociology  matching the available language tools for Polish  the first set of several prototype application illustrating possibilities and facilitating identification of the needs  First applications  Spokes – searching corpora of conversational data  A system for collecting Polish text corpora from the Web  A open textometric and stylometric system focused on Polish  Semantic text classification for sociology  Literary Map Humanistyka Cyfrowa Warszawa CLARIN-PL

Spokes (University of Łódź) Humanistyka Cyfrowa Warszawa CLARIN-PL

System for Collecting Polish Text Corpora from the Web  Requests from the users revealed gaps in the available technology  existing corpus building systems were too sensitive to text encoding errors found in the web  not designed for informal corpora like blogs  A system for collecting Polish text corpora from the Web had to be constructed:  based on tools from the Masaryk University in Brno  to detect texts including larger number of errors (by morphological analysis)  supports semi-automated extraction of texts from blogs, posts on forums, etc.  integrated with tools for processing Humanistyka Cyfrowa Warszawa CLARIN-PL

Open Textometric and Stylometric System  System designed for characteristic features of Polish  like rich inflection, weakly constrained word order  Based on several existing components including Stylo (Eder & Rybicki)  Enabling the use of features defined on any level of the linguistic structure:  from the level of word forms  up to the level of the semantic-pragmatic structures.  Available as Web Application and a Web Service  Stylometric techniques appear to be applicable in many tasks of H&SS  sociology (characteristic features that are for different subgroups), political studies (similarity and differences between political parties), literary studies … Humanistyka Cyfrowa Warszawa CLARIN-PL

Semantic Text Classification for Sociology  Users: Collegium Civitas, Warsaw  Goal  Support for large scale analysis of the source materials  Automatically annotate documents and text fragments with pre-defined semantic categories  Definition of categories by examples  Automated semantic grouping of documents and text fragments  Support for  Corpus building  Manual annotation of the learning sub-corpus  Automated annotation process  Statistical analysis of the results Humanistyka Cyfrowa Warszawa CLARIN-PL

GeTClasS – Generalised Text Classification for Sociology Humanistyka Cyfrowa Warszawa CLARIN-PL

Literary Map  Users: Digital Humanities Centre of The Institute of Literary Research (Polish Academy of Sciences)  Goal  Support for using maps in the literary criticism  Tool for the identification of all geographical names in the literary text (or a corpus) and mapping them onto a geographical map  Tasks 1.Identification and semantic classification of the referring language expressions 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic relations and statistical analysis Humanistyka Cyfrowa Warszawa CLARIN-PL

Literary Map Humanistyka Cyfrowa Warszawa CLARIN-PL

Conclusions  Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems!  LT for Polish achieved a stage in which valuable support can be provided for research applications  Bi-directional approach combines  development of the basic, universal set of language tools and resources  with inspirations from the research applications Humanistyka Cyfrowa Warszawa CLARIN-PL

Thank you very much for your attention! Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]

Bi-directional: bottom-up part  LRTs and LRT chains can be useful …  if the required tools and resources exist,  and, they are robust!  What is the minimal set of LRTs?  What kind of LRTs can be called robust?  automated applications in H&SS seem to require high quality of language tools and mostly large coverage of resource  BLARK – The Basic Language Resource Kit  “the minimal set of language resources that is necessary to do any precompetitive research and education at all” (Krauwer, 2003) and also basic processing chains  possible reference point to compare LRTs for different languages PALC 2014 Łódź CLARIN-PL