Digital Editions & Language Resources Portal Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ ACDH, ÖAW

Slides:

Advertisements

Similar presentations

Open repositories: value added services The Socionet example Sergey Parinov, CEMI RAS and euroCRIS.

Advertisements

IST Humboldt University Berlin, Germany – Computer and Media Service – Electronic Publishing Group Birgit Matthaei, 4th Sept. 2003, Bath,

A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.

Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.

Laying the Foundations for a Diachronic Dictionary of Tunis Arabic A First Glance at an Evolving New Language Resource Karlheinz Mörth 1, Stephan Procházka.

IAEA International Atomic Energy Agency United Nations Library and Information Network for Knowledge Sharing (UN-LINKS) September 2013, Geneva.

IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.

IAEA International Atomic Energy Agency ICSTI 2013 Annual Members’ Meeting March 2013.

ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.

1 Adaptive Management Portal April

Building a Digital Library with Fedora International Conference on Developing Digital Institutional Repositories Hong Kong December 9, 2004.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Antonella De Robbio, Dario Maguolo Mathematics Library – University Library System University of Padova – ITALY Mathematics Subject Classification and.

WebLicht Application and Workspaces Munich September WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University.

Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.

OU Digital Library development project Liz Mallett – Project Manager James Alexander – Project Developer 25 January 2012.

Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.

A Scalable Framework for the Collaborative Annotation of Live Data Streams Thesis Proposal Tao Huang

Digital Library Architecture and Technology

Advances in Technology and CRIS Nikos Houssos National Documentation Centre / National Hellenic Research Foundation, Greece euroCRIS Task Group Leader.

Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:

Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.

SITools Enhanced Use of Laboratory Services and Data Romain Conseil

Web based METS creation Ralf Stockmann case study.

The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.

Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.

Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,

Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.

Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.

1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.

Exploring ‘Workspaces’ Tom Visser, SARA compute and networking services, Amsterdam Garching Workshop 21 st September 2010.

Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.

© 2006 Altova GmbH. All Rights Reserved. Altova ® Product Line Overview.

Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.

 Programming - the process of creating computer programs.

Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.

Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.

IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features The Role of the International Nuclear Information System.

Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Margret Plank 17th International Conference on Grey Literature 1st and 2nd December 2015, Amsterdam (Netherlands) Move beyond text – How TIB manages the.

A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.

FACES General Overview ViRR (Virtueller Raum Reichsrecht) Software Solutions Kristina Büchner and Bastien Saquet Contact:Kristina Buechner:

CMD and TEI CMDI interoperability workshop Utrecht Matej Ďurčo, ICLTT, Vienna.

Ideas on Opening Up GEOSS Architecture and Extending AIP-5 Wim Hugo SAEON.

5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.

A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

GISELA & CHAIN Workshop Digital Cultural Heritage Network

B. Piringer R. Barbera, A. Calanducci, C. Carrubba, D. Davidovic, G

Implementing institutional Content Repositories with MyCoRe and MILESS

Knowledge Management Systems

An Architecture for Complex Objects and their Relationships

Biosafety Clearing-House Training Workshop

European Network of e-Lexicography

Experiences of the Digital Repository of Ireland

DIGITAL LIBRARY.

Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%

Collaborative Knowledge Discovery Environment on Biodiversity and Linguistic Diversity Eveline Wandl-Vogt, Ksenia Zaytseva, Davor Ostojic – OEAW-ACDH -

Malte Dreyer – Matthias Razum

BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES

GISELA & CHAIN Workshop Digital Cultural Heritage Network

DARIAH – Competence Centre in a nutshell

Presentation transcript:

Digital Editions & Language Resources Portal Workshop - Save the data, , Wien Matej Ďurčo ICLTT/ ACDH, ÖAW

What kind of data? 2 TEXT

Dictionaries Persian – English Dictionary German – Russian Dictionary Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3 What kind of data?

Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: Combining lexicographic material with information from corpora (encoding in TEI) semantic representation of lexicographic resources in RDF Audio with aligned transcription Complexity, Formats 4 XML TEI

qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < t Multiple dictionaries with a few thousand entries Size? 5

Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context -Lexical Resource -TextCorpus -Collection -teiHeader (emulated in CMDI) Metadata 6

Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing  VRE ! Requirements on online availability 7

Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal Solutions 8

„under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres (real ones with computing and storage – not virtual) Certification process (centre assessment)centre assessment Typ: A (infrastructure), B (LRT data/services), C (metadata) currently 14 centres certified (+ 4 pending)14 centres Coordinated through SCCTC Standing Committee on CLARIN Technical CentresSCCTC 9 CLARIN European Research Infrastructures

Federated Identity AAI, Single-Sign-on Persistent Identifier CMDICMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCSFCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standardsstandards CLARIN Standards Committee (SCS) 10 Infrastructure CLARIN

modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADESADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github)github 11 corpus_shell Publishing Framework

12 vicav corpus_shell instances

13 ABaC:us corpus_shell instances

Dictionary Server Open and freely available software that can be readily distributed (MySQL, PHP) Integrated with corpus_shell (FCS as common protocol) Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate a service for hosting lexicographic data for smaller lexicographic projects 14 Lexicography suite

Viennese Lexicographic Editor (VLE) XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15 Lexicography suite

First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16 CLARIN Centre Vienna clarin.oeaw.ac.at

17 CLARIN Centre Vienna clarin.oeaw.ac.at

Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften Thank you!