Digital Editions & Language Resources Portal Workshop - Save the data, , Wien Matej Ďurčo ICLTT/ ACDH, ÖAW
What kind of data? 2 TEXT
Dictionaries Persian – English Dictionary German – Russian Dictionary Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3 What kind of data?
Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: Combining lexicographic material with information from corpora (encoding in TEI) semantic representation of lexicographic resources in RDF Audio with aligned transcription Complexity, Formats 4 XML TEI
qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < t Multiple dictionaries with a few thousand entries Size? 5
Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context -Lexical Resource -TextCorpus -Collection -teiHeader (emulated in CMDI) Metadata 6
Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing VRE ! Requirements on online availability 7
Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal Solutions 8
„under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres (real ones with computing and storage – not virtual) Certification process (centre assessment)centre assessment Typ: A (infrastructure), B (LRT data/services), C (metadata) currently 14 centres certified (+ 4 pending)14 centres Coordinated through SCCTC Standing Committee on CLARIN Technical CentresSCCTC 9 CLARIN European Research Infrastructures
Federated Identity AAI, Single-Sign-on Persistent Identifier CMDICMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCSFCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standardsstandards CLARIN Standards Committee (SCS) 10 Infrastructure CLARIN
modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADESADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github)github 11 corpus_shell Publishing Framework
12 vicav corpus_shell instances
13 ABaC:us corpus_shell instances
Dictionary Server Open and freely available software that can be readily distributed (MySQL, PHP) Integrated with corpus_shell (FCS as common protocol) Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate a service for hosting lexicographic data for smaller lexicographic projects 14 Lexicography suite
Viennese Lexicographic Editor (VLE) XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15 Lexicography suite
First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16 CLARIN Centre Vienna clarin.oeaw.ac.at
17 CLARIN Centre Vienna clarin.oeaw.ac.at
Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften Thank you!