Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Metadata in Context workshop Nijmegen
Metadata in Context Nijmegen Overview Context sketch VLO: ideas, sources, modalities Interoperability issues Future plans
Metadata in Context Nijmegen Context sketch Lots of resources somewhere out there: Data collections Corpora Lexica Grammars Multimedia recordings Software Web applications / services Old-school linguistic resources: Books Articles CD-ROMs It’s like a jungle, sometimes...
Metadata in Context Nijmegen VLO: the idea Researcher: “where do I start?” Provide a single entry point giving access to all information Because of the large amount of data: Drill-down paradigm (decrease search space gradually) Multiple ways of exploring: Full-text search Facet browsing Geographic overlay Unified interface, links to the original context Available via
Metadata in Context Nijmegen VLO: the sources
Metadata in Context Nijmegen VLO: the sources – LRT inventory Initiated by CLARIN Ad-hoc, low-barrier, user-driven inventory of Language Resources and Tools Number of records (+/-): Resources: 848 Tools: 180 You can add new entries yourself!
Metadata in Context Nijmegen VLO: the sources – OLAC catalogue > OLAC data providershttp://catalog.clarin.eu Metadata as harvested from 40 OLAC providers (among them several CLARIN centres) Quality and quantity differs hugely
Metadata in Context Nijmegen VLO: the sources – MPI catalogue About metadata records Broad spectrum: Experimental data Spoken Dutch corpus Sign Language corpora Endangered languages documentation Archive in principle open for externally created linguistic data collections (eg: endangered languages, see Donated Corpora) If these collections comply with the technical requirements (archiveable formats, metadata, …)
Metadata in Context Nijmegen VLO: the sources – DFKI tool registry Contains information about 292 (linguistic) software packages You can add entries yourself
Metadata in Context Nijmegen VLO: the modalities GIS
Metadata in Context Nijmegen VLO: the modalities Hierarchical catalogue
Metadata in Context Nijmegen VLO: the modalities Facet browser
Metadata in Context Nijmegen Interaction between modalities
Metadata in Context Nijmegen … all leading to the data
Metadata in Context Nijmegen Interoperability issues (1) The six facets to which all of the metadata records are mapped are currently country continent origin language organization genre subject
Metadata in Context Nijmegen Interoperability issues (2) Observations: Lots of inconsistencies and errors, eg for 1 organisation: MPI (5) MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (2) MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (39) Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (112) Max Planck Institute for Psycholinguistics (13849) Max Planck Institute for Psycholinguistics & Volkswagen Stiftung (12) Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands (2) Max Planck Institute for Psycholinguistics, Postbus 310, 6500 AH Nijmegen, The Netherlands (15) Facets help to detect them
Metadata in Context Nijmegen Interoperability issues (3) Because of the distributed approach: Distributed responsabilities Loss of specificity by converting all metadata records to a common subset Important to provide link to original record (also for the context!) Need for high-quality and well maintained controlled vocabularies and relevant Persistent Identifiers: Mime types Organisation names ISO language codes (cfr. ISOcat) Domain-specific vocabularies
Metadata in Context Nijmegen Interoperability issues (4) Metadata exchange protocols exist (OAI-PMH eg) but: They are not always used For the VLO one still has to rely on non-continuous information flows like CSV files Clearly an undesired situation on the longer term Granularity: how to indicate it in a standardized way? User feedback
Metadata in Context Nijmegen Future steps Curate the metadata: correct typographical errors add information use consistent terminology, etc. Process CMDI- and ISOcat based metadata Use (emerging) standards to refer to persons projects resources... in a persistent and interoperable way
Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n°