CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Slides:



Advertisements
Similar presentations
Theo van Veen, Koninklijke Bibliotheek The European Library: opportunities for new services.
Advertisements

Keys to Building a Multilingual Search Engine Thierry Sourbier.
Digital Repositories – Linked Open Data – the possible Role of D4Science Workshop, December 2010, FAO use cases A tool to create Linked Data providers.
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
CTS2 Terminology Services
CATCHPlus Valorisation project for CATCH research programme. –Public funding –But: development mainly by commercial parties –Open source required Cultural.
Open Annotation Overview Frankfurt Germany, 10 th of October Open Annotation: Social Bookmarking and Annotation of eBooks Robert Sanderson
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 COS 425: Database and Information Management Systems XML and information exchange.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
DartGrid Browser-based mapping tool of SQL to RDF Point Template Zhejiang University & OpenLink Software.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Luc Audrain Hachette Livre Head of digitalization
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Methods For Web Page Design 6. Methods Why use one? What it covers –Possibly all stages Feasibility Analysis Design Implementation Testing –Maybe just.
University of Illinois at Urbana-Champaign OAI Alpha Experiences Timothy W. Cole Thomas G. Habing Grainger Engineering.
Michalis Vafopoulos NTUA, GFOSS & The transformers GREEN CITY HACKATHON.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Semantic Technologies & GATE NSWI Jan Dědek.
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP2 – Media Semantics and Ontologies.
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Semantic Technologies and Application to Climate Data M. Benno Blumenthal IRI/Columbia University CDW /04-01.
SemantEco Annotator for Linked Data Generation and Generalized Semantic Mapping Session: Technologies, Reasoning, and Annotation Methods of the Semantics.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Using Semantic Mapping to Manage Heterogeneity in XLIFF Interoperability by Dave Lewis, Rob Brennan, Alan Meehan, Declan O’Sullivan CNGL Centre for Global.
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
The Semantic Logger: Supporting Service Building from Personal Context Mischa M Tuffield et al. Intelligence, Agents, Multimedia Group University of Southampton.
An introduction to data exchange protocols in TDWG Renato De Giovanni TDWG 2008.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Understanding RDF. 2/30 What is RDF? Resource Description Framework is an XML-based language to describe resources. A common understanding of a resource.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Tekstcollecties in Nederlab Hennie Brugman Meertens Instituut Workshop ‘morfosyntactisch verrijken van historische teksten’,
© 2006 University of Kansas An LSID resolver for specimens and a digression into issues raised by the use of GUIDs Steve Perry
ITS 2.0 in XLIFF 2 FEISGILTT Dublin June 2014 Yves Savourel ENLASO Corporation This presentation was made possible by.
Jan Christoph Meister University of Hamburg
Web Services Martin Nečaský, Ph.D. Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Summer 2014.
DBpedia - A Crystallization Point
Paloma Marín Arraiza 17 th International Conference on Grey Literature 1 st and 2 nd December 2015, Amsterdam (Netherlands) SCIENTIFIC AUDIOVISUAL MATERIALS.
OWL Web Ontology Language Summary IHan HSIAO (Sharon)
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
BRAT: a web based tool for manual annotation Hans Paulussen ITEC, KU Leuven KULAK.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Samad Paydar WTLab Research Group Ferdowsi University of Mashhad LD2SD: Linked Data Driven Software Development 24 th February.
Building Enterprise Applications Using Visual Studio®
Components.
Searching and browsing through fragments of TED Talks
Web archives as a research subject
Presentation transcript:

CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012

Annotation context Audiovisual –ASR, language, gesture, oral history Text – Semantic annotation Music – lyrics, music notation Linguistic Annotation – named entities Image annotation Programs: CATCH, CATCHPlus, CLARIN

CODA main use cases Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen) –Line strip and word zone annotations –ML: search in manuscript images –Add Named Entity annotations Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker) –Support manual annotation –Line strip detection service

2

Line annotation tools (catchplus)

godefroit navis-SAL7316_0195-line-026 -y1=2094-y2=2317-zone-HUMAN -x=1145-y=105-w=315-h=116 -unshear=0.0-version=ortho mceunen Wed Jan 26 16:37:

OAC representation ImageAnnotationTextAnnotations hasBody hasTarget hasBody hasTarget constrains hasTarget hasBody “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” cnt:chars imageScan.jpg ia:1 page:0 zone:2 line:1 Canvas1 ct:1 ct:2 cb:2 cb:1 ib:0 hasBody linestrip.jpg ia:2 Named Entity

OAC representation – Named Entities ImageAnnotationTextAnnotationsEntityAnnotation hasBodyhasTargethasBody hasTarget hasBodyconstrains hasTarget hasBody “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” “location” cnt:chars imageScan.jpg ia:1ta:0 ta:2 ta:1 Canvas1 ct:1 ct:2 ct:3 ct:4 cb:2 cb:1 ib:0 ib:1 ea:1 ! Annotation of annotations? ! Annotation of segments of inline text? InlineTextConstraint: <constrains xmlns=" rdf:resource=" %3Ad bf-40a8-a648-2cd5ebb9acfd"/> <constrainedBy xmlns=" rdf:resource="urn:uuid:4f6b7d ab6-be89-a0feec9e7208"/> "<textsegment offset="279" range="2"/>" UTF-8

KdK-2-OAC conversion Implicit line and page text Word and line order Text offsets and ranges Spatial information Identifiers and ‘annotatability’ Redundant text for searchability ! Need for explicit representation of Sequence? ! Search on text of ConstrainedTarget/Body?

KdK2OAC conclusions Bidirectional mapping is possible Compatible with SharedCanvas model OAC + Canvas links everything together Implicit information made explicit Supports alternative text segmentations OAC representation is extremely verbose ! For many annotation tasks OA may be overkill

Open Annotation Service (OAS) Upload annotation RDF using SRU/Update Inlines external text and XML Bodies and authors Indexes OA and DC properties Assigns resolvable http URIs and resolves those Implementation: RDF store icw Solr, production quality software components (Meresco) Built-in OAI-PMH data provider and harvester for ‘annotation sets’ Query: SRU/CQL, SPARQL, OAI-PMH Simple management dashboard (authentication and authorization, collection management, harvesting) Easy installation and Open Source ! Model does not support Annotation “sets”

OAS: issues Annotation publication Searchability: ‘harvest and index’ Text search on external bodies Annotation boundaries ‘Bypassing’ oac:constrains ! In RDF, what are the boundaries of an annotation?

Entity Recognition service service frog converter URL or text OAS resolve source_text FoLiA_document URL or ID entity annotations

‘frog’ and FoLiA ‘Frog’ tool generates FoLiA XML document with –Segmentation of text in paragraphs, sentences and words (tokens) – XML hierarchy –Part of speech, lemma, morphology, chunking, dependency structure and named entities Mix of inline and standoff annotation –‘Frog’ does not keep track of character offsets –Explicit ordering: numbering system in ids Trained for Dutch Widely used for Dutch corpora Made available by: Tilburg University

FoLiA-2-OAC conversion Reconstruct character offsets after tokenization Operates on inline text as published by OAS Construct and add entity text from tokens + sequence (the+hague != hague+the) Two approaches 1.Minimal: extract entity annotations and tokens, and convert to OAC 2.Maximal: full conversion to OAC

Linguistic Annotation ! Mix-in domain semantics as subtypes/subproperties ? ! Maximal OA mapping or embed linguistic standards ? ! Layers, hierarchies (syntax) and Documents ! Sequence (e.g. entities, morpheme breakup)

Synchronized viewing client demo demo Demo/screenshot

Summary of OA issues ! Annotation of annotations? ! Annotation of segments of inline text? ! Need for explicit representation of Sequence? ! Search on ConstrainedTarget/Body? ! For many annotation tasks OA may be overkill ! Model does not support Annotation sets ! In RDF, what are the boundaries of an annotation?

Future work Finalize and integrate software (with web services) Upgrade to new OA spec (incl OAS) Line strip detection web service Possible applications –AV annotation in CATCHPlus –Nederlab

Questions?