Presentation is loading. Please wait.

Presentation is loading. Please wait.

CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Similar presentations


Presentation on theme: "CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012."— Presentation transcript:

1 CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012

2 Annotation context Audiovisual –ASR, language, gesture, oral history Text – Semantic annotation Music – lyrics, music notation Linguistic Annotation – named entities Image annotation Programs: CATCH, CATCHPlus, CLARIN

3 CODA main use cases Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen) –Line strip and word zone annotations –ML: search in manuscript images –Add Named Entity annotations Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker) –Support manual annotation –Line strip detection service

4 2

5

6 Line annotation tools (catchplus)

7 godefroit navis-SAL7316_0195-line-026 -y1=2094-y2=2317-zone-HUMAN -x=1145-y=105-w=315-h=116 -unshear=0.0-version=ortho mceunen Wed Jan 26 16:37:01 2011

8 OAC representation ImageAnnotationTextAnnotations hasBody hasTarget hasBody hasTarget constrains hasTarget hasBody “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” cnt:chars imageScan.jpg ia:1 page:0 zone:2 line:1 Canvas1 ct:1 ct:2 cb:2 cb:1 ib:0 hasBody linestrip.jpg ia:2 Named Entity

9 OAC representation – Named Entities ImageAnnotationTextAnnotationsEntityAnnotation hasBodyhasTargethasBody hasTarget hasBodyconstrains hasTarget hasBody “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” “location” cnt:chars imageScan.jpg ia:1ta:0 ta:2 ta:1 Canvas1 ct:1 ct:2 ct:3 ct:4 cb:2 cb:1 ib:0 ib:1 ea:1 ! Annotation of annotations? ! Annotation of segments of inline text? InlineTextConstraint: <constrains xmlns="http://www.openannotation.org/ns/" rdf:resource="http://oas.dev.seecr.nl:8000/resolve/urn%3Auuid %3Ad8741024-18bf-40a8-a648-2cd5ebb9acfd"/> <constrainedBy xmlns="http://www.openannotation.org/ns/" rdf:resource="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"/> "<textsegment offset="279" range="2"/>" UTF-8

10 KdK-2-OAC conversion Implicit line and page text Word and line order Text offsets and ranges Spatial information Identifiers and ‘annotatability’ Redundant text for searchability ! Need for explicit representation of Sequence? ! Search on text of ConstrainedTarget/Body?

11 KdK2OAC conclusions Bidirectional mapping is possible Compatible with SharedCanvas model OAC + Canvas links everything together Implicit information made explicit Supports alternative text segmentations OAC representation is extremely verbose ! For many annotation tasks OA may be overkill

12

13 Open Annotation Service (OAS) Upload annotation RDF using SRU/Update Inlines external text and XML Bodies and authors Indexes OA and DC properties Assigns resolvable http URIs and resolves those Implementation: RDF store icw Solr, production quality software components (Meresco) Built-in OAI-PMH data provider and harvester for ‘annotation sets’ Query: SRU/CQL, SPARQL, OAI-PMH Simple management dashboard (authentication and authorization, collection management, harvesting) Easy installation and Open Source ! Model does not support Annotation “sets”

14 OAS: issues Annotation publication Searchability: ‘harvest and index’ Text search on external bodies Annotation boundaries ‘Bypassing’ oac:constrains ! In RDF, what are the boundaries of an annotation?

15

16 Entity Recognition service service frog converter URL or text OAS resolve source_text FoLiA_document URL or ID entity annotations

17 ‘frog’ and FoLiA ‘Frog’ tool generates FoLiA XML document with –Segmentation of text in paragraphs, sentences and words (tokens) – XML hierarchy –Part of speech, lemma, morphology, chunking, dependency structure and named entities Mix of inline and standoff annotation –‘Frog’ does not keep track of character offsets –Explicit ordering: numbering system in ids Trained for Dutch Widely used for Dutch corpora Made available by: ILK @ Tilburg University

18 FoLiA-2-OAC conversion Reconstruct character offsets after tokenization Operates on inline text as published by OAS Construct and add entity text from tokens + sequence (the+hague != hague+the) Two approaches 1.Minimal: extract entity annotations and tokens, and convert to OAC 2.Maximal: full conversion to OAC

19 Linguistic Annotation ! Mix-in domain semantics as subtypes/subproperties ? ! Maximal OA mapping or embed linguistic standards ? ! Layers, hierarchies (syntax) and Documents ! Sequence (e.g. entities, morpheme breakup)

20

21 Synchronized viewing client demo demo Demo/screenshot

22 Summary of OA issues ! Annotation of annotations? ! Annotation of segments of inline text? ! Need for explicit representation of Sequence? ! Search on ConstrainedTarget/Body? ! For many annotation tasks OA may be overkill ! Model does not support Annotation sets ! In RDF, what are the boundaries of an annotation?

23 Future work Finalize and integrate software (with web services) Upgrade to new OA spec (incl OAS) Line strip detection web service Possible applications –AV annotation in CATCHPlus –Nederlab

24 Questions?


Download ppt "CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012."

Similar presentations


Ads by Google