Jan Christoph Meister University of Hamburg
CATMA - an integrated textual markup and analysis tool CLARIN's Turn Towards The Literary Text
Text vs. sentence, or: What‘s so different about processing texts? structural complexity: min TEXT > 2 (SENTENCE) structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences structural dynamic: TEXT processing represents & simulates cognitive and empirical processes CLARIN's Turn Towards The Literary Text3 TEXT yields more INTERPRETATIONS than SENTENCE +CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“
The what and why of MarkUp procedural, descriptive & discursive function discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration „What might this text mean to us?“ declarative markup: informs a human reader how to process a text as a communicative device „How is this text put together and how does it function in its communicative universe?“ procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string „What is the correct operation to perfom on this input?“ CLARIN's Turn Towards The Literary Text performative function discursive function
Hermeneutic „must haves“ of discursive markup facilitate collaboration & non-deterministic annotation allow for multiple markup allow for overlap allow for concurrent tagging conceptualize markup as dynamic & recursive allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop CLARIN's Turn Towards The Literary Text
MarkUp types & data models CLARIN's Turn Towards The Literary Text 6 There is no such thing as “no-mark up”. (Coombs, Renear, DeRose 1987) opaqueimplicit There is no such thing as “no-mark up.” linear inline, deterministic There is no such thing as “no-mark up”. nested inline, deterministic sequential There is no such thing as ”no-mark up”. relational stand off, descriptive There is no such thing as “no-mark up”. network stand off, discursive
Implementation in CATMA CLARIN's Turn Towards The Literary Text
The CATMA/CLÉA approach to markup text range based model a tag references a text range with a start and an end offset external standoff markup markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users markup is stored in a standoff manner to allow overlapping markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity CLARIN's Turn Towards The Literary Text
Example for overlapping markup in CATMA CLARIN's Turn Towards The Literary Text 9 (NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up)
TEI feature structure tag declaration & overlapping markup Keynote_speaker&affiliation CLARIN's Turn Towards The Literary Text 10
Question 1: How can we model a collaborative mark up practice? CLARIN's Turn Towards The Literary Text 11
Answer 1: CATMA’S “n-meta-data set to-1 object data instance”-model CLARIN's Turn Towards The Literary Text TEXT 0 A user markup 1..n meta-data procedural declarative hermeneutic object-data Tagsets
Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow? CLARIN's Turn Towards The Literary Text 13 TEXT
Example for recursion: a simple querie across the object data/meta data divide CLARIN's Turn Towards The Literary Text 14 Step 1: object data querie Step 2: refinement by adding an additional meta-data constraint
... which is why (reg="\b\S*\Qez\E(?=\W)") where (tag="Keynote_speaker&affiliation") generates this: CLARIN's Turn Towards The Literary Text 15
Answer 2: CATMA’S dynamic data model, e.g. (n meta-data set to 1 object instance) >n CLARIN's Turn Towards The Literary Text TEXT 0 A markup 1..n meta-data procedural declarative hermeneutic object-data TEXT 0 A markup 1..n object-data Tagsets
Question 3: How can we implement this practice in a system? CLARIN's Turn Towards The Literary Text 17
Answer 3: Call the big sister – CLÉA! CLARIN's Turn Towards The Literary Text18 CLÉA Data Base Model
CATMA/CLÉA: User and resource administration CLARIN's Turn Towards The Literary Text19
Manage corpora & source documents, markup collections and tag libraries CLARIN's Turn Towards The Literary Text20
Annotate texts or corpora using pre-defined or ready-made tags CLARIN's Turn Towards The Literary Text21
Build and execute queries on source text & tags, or any combination thereof CLARIN's Turn Towards The Literary Text22
Visualize results CLARIN's Turn Towards The Literary Text23
What’s in it for CLARIN? Import any text or corpus into CATMA/CLÉA Run standard analytical procedures automatically or inter actively on upload (indexing, POS tagging etc.) Annotate and analyse texts or corpora collaboratively Share and export markup from the CATMA/CLÉA data base in multiple formats CLÉA = Collaborative Literature Éxploration and Annotation CLARIN's Turn Towards The Literary Text 24
CLARIN's Turn Towards The Literary Text 25 Mille grazie to my CATMA/CLÉA development team Evelyn Gius Malte Meister Marco Petris Lena Schüch and to our funders University of Hamburg (2009) Google DH Awards ( ) BMBF ( )
Tag definition each Tag can have additional user defined properties each Tag has a type each Tag has a color CLARIN's Turn Towards The Literary Text
Tag instance a Tag instance can have individual values for the user defined properties each Tag instance is of a type CLARIN's Turn Towards The Literary Text
Tag referencing The content of a range is referenced by a pointer to an external entity. The URI is based on the RFC 5147 for pointing to plain text CLARIN's Turn Towards The Literary Text
Potential problems and possible solutions referencing ranges based on character offsets are vulnerable to modifications of the content possible solution: automated adjustments with checksums and context information, and track versioning and revision history in the source document header the encoding of the tags is machine readable but not interoperable out of the box possible solution: defining the feature structure encoding of tags in terms of the open annotation framework CLARIN's Turn Towards The Literary Text