Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linguistic Annotation and Standoff Markup Henry S. Thompson HCRC Language Technology Group World Wide Web Consortium Markup Technology Ltd. University.

Similar presentations


Presentation on theme: "Linguistic Annotation and Standoff Markup Henry S. Thompson HCRC Language Technology Group World Wide Web Consortium Markup Technology Ltd. University."— Presentation transcript:

1 Linguistic Annotation and Standoff Markup Henry S. Thompson HCRC Language Technology Group World Wide Web Consortium Markup Technology Ltd. University of Edinburgh © 2001 Henry S. Thompson

2 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 2 Ontology and 'ontology' n It's the 'in' word just now n Philosophy ä The nature of being(s) n Computing Industry ä Scholastic taxonomy ä I.e. (a description of) a data model n Where does XML fit in?

3 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 3

4 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 4 XML is ASCII for the 21st century n ASCII (ISO 646) solved a fundamental interchange problem for flat text documents ä What bits encode what characters –(For a pretty parochial definition of 'character') n UNICODE/ISO 10646 extends that solution to the whole world n XML thought it was doing the same for simple tree- structured documents ä The emphasis in the XML design was on simplifying SGML to move it to the Web ä XML didn't touch SGML's architectural vision –flexible linearisation/transfer syntax –for tree-structured documents with internal links

5 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 5 The essence of XML n It's a markup language used for annotating text n It is concerned with logical structure ä to identify sections, titles, section headers, chapters, paragraphs,… n It is not concerned with appearance ä you say 'this is a subtitle' not 'this is in bold, 14pt, centered' ä you say 'this is an example' not 'this is in verbatim, indented by 5pts, ragged right' n It is authored and consumed by people

6 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 6 XML marked up text Internet-based Application Architectures for the 21st Century: The Role of XML Let's skip straight to an example of XML syntax for a simple bit of structure: <tip><emph>Never</emph> stand up in a canoe!</tip>

7 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 7 Connecting structure and form n There is a stylesheet language called XSLT which will allow us to write simple style rules which will produce the formatted presentation from the structured version n For example will do part of the Transformation job

8 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 8 The essence of XML, mark two n It's a markup language used for transferring data n It is concerned with data models ä to convert between application-appropriate and transfer-appropriate forms n It is not concerned with human beings ä It's produced and consumed by programs

9 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 9 What just happened!? n The whole transfer syntax story just went meta, that's what happened! n XML has been a runaway success, on a much greater scale than its designers anticipated ä Not for the reason they had hoped –Because separation of form from content is right ä But for a reason they barely thought about –Data must travel the web n Tree structured documents are a useable transfer syntax for just about anything ä So data-oriented web users think of XML as a transfer mechanism for their data

10 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 10

11 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 11 What's missing? n The relationship between transfer syntax and tree-structured data is well-defined ä Defined by XML 1.0 + XML Namespaces + XML Infoset n The relation between tree-structured data and application data is not ä Left up to each application –an XML application = syntax and semantics ä No official or even consensus standard for expressing the relation –So more-or-less ad-hoc scripting solutions predominate

12 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 12 What's missing? n The relationship between transfer syntax and tree-structured data is well-defined ä Defined by XML 1.0 + XML Namespaces + XML Infoset n The relation between tree-structured data and application data is not ä Left up to each application –an XML application = syntax and semantics ä No official or even consensus standard for expressing the relation –So more-or-less ad-hoc scripting solutions predominate –And it's easy to confuse domain analysis ('ontology') and document design (DSD)

13 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 13 A twin-track declarative approach n My colleague Ari Krupnikov and I are working on an approach based on annotating W3C XML Schemas with data-binding information n In looking at existing uses of markup, a number of pre-existing patterns of practice emerged ä A data-oriented example A data-oriented example n Raises the question of what aspects of the XML Data Model map to what aspects of the application data, at a generic/ontological level

14 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 14 Where does annotation fit in? n It's not just a lexical accident n Annotation is markup ä In practice, often quite literally, with coloured pens n The question of semantics remains n So we're in the curious state of using trees both as data model and as external representation n There's a tension between two views of XML documents: ä Opaque transfer mechanism ä Repurposable information store –XML Query/XPath/XSLT

15 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 15 Overlapping Hierarchies n In eithercase, overlapping more-or-less orthogonal annotations are a challenge n Consider for example annotating poetry n There is a verse/stanza/line perspective n And a sentence/clause/phrase perspective n Ordinary trees can't handle this n Initially we thought of it as a markup problem ä But latterly we've embraced schizophrenia

16 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 16 A little sloppiness is a good thing n Just as in ordinary natural language communication n XML trees can bear a wide range of interpretations n Lack of absolute precision is not necessarily a flaw ä The ontology of linguistic artefacts in general, and annotation in particular, is just not clear ä XML/Trees seem to be at a useful point with respect to concreteness and precision

17 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 17 What’s standoff markup? n Separating markup from the material marked up n Three obvious reasons ä Base material may be read-only and large –Or not freely distributable –Or of necessity somewhere else ä Markup may involve multiple overlapping hierarchies ä Multiple analysts may be at work simultaneously

18 Annotation; Standoff MarkupHenry S. Thompson IRCS, Philadelphia, 2001-12-12 18 Where's the beef? n At the data model level, there's not much difference: ä Instead of a parentfunction from node to node ä We have a children function from node to node sequence n At the document level, this means using reference mechanisms (URIs) instead of containment n In practice, the distinction provides a lot of leverage


Download ppt "Linguistic Annotation and Standoff Markup Henry S. Thompson HCRC Language Technology Group World Wide Web Consortium Markup Technology Ltd. University."

Similar presentations


Ads by Google