Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.

Slides:



Advertisements
Similar presentations
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
Advertisements

History Leading to XHTML
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
Document Type Definitions
Lecture 14 XML Validation. a simple element containing text attribute; attributes provide additional information about an element and consist of a name.
Thayer School of Engineering Dartmouth Lecture 2 Overview Web Services concept XML introduction Visual Studio.net.
1 XML and QUERY Shilpi Ahuja CSE Data Mining 4 th April 2002.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Tutorial 11 Creating XML Document
COS 381 Day 14. Agenda Questions?? Resources Source Code Available for examples in Text Book in Blackboard
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
Introduction to XML This material is based heavily on the tutorial by the same name at
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Topics The "bigger picture" –The "XML sales pitch" –XML/XHTML vs. SGML/HTML –XML in electronic publishing –XML and the future, web 2.0 XML basics: –Building.
4/20/2017.
XML – Extensible Markup Language Sivakumar Kuttuva & Janusz Zalewski.
Lecture 15 XML Validation. a simple element containing text attribute; attributes provide additional information about an element and consist of a name.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
XML eXtensible Markup Language by Darrell Payne. Experience Logicon / Sterling Federal C, C++, JavaScript/Jscript, Shell Script, Perl XML Training XML.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
Introduction to XML. What is XML? Extensible Markup Language XML Easier-to-use subset of SGML (Standard Generalized Markup Language) XML is a.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
XML Extensible Markup Language. What is XML? ● meta-markup language ● a language for defining a family of languages ● semantic/structured mark-up language.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
XML - Why: The HTML-Dilemma HTML, SGML, XML - How: Syntax, Concept, Language Elements Basics Well-formed XML-Documents (without DTD) Valid XML-Documents.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Session IV Chapter 9 – XML Schemas
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Lecture 6 XML DTD Content of.xml fileContent of.dtd file.
How do I use HTML and XML to present information?.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
Waqas Anwar Next SlidePrevious Slide. Waqas Anwar Next SlidePrevious Slide XML XML stands for EXtensible Markup Language.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XML Instructor: Charles Moen CSCI/CINF XML  Extensible Markup Language  A set of rules that allow you to create your own markup language  Designed.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
An Introduction to XML Sandeep Bhattaram
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
XML for Text Markup An introduction to XML markup.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
225 City Avenue, Suite 106 Bala Cynwyd, PA , phone , fax presents… XML Syntax v2.0.
What is XML? eXtensible Markup Language eXtensible Markup Language A subset of SGML (Standard Generalized Markup Language) A subset of SGML (Standard Generalized.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Games: XML Presented by: Idham bin Mat Desa Mohd Sharizal bin Hamzah Mohd Radzuan bin Mohd Shaari Shukor bin Nordin.
XML Validation. a simple element containing text attribute; attributes provide additional information about an element and consist of a name value pair;
Tutorial 9 Working with XHTML. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Describe the history and theory of XHTML.
C Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Introduction to XML Standards.
CIS 228 The Internet 9/20/11 XHTML 1.0. “Quirks” Mode Today, all browsers support standards Compliant pages are displayed similarly There are multiple.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
XML Introduction to XML Extensible Markup Language.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Unit 4 Representing Web Data: XML
CIS 228 The Internet 9/20/11 XHTML 1.0.
Chapter 7 Representing Web Data: XML
Presentation transcript:

Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec

Overview 1. a few words about me 2. a few words about you 3. a short introduction to standards 4. some words on XML Practicum: writing a small document in XML (recipes?)

Lecturer Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana corpora and other language resources, standards, annotation, text-critical editions corpora and other language resources, standards, annotation, text-critical editions Web page for this course: Web page for this course:

Students background: field of study, exposure to XML/TEI? background: field of study, exposure to XML/TEI? s? s? expectations? expectations?

Standards dictionary: an obligatory uniform regulation for measurment, quantity or quality // that which specifies how something can or must be dictionary: an obligatory uniform regulation for measurment, quantity or quality // that which specifies how something can or must be consensually accepted regulations, which are public and contain explicit definitions consensually accepted regulations, which are public and contain explicit definitions the main purpuse is to harmonise industrial practice in various fields in order to enable interchange the main purpuse is to harmonise industrial practice in various fields in order to enable interchange

Some history XVIII century: in France each region (village) has its own units of measurement; also, different objects (say a field or forest) are measured differently XVIII century: in France each region (village) has its own units of measurement; also, different objects (say a field or forest) are measured differently how to definine a uniform system of measurements: search for a single unit from which it would be possible to derive all other measures how to definine a uniform system of measurements: search for a single unit from which it would be possible to derive all other measures meter: one ten-millionth of the length of the meridian through Paris, from the North Pole to the equator meter: one ten-millionth of the length of the meridian through Paris, from the North Pole to the equator the importance of standardisation grows with the industrial revolution: mechanical and electrical engineering, construction work… the importance of standardisation grows with the industrial revolution: mechanical and electrical engineering, construction work… today, standards encompas even such “soft” fields as the organisation of bussines (ISO 9000) today, standards encompas even such “soft” fields as the organisation of bussines (ISO 9000) big bussiness: companies that check compliance with standards big bussiness: companies that check compliance with standards

Standardisation bodies publish standards according to strictly defined procedures: national standards: DIN, ANSI, SIST national standards: DIN, ANSI, SIST international standards: IEC, ISO international standards: IEC, ISO ISO International Organization for Standardization, Geneva (1947) ISO International Organization for Standardization, Geneva (1947) ISO ISO Technical Committees are composed of members from participating countries, who then develop and approve standards from their field ISO Technical Committees are composed of members from participating countries, who then develop and approve standards from their field ISO TCs can be further composed sub-committees (SC) and these can containing Working Groups (WG) ISO TCs can be further composed sub-committees (SC) and these can containing Working Groups (WG)

ISO TC 37 Technical Committee on Terminology Technical Committee on Terminology important for all other standards, as each standard must contain a section on terminology important for all other standards, as each standard must contain a section on terminology basic definitions,…, ISO 639, MARTIF basic definitions,…, ISO 639, MARTIFISO 639ISO 639 in 2001 name of TC 37 changed to: … and other language and content resources in 2001 name of TC 37 changed to: … and other language and content resources ISO TC 34 SC4: Language Resources Management ISO TC 34 SC4: Language Resources ManagementSC4

W3C The World Wide Web Consortium The World Wide Web ConsortiumWorld Wide Web ConsortiumWorld Wide Web Consortium first recommendation was HTML (1992) first recommendation was HTML (1992) best known versions of HTML: 3.2, 4.1 best known versions of HTML: 3.2, 4.1 XML 1.0 released February 1998 XML 1.0 released February 1998 XML 1.0 XML 1.0 Many XML related standards: Many XML related standards:  DOM Level 1 V1.0 (October 1998)  XML Namespaces V1.0 (January 1999)  XPath V1.0 (November 1999)  XSLT V1.0 (November 1999)  XHTML V1.0 (January 2000)  XML Schema V1.0 (May 2001)  XLink V1.0 (June 2001)  XPointer V1.0 (September 2001)  XSL V1.0 (October 2001)  XML Information Set V1.0 (October 2001)  XPath 2.0 WD (April 2002)

Why standards for encoding of digital data? The encoding of digital data is typically bound to a particular piece of software e.g. a text editor. Problems: longevity: rapid advances in technology make programs obsolete very soon, and the data bound to these programs becomes unreadable longevity: rapid advances in technology make programs obsolete very soon, and the data bound to these programs becomes unreadable interchange: difficult to use data on other platforms interchange: difficult to use data on other platforms exploitation: difficult to re-use the data for other purposes exploitation: difficult to re-use the data for other purposes intelligibility: the data are understandable only to the program (no public and stable specifications of the format) intelligibility: the data are understandable only to the program (no public and stable specifications of the format) validation: we don’t know whether certain data is written according to the format specification or not validation: we don’t know whether certain data is written according to the format specification or not

Language data text editors: very loose encoding, too oriented to the visual appearance of text text editors: very loose encoding, too oriented to the visual appearance of text databases: too rigid encoding, does not allow for mixture of content (text) and structure (markup) databases: too rigid encoding, does not allow for mixture of content (text) and structure (markup) ISO 8879 ISO 8879 SGML (Standard Generalised Markup Language), 1986 defined a language for the representation of texts that will be processed by computer programs

SGML it defined an encoding which is: very general, as it is a “ metalanguage ” (a language for describing other languages) and lets you design your own customised markup languages for different types of documents very general, as it is a “ metalanguage ” (a language for describing other languages) and lets you design your own customised markup languages for different types of documents interchangable between computer platforms resistant to changes in technology enables the use of documents for various purposes enables automatic validation whether a certain document is compliant with the standard

Problems with SGML the standard is very complex the standard is very complex software for using it was either very expensive or very “academic” software for using it was either very expensive or very “academic” the conversion of existing documents into SGML was expensive the conversion of existing documents into SGML was expensive so, the use of SGML was limited to large companies or academia so, the use of SGML was limited to large companies or academia

The Web HTML was an applicatoin of SGML HTML was an applicatoin of SGML but SGML compliant HTML is used by very few web pages.. but SGML compliant HTML is used by very few web pages.. HTML is also not expressive enough for the encoding of arbitrary web data HTML is also not expressive enough for the encoding of arbitrary web data the need for a new standard for encoding web data that would have all the advantages of SGML without its weaknesess the need for a new standard for encoding web data that would have all the advantages of SGML without its weaknesess  eXtended Markup Language, XML (1998)  eXtended Markup Language, XML (1998)

XML now XML became very popular, and is becoming the universal medium for interchange of (language) data XML became very popular, and is becoming the universal medium for interchange of (language) data many related standards many related standards many freely available tools for processing XML many freely available tools for processing XML many programs support import and export of data in XML many programs support import and export of data in XML

What is XML? XML is a definition of device-independent, system-independent methods of storing and processing texts in electronic form XML is a definition of device-independent, system-independent methods of storing and processing texts in electronic form XML is a project of W3C; hence, it is an open and non-proprietary specification XML is a project of W3C; hence, it is an open and non-proprietary specification XML is a subset of SGML XML is a subset of SGML XML is a “ metalanguage ” -- a language for describing other languages -- which lets you design your own customised markup languages for different types of documents XML is a “ metalanguage ” -- a language for describing other languages -- which lets you design your own customised markup languages for different types of documents

What is a Markup Language? markup (equivalently, encoding) markup (equivalently, encoding)  making explicit an interpretation of text markup language markup language  a set of markup conventions used together for encoding texts. A markup language must specify: A markup language must specify:  how markup is to be distinguished from text,  what the markup means,  what markup is allowed,  what markup is required

Structure of XML documents <poem> The SICK ROSE The SICK ROSE O Rose thou art sick. O Rose thou art sick. The invisible worm, The invisible worm, That flies in the night That flies in the night In the howling storm: In the howling storm: Has found out thy bed Has found out thy bed Of crimson joy: Of crimson joy: And his dark secret love And his dark secret love Does thy life destroy. Does thy life destroy. </poem> document = text + mark-up document = text + mark-up element = start tag + content + end tag element = start tag + content + end tag generic identifier = name of the tag generic identifier = name of the tag element contains text or elements or both (or nothing) element contains text or elements or both (or nothing)

XML data model The SICK ROSE O Rose thou art sick. The invisible worm, That flies in the night In the howling storm: Has found out thy bed Of crimson joy: And his dark secret love Does thy life destroy. The SICK ROSE O Rose thou art sick. The invisible worm, That flies in the night In the howling storm: Has found out thy bed Of crimson joy: And his dark secret love Does thy life destroy. ~ The SICK ROSE poemtitle stanza linelineline O Rose thou art sick The invisible worm, That flies in the night linelineline Has found out thy bed Of crimson joy And his dark secret love stanza line In the howling storm : line Does thy life destroy. serialization data model

Empty elements elements with content: … elements with content: … empty elements have no content: empty elements have no content: used for indicating “points” in the document, for example page breaks used for indicating “points” in the document, for example page breaks formally = formally =

Attributes used to describe properties of elements Example:... Example:... given as attribute-value pairs inside the start-tag given as attribute-value pairs inside the start-tag value must be inside matching quotation marks, single or double; value must be inside matching quotation marks, single or double; order in which attribute-value pairs are supplied inside a tag has no significance; order in which attribute-value pairs are supplied inside a tag has no significance; an XML processor can use the values of the attributes in any way it chooses; the id attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross reference purposes. an XML processor can use the values of the attributes in any way it chooses; the id attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross reference purposes.

Comments Comments can appear anywhere in text (but not in markup) Comments can appear anywhere in text (but not in markup) Comments start with Comments start with Comments cannot be nested and cannot contain -- Comments cannot be nested and cannot contain -- e.g. The SLICK ROSE O Rose thou art sick. e.g. The SLICK ROSE O Rose thou art sick. Note that in XML 'meta-markup' starts with <! or <? Note that in XML 'meta-markup' starts with <! or <?

Example: annotated corpus

Example: dictionary

Entities XML documents can also contain entity references, which are, when processing the document, substituted by their interpretation (the entity) XML documents can also contain entity references, which are, when processing the document, substituted by their interpretation (the entity) an entity reference starts with the character ampersand and ends with the semicolon: &…; an entity reference starts with the character ampersand and ends with the semicolon: &…; a few entities are predefinirane in XML: & lt ; = & amp ; = & & apos ; = ' & quot ; = " a few entities are predefinirane in XML: & lt ; = & amp ; = & & apos ; = ' & quot ; = " < and & are “magic” characters and must always be escaped when using them in the text: < and & are “magic” characters and must always be escaped when using them in the text: 1 < 2 must be written as 1 < 2 1 < 2 must be written as 1 < 2 Procter & Gamble must be written as Procter & Gamble Procter & Gamble must be written as Procter & Gamble entities are also used for other purposes entities are also used for other purposes

XML declaration Every XML document must begin with an XML declaration which does two things: specifies that this is an XML document, and which version of the XML standard it follows specifies which character encoding the document uses:   The default, and recommended, encoding is UTF-8

Minimal requirements the document starts with the XML declaration the document starts with the XML declaration tags and entities are correctly written Wrong: 1 &lt 2 tags and entities are correctly written Wrong: 1 &lt 2 the document must be a tree: the document must be a tree:  every start tag has a matching end-tag ( ≠ ≠ )  elements are correctly nested Wrong: … … …  elements are correctly nested Wrong: … … …  the document has a single top-level element  a well-formed XML document

Splot the mistake Hello world! Ho world! Ho

Another bad XML document <HTML><HEAD><TITLE>Links</TITLE></HEAD><BODY> Interesting WWW links Interesting WWW links <UL> W3C XML W3C XML Cover's pages Cover's pages </ul> Google Google </FORM></BODY></HTML>

Defining the rules A valid XML document conforms to rules which are stated in an external schema (“element grammar”) of some sort. A schema specifies:   names for all elements used   names and datatypes and (occasionally) default values for their attributes   rules about how elements can nest   and a few other things, depending on the schema   language n.b. A schema does not specify anything about what elements "mean"

In XML a schema is optional! XML allows you to make up your own tags, and doesn’t require a schema... The XML concept is dangerously powerful:   XML elements are light in semantics   one man’s is another’s (or is it?)   the appearance of interchangeability may be worse than its absence But XML is too good to ignore   mainstream software development   proliferation of tools   the language of the web

What can a schema (or DTD) do for you? ensure that your documents use only predefined elements, attributes, and entities enforce structural rules such as ‘every chapter must begin with a heading’ or ‘recipes must include an ingredient list’ make sure that the same thing is always called by the same name schema languages vary in the amount of validation they support

Schema languages Schemas can be written in:   XML DTD Language (inherited from SGML)   The W3C schema language (main successor of DTDs)   The ISO Relax NG schema language (mostly used by latest version of TEI)

A simple DTD XML document: <city> Graz Graz 285, ,470 Austria Austria DTD:

A more complex DTD <anothology> The SICK ROSE The SICK ROSE O Rose thou art sick. O Rose thou art sick. The invisible worm, The invisible worm, That flies in the night That flies in the night In the howling storm: In the howling storm: Has found out thy bed Has found out thy bed Of crimson joy: Of crimson joy: And his dark secret love And his dark secret love Does thy life destroy. Does thy life destroy. </anothology> An element definition gives: the name of the element the name of the element its content model its content model

Content Model Operators ( open bracket for grouping ( open bracket for grouping ) close bracket ) close bracket, follows, follows | or | or ? maybe ? maybe * repeated 0 or more times * repeated 0 or more times + repeated once or more times + repeated once or more times <!ELEMENT poem (title?, (line+ (line+ | (refrain?, (stanza, refrain?)+) (refrain?, (stanza, refrain?)+)) )>

Mixed content If an element contains #PCDATA and element content, #PCDATA must always appear as the first option in an alternation; the group containing it must use the star operator; it may appear once only, and in the outermost model group.

Content model ambiguity XML parsing is deterministic so content model must not be ambiguous.

Empty Content Empty elements do not have content. To distinguish them from those with content in well-formed XML documents, they have a special form: the tag ends with a slash. In the DTD: In the DTD: In the document:... The page ends here. Here starts a new one.... In the document:... The page ends here. Here starts a new one....

Attributes In the DTD: In the DTD: attribute name; type default <!ATTLIST table typeCDATA #IMPLIED allowed id ID #REQUIRED necessary status (draft| revised | revised | final ) "draft" default value final ) "draft" default value> In the XML document: In the XML document:

Entities in the DTD: &xml-url; "> in the DTD: &xml-url; "> in the document: Read about XML at &xml-ref;. in the document: Read about XML at &xml-ref;. after processing: Read about XML at after processing: Read about XML at

Character references Character references are used for cases where certain characters cannot represented (entered, stored, transmitted, displayed) directly. Character references are used for cases where certain characters cannot represented (entered, stored, transmitted, displayed) directly. Character reference starts with Character reference starts with  &# followed by the decimal number of the character, or by  &#x followed by the hexadecimal number of the character, and ending with ;, e.g.: Saarbrücken When processing, such references are substituted by their codepoint When processing, such references are substituted by their codepoint Codepoints can be found on the Unicode Web pages Codepoints can be found on the Unicode Web pagesUnicode Web pagesUnicode Web pages

External Entities External entity references are substituted by the contents of files: External entity references are substituted by the contents of files: External entities are referenced in the document just as internal ones are: &Chap1; &Chap2; External entities are referenced in the document just as internal ones are: &Chap1; &Chap2;

The Document Type Declaration Specifies: the root element of the document, the root element of the document, the external entity containing the DTD the external entity containing the DTD and/or the (part of the) DTD contained in the internal subset and/or the (part of the) DTD contained in the internal subsete.g. ]> ]>

A Complete Valid XML Document <!DOCTYPE anthology [ ]><anthology> The SICK ROSE The SICK ROSE O Rose thou art sick. O Rose thou art sick. The invisible worm, The invisible worm, That flies in the night That flies in the night In the howling storm: In the howling storm: Has found out thy bed Has found out thy bed Of crimson joy: Of crimson joy: And his dark secret love And his dark secret love Does thy life destroy. Does thy life destroy. </anthology>