Beyond HTML: Extensible Markup Language Timothy W. Cole Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign American Association of Law Libraries 19 July 2000 t-cole3@uiuc.edu http://dli.grainger.uiuc.edu/Publications/TWCole/AALL_2000/
Ordered Hierarchy of Content Objects A Definition of Text in Computer Terms Premise: A Text is the Sum of its Components So a <BOOK> Could Be Defined as Containing: <FRONT_MATTER> <CHAPTER>s <BACK_MATTER> <FRONT_MATTER> Could Contain: <BOOK_TITLE> <AUTHOR>s <PUBLISHER> While Each <CHAPTER> Could Contain: <CHAPTER_TITLE> <SECTION>s And Each <SECTION> Could Contain: <SECTION_TITLE> <PARAGRAPH>s Components Chosen Reflect Anticipated Use
Ordered Hierarchy of Content Objects (continued) OHCO is a Useful, Albeit Imperfect Model More Powerful Than Model of Text as a Stream of Characters & Formatting Instructions Does Not Allow for Overlapping Content Objects OHCO Model is Inherent in XML, HTML XML Designed for Descriptive Content Objects, Not Presentational Content Objects XML Syntax is Fixed, But Semantics is Extensible
XML Basics: Markup & Content Consider: Would Display As: <?xml version='1.0' ?> Colè, Tim <!-- This is an Example --> <author sequence='first'> <LName> Colè </LName>, <FName> Tim </FName> </author> This example illustrates: XML Processing Instructions XML Comments (Ignored by XML Applications) XML Element Markup, Including an Attribute XML Content, Including an Entity
XML Basics (continued) “Well-Formed” XML Rules: XML Element Markup is Case-Sensitive All XML Tags Must Be Closed Hierarchical Nesting; No Overlapping Elements All XML Attribute Values Must Be Quoted Enforces Stricter Syntax than HTML Facilitates Fast, Efficient Parsing Extensible Semantics Provide Flexibility “Well-Formed” More Lightweight Than SGML
Is It Valid Or Well-Formed? When Does It Matter? All Web Browsers Need Is Well-Formed XML Authoring Tools Need To Validate Otherwise Tower of Babel Ensues Indexing Agents & Schema-Specific Rendering Agents May Need To Validate Illustrations: Malformed XML Well-Formed But Invalid XML Valid XML
Library Uses of XML: Using XML for Primary Sources Facilitates Searching Full-Text Searching & Field-Specific Searching More Meaningful Proximity Searching Better Retrieval / Browsing Selective Views / Suppression of Personal Data Re-Ordered & Piecemeal Views Illustration -- Illinois Agronomy Handbook Search Browsing
Library Uses of XML: XML for Metadata & Wrapping Facilitates Interchange, Normalization, ... Simpler than Fixed Fields, Record Headers, Etc. XML Implementations of Metadata Standards, e.g.: RDF, EAD, DC, FGDC, US-MARC Easier Routing / Handling of Specialized Content In Combination with Primary Source XML Automatic Extraction of Metadata From Source Facilitates Authority Control
Library Uses of XML: XML for Document Management Smarter Documents XML Namespaces -- Integrating Multiple XML Schemas (Including XHTML) Rights Management, Technical Requirements,… Facilitates Enhanced Linking Between Docs. Creation of Links From Marked Up Content Easy to Add or Modify Links Over Time XLink & XPointer Promise More Robust Linking Metadata File from Illinois DLIB Testbed Schema Integrates RDF, DC, & Project Design
Components of XML Implementations DTDs & XML Schemas Use Either to: Define Content Models Declare Attributes & Entities DTDs Inherited from SGML DTDs Themselves Not Well-Formed XML Limits on Detail of Content Model Definitions Minimal Data Typing XML Schemas Are Well-Formed XML Data Typing & Better Content Models Supported Not Yet in Widespread Use
Components of XML Implementations Encoding & Entities (Using Characters Not on Your Keyboard) Computers Use 1s and 0s, but Characters form the Basis of Human-Readable Texts Coded Character Sets (CCS) Assign Integer Values to Characters -- ASCII, ISO 8859, Unicode Character Encoding Schemes (CES) Map Those Integers to Bytes -- 7-bit, 8-bit, UTF-8 Bytes Are Then Rendered as Glyphs by Your Computer, Using Font Appropriate to CCS/ CES Font Unavailable Or CCS/CES Misunderstood Results in Incorrect Character(s) on Screen
Components of XML Implementations Encoding & Entities (continued) Common Ways to Deal With This Problem: Select CCS/CES Appropriate to Language Use Default CCS/CES, but Override Default Font Use XML/HTML Named or Numeric Entity HTML Understands Non-Extensible Set of Named Entities XML Understands Numeric Entities Corresponding to Unicode CCS, All Named Entities Must Be Declared in DTD Use Unicode for CCS, UTF-8 for CES - XML Defaults An Illustration in HTML
Components of XML Implementations Presentation - CSS Style Sheets XML Content Objects Have No Style Use Cascading Style Sheets (CSS) Work Like CSS for HTML, Except: Must Be Explicit About Everything No Special Treatment of Class & ID Attributes Attach CCS to XML Using Special XML PI CSS Does Define Formatting CSS DOES NOT Reorganize or Add Content Simple XML-CSS Example; The CSS Used
Components of XML Implementations Transformations - XSLT Style Sheets Some Characteristics of XSLT Style Sheets XSLT Files Are Well-Formed XML XSLT Transform to Another Schema, Or to XHTML XSLT Objects Have Implicit Functionality Attach XSLT To Document Using XML PI XSLT Can Reorganize & Add Content Still Need CSS for Presentation -- CSS Style Sheets Work on the Output of XSLT Processing Supplement XSLT With Script To Manipulate & Modify Actual Content Simple XSLT Example; The XSLT Style Sheet
The State-of-the-Art in XML Tools XML Authoring Add-Ons to Established Word Processors, e.g.: WordPerfect 9 / WordPerfect 2000 Tools With SGML Roots, e.g.: ArborText’s Epic (was Adept) Editor SoftQuad’s XMetaL Editor New XML Tools, e.g.: Vervet Logic’s XMLPro Extensibility’s XML Authority / XML Turbo So Far, There Are Fewer Authoring Tools Customized for Specialized XML Schemas
The State-of-the-Art in XML Tools (continued) XML Presentation Tools: Latest Releases of Netscape Navigator/Mozilla, and Microsoft’s Internet Explorer Support XML -- But Support is Generic, Partial, & Uneven Plug-Ins, Standalones Available / In Work for Advanced XML Schemas (CML, MML, VML,…) XML Database Integration Tools: Add-Ons to Established DBMS Available/In Work Microsoft SQL Server-XML Technology Preview Illustration; With Query & CSS; XML Source File; XML Query Language Specification In Work
Developing XML Applications: The Politics of XML Evolution of XML XML Formalized as W3C Recommendation 2/98 Numerous Ancillary Specs Released & In Work Namespaces, XSLT, XLink/XPointer, XML Signature Numerous Early Implementors (Chemistry, Biology, Multimedia, Metadata) Prerequisites for Community Implementations Identify Target(s) of Opportunity Define Horizontal & Vertical Content Objects Consensus Building & Community Buy-In Test Implementations & Tool Building
Developing XML Applications: The Politics of XML (continued) Status of XML In Legal Community LegalXML Has Identified Targets Begun Process of Defining Content Objects & Building Consensus Progress in Some Areas, e.g.: Court Filing (see also XML Court Interface) Less Visible Progress in Other Workgroups, e.g.: Reference, Public Law, Users Presence (& Vested Interests) of Extensive Non-XML Legal Automation Systems In Place Lessens Motivation
Developing XML Applications: The Politics of XML (continued) Status of XML In Publishing & Libraries Extensive XML Work in Metadata Unfortunately Has Led to Competing Stds. Many Publishers Have Been Using SGML for a Decade or More -- But Only Internally Perceived Tradeoff (probably overrated): Publicly Releasing Primary Sources in XML vs. Control of Product & Marketplace Problems with Early SGML Web Experiments No One Wants to be First But No One Wants to be Last Either
Future Directions Continued Evolution of Standards, Tools Continued Development of Community Implementations -- Selected Disciplines Increased Use of XML Behind the Scenes Carryover from SGML Trends Integration of XML with Databases XML Unlikely to Replace HTML, Other Document Formats, But Will Co-Exist Magnitude of Role in Law Libraries Uncertain, but Likely to Have At Least Some Role