Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
3 November 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures.
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
Chapter 1 Program Design
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Introduction to XML This material is based heavily on the tutorial by the same name at
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Creating Document Type Definitions (DTDs) Ellen Pearlman Eileen Mullin.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Introducing XML Maria Esteva DLSD General Libraries May 2004.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Introduction to XML cs3505. References –I got most of this presentation from this site –O’reilly tutorials.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
August Chapter 2 - Markup and Core Concepts Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XML Instructor: Charles Moen CSCI/CINF XML  Extensible Markup Language  A set of rules that allow you to create your own markup language  Designed.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
XML for Text Markup An introduction to XML markup.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
XML Introduction. Markup Language A markup language must specify What markup is allowed What markup is required How markup is to be distinguished from.
Tutorial 13 Validating Documents with Schemas
INFSY 547: WEB-Based Technologies Gayle J Yaverbaum, PhD Professor of Information Systems Penn State Harrisburg.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
Modul 4 Struktur Informasi Mata Kuliah Preservasi Informasi Digital.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Introduction to DTDs. Introduction We learned how to structure information using XML Learned XML grammar Learned the rules for XML encoding We learned.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
XML DTD. XML Validation XML with correct syntax is "Well Formed" XML. XML validated against a DTD is "Valid" XML.
C Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Introduction to XML Standards.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
XML Extensible Markup Language
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.
SNU OOPSLA Lab. A Tour of XML © copyright 2001 SNU OOPSLA Lab.
Perspective: Information Processing and Computation Week 11 Lecture notes INF 380E: Perspectives on Information Spring
XML intro. What is XML? XML stands for EXtensible Markup Language XML is a markup language much like HTML XML was designed to carry data, not to display.
Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.
Information Retrieval in Practice
Karen Wickett School of Information University of Texas at Austin
Unit 4 Representing Web Data: XML
XML QUESTIONS AND ANSWERS
Software Design Mr. Manoj Kumar Kar.
Formal Language Theory
Chapter 7 Representing Web Data: XML
Introducing HTML & XHTML:
Web Programming Maymester 2004
Attributes and Values Describing Entities.
XML Data Introduction, Well-formed XML.
Introduction to DTDs.
Attributes and Values Describing Entities.
Review of XML IST 421 Spring 2004 Lecture 5.
New Perspectives on XML
Presentation transcript:

Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1

The Computational Perspective (review) Computing systems let us store and manipulate information The only thing a computing system can do is follow instructions an algorithm is a set of instructions – an ordered set of unambiguous, executable steps that defines a terminating process a program is a representation of an algorithm encoded for use by a computing system 2

Abstraction “distinction between the external properties of an entity and the details of the entity’s internal composition” abstraction allows development of algorithmic processing for classes of objects we can define a data structure that encodes information and algorithms to process and manipulate that information 3

Abstraction How do we decide on a general data structure for some variety of information object? We determine the important, essential features that we need to represent – in order to achieve our desired functions And define an abstract class that generalizes each individual object 4

Abstraction for document processing Document processing systems use computers to manipulate and process information in documents. – frequently using XML and similar technologies Relies on document modeling – “A set of techniques for designing systems and representing information in order to make more efficient and more functional the creation, management, and exploitation of document-like content” 5

Early Text Processing Files with text and markup:.pa odd;.font Times;.size 14;.it;.ce;.in +5 -5;.sk 3p b ;.sk 2p a;.kp next;.toc include; Assembling Your Silex […] were “batch” processed, creating formatted text: Assembling Your Silex 6

Abstracting from the specific Phase 1: Simple Macros – The text and macro “call” in the main source file :format17;Assembling Your Silex [...] – The macro “expansion” in a (possibly) separate location format17 = { “.pa odd;.font Times;.size 14;.it;.ce;.in ;.sk 3p b ;.sk 2p a;.kp next;.toc include” } 7

to the general Phase 2: Descriptive Markup – The text and descriptive markup in the main source file :title;Assembling Your Silex [...] – The macro “expansion” in a (possibly) separate location title = { “.pa odd;.font Times;.size 14;.it;.ce;.in +5 -5;.sk 3p b ;.sk 2p a;.kp next;.toc include” } 8

Indirection Indirection in information systems allows for greater modularity – e.g. separating the formatting instructions from textual component they apply to In this case, it was a first step towards recognizing the "document genre" and defining an abstract class around it 9

Typical text components title author date abstract section, subsection, subsubsection section title, subsection title … etc paragraph extract (long quotation) equation diagram footnote 10

Genre-specific text components Playscripts: act, scene, stage direction, line, character, cast list. Poetry: title, author, verse, stanza, couplet, line, half- line Scientific article: title, author, affiliation, address, date submitted, date revised, keywords, abstract, introduction, methodology, results, discussion, conclusion, diagram, equation, plate, graph, chart, bibliography, bibliography item,.... date 11

Document modeling Analysis: Conceptually distinguish the “logical” structure of a document from appearance. – Sometimes called “document analysis” Markup: Identify the logical components of a document with descriptive markup (“tags”) from a given markup vocabulary To exploit the marked up document… – Develop and apply a stylesheet that associates each markup tag (type) with the appropriate processing instructions Or invoke or adapt an existing stylesheet 12

Document modeling activity Form a group with someone you didn't work with last week. identify content objects / textual components of a recipe do they consist of: – other components – character data – a mixture of character data and other components how do they occur? – exactly once – zero or once – one or more times – zero, one, or more times 13

<!DOCTYPE recipe [ ]> 14

Two kinds of abstraction in document modeling Genre level – Determining the "document model" for some type of documents – Results in a schema -- an XML vocabulary for marking up members of that document genre Document level – Determing what documents are in general – Results in the high-level XML data structure – a directed acyclic graph (a tree) 15

Data structures Generally, “the conceptual shape or arrangement of data”. You want a data structure that – fits the conceptual shape of your information – to let you access and manipulate information according to your needs 16

The best abstraction is one that captures what the thing really is "No hardware improvements or programming ingenuity can completely overcome a flawed representation." We need representations that will aid us in “collecting, preserving, organizing (arranging), representing (describing), selecting (retrieving), reproducing (copying), and disseminating documents”. 17

The OHCO Vision of What Text is Text is an Ordered Hierarchy of Content Objects, the grammar of which is determined by genre – content objects = things such as chapters, paragraphs, sentences, stanzas, lines, speeches, equations, titles, headings, abstracts – hierarchy = sentences inside paragraphs, paragraphs inside sections, sections inside chapters, etc., nesting with no overlaps – ordered = objects proceed or follow one another Formal structure: tree with ordered branches; – syntax expressible with a context free generative grammars developed in linguistics 18

The Two Things in the XML World Document Instances – particular documents, marked up with a markup language Schemas – One for each document type (class, category, genre) – Often playing the role of data structure standard – Defines a markup language for document structures by specifying its vocabulary and syntax (grammar) including: what elements can occur in documents of a particular type, what patterns these elements may form, what other information can be included about these elements rules for applying the markup to documents are not a formal part of XML per se, but are... which kind of standard? – (from Gilliland...) 19

Documents and Languages An XML document is like a sentence in a formal language. The schema (e.g. DTD) is a formal grammar for the language. – DTDs are based on BNF meta-grammars. – They define “context free grammars” (type 2 in the Chomsky hierarchy). – If a document conforms to the schema it is in the language, otherwise it isn’t. An XML document is a linearized parse tree – using a form of the “labeled nested bracket” linearization technique. Grammars are a technique for information modeling – for describing schemas and instances of schemas – that is well-suited for documents and text. 20

XML languages XML is a “meta-grammar” that lets you define document markup languages The elements of an XML vocabulary are based on abstractions of the specific components that appear in some genre of document. A document model is an abstract conceptualization of a class of documents; – it identifies the possible components of a document and the relationships those components may have. A schema (small "s") is (more or less) a formal specification of a document model in a particular document model specification language. EAD is an XML language – with elements that are based on the components of a finding aid TEI is an XML language EML (ecological metadata language) is an XML language 21

XML data structure The underlying data structure for an XML document is a “tree” – a directed acyclic graph – with ordered branches This hierarchical structure can be parsed to check for validation against a grammar – the grammar is an abstraction of class of documents Good for documents or data with hierarchical structure 22

XML schemas 23

Validity and Well-formedness Document instances can be Well-Formed elements must be bounded by both start and end tags elements must nest, no overlaps attribute values must be quoted attribute/value assignments must not be “minimized” all “<“ and “&” in content must be escaped Valid – Document instance correctly matches the rules given in a schema – "If documents are of known types, a special-purpose program (called a parser), once provided with an unambiguous definition of a document type, can check that any document claiming to be of that type does in fact conform to the specification." 24

Example DTD for poems <!ELEMENT author (#PCDATA) <!ELEMENT line (#PCDATA | italic | persname) 25

XML Processing Schema (DTD, XSL, RNG, etc) Document Instance Other information (e.g. stylesheet) XML Parser XML Application (e.g. formatter) Output (e.g. (Formatted Output) Expanded and reorgnized Parsed data Validity: Yes|No Well-formedness: Yes|No Errors 26

discussion break: interoperability Find a partner or two and discuss: Recall the levels of interoperability – if I send you a well-formed XML document, which level have we achieved? – if I send you a schema and a valid XML document, which level have we achieved? 27

What is a document, though? While DeRose, et al. look inside documents to ask "the question of essentials" – "What is it which, if changed, makes a document essentially different, and what is it which can change, yet a document remains 'the same'?" Buckland looks outside to ask how documents have been treated as an object of study. – Just what is it that we are working on when we're “collecting, preserving, organizing (arranging), representing (describing), selecting (retrieving), reproducing (copying)" 28

A challenge: the indeterminacy of documents Objects can be treated as informative, as evidence of some assertion – even if they were not created for that purpose, – or even if they were not created by people at all. 29

Discussion Let's discuss: Recall Suzanne Briet's requirements for something to be a document. – Do these requirements fit your own intuitions? – Does a digital environment present particular challenges for understanding or applying them? – What readings or concepts from earlier in the semester might we apply here? 30