1 XML Major Sources: ppt CIS550 Course Notes, U. Penn, source for many slides Yaron Kanza’s slides, source.

Slides:



Advertisements
Similar presentations
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
Advertisements

XML 6.3 DTD 6. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:  Elements.
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
1 XML DTD & XML Schema Monica Farrow G30
1 XML Data Management 2. XML Syntax Werner Nutt. 2 HTML Designed for publishing hypertext on the Web Describes how a browser should arrange text, images,
Fall 2001 CSE3301 XML and Beyond: Parts I and II
XML Examples CSC 436 – Fall 2005 Slides to be used in conjunction with class notes.
Document Type Definitions
Introduction to XLink Transparency No. 1 XML Information Set W3C Recommendation 24 October 2001 (1stEdition) 4 February 2004 (2ndEdition) Cheng-Chia Chen.
 2002 Prentice Hall, Inc. All rights reserved. ISQA 407 XML/WML Winter 2002 Dr. Sergio Davalos.
1 Document Type Descriptors (DTDs) Imposing Structure on XML Documents.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
XML eXtensible Markup Language.
1 XML and Databases. 2 Outline (ambitious) Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.
1 XML Major Sources: ppt CIS550 Course Notes, U. Penn, source for many slides Yaron Kanza’s slides, source.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML Verification Well-formed XML document  conforms to basic XML syntax  contains only built-in character entities Validated XML document  conforms.
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
Introduction to XML This material is based heavily on the tutorial by the same name at
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes, Harvard University Extension School — Cambridge, MA The Web Wizard’s Guide.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
Chapter 10: XML.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Introduction to XML. What is XML? Extensible Markup Language XML Easier-to-use subset of SGML (Standard Generalized Markup Language) XML is a.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
August Chapter 2 - Markup and Core Concepts Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
XML Syntax - Writing XML and Designing DTD's
XML eXtensible Markup Language Part 2.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
Jeff Ullman: Introduction to XML 1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
Chapter 23 XML. 2 Introduction  XML: eXtensible Markup Language (What is a Markup language?)  Defined by the WWW Consortium (W3C)  Originally intended.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
1 IST 210 Organization of Data Database and the Web.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
SNU OOPSLA Lab. Logical structure © copyright 2001 SNU OOPSLA Lab.
+ 1 XML eXtensible Markup Language. + 2 XML Lecture Adapted from the work of Dr. Praveen Madiraju of Marquette University.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
XML eXtensible Markup Language.
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Managing XML and Semistructured Data
DTD (Document Type Definition)
eXtensible Markup Language
Presentation transcript:

1 XML Major Sources: ppt CIS550 Course Notes, U. Penn, source for many slides Yaron Kanza’s slides, source for many slides Brian Travis, XML Day At Microsoft Tech·Ed 99 XML Black Book Other sources ….

2 Part I: Background What’s the difference between the world of documents and information retrieval and databases and query interfaces?

3 Documents vs Databases Document world > plenty of small documents > usually static > implicit structure section, paragraph, toc, > tagging > human friendly > content form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description

4 What to do with them Documents editing printing spell-checking counting words retrieving (IR) searching Database updating cleaning querying composing/transforming

5 HTML Lingua franca for publishing hypertext on the World Wide Web Designed to describe how a Web browser should arrange text, images and push-buttons on a page. Easy to learn, but does not convey structure. Fixed tag set. Welcome to the XML course Introduction Opening tag Text (PCDATA) Closing tag “Bachelor” tag Attribute nameAttribute value

6 Thin red line The line between the document world and the database world is not clear. In some cases, both approaches are legitimate. An interesting middle ground is data formats -- of which XML is an example

7 The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested good bad (You can’t do in HTML)

8 XML text XML has only one “basic” type -- text. It is bounded by tags e.g. The Big Sleep is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data

9 XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : Jeff Cohen

10 XML structure (cont.) We can represent a list by using the same tag repeatedly:...

11 XML structure (cont.) We can represent a list by using the same tag repeatedly: Yossi Orr Irma Levy

12 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. Malcolm Atchison (215) element not an element element, a sub-element of

13 XML is tree-like person name tel Malcolm Atchison (215) Semistructured data models typically put the labels on the edges

14 Mixed Content An element may contain a mixture of sub-elements and PCDATA British Airways World’s favorite airline Data of this form is not typically generated from databases. It is needed for consistency with HTML

15 A Complete XML Document Jeff Cohen

16 The Header Tag You can leave out the encoding attribute and the processor will use the UTF-8 default.

17 Two ways of representing a DB projects: title budget managedBy employees: name ssn age

18 Project and Employee relations in XML Pattern recognition Joe Joe Sandra Auto guided vehicle Sandra : Projects and employees are intermixed

19 Pattern recognition Joe Auto guided vehicles Sandra : Project and Employee relations in XML (cont’d) Joe Sandra : Employees follow projects

20 Pattern recognition Joe Auto guided vehicles Sandra : Project and Employee relations in XML (cont’d) Joe Sandra : Or without “separator” tags …

21 Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element cheese fromage branza A food made …

22 Attributes (cont’d) Another common use for attributes is to express dimension or type A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed.

23 Attributes (cont’d) Jeff Cohen Irma Levy

24 When to use attributes It’s not always clear when to use attributes F. MacNiel F. MacNiel

25 Using IDs Jeff Cohen Irma Levy

26 ODL schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::casts; attribute int age; attribute set directed; } ;

27 An example Waking Ned Divine Kirk Jones III 100,000 Dragonheart Rob Cohen 110,000 Moondance Dagmar Hirtz 90,000 : David Kelly Sean Connery 68 Ian Bannen :

28 Part II: Document Type Descriptors Imposing structure on XML documents

29 <!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED>

30 In XMLSpy Grid View

31 Document Type Descriptors Document Type Descriptors (DTDs) impose structure on an XML document. There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems. The DTD is a syntactic specification.

32 Example: The Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH (321) Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

33 Specifying the structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name,greet? to specify a name followed by an optional greet

34 Specifying the structure (cont) addr* to specify 0 or more address lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or fax * 0 or more elements

35 Specifying the structure (cont) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, * This is known as a regular expression. Why is it important?

36 Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example: name, addr*, This suggests a simple parsing program name addr

37 Another example name,address*,(tel | fax)*, * name address tel fax Adding in the optional greet further complicates things

38 Internal DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT project (name, greet?, address*, (fax | tel)*, *)> ]>

39 Rest of the address book Jeff Cohen Dr. Cohen

40 Our relational DB revisited projects: title budget managedBy employees: name ssn age

41 Two DTDs for the relational DB <!DOCTYPE db [... ]> <!DOCTYPE db [... ]>

42 Recursive DTDs <DOCTYPE genealogy [ <!ELEMENT person ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this? XMLSpy does not notice it!

43 Recursive DTDs cont’d. <DOCTYPE genealogy [ <!ELEMENT person ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this?

44 Some things are hard to specify Each employee element is to contain name, age and ssn elements in some order. <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields !

45 General Definitions of Entities ANY - tells that the element can have any content. EMPTY - tells that the element have no content.

46 Summary of XML regular expressions AThe tag A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional -- 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping

47 Deterministic Requirement Content models in element type declarations should be deterministic. Formally, the Glushkov automaton is deterministic. This automaton has states the positions of the regular expression (semantic actions). The transitions are based on the ‘follows set’. The associated automata are succinct. A regular language may not have an associated deterministic grammar, e.g.,

48 Specifying attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required; the accuracy attribute is optional. CDATA is the “type” of the attribute -- it means string, may take any literal string as a value.

49 Specifying ID and IDREF attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>

50 Some conforming data Jane Doe John Doe Mary Doe Jack Doe

51 Consistency of ID and IDREF attribute values If an attribute is declared as ID –the associated values must all be distinct (no confusion) If an attribute is declared as IDREF –the associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute ID and IDREF attributes are not typed

52 Formally Validity constraint: One ID per Element Type No element type may have more than one ID attribute specified. Validity constraint: ID Attribute Default An ID attribute must have a declared default of #IMPLIED or #REQUIRED. Validity constraint: IDREF Values of type IDREF must match the Name production, and values of type IDREFS must match Names; each Name must match the value of an ID attribute on some element in the XML document; i.e. IDREF values must match the value of some ID attribute.NameNames Name

53 A useful abbreviation When an element has empty content we can use for For example: Jane Doe...

54 An alternative specification <!DOCTYPE family [ ]>

55 The revised data Jane Doe John Doe Ami Doe Tami Doe

56 ODL schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set cast inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::cast; attribute int age; attribute set directed; } ;

57 Schema.dtd <!DOCTYPE db [

58 Schema.dtd (cont’d) ]>

59 Data Oh God! Woody Allen $2M George Burns

60 Constraints on ID s and IDREF s ID stands for identifier. No two ID attributes may have the same value (of type CDATA ) IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value IDREFS specifies several (0 or more) identifiers

61 Connecting the document with its DTD In line: … ]>... Another file : A URL: <!DOCTYPE db SYSTEM "

62 Connecting the document with its DTD Both: file c:/schema.dtd: file to be validated <!DOCTYPE db SYSTEM "c:/schema.dtd" [ ]> Oh God! Woody Allen $2M George Burns

63 Well-formed and Valid Documents Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

64 DTDs v.s Schemas (or Types) By database (or programming language) standards DTDs are rather weak specifications. –Only one base type -- PCDATA –No useful “abstractions” e.g., sets –IDREFs are untyped. You point to something, but you don’t know what! –No constraints e.g., child is inverse of parent –No methods –Tag definitions are global Some of the XML extensions impose something like a schema or type on an XML document. We may see these later

65 Part III: Entities To take storage into account

66 What are Entities An entity is a shortcut to a set of information You might think of an entity as being a bit like a macro. Entities allow dividing a document between some different storage devices.

67 Why to use entities: Entities save typing. Entities can reduce errors. Entities are easy to update. Entities can act as placeholders for TBD information.

68 Defining Entities You can define entities in your local document as part of the DOCTYPE definition. You can also link to external files that contain the entity data. This, too, is done through the DOCTYPE definition. A third option is to define the entities in your external DTD. Use a local definition when the entity is being used only in this one particulars file. Use a linked, external file when the entity being used in many document sets.

69 Kinds of Entities There are two kinds of entities: general entities parameter entities Internal External Parsed Unparsed Possibilities (first 4 are parsed): 1.Internal Parameter 2.External Parameter 3.Internal General 4.External General 5.External General Unparsed

70 General entities The definition of general entities in the DTD The usage of the entity in the document is by &Name;

71 Example <!DOCTYPE mdb [ ]> Oh God! Woody Allen $2M

72 Browser View

73 Non-parsed Entities <!DOCTYPE mdb [ <!ATTLIST movie id ID #REQUIRED opinion CDATA #IMPLIED starimage ENTITY #IMPLIED> ]>

74 Data Oh God! Woody Allen $2M

75 Parameter Entities Parameter entities are used only within DTDs. They carry information for use in the markup declaration. Internal entities - references are within the DTD. External entities - references draw information from outside files. Parameter Entity declaration: Can’t use in internal DTD subset

76 Parameter Entity Example <!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED>

77 Entities Definition Local Definition: <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact ]> Global Definition: <!DOCTYPE [ <!ENTITY copyright SYSTEM "

78 Example <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact ]>

79 Example (cont.) Mini-globe revolutionizes keychain industry Today As The World Spins introduces a new approach to key chains. With the new MINI-GLOBE keys can be kept inside a chain, called for upon demand, and stored safely. Never more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one. &trademark;&copyright;

80 Using CDATA <HEAD1> Entering a Kennel Club Member Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: Sir Fredrick of Ledyard's End ]]> </EXAMPLE>

81

82 Namespaces Namespaces are a way of preventing name clashes among elements from more than one source within the same XML document. They are also useful in identifying elements that are meaningful for a particular XML application. See

83 Namespaces URIs are either of URLs or URNs. An XML namespace is, literally, identified by a URI reference. The reference need not point to an actual resource! A URI reference may be associated more than one prefix. Prefixes are used in XML documents in forming element and attribute names (prefix:localname). Two prefixes that are associated with the same URI are said to be in the same namespace. declaring a namespace - identifying a namespace used in the document. DTDs are unaware of namespaces.

84 Example Defining the Namespace ATDB: Using a tag from the ATDB Namespace This is an xml tag. ADTB:myTag is a qualified name. Using A tag not from the namespace: This is a ‘made in Israel’ tag.

85 Scope of Namespaces A prefix is associated with the namespace in the element scope in which it is defined. Example (birthdate is associated with no namespace): John Smith Technion City 234

86 Default Namespaces A default namespace applies to all elements in its scope. However, it does not override explicit prefixes (their non- prefixed child elements are default-bound). Example (name and birthdate are bound): John Smith Technion City 234 Non-prefixed attribute names are associated with no namespace even when in scope.

87 Summary XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semi structured data (data without schema) DTDs provide some useful syntactic constraints on documents. As schemas they are weak How to store large XML documents? How to query them? How to map between XML and other representations?