Download presentation
Presentation is loading. Please wait.
1
2005 http://www.cs.huji.ac.il/~dbi 1 XML eXtensible Markup Language
2
2005 http://www.cs.huji.ac.il/~dbi 2 Introduction and Motivation
3
2005 http://www.cs.huji.ac.il/~dbi 3 XML vs. HTML HTML is a HyperText Markup language –Designed for a specific application, namely, presenting and linking hypertext documents XML describes structure and content (“semantics”) –The presentation is defined separately from the structure and the content
4
2005 http://www.cs.huji.ac.il/~dbi 4 An Address Book as an XML document Donald Duck 04-828-1345 donald@cs.technion.ac.il Miki Mouse 03-426-1142 miki@yahoo.com
5
2005 http://www.cs.huji.ac.il/~dbi 5 Main Features of XML No fixed set of tags –New tags can be added for new applications An agreed upon set of tags can be used in many applications –Namespaces facilitate uniform and coherent descriptions of data For example, a namespace for address books determines whether to use or
6
2005 http://www.cs.huji.ac.il/~dbi 6 Main Features of XML (cont’d) XML has the concept of a schema –DTD and the more expressive XML Schema XML is a data model –Similar to the semistructured data model XML supports internationalization (Unicode) and platform independence (an XML file is just a character file)Unicode
7
2005 http://www.cs.huji.ac.il/~dbi 7 XML is Self-Describing Data Traditionally, a data file is just a bit stream Only a program that reads or writes this file has the details about –How to break the bit stream into records –How to break each record into fields –The type of each data field Over the years, companies retained valuable data (e.g., on magnetic tapes), but lost the programs that have the above information –As a result, the data was practically lost It cannot happen with XML data
8
2005 http://www.cs.huji.ac.il/~dbi 8 XML is the Standard for Data Exchange Web services (e.g., ecommerce) require exchanging data between various applications that run on different platforms XML (augmented with namespaces) is the preferred syntax for data exchange on the Web
9
2005 http://www.cs.huji.ac.il/~dbi 9 XML is not Alone XML Schemas strengthen the data-modeling capabilities of XML (in comparison to XML with only DTDs) XPath is a language for accessing parts of XML documents XLink and XPointer support cross-references XSLT is a language for transforming XML documents into other XML documents (including XHTML, for displaying XML files)XSLT –Limited styling of XML can be done with CSS aloneCSS XQuery is a lanaguage for querying XML documents
10
2005 http://www.cs.huji.ac.il/~dbi 10 The Two Facets of XML Some XML files are just text documents with tags that denote their structure and include some metadata (e.g., an attribute that gives the name of the person who did the proofreading) –See an example on the next slide –XML is a subset of SGML (Standard Generalized Markup Language) Other XML documents are similar to database files (e.g., an address book)
11
2005 http://www.cs.huji.ac.il/~dbi 11 XML can Describe the Structure of a Document Complexity of Computations M. O. Rabin Hebrew University …
12
2005 http://www.cs.huji.ac.il/~dbi 12 XML Syntax W3Schools Resources on XML Syntax
13
2005 http://www.cs.huji.ac.il/~dbi 13 The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested –good......... –bad......... (You can’t do......... in HTML)
14
2005 http://www.cs.huji.ac.il/~dbi 14 A Useful Abbreviation Abbreviating elements with empty contents: for For example: Lisa Simpson... Note that a tag may have a set of attributes, each consisting of a name and a value
15
2005 http://www.cs.huji.ac.il/~dbi 15 XML Text XML has only one “basic” type – text It is bounded by tags, e.g., The Big Sleep 1935 – 1935 is still text XML text is called PCDATA –(for parsed character data) It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem
16
2005 http://www.cs.huji.ac.il/~dbi 16 XML Structure Nesting tags can be used to express various structures, e.g., a tuple (record): Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il
17
2005 http://www.cs.huji.ac.il/~dbi 17 XML Structure (cont’d) We can represent a list by using the same tag repeatedly: …
18
2005 http://www.cs.huji.ac.il/~dbi 18 XML Structure (cont’d) Donald Duck 04-828-1345 donald@cs.technion.ac.il Miki Mouse 03-426-1142 miki@yahoo.com
19
2005 http://www.cs.huji.ac.il/~dbi 19 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il element element, a sub-element of not an element
20
2005 http://www.cs.huji.ac.il/~dbi 20 An XML Document is a Tree person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il Note that semistructured data models typically put the labels on the edges, and are arbitrary graphs and not just trees Leaves are either empty or contain PCDATA
21
2005 http://www.cs.huji.ac.il/~dbi 21 Mixed Content An element may contain a mixture of sub- elements and PCDATA British Airways World’s favorite airline How many leaves are there in the corresponding tree? How many leaves are empty?
22
2005 http://www.cs.huji.ac.il/~dbi 22 The Header Tag –Standalone=“no” means that there is an external DTD –You can leave out the encoding attribute and the processor will use the UTF-8 default
23
2005 http://www.cs.huji.ac.il/~dbi 23 Processing Instructions Hello, world!
24
2005 http://www.cs.huji.ac.il/~dbi 24 Using CDATA Entering a Kennel Club Member Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: Sir Fredrick of Ledyard's End ]]> We want to see the text as is, even though it includes tags
25
2005 http://www.cs.huji.ac.il/~dbi 25 A Complete XML Document Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il
26
2005 http://www.cs.huji.ac.il/~dbi 26 Well-Formed XML Documents An XML document (with or without a DTD) is well-formed if –Tags are syntactically correct –Every tag has an end tag –Tags are properly nested –There is a root tag –A start tag does not have two occurrences of the same attribute An XML document must be well formed
27
2005 http://www.cs.huji.ac.il/~dbi 27 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs
28
2005 http://www.cs.huji.ac.il/~dbi 28 Motivation A DTD adds syntactical requirements in addition to the well-formed requirement It helps in eliminating errors when creating or editing XML documents It clarifies the intended semantics It simplifies the processing of XML documents
29
2005 http://www.cs.huji.ac.il/~dbi 29 An Example In an address book, where can a phone number appear? –Under, under or under both? If we have to check for all possibilities, processing takes longer and it may not be clear to whom a phone number belongs –We would like to know that a phone number is allowed to appear under both a department and the manager of that department –If we don’t know that and there is only one phone number, we may not know whether it serves both the department and its manager or just one of them
30
2005 http://www.cs.huji.ac.il/~dbi 30 Document Type Definitions Document Type Definitions (DTDs) impose structure on XML documents There is some relationship between a DTD and a schema, but it is not close –Hence, the need for additional “typing” systems (XML schemas) The DTD is a syntactic specification
31
2005 http://www.cs.huji.ac.il/~dbi 31 Example: An Address Book Homer Simpson Dr. H. Simpson 1234 Springwater Road Springfield USA, 98765 (321) 786 2543 (321) 786 2544 homer@math.springfield.edu Mixed telephones and faxes As many as needed As many address lines as needed (in order) At most one greetingExactly one name
32
2005 http://www.cs.huji.ac.il/~dbi 32 Specifying the Structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name, greet? to specify a name followed by an optional greet
33
2005 http://www.cs.huji.ac.il/~dbi 33 Specifying the Structure (cont’d) addr*to specify 0 or more address lines tel | faxa tel or a fax element (tel | fax)* 0 or more repeats of tel or fax email*0 or more email elements
34
2005 http://www.cs.huji.ac.il/~dbi 34 Specifying the Structure (cont’d) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, email* This is known as a regular expression Why is it important?
35
2005 http://www.cs.huji.ac.il/~dbi 35 Summary of Regular Expressions AThe tag (i.e., element) A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional: 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping
36
2005 http://www.cs.huji.ac.il/~dbi 36 The Definition of an Element Consists of Exactly One of the Following A regular expression (as defined earlier) EMPTY means that the element has no content ANY means that the content can be any mixture of PCDATA and elements defined in the DTD Mixed content which is defined as described on the next slide (#PCDATA)
37
2005 http://www.cs.huji.ac.il/~dbi 37 The Definition of Mixed Content Mixed content is described by a repeatable OR group (#PCDATA | element-name | …)* –Inside the group, no regular expressions – just element names –#PCDATA must be first, followed by 0 or more element names that are separated by | –The group can be repeated 0 or more times
38
2005 http://www.cs.huji.ac.il/~dbi 38 An Address-Book XML Document with an Internal DTD <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> ]> The name of the DTD is addressbook “Internal” means that the DTD and the XML Document are in the same file The syntax of a DTD is not XML syntax
39
2005 http://www.cs.huji.ac.il/~dbi 39 The Rest of the Address-Book XML Document Jeff Cohen Dr. Cohen jc@penny.com
40
2005 http://www.cs.huji.ac.il/~dbi 40 Regular Expressions Each regular expression determines a corresponding finite-state automaton Let’s start with a simpler example: name, addr*, email name addr email This suggests a simple parsing program A double circle denotes an accepting state
41
2005 http://www.cs.huji.ac.il/~dbi 41 Another Example name,address*,(tel | fax)*,email* name address tel fax email Adding in the optional greet further complicates things email
42
2005 http://www.cs.huji.ac.il/~dbi 42 Deterministic Requirement: Content Models must be Deterministic If element-type declarations are deterministic, it is easier to parse XML documents W3C XML recommendation requires the Glushkov automaton to be deterministic The states of this automaton are the positions of the regular expression (semantic actions) The transitions are based on the “follows set”
43
2005 http://www.cs.huji.ac.il/~dbi 43 Deterministic Requirement (cont’d) The associated automata are succinct A regular language may not have an associated deterministic grammar, e.g., <!ELEMENT ndeter ((movie|director)*,movie,(movie|director))> This is not allowed in a DTD
44
2005 http://www.cs.huji.ac.il/~dbi 44 Some Things are Hard to Specify Each employee element should contain name, age and ssn elements in some order <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose that there were many more fields!
45
2005 http://www.cs.huji.ac.il/~dbi 45 Some Things are Hard to Specify (cont’d) <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields! There are n! different orders of n elements It is not even polynomial
46
2005 http://www.cs.huji.ac.il/~dbi 46 Specifying Attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED “cm” accuracy CDATA #IMPLIED > The dimension attribute is required The accuracy attribute is optional CDATA is the “type” of the attribute – it means “character data,” and may take any literal string as a value
47
2005 http://www.cs.huji.ac.il/~dbi 47 The Format of an Attribute Definition The default value is given inside quotes
48
2005 http://www.cs.huji.ac.il/~dbi 48 Summary of Attribute Types CDATA (value | … | … ) is an enumeration of allowed values ID, IDREF, IDRERS –to be explained later ENTITY, ENTITIES –to be explained later NMTOKEN, NMTOKENS, NOTATION
49
2005 http://www.cs.huji.ac.il/~dbi 49 Summary of Attribute Default Values #REQUIRED means that the attribute must by included in the element #IMPLIED #FIXED “value” –The given value (inside quotes) is the only possible one “value” –The default value of the attribute if none is given
50
2005 http://www.cs.huji.ac.il/~dbi 50 Recursive DTDs <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this? A parser does not notice it! Each person should have a father and a mother. This leads to either infinite data or a person that is a descendent of herself.
51
2005 http://www.cs.huji.ac.il/~dbi 51 Recursive DTDs (cont’d) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this? If a person only has a mother, how can you tell that he has a mother and does not have a father?
52
2005 http://www.cs.huji.ac.il/~dbi 52 Using ID and IDREF Attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>
53
2005 http://www.cs.huji.ac.il/~dbi 53 IDs and IDREFs ID stands for identifier –No two ID attributes may have the same value (of type CDATA) IDREF stands for identifier reference –Every value associated with an IDREF attribute must exist as the value of some ID attribute IDREFS specifies several (0 or more) identifier references
54
2005 http://www.cs.huji.ac.il/~dbi 54 Some Conforming Data Lisa Simpson Bart Simpson Marge Simpson Homer Simpson
55
2005 http://www.cs.huji.ac.il/~dbi 55 ID References do not Have Types The attributes mother and father are references to IDs of other elements However, those are not necessarily person elements! The mother attribute is not necessarily a reference to a female person
56
2005 http://www.cs.huji.ac.il/~dbi 56 An Alternative Specification <!DOCTYPE family [ ]>
57
2005 http://www.cs.huji.ac.il/~dbi 57 The Revised Data Marge Simpson Homer Simpson Bart Simpson Lisa Simpson
58
2005 http://www.cs.huji.ac.il/~dbi 58 Consistency of ID and IDREF Attribute Values If an attribute is declared as ID –The associated value must be distinct, i.e., different elements (in the given document) must have different values for the ID attribute (no confusion) Even if the two elements have different element names If an attribute is declared as IDREF –The associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute ID, IDREF and IDREFS attributes are not typed
59
2005 http://www.cs.huji.ac.il/~dbi 59 Adding a DTD to the Document A DTD can be internal –The DTD is part of the document file or external –The DTD and the document are on separate files –An external DTD may reside In the local file system (where the document is) In a remote file system
60
2005 http://www.cs.huji.ac.il/~dbi 60 Connecting a Document with its DTD An internal DTD: … ]>... A DTD from the local file system: A DTD from a remote file system:
61
2005 http://www.cs.huji.ac.il/~dbi 61 Well-Formed XML Documents An XML document (with or without a DTD) is well-formed if –Tags are syntactically correct –Every tag has an end tag –Tags are properly nested –There is a root tag –A start tag does not have two occurrences of the same attribute An XML document must be well formed
62
2005 http://www.cs.huji.ac.il/~dbi 62 Valid Documents A well-formed XML document isvalid if it conforms to its DTD, that is, –The document conforms to the regular- expression grammar, –The types of attributes are correct, and –The constraints on references are satisfied
63
2005 http://www.cs.huji.ac.il/~dbi 63 DTDs are CFGs (Context-Free Grammars) Checking validity and parsing a document according to a DTD is in polynomial time, using a dynamic-programming algorithm –A element has the same rules regardless of whether it is under a element or a element Note that XML Schemas are capable of describing context-sensitive structures –The complexity is higher
64
2005 http://www.cs.huji.ac.il/~dbi 64 Validating XML Parsers A validating parser is one that checks validity It also converts the XML document to some normal form –For example, it adds attributes that have fixed values (according to the DTD) if these attributes do not appear explicitly
65
2005 http://www.cs.huji.ac.il/~dbi 65 DTDs vs. Schemas (or Types) DTDs are rather weak specifications by DB & programming-language standards –Only one base type – PCDATA –No useful “abstractions”, e.g., sets –IDREFs are untyped – the type of the object being referenced is not known –No constraints, e.g., child is inverse of parent –No inheritance –No methods –Tag definitions are global Some extensions of XML impose a schema or types on an XML document More problems with DTD
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.