Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions.

Similar presentations


Presentation on theme: "CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions."— Presentation transcript:

1 CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions XML-Data, XLink, XPointer

2 CIS 670 Fall 2001 (LN 5)2 XML’s history: SGML, HTML, XML SGML: Standard Generalized Markup Language -- Charles Goldfarb, ISO 8879, 1986 4 DTDs (Document Type Declarations) 4 powerful and flexible tool for structuring information, but –complete, generic implementation of SGML proved extremely difficult –tools for working with SGML documents proved expensive 4 two children that have outpaced SGML: –HTML: HyperText Markup Language (Tim Berners-Lee, 1991). Describing presentation. –XML: eXtensible Markup Language, W3C, 1998. Describing content.

3 CIS 670 Fall 2001 (LN 5)3 From HTML to XML HTML is good for presentation (human friendly), but does not help automatic data extraction by means of programs (not computer friendly). Why? HTML tags: 4 predefined and fixed 4 describing display format, not the structure of the data. Book SGML Goldfarb 1986 XML W3C

4 CIS 670 Fall 2001 (LN 5)4 XML: a first glance XML tags: 4 user defined 4 describing the structure of the data SGML Goldfarb 1986 XML W3C

5 CIS 670 Fall 2001 (LN 5)5 XML vs. HTML 4 user defined new tags, describing structure instead of display 4 structures can be arbitrarily nested (even recursively defined) 4 optional description of its grammar (DTD) and thus validation is possible XML presentation: 4 XML standard does not define how data should be displayed 4 Style sheet: provide browsers with a set of formatting rules to be applied to particular elements –CSS (Cascading Style Sheets), originally for HTML –XSL (eXtensible Style Language), for XML

6 CIS 670 Fall 2001 (LN 5)6 Web applications 4 data exchange 4 data transformation 4 data integration 4 data extraction 4 E-commerce

7 CIS 670 Fall 2001 (LN 5)7 XML basics (1) 4 XML consists of tags and text XML W3C 4 tags come in pairs: markups –start tag, e.g., –end tag, e.g., 4 tags must be properly nested – … -- good – … -- bad 4 XML has only one “basic” type: text, called PCDATA (Parsed Character DATA)

8 CIS 670 Fall 2001 (LN 5)8 XML basics (2) 4 nested tags can be used to express various structures, e.g., “records”: Wenfei Fan (215) 204-6485 fan@joda.cis.temple.edu wenfei@acm.org 4 a list: represented by using the same tags repeatedly: …...

9 CIS 670 Fall 2001 (LN 5)9 XML basics (3) XML data is ordered! 4 How to represent sets in XML? 4 How to represent an unordered pair (a, b) in XML? 4 Can one directly represent the following in a conventional database? – … … … – Wenfei Fan (215) 204-6485 fan@joda.cis.temple.edu wenfei@acm.org

10 CIS 670 Fall 2001 (LN 5)10 XML element (1) 4 Element: the segment between an start and a corresponding end tag 4 subelement: the relation between an element and its component elements. Wenfei Fan (215) 204-6485 fan@joda.cis.temple.edu wenfei@acm.org

11 CIS 670 Fall 2001 (LN 5)11 XML elements (2) 4 root element: an XML document consists of a single element called the root element, e.g., … …... 4 empty element: special element indicating non-textual content, e.g., or simply –sound effect (to be generated by application): Everyone taking CIS 670 can get an A. However, … –its attributes carry information:

12 CIS 670 Fall 2001 (LN 5)12 XML elements (3) 4 mixed content: an element may contain a mixture of subelements and PCDATA: This is my daughter Grace Fan nuonuo_fan@yahoo.com I am not too sure of the following crying Ph.D Don’t even think about it:

13 CIS 670 Fall 2001 (LN 5)13 XML attributes (1) An start tag may contain attributes describing “properties” of the element (e.g., dimension or type) 2400 96 M05-+C$ … References (meaningful only when a DTD is present): Grace Fan nuonuo_fan@yahoo.com

14 CIS 670 Fall 2001 (LN 5)14 XML attributes (2) 4 XML attributes cannot be nested 4 XML attributes must be unique one can’t write... 4 XML attributes are not ordered Grace Fan is the same as Grace Fan

15 CIS 670 Fall 2001 (LN 5)15 Attributes vs. subelements When to use attributes? Not always clear. A research problem. An incomplete guideline: 4 attributes cannot nest (flat structure) 4 subelements cannot represent references 4 subelements are more easily displayed (without the use of complex style sheets) 4 subelements are often easier to read 4 attributes are more compact

16 CIS 670 Fall 2001 (LN 5)16 Other XML constructs (1) 4 XML declaration: version information must be provided: 4 comments: 4 CDATA: escape blocks containing characters that would otherwise be recognized as markup: e.g., this is not an element ]]>

17 CIS 670 Fall 2001 (LN 5)17 Other XML constructs (2) 4 PI (Processing Instruction): for applications, not parsers Example: associate a CSS style sheet with XML document Example: associate an XSL style sheet with XML document <?xml:stylesheet href=“http://www.cis.temple.edu/~fan/book.xsl” type=“text/xsl” ?>

18 CIS 670 Fall 2001 (LN 5)18 Other XML constructs (3) 4 Entities: macros (defined in a DTD): E.g., in 670.dtd In 670.xml: XML is about &XML-flavor while HTML about &HTML-flavor 4 DTD (Document Type Declaration) <!DOCTYPE 670 PUBLIC “http://www.cis.temple.edu/670/670.dtd”>

19 CIS 670 Fall 2001 (LN 5)19 A complete XML document SGML Goldfarb 1986 XML

20 CIS 670 Fall 2001 (LN 5)20 Well-formed XML documents a document is well-formed if it satisfies two constraints (when only elements and attributes are considered): 4 tags have to nest properly 4 attributes have to be unique There are also constraints about other constructs (e.g., entities) -- XML specification. Very weak constraints: it does little more than ensure that XML data will parse into a labeled tree XML is tree-like: rooted directed tree (graph) with labels on vertices!

21 CIS 670 Fall 2001 (LN 5)21 We are here Introduction to XML 4 XML basics 4 DTDs 4 XML and semistructured data

22 CIS 670 Fall 2001 (LN 5)22 Document Type Declarations (DTDs) 4 A DTD imposes structure on an XML document 4 The DTD is a syntactic specification (grammar) 4 There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems (extensions) 4 DTDs are optional: an XML document may not come along with a DTD 4 DTDs are somewhat unsatisfactory, and several proposals have been made for better schema formalisms

23 CIS 670 Fall 2001 (LN 5)23 A DTD <!DOCTYPE db [ ]>

24 CIS 670 Fall 2001 (LN 5)24 Element declarations (1) for each element type E, a declaration of the form: where P is a regular expression, i.e., P ::= EMPTY | ANY | #PCDATA | E’ | P1, P2 | P1 | P2 | P? | P+ | P* –E’: element type –P1, P2: concatenation –P1 | P2: disjunction –P?: optional –P+: one or more occurrences – P*: the Kleene closure

25 CIS 670 Fall 2001 (LN 5)25 Element declarations (2) 4 Extended context free grammar: Why is it called extended? E.g., 4 single root: 4 subelements are ordered. The following two are different. Why? 4 recursive definition, e.g., section, binary tree: <!ELEMENT node (leaf | (node, node))

26 CIS 670 Fall 2001 (LN 5)26 Element declarations (3) 4 more on recursive DTDs What is the problem with this? How to fix it? –Attributes –optional (e.g., father?, mother?) 4 more on ordering How to declare E to be an unordered pair (a, b)?

27 CIS 670 Fall 2001 (LN 5)27 Element declarations (4) 4 EMPTY element: observe it has attributes 4 ANY: may contain any content -- discouraged 4 mixed content

28 CIS 670 Fall 2001 (LN 5)28 Element declarations (5) 4 global definition: The type associated with an element is unique -- only one declaration for name is allowed. To avoid name clashes, one may use two distinct tags: e.g., personname, coursename. 4 namespace: define two namespaces <MYNAMESPACE xmlns:person=“~fan/person.dtd” xmlns:course=“~fan/course.dtd”> … …

29 CIS 670 Fall 2001 (LN 5)29 Attribute declarations (1) General syntax: <!ATTLIST element_name attribute-name attribute-type default-declaration> example: <!ATTLIST book isbn ID #required> <!ATTLIST ref to IDREFS #implied> Note: it is OK for several element types to define an attribute of the same name, e.g.,

30 CIS 670 Fall 2001 (LN 5)30 Attribute declarations (2) <!ATTLIST element_name attribute-name attribute-type default-declaration> 4 attribute types: 10 –CDATA –ID, IDREF, IDREFS –ENTITY, ENTITIES –NMTOKEN, NMTOKENS –enumerated, notation 4 default declarations: 4 –#required, #implied –“default value”, #fixed “default value”

31 CIS 670 Fall 2001 (LN 5)31 Specifying ID and IDREF attributes <!ATTLIST person id ID #required father IDREF #implied mother IDREF #implied children IDREFS #implied> e.g., <person id=“898” father=“332” mother=“336” children=“982 984 986”> ….

32 CIS 670 Fall 2001 (LN 5)32 XML reference mechanism 4 ID attribute: unique within the entire document. –An element can have at most one ID attribute. –No default (fixed default) value is allowed. #required: a value must be provided #implied: a value is optional 4 IDREF attribute: its value must be some other element’s ID value in the document. 4 IDREFS attribute: its value is a set, each element of the set is the ID value of some other element in the document. <person id=“898” father=“332” mother=“336” children=“982 984 986”>

33 CIS 670 Fall 2001 (LN 5)33 ID vs. object identifiers in OODBs 4 ID is unique within the whole document, like an oid 4 ID is not system-generated and can be changed - different from oid and somehow like keys in relational DBs 4 IDREF (IDREFS) are untyped -- a big problem: you point to something, but you don’t know what it is! 4 No inverse constraints, i.e., child is inverse of parent. This makes it difficult to translate object-oriented databases into an XML encoding.

34 CIS 670 Fall 2001 (LN 5)34 ID/IDREF vs. key/foreign key in RDBs 4 keys are unique within the same relation, while IDs are unique within the whole database 4 a relation may have several different keys, while an element can have at most one ID 4 keys can be multi-valued, while IDs must be single-valued enroll (sid: string, cid: string, grade:string) 4 foreign keys are typed, while IDREF (IDREFS) is not This makes it difficult to translate RDBs into XML. Why is it a problem? To exchange/integrate/transform data, we need to translate legacy data into an XML encoding while preserving the original semantics of the data.

35 CIS 670 Fall 2001 (LN 5)35 CDATA attributes CDATA: string <!ATTLISTsnore volume CDATA #implied> e.g., 4 default value: used when no value is given <!ATTLISTsnore volume CDATA “normal”> e.g., 4 fixed default value: fixed and may not be changed <!ATTLISTsnore volume CDATA #fixed “normal”> e.g.,

36 CIS 670 Fall 2001 (LN 5)36 Enumerated types We specify a range and don’t want its volume out of range. <!ATTLIST snore volume (silent | quite | normal | loud | loudest) “normal”>

37 CIS 670 Fall 2001 (LN 5)37 NOTATIONS Notations allow documents to identify the types of content they will contain: e.g., In attributes: <!ATTLIST picture source CDATA #required type NOTATION (gif | jpg) #required>

38 CIS 670 Fall 2001 (LN 5)38 NMTOKEN, NMTOKENS 4 NMTOKEN: name token, restricted form of strings that has the same production rule as element names. No additional constraints: it does not have to match another attribute or declaration. <!ATTLIST picture width NMTOKEN #required> 4 NMTOKENS: a list of name tokens.

39 CIS 670 Fall 2001 (LN 5)39 ENTITY, ENTITIES 4 ENTITY (attribute): must match the name of an entity 4 ENTITIES: multiple ENTITY values, each must be the name of an entity. Parametric entities: their use is limited to the DTD: syntax: use: %entity-name; e.g., <!ATTLIST DB-user skills %basics> <!ATTLIST DBA skills %basics others CDATA #required> Research: sub-typing and inheritance by means of entities?

40 CIS 670 Fall 2001 (LN 5)40 Valid XML documents A valid XML document must have a DTD. 4 The document is well-formed 4 It conforms to the DTD: – elements observe the grammar (nested only in the way described by the DTD) –elements have only the attributes specified by the DTD –ID/IDREF attributes satisfy their constraints: ID must be distinct IDREF/IDREFS values must be existing ID values

41 CIS 670 Fall 2001 (LN 5)41 We are here Introduction to XML –XML basics –DTDs –XML and semistructured data

42 CIS 670 Fall 2001 (LN 5)42 Graph representation XML data can be presented as a rooted node-labeled directed tree, while semistructured data is a rooted edge-labeled directed graph. 4 If references are not considered (without DTD), XML document is usually modeled as a tree. 4 References (IDREF/IDREFS) can be viewed as edges and thus lead to graphs (cyclic structure) -- XML-QL model 4 When IDREF/IDREFS attributes are simply treated as text data, XML data is modeled as a tree - XSL and XQL model. This model does not take validation into account. Research: ordered graph/tree?

43 CIS 670 Fall 2001 (LN 5)43 Node-labeled vs. edge-labeled (1) Consider an ssd expression {a: {b: “string”}, a: {c: “string”}} Edge-labeled graph: We may encode it in XML either as (with some DTD): string or as string aa cb “string”

44 CIS 670 Fall 2001 (LN 5)44 Node-labeled vs. edge-labeled (2) Node-labeled graph: 4 string 4 string aa c b “string” aa c b

45 CIS 670 Fall 2001 (LN 5)45 XML vs. semistructured data 4 similarities: –both described best by graphical representation –both are schema-less, self-describing 4 differences: –XML is ordered, semistructured data is not. –XML data is modeled as a node labeled graph (tree), semistructured data as an edge labeled graph. –XML can mix text and elements; when translating it to a semistructured data expression, we have to add some surrounding tag for the PCDATA. –XML has other stuffs (PIs, comments). –XML may have (optional) DTD.

46 CIS 670 Fall 2001 (LN 5)46 Query languages 4 for semistructured data: –Lorel –UnQL 4 for XML –XML-QL –XQL –XSLT (transformation language) XML data can be treated and queried as semistructured data (e.g., Lorel).

47 CIS 670 Fall 2001 (LN 5)47 DTDs vs. schemas (types) 4 By database (or programming language) standard XML DTDs are rather weak specifications. –Only one base type -- PCDATA. –No useful “abstractions”, e.g., sets. –IDREFs are not typed or scoped -- you point to something, but you don’t know what! –No constraints (e.g., inverse) or methods. –No sub-typing or inheritance. 4 XML extensions to overcome the limitations. –Type systems: XML-Data, XML-Schema, SOX, DCD –metadata: RDF –constraints

48 CIS 670 Fall 2001 (LN 5)48 Summary 4 XML is a new data exchange format. Its main virtues include widespread acceptance. 4 DTDs provide some useful syntactic constraints on documents. As schemas they are weak. 4 Research topics include: –How will we query XML data? –How will we navigate the Web (XLink, XPointer)? –How will we extend XML DTDs to capture the semantics of the data? –How will we map between XML and other representations, esp. structured databases? –How will we compress/store XML data?


Download ppt "CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions."

Similar presentations


Ads by Google