XML Name: Niki Sardjono Class: CS 157A Instructor : Prof. S. M. Lee
Introduction XML stands for Extensible Markup Language It’s root is in document managements and derived from Standard Generalized Markup Language (SGML) XML can represent Database data and other kinds of structured data.
Background The root is a document Markup language Markup refers to anything in a document that is not meant to be part of the printed output. For the family of markup language (HTML,SGML, and XML), the markup takes the form of tags enclosed in angle brackets, <>, and are always used in pair with and for beginning and ending of the document where the tag refers. Example would be: Database
Unlike HTML however, XML does not prescribe the set of tags allowed and tags can be specialized as needed. Compared to storage of data in database, XML can be inefficient since tag names are repeated throughout the documents. However XML can have an advantage if it’s used to exchange data. - the presence of tags makes message self documenting (schema don’t need to be consulted to understand meaning of text). - The format of the document is not rigid. - XML format is widely accepted. XML in a sense is becoming the dominant format for data exchange.
Structure of XML Data The fundamental construct in XML document is the element (a pair of matching start-and end-tags and the text between them) XML documents must have a single root element that encompasses all other elements in a document. Examples : Text is said to appear in the context of an element if it appears between the start-tag and end-tag of that element and tags are properly nested if every start-tag has a unique matching end-tag that is in the context of the same parent element.
Nested representations are widely used in XML data interchange applications to avoid joins XML specifies the notion of an attribute. Attributes are strings and do not contain markup, and can appear only once in a given tag.
Example would be: A-120 Perryridge 400 A name space mechanism has been introduced in XML to allow organizations to specify globally unique names to be used as element tags. The idea is to prepend each tag or attribute with a universal resource identifier (Example would be Web Address.), but using long namespace would be inconvenient, so namespace standard provides a standard to use abbreviation for identifiers.
Example : …………. We can use default namespace in the example above by using xmins instead of xmins:FB…. In the root element.
XML Document Schema Document type definition DTD (Document Type Definition) is an optional part of XML. The main purpose of DTD –To constrain and type the information present in the document, but only constrains the appearance of subelements and attributes within an element. DTD is a list of rules for what pattern of subelements appear within an element. Operators used are –+ specifies one or more –| specifies or –* specifies zero or more –? specifies optional elements
Attributes can be specified into several types such as: –CDATA : character data –ID : unique identifier for the element. –IDREF : a reference to an element which uses a value that appears in ID attribute in some elements in the document. –IDREFS: is a list of identifiers. Limitations on DTDs as schema mechanism: –Individual text elements and attributes cannot be further typed, which is quite problematic for data processing and exchange applications. –Difficult to use DTD to specify unordered sets of subelements. –Lack of typing in ID & IDREF which will lead to impossibility to specify the type of element to which an IDREF & IDREFS should refer.
XMLSchema XMLSchema is a more sophisticated schema language compared to DTD. Benefits compared to DTD: –Allows user-defined types to be created. –Allows the text that appears in elements to be constrained to specific types. –Allows types to be restricted to create specialized types, for instance by specifying min and max values. –Allows complex types to be extended by using form of inheritance. –Is a superset of DTDs. –Allow uniqueness and foreign key constraints. –It is integrated with namespaces to allow different parts of documents to conform to different Schema. –It is itself specified by XML syntax. Disadvantage of it is XMLSchema is significantly more complicated compared to DTDs.
Querying and Transformation Querying and Transformation are essential to extract information from large bodies of XML data, and convert it to different representations (schemas) in XML. Several languages provide increasing degrees of querying and transformation capabilities: –XPath is a language for path expressions, and is actually a building block for the remaining two query languages. –XSLT is the transformation language (part of XLS style sheet system, used to control the formatting of XML data to HTML or other). It can generate XML as output. –XQuery is the standard for querying of XML data. All of these languages use the tree model of XML data, where nodes correspond to elements and attributes.
XPath Path expression in XPath is a sequence of locations steps separated by “/”. Example would be: /bank-2/customer/name/text() It’s the same with directory structure where the initial / is the root and the other / are above. It is also inspected from left to right. If an element name appears before the next ‘/’, it will refer to all the elements of the specified name that are children of elements in the current element set. Attributes can also be accessed by using the character Example would be : which will return a set of all values of account-number attributes of account elements. IDREF however by default are not followed.
Xpath supports a number of other features: –Selection predicates may follow any step in a path and contained in square brackets. Example /bank-2/account[balance > 400]. –Provides several functions that can be used as part of predicates including testing the position of the current node in sibling order and counting the number match. Example : /bank-2/account/[customer/count()>2] –Function id(“foo”) returns nodes(if any) with an attribute of type ID and value foo. –The | operator allows expression results to be unioned. For example : | will return customers with either accounts or loans. However, the | operator can’t be nested inside other operators. –Can skip multiple level of nodes by using “//” –Each step need not select from the children of the nodes in the current node set.
XSLT XML Style Language (XSL) was originally designed for generating HTML from XML. The language however includes a general- purpose transformation mechanism, called XSL Transformation (XSLT). XSLT transformations is expressed as a series of recursive rules, called templates. Structural recursion is important in XSLT due to the fact that the data are based on tree structure. So XSLT can use recursion to apply template rules recursively on subtrees. XSLT has a feature called key which is similar to id() in goals, but can use more than the ID attributes. Example: where name is to distinguish keys, match to specify which nodes the key applies, and use which expressions to be used as value of the key.
XQuery Built by the world wide web consortium (w3c). Organized into “FLWR” comprising of for, let, where, and return. –for: gives a series of variables that range over the results of XPath expressions. Where more than one var. is specified, the result will include Cartesian product of possible values the variable can take. –let: allow complicated expressions to be assigned to variable names for simplicity of representation. –where: performs additional tests on joined tuples from the for section. –return: allows the construction of result in XML. Example: for $x in /bank-2/account let $acctno := where $x/balance > 400 return $acctno
Application Program Interface Two standards which is DOM (document object model) and SAX (Simple API for XML). DOM treats XML content as tree and can be used to access XML data stored in databases. XML databases can also be built using DOM as it’s primary interface for accessing and modifying data. DOM does not support any form of declarative querying however. SAX is an event model, where it provides a common interface between parsers and applications.
Storage of XML Data Using a relational database. If data from XML was generated from relational schema, the converting process is straight forward. If it’s not however, there are several alternatives to approach this problem: –Store as string: store each child element of the top-level element as a string in a separate tuple in database. It is easy to use, however the database system does not know the schema of the stored elements. A partial solution to that problem would be to store different types of elements in different relations, and also store the values of some critical elements as attributes of the relation to enable indexing. Drawback of this type of storage is that a large part of the XML information is stored within strings.
–Tree representation: use a tree structure where elements & attributes in XML data is given a unique identifier. Tuple inserted in the nodes deoends on identifier(id), type (attribute or element), the name of the element or attribute(label), and the ext value of element or attribute(value). Advantage would be that all XML information can be represented directly in relational form, and many XML queries can be translated into relational queries and executed inside the database system. The drawback would be that each element gets broken up into so many pieces and will require a large number of join to assemble elements. –Map to relations: XML elements whose schema is known are mapped to relations and attributes. If it’s unknown it will be stored as strings or as tree representation. There is also Nonrelational Data Stores which is –Store in flat files: lacks data isolation, integrity checks, atomicity, concurrent access, and security. –Store in an XML Database
XML Applications Central goal is to make it easy to communicate information on the Web and between applications.