XML, distributed databases, and OLAP/warehousing The semantic web and a lot more
What is XML? A framework for declarative languages A syntax and two major constructs: elements & attributes Elements: Have begin and end tags Can be embedded Can be put in lists (homogeneous or heterogeneous) Attributes: Are assigned to elements Are strings Are put in quotes
What is XML for? Initially, as a cornerstone of the semantic web Automatic searching of the web (versus interactive) Self-describing data Has been adapted to a wide variety of application domains As a means for specifying the structure of data As a catch-all for nontraditional data
XML documents An instance of XML is a language An instance of an XML language is a document Documents are hierarchical & list-oriented XML documents can be parsed in a single, linear pass There is do notion of a fixed schema Does not leverage meta data for set-oriented queries Order matters in a set of documents Order matters in a series of elements in a document
Is it a generalized HTML? Sort of, but perhaps more of a meta alternative to HTML The real point is to allow HTML pages to be located and searched automatically This is done by allowing language developers to create their own names for documents, elements, & attributes
What else is part of the XML philosophy? Namespaces Associated with URLs Can be referenced in a nested fashion in an XML document Widely distributed sharing of data, XML languages, and namespaces
What’s missing, from the database uer’s and a programmer’s perspective? No innate notion of a query language No Objects Very limited data structuring capabilities Yet another impedance mismatch problem No way to store XML documents in a relational database, at least not natively No way to make a database out of a set of documents
So, in response to the database community’s desires… A hierarchical query language – Xpath A specification format for schemas – DTDs But uses a different syntax Does not accommodate namespaces
So, in response to the database community’s desires, phase 2… XML schema More atomic or “basic” types Like DTD’s, but with an XML syntax Supports namespaces Adds primary keys and foreign keys Adds more constructs for structuring data Simple types: primitive types, list and union, & restriction Attributes can be of simple types Complex types: compositors all (unordered) and sequence (ordered), and choice Extension and restriction Integrity constraints
Query language 1: XPath Follows hierarchy of XML documents Uses syntax borrowed from Unix file system \ for root . for current node for value of an attribute [1], [2], etc., for siblings // for self or descendent of .//x for all descendants to find an element of a specific type x Augmented with URLs to create Xpointer Relational database systems generally have an XML data type now
Distributed Databases & Distributed TXS – homogenous and heterogeneous See page 689: multiple DBs vs. a distributed DB Homogeneous distributed DBs Single unified schema Designed top down Distribution by row, column, table, by table selection Issues of distribution Redundancy: availability vs. keeping copies up to date Hidden joins with column distribution Hidden unions with table selection distribution
Executing distributed transactions Each node has a master and a client module Masters are all identical and contain distributed data info Clients are like single site databases with a prepare to commit 3 basic strategies for query fragment execution Bring data to procedure Send procedure to data Meet in a 3 rd place Estimating costs Data shipping Result shipping Wait times on nodes Integrity constraint enforcement
Heterogeneous distributed databases Forms of heterogeneity Model Schema Database product Namespace Table structure (implications for object identities) Keys and Foreign keys Units SQL dialect Semantic issues relating to varying interpretations of data
Integrating heterogeneous databases After the fact Stability is never achieved Mappings are complex Data may have conflicts, redundancy, and gaps Closed world vs. open world
Engineering for nonstop change Mediators around databases Gateways connecting old apps and new databases Gateways connecting new apps and old databases A stability of instability
OLAP Standard model N dimension tables 1 fact table (PK is union of keys of dimension tables) Hypercube visualization Multidimensional table result visualizations Star and constellation schemas Terminology Drilling down – stepping down nested attributes Rolling up – moving up nested attributes Pivot – group by
Specialized operators Cube operator and 4 equivalent queries Viewing results See page 722 Equivalent – see 723
Populating the warehouse Transformation Integration cleaning
Data mining Effectively an open world application Association, classification, clustering – page 730 Association – confidence and support – page 731