Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Structured Data Models By Chris Bennett. Semi-Structured Data  What is it? Data where structure not necessarily determined in advance (often implicit.

Similar presentations


Presentation on theme: "Semi-Structured Data Models By Chris Bennett. Semi-Structured Data  What is it? Data where structure not necessarily determined in advance (often implicit."— Presentation transcript:

1 Semi-Structured Data Models By Chris Bennett

2 Semi-Structured Data  What is it? Data where structure not necessarily determined in advance (often implicit in data) Descriptive, not prescriptive Self-describing and flexible in structure  Where does it come from? When the data cannot (or simply is not) modeled naturally or usefully using a standard data model Merging multiple data sources, sparse user annotations, rapidly evolving schemas specific to given communities Raw data is often semi-structured Frequently a product of rapidly evolving schema  Examples HTML, XML, BibTex, Integrated data sources, etc..

3 Semi-Structured Data  This is great – infinite flexibility!! Is there a catch? Always a tradeoff…  In this case, retrieval and query performance can suffer greatly compared to more structured data models

4 Semi-Structured Data So we know what it is – how do we…  Model it? Directed labeled graphs  Query it? Many proposals, all include regular path expressions…Lorel, XML Query…  Store it? Big challenge Haystack ModelHaystack Model

5 Semi-Structured Data Models  What do they do? Provide a common framework In effect, they add some structure  Why? Semi-structured data often is irregular or missing, similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully Standardize information exchange Data verification (both internal and external)  Examples OEM, XML DTD, XML Schema…

6 OEM – Object Exchange Model  Developed at Stanford (mid 90s)  Precursor to today’s accepted semi- structured data acronyms (XML) (label, type, value, object-ID)  Main feature – self-describing  Requires a good bit of human intervention, though

7 Object-Oriented Model versus OEM  OEM is an information exchange model (does not specify object storage issues)  OEM is much simpler (supports object nesting…omits classes, methods, inheritance)  Uses labels in place of schema

8 Advantages of OEM  Simple model makes transforming and merging data simpler  Advanced features can be “emulated” (implies human intervention)  More suitable for heterogeneity  Hindsight: Extreme heterogeneity mandates more than a little human intervention without some structure

9 Components of OEM  Query Language OEM-QL – typical SELECT-WHERE- FROM  Translator Translates OEM-QL to specific data source and back  Mediator Collects work of translators then merges and/or combines them to make OEM structures

10 OEM-QL SELECT – WHERE – FROM Adaptation of SQL-like language for OO models SELECT fetch-expression FROM object WHERE condition Expressions in the SELECT and WHERE clauses use the notion of a path that describes a traversal through an object using sub-object structure and labels

11 OEM-QL SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no ? - denotes match to any label  Return the topic of books where there exists an internal call number  The question mark allows the user to say that the intermediate “node” in the path through the object can be named anything

12 XML DTD – Document Type Definition  Let there be (a little) more structure…  DTD’s define the legal building blocks of an XML document.  It defines the document structure with a list of legal elements and/or attributes, and it can be declared inline or external to the XML document.

13 XML DTD Example <!DOCTYPE note [ ]>

14 XML DTD Advantages  An application can use a standard DTD to verify that data you receive from the outside world is valid.  It is flexible enough so that you can nest: + -- at least one occurrence * -- zero or more occurrences ? – zero or one occurrence Example:

15 DTD Drawbacks  What about constraints?? DTD’s do not offer much help in constraining the value of a particular attribute or element (only on the use of markup)  Automated processing of XML documents requires more rigorous and comprehensive facilities in this area.  Requirements are for constraints on how the component parts of an application fit together, the doc structure, attributes, data-typing, and so on.

16 XML Schema Well formatted is not enough! Let there be more structure!  XML Schema is an XML-based alternative (and ultimate successor) to DTD’s  They express shared vocabularies and allow machines to carry out rules made by people.  They provide a means for defining the structure, content and semantics of XML documents

17 Successor to DTD’s  XML Schema: Extensible to future additions Richer and more useful than DTD’s Written in XML Support data types Support namespaces

18 XML Schema Advantages  Better validation, restriction, and type conversion  Extensible – reuse, modify existing data types, reference multiple schemes

19 XML Schema Details Defines…  Elements that can appear in a document  Attributes that can appear in a document  Which elements are child elements  Order of child elements  Number of child elements  Whether an element is empty or can include test  Data types for elements and attributes  Default and fixed values for elements and attributes

20 XML Schema Components Primary components,:  Simple type definitions, Complex type definitions, attribute declarations, and elements declarations The secondary components, which must have names, are as follows:  Attribute group definitions, Identity-constraint definitions, Model group definitions, and Notation declarations Finally, the "helper" components provide small parts of other components; they are not independent of their context:  Annotations, Model groups, Particles, Wildcards, Attribute Uses

21 XML Namespaces (W3C Documentation) (W3C Documentation)  Collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names  XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set

22 XML Schema Example W3C XML Schema PrimerW3C XML Schema Primer (examples)

23 Querying Semi-Structured Data  Keys: Semi-structured data modeled on directed graphs User cannot have full knowledge of data structure, but we should exploit what structure we do know exists  Examples Lorel  Developed at Stanford (1997) as part of the Lore (lightweight object repository) project XPath  W3C standard  Language for addressing parts of an XML document

24 Lore System Stanford Link Stanford Link  Successor to OEM  Fully functional DBMS for XML with: Declarative query language, multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recoveryquery languageindexingquery optimizer  Novel features include: DataGuides, DataGuides Management of external dataexternal data Proximity search. Proximity search

25 Lore – Novel Features  DataGuides Structural summary of all paths in that database Used by query optimizer to exploit known structure  Manage External Data  Proximity Search Ranks database objects based on their proximity to other objects Measure proximity based on distances in the graph linking the objects together

26 Lorel – Lore Query Language  Based on OQL  Provides powerful path traversal operators  Makes extensive use of type coercion to help yield "intuitive" results for all queries over XML data Permits flexible form of declarative navigational access Particularly suited to when details of structure are not known

27 Lorel – Coercion Rules ValueAtomic ObjectSet of ObjectsComplex Object ValuecoerceDereferenceExistential with == False Atomic ObjectExistential with == False Set of ObjectsExistential with == on both sides False Complex Object Value =

28 Lorel Example Find the names and zip codes of all “cheap” restaurants select Guide.restaurant.name, Guide.restaurant.(.address)?.zipcode where Guide.restaurant.% grep “cheap” - The ? after.address means the address is optional in the path expression - The % will match any subobject of restaurant - Comparison operator grep returns true if string “cheap” appears anywhere in the subobject value

29 Lorel – Another example select X.name from John.name JN, John.child X, X.name XN where JN == XN  “Retrieve the children of John bearing his name”  == expects atomic values so they are coerced Rewritten: select X.name from John.child X where John.name == X.name

30 Lorel – Constructing Results  S-F-W in Lorel has same semantics as SQL: results are a bag (multiset) or a set if ‘distinct’ is used  Results is always a collection of OEM objects (elimination by OID)  For each assignment of the variables in the from clause that passes the condition of the where clause, a value is generated according to the expressions in the select clause  Results could refer to database objects or could refer to new objects created by coercion

31 Lorel – Data Updates  Create and delete database names Delete is implicit when object becomes unreachable  Create a new atomic or complex object  Modify the value of an existing atomic or complex object  Bulk load an OEM database

32 Lorel – Updates cont’d…  Assigning names to objects Name myFavorite := element (select Guide.Restaurant where Guide.Restaurant.name = “Saigon”)  Creating objects new_oem (int, 5) new_oem (complex, struct(a:{new_oem(int,5)}, b:{X,Y}))

33 XPath Features  XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax  Provides basic facilities for manipulation of strings, numbers and booleans  XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values

34 XPath – How It Works W3C XPath Information W3C XPath Information  XPath models an XML document as a tree of nodes Root nodes, element nodes, text nodes, attribute nodes, namespace nodes, processing instruction nodes, comment nodes  Evaluation occurs with respect to a “context” which consists of: a node (the context node) a pair of non-zero positive integers (the context position and the context size) a set of variable bindings a function library the set of namespace declarations in scope for the expression

35 XQuery – How It Works  Location path – selects a set of nodes relative to the context node  An expression that is a location path results in a node set Examples of location paths  Includes functions for node sets, strings, numbers, etc…

36 XPath – Generic Example Simple: employee[@secretary and @assistant] Selects all the employee children of the context node that have both a secretary attribute and an assistant attribute W3C School Examples


Download ppt "Semi-Structured Data Models By Chris Bennett. Semi-Structured Data  What is it? Data where structure not necessarily determined in advance (often implicit."

Similar presentations


Ads by Google