Presentation is loading. Please wait.

Presentation is loading. Please wait.

IS432 Semi-Structured Data

Similar presentations


Presentation on theme: "IS432 Semi-Structured Data"— Presentation transcript:

1 IS432 Semi-Structured Data
Lecture 1: SSD & XML Dr. Gamal Al-Shorbagy

2 In this lecture What semi structured data is. What is XML Papers:
Why we need it How it is represented and processed Related technologies What is XML XML syntax XML Query data model Comparison of XML with semistructured data Papers: XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems. W3C XML Query Data Model Mary Fernandez, Jonathan Robie.

3 Unstructured The Data A Document/Page for a common user
Type of data ? Difficult to identify. Is there any order ? No particular format or sequence Does it follow any rules ? Can we predict about data ? Management and Representation Unmanageable by nature Often found as; text , video, sound , images Query and Search Brute force, finding a needle in the haystack. Unstructured

4 Structured The data A table for organizations
Data follows certain model e.g. Relational; Entities, Same Attributes, Order and Relations Schema  Data separation First Schema then Data Data elements are strongly “typed” and “Ordered” Corporate Ownership Management and Representation Specialized DBMS Engine Management, Storage, Query Formulation Represented as ; Entity - Tuples, Class - Objects Query and Search Optimized via indexes, trees … Structured

5 group of tables

6 The data name: Some Body
name: first name: Ceaser last name: The Great name: Ranjeet Singh affiliation: Punjab

7 Semi Structured the Data A graph/web for advanced users
Structure  Data (mixture) Schema-Less, Self Describing (Prescription Vs. Description) Schema may evolve overtime Schema may be larger than the data itself Irregular, Incomplete, Evolving Structure Entities may have different/missing attributes(Example; Person) Ownership is often shared among organizations Management and Representation Data Representation & Exchange on WWW Labeled Directed Graph Representation Query and Search Getting better … Semi Structured

8 Kinds of Data Unstructured Structured Semi Structured Title Author FN
&96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122 133 paper book references author title year http publisher page firstname lastname first last Bib Unstructured Structured Title Author FN Author LN Publisher Page A D Edd IEEE 233 B Ted Hee ACM 553 Semi Structured

9 Why Semi structured data is important?
Scenario An organization A publishes movie data on its web pages (HTML), generated from DBMS. A second organization B wants some movie information; can access only web data. DBMS A B HTML When we want to treat Web sources like a database, but can’t constrain these sources with a schema

10 Why Semi structured data is important?
Scenario; Electronic Data Interchange Standard computer-to-computer interchange of strictly formatted messages Electronic Data Interchange ISO Standard When we want as flexible format for data exchange between disparate systems/databases;

11 Semi Structured Data-(Pros&Cons)
Advantages No need to update schema continuously Easy to discover new data and load it Easy to integrate heterogeneous data Easy to query without knowing data types Disadvantages The type information loss Harder Storage/Query Optimization/Management

12 Managing Semi structured Data
How do we model it? (directed labeled graphs). How do we query it? (many proposals, all include regular path expressions). Optimize queries? (beginning to understand). Store the data? (looking for patterns) Integrity constraints, views, updates,…,

13 Semi Structured Data: OEM
Object Exchange Model Data in OEM is schema-less and self-describing, can be thought of as labeled directed graph where nodes are objects, consisting of: unique object identifier (for example, &7), descriptive textual label (street), type (string), a value (“22 Deer Rd”). Objects: atomic and complex: atomic object contains value for base type (e.g., integer or string) and in diagram has no outgoing edges. All other objects are complex objects whose types are a set of object identifiers. Lore: OEM Confirming Data Storage System Lorel: Lore Query Language

14 Semi-structured data model example
Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title author publisher title author author &o43 &25 &96 1997 last firstname atomic object firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM) Nodes are objects; labels on the arcs are attribute names.

15 Querying Semi structured Data
Important features: Ability to navigate the data (regular path expressions), Querying the attribute names (arc variables), Create new structures, Type coercion. Languages: Lorel (Stanford) UnQL (U. Penn),

16 Lore and Lorel 17.2 Semistructured data
Lore (Lightweight Object Repository) A DBMS Has external data manager Lorel (Lore language): Returning meaningful results even when some data absent To operate uniformly over single-valued and set-valued data Accepts data with different types Can return heterogeneous objects Allows the object structure to be partially known. Example: Find all properties with annual rent. SELECT DreamHomes.PropertyForRent FROM DreamHome.PropertyForRent.annualRent Answer: PropertyForRent &6, street &14 “18 Dale Rd”, type &15 1, annualRent & OverseenBy &4

17 Data Models Timeline Network Data Models (1964)
Hierarchical Data Models (1968) Relational Data Models (1970) Object-oriented Data Models (~ 1985) Object-relational Data Models (~ 1990) Semi-structured Data Models (XML 1.0) (~1998)

18 XML a W3C standard to complement HTML origins: structured text SGML
motivation: HTML describes presentation XML describes content (version 2, 10/2000)

19 XML – An Embodiment of Semi structured Data
Meta-language A de-facto language to Represent Semi-Structured Data To create new languages (WAP, VoiceXML, MathML) Extensibility Create new elements Create new languages (WML, WAP) Markup Text Markup Element = Data + Markup Document = Nested Elements <note> <to>Rana </to> <from>Tunga </from> <heading>Hello </heading> <body>What’s up ! </body> </note>

20 From HTML to XML HTML describes the presentation

21 HTML <h1> Bibliography </h1>
<p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999

22 XML XML describes the content <bibliography>
<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> </bibliography> XML describes the content

23 XML VS. HTML XML and HTML were designed with different goals:
XML to describe data and to focus on what data is. HTML was designed to display data and to focus on how data looks. It is important to understand that XML is not a replacement for HTML.

24 XML Data Model Several competing models: Document Object Model (DOM):
(2/2001) class hierarchy (node, element, attribute,…) objects have behavior defines API to inspect/modify the document XSL data model Infoset PSV (post schema validation) XML Query data model (next)

25 Why XML Portability Language neutrality Platform independence
Program-Data Decoupling Logic and Notation Data and Metadata Information and Structure Content and Form

26 Why XML Data Evolution: Integration:
Schema update not required Integration: A prior knowledge of schema is not necessary Sharing between incompatible formats Interoperability without rebuilding the systems. Report Concrete Examples

27 How computers understand xml
Parsers; Software to understand XML Removes Markup and Retrieves Data Document Object Model (DOM) Model a document as a Tree Simple API for XML (SAX) Sequential access

28 What XML is not A little hard to understand, but XML does not DO anything. XML is created to structure, store and send information. <note> <to>Rana </to> <from>Tunga </from> <heading>Hello </heading> <body>What’s up ! </body> </note> The note; a header, a message body, sender and receiver information. But still, this XML document does not DO anything. Just information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.

29 XML Terminology tags: book, title, author, …
start tag: <book>, end tag: </book> elements: <book>…<book>,<author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element well formed XML document: if it has matching tags

30 More XML: Attributes <book price = “55” currency = “USD”>
<title> Foundations of Databases </title> <author> Abiteboul </author> <year> 1995 </year> </book> attributes are alternative ways to represent data

31 Parsers and Well-formed XML Documents
XML parser Processes XML document Reads XML document Checks syntax Reports errors (if any) Allows programmatic access to document’s contents

32 Parsers and Well-formed XML Documents (cont.)
XML document syntax Considered well formed if syntactically correct Single root element Each element has start tag and end tag Tags properly nested Attribute (discussed later) values in quotes Proper capitalization Case sensitive

33 Parsers and Well-formed XML Documents (cont.)
XML parsers support Document Object Model (DOM) Builds tree structure containing document data in memory Simple API for XML (SAX) Generates events when tags, comments, etc. are encountered (Events are notifications to the application)

34 Parsing an XML Document with msxml
Contains data Does not contain formatting information Load XML document into Internet Explorer 5.0 Document is parsed by msxml. Places plus (+) or minus (-) signs next to container elements Plus sign indicates that all child elements are hidden Clicking plus sign expands container element Displays children Minus sign indicates that all child elements are visible Clicking minus sign collapses container element Hides children Error generated, if document is not well formed

35 XML document shown in IE5.

36 Characters Character set
Characters that may be represented in XML document e.g., ASCII character set Letters of English alphabet Digits (0-9) Punctuation characters, such as !, - and ?

37 Character Set XML documents may contain Carriage returns Line feeds
Unicode characters (Section 5.5.4) Enables computers to process characters for several languages

38 Characters vs. Markup XML must differentiate between Markup text
Enclosed in angle brackets (< and >) e.g,. Child elements Character data Text between start tag and end tag e.g., Fig. 5.1, line 7: Welcome to XML!

39 White Space, Entity References and Built-in Entities
Whitespace characters Spaces, tabs, line feeds and carriage returns Significant (preserved by application) Insignificant (not preserved by application) Normalization Whitespace collapsed into single whitespace character Sometimes whitespace removed entirely <markup>This is character data</markup> after normalization, becomes <markup>This is character data</markup>

40 White Space, Entity References and Built-in Entities (cont.)
XML-reserved characters Ampersand (&) Left-angle bracket (<) Right-angle bracket (>) Apostrophe (’) Double quote (”) Entity references Allow to use XML-reserved characters Begin with ampersand (&) and end with semicolon (;) Prevents from misinterpreting character data as markup

41 White Space, Entity References and Built-in Entities (cont.)
Build-in entities Ampersand (&) Left-angle bracket (<) Right-angle bracket (>) Apostrophe (&apos;) Quotation mark (") Mark up characters “<>&” in element message <message><>&</message>

42 More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/> </person> <person id=“o123” mother=“o456”><name>John</name> oids and references in XML are just syntax

43 More XML: CDATA Section
Syntax: <![CDATA[ .....any text here...]]> Example: <example> <![CDATA[ some text here </notAtag> <>]]> </example>

44 Using a CDATA section

45 More XML: Entity References
Syntax: &entityname; Example: <element> this is less than < </element> Some entities: < > & & &apos; " & Unicode char

46 More XML: Processing Instructions
Syntax: <?target argument?> Example: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> </price> </product> What do they mean ?

47 More XML: Comments Syntax <!-- .... Comment text... -->
Yes, they are part of the data model !!!

48 XML Namespaces http://www.w3.org/TR/REC-xml-names (1/99)
name ::= [prefix:]localpart <book xmlns:isbn=“ <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

49 XML Namespaces syntactic: <number> , <isbn:number>
semantic: provide URL for schema <tag xmlns:mystyle = “ <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> defined here

50 XML Namespaces Naming collisions Namespaces
Two different elements have same name <subject>Math</subject> <subject>Thrombosis</subject> Namespaces Differentiate elements that have same name <school:subject>Math</school:subject> <medical:subject>Thrombosis</medical:subject> school and medical are namespace prefixes Prepended to elements and attribute names Tied to uniform resource identifier (URI) Series of characters for differentiating names

51 XML Namespaces Creating namespaces Use xmlns keyword
xmlns:text = “urn:deitel:textInfo” xmlns:image = “urn:deitel:imageInfo” Creates two namespace prefixes text and image urn:deitel:textInfo is URI for prefix text urn:deitel:imageInfo is URI for prefix image Default namespaces Child elements of this namespace do not need prefix xmlns = “urn:deitel:textInfo”

52 Element directory contains two namespace prefixes
1 <?xml version = "1.0"?> 2 3 <!-- Fig. 5.8 : namespace.xml --> 4 <!-- Namespaces > 5 6 <directory xmlns:text = "urn:deitel:textInfo" xmlns:image = "urn:deitel:imageInfo"> 8 9 <text:file filename = "book.xml"> <text:description>A book list</text:description> 11 </text:file> 12 13 <image:file filename = "funny.jpg"> <image:description>A funny picture</image:description> <image:size width = "200" height = "100"/> 16 </image:file> 17 18 </directory> Element directory contains two namespace prefixes Use prefix text to describe elements file and description Apply prefix text to describe elements file, description and size

53 urn:deitel:textInfo is default namespace
1 <?xml version = "1.0"?> 2 3 <!-- Fig. 5.9 : defaultnamespace.xml --> 4 <!-- Using Default Namespaces > 5 6 <directory xmlns = "urn:deitel:textInfo" xmlns:image = "urn:deitel:imageInfo"> 8 9 <file filename = "book.xml"> <description>A book list</description> 11 </file> 12 13 <image:file filename = "funny.jpg"> <image:description>A funny picture</image:description> <image:size width = "200" height = "100"/> 16 </image:file> 17 18 </directory> Element file is in default namespace Specify namespace

54 XML Stylesheet Extensible Stylesheet Language (XSL)
Language for document transformation Transformation Converting XML to another form Formatting objects Layout of XML document Defined by W3C

55 Xml path To Access particular parts of and XML Document
WHY To Access particular parts of and XML Document To Navigate within an XML Document WHAT Analogous to Select statement in SQL HOW It views an XML document as a tree Root of the tree is a node, which doesn’t correspond to anything in the document Internal nodes are elements Leaves are either Attributes Text nodes Comments

56 Xml path

57 Xml query WHY Need to extract parts of XML documents (Database) Need to transform documents into different forms Another XML form HTML (to display on a Web browser) Other (e.g. bibtex) Need to relate – join – parts of the same or different documents WHAT XQuery can be used to: Extract information to use in a Web Service Generate summary reports Transform XML data to XHTML Search Web documents for relevant information HOW The XML-QL language XQuery – W3C standard. Very powerful, fairly intuitive, SQL-style

58 XML Query Data Model http://www.w3.org/TR/query-datamodel/ 2/2001
Describes XML as a tree, specialized nodes Uses a functional-style notation (think ML)

59 XML Query Data Model Node ::= DocNode | ElemNode | ValueNode | AttrNode | NSNode | PINode | CommentNode | InfoItemNode | RefNode

60 XML Query Data Model Element node (simplified definition):
elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode])  ElemNode QNameValue = means “a tag name” {...} = means “set of...” [...] = means “list of ...”

61 XML Query Data Model Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”

62 XML Query Data Model Example
book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8]) price2 = attrNode(…) /* next */ currency3 = attrNode(…) title4 = elemNode(title, string9) … <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

63 XML Query Data Model Attribute node:
attrNode : (QNameValue, ValueNode)  AttrNode

64 XML Query Data Model Example
price2 = attrNode(price,string10) string10 = valueNode(…) /* next */ currency3 = attrNode(currency, string11) string11 = valueNode(…) <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

65 XML Query Data Model Value node:
ValueNode = StringValue | BoolValue | FloatValue … stringValue : string  StringValue boolValue : boolean  BoolValue floatValue : float  FloatValue

66 XML Query Data Model Example
price2 = attrNode(price,string10) string10 = valueNode(stringValue(“55”)) currency3 = attrNode(currency, string11) string11 = valueNode(stringValue(“USD”)) title4 = elemNode(title, string9) string9 = valueNode(stringValue(“Foundations…”)) <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

67 Semi-structured Data vs. XML
both described best by a graph both are schema-less, self-describing Attributes ---> tags objects ---> elements atomic values ---> CDATA (characters) Order? Assumed in XML. XML attributes (fixable) References in XML.

68 Similarities and Differences
<person id=“o123”> <name> Alan </name> <age> 42 </age> < > </ > </person> { person: &o123 { name: “Alan”, age: 42, } } <person father=“o123”> … </person> { person: { father: &o123 …} } person name age Alan 42 father similar on trees, different on graphs

69 More Differences XML is ordered, ssd is not
XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> XML has lots of other stuff: entities, processing instructions, comments Very important: these differences make XML data management harder


Download ppt "IS432 Semi-Structured Data"

Similar presentations


Ads by Google