UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla

copy-right@sanjaymadria, UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias@umr.edu

copy-right@sanjaymadria, UMR WWW Huge, widely distributed, heterogeneous collection of semistructured multimedia documents in the form of web pages connected via hyperlinks.

copy-right@sanjaymadria, UMR World Wide Web Web is fast growing More business organizations putting information in the Web Business on the highway Myriad of raw data to be processed for information

copy-right@sanjaymadria, UMR As WWW grows, more chaotic it becomes Web is fast growing, distributed, non- administered global information resource WWW allows access to text, image, video, sound and graphic data More business organizations creating web servers More chaotic environment to locate information of interest Lost in hyperspace syndrome

copy-right@sanjaymadria, UMR Characteristics of WWW WWW is a set of directed graphs Data in the WWW has a heterogeneous nature, self-describing and schema less Unstructured information, deeply nested No central authority to manage information Dynamic verses static information Web information discoveries - search engines

copy-right@sanjaymadria, UMR Web is Growing! In 1994, WWW grew by 1758 % !! June 1993 - 130 June 1994 - 1265 Dec. 1994 - 11,576 April 1995 - 15,768 July 1995 - 23,000+ 2000 - !!!!!

copy-right@sanjaymadria, UMR ‘COM’ domains are increasing! As of July 1995, 6.64 million host computers on the Internet: –1.74 million are ‘com’ domains –1.41 million are ‘edu’ domains –0.30 million are ‘net’ –0.27 million are ‘gov’ –0.22 million are ‘mil’ –0.20 million are ‘org’

copy-right@sanjaymadria, UMR Top web countries 1. Canada (1) 80% 9. New Zealand(7)101 2. US (4) 140% 10. Sweden (9) 101% 3. Ireland (3) 110% 11. Israel (12) 112% 4. Iceland (2) 68% 12. Cyprus (8) 72% 5. UK (14) 336 % 13. Hong Kong (15)148% 6. Malta (5) 155% 14. Norway (10) 64% 7. Australia (6) 133% 15. Switzerland (13) 75% 8. Singapore (11) 207% 16. Denmark (16) 105%

copy-right@sanjaymadria, UMR How users find web sites Indexes and search engines 75 UseNet newsgroups 44 Cool lists 27 New lists 24 Listservers 23 Print ads 21 Word-of-mouth and e-mail 17 Linked web advertisement 4

copy-right@sanjaymadria, UMR Limitations of Search Engines Do not exploit hyperlinks Search is limited to string matching Queries are evaluated on archived data rather than up-to-date data; no indexing on current data Low accuracy Replicated results No further manipulation possible

copy-right@sanjaymadria, UMR Limitations of Search Engines ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery

copy-right@sanjaymadria, UMR More PROBLEMS specifying/understanding what information is wanted the high degree of variability of accessible information the variability in conceptual vocabulary or “ontology” used to describe information complexity of querying unstructured data

copy-right@sanjaymadria, UMR complexity of querying structured data uncontrolled nature of web-based information content determining which information sources to search/query

copy-right@sanjaymadria, UMR Search Engine Capabilities –Selection of language –Keywords with disjunction, adjacency, presence, absence,... –Word stemming (Hotbot)Hotbot –Similarity search (Excite)Excite –Natural language (LycosPro)LycosPro –Restrict by modification date (Hotbot) or range of dates (AltaVista)HotbotAltaVista –Restrict result types (e.g., must include images) (Hotbot)Hotbot –Restrict by geographical source (content or domain) (Hotbot)Hotbot –Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)LycosProOpentext

copy-right@sanjaymadria, UMR SEARCH & RETRIEVAL Search Engines Search engine% web covered Hotbot34 AltaVista28 Northern Light20 Excite14 Infoseek10 Lycos3 + using several search engines is better than using only one +Source: Lawrence, S., and Giles, C.L., “Searching the World Wide Web,” Science 280, pp. 98-100, 1998.

copy-right@sanjaymadria, UMR Key Objectives Design a suitable data model to represent web information Development of web algebra and query language, query optimization Maintenance of Web data - view maintenance Development of knowledge discovery and web mining tools Web warehouse Data integration, secondary storages, indexes

copy-right@sanjaymadria, UMR Web Data Representation HTML - Hypertext Markup Language –fixed grammar, no regular expressions –Simple representation of data –good for simple data –difficult to extract information SGML - Standard Generalized Markup Language - good for publishing deeply structured document XML - Extended Markup Language -a subset of SGML

copy-right@sanjaymadria, UMR Terminology HTML - Hypertext Mark-up Language HTTP - Hypertext Transmission Protocol URL - Uniform Resource Locator example - := :// / /filen ame>[ ] where – is http, ftp, gopher –host is internet address … –#location is a textual label in the file.

copy-right@sanjaymadria, UMR Links are specified as Anhor Text “destination URL is the URL of the destination document and Anchor Text is the text that appears as an anchor when displayed. Example: Nanyang Technological University http://www.ntu.edu.sg/ Absolute and relative URL New York is relative NCSA's Beginner's Guide to HTML absolute address

copy-right@sanjaymadria, UMR World Wide Web Prevalent, persistent and informative HTML documents (soon, XML) created by humans or applications. Can database technology help? Persistent HTML documents!!! Accessed day in and day out by humans and applications.

copy-right@sanjaymadria, UMR Current Research Projects Web Query System –W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus Semistructured Data Management –LOREL, UnQL, WebOQL, Florid Website Management System –STRUDEL, Araneus Web Warehouse –WHOWEDA

copy-right@sanjaymadria, UMR Main Tasks Modeling and Querying the Web –view web as directed graph – content and link based queries –example - find the page that contain the word “clinton” which has a link from a page containing word “monica”.

copy-right@sanjaymadria, UMR Information Extraction and integration –wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. –Mediator - integration of data-softwares that access multiple source from a uniform interface Web Site Construction and Restructuring –creating sites –modeling the structure of web sites –restructuring data

copy-right@sanjaymadria, UMR MEDIATOR ARCHITECTURE User Interface Mediator (Query/Search/ Retrieval/Result) Wrapper...

copy-right@sanjaymadria, UMR What to Model Structure of Web sites Internal structure of web pages Contents of web sites in finer granularities

copy-right@sanjaymadria, UMR Data Representation of Web Data Graph Data Models Semistructured Data Models (also graph based)

copy-right@sanjaymadria, UMR Graph Data Model Labeled graph data model where node represents web pages and arcs represent links between pages. Labels on arcs can be viewed as attribute names. Regular path expression queries

copy-right@sanjaymadria, UMR Semistructured Data Models Irregular data structure, no fixed schema known and may be implicit in the data Schema may be large and may change frequently Schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated

copy-right@sanjaymadria, UMR Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogenious sources) No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes Ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph

copy-right@sanjaymadria, UMR Graph based Query Languages Use graph to model databases Support regular path expressions and graph construction in queries. Examples Graph Log for hypertext queries graph query language for OO

copy-right@sanjaymadria, UMR Query Languages for Semi- Structured data Use labeled graphs Query the schema of data Ability to accommodate irregularities in the data, such as missing links etc. Examples : Lorel (Stanford), UnQL (AT&T), STRUQL (AT&T)

copy-right@sanjaymadria, UMR Comparison of Query Systems

copy-right@sanjaymadria, UMR Types of Query Languages First Generation Second generation

copy-right@sanjaymadria, UMR First Generation Query Languages Combine the content-based queries of search engines with structure-based queries Combine conditions on text pattern in documents with graph pattern describing link structures Examples - W3QL (TECHNION, Israel) WebSQL (Toronto), WebLOG (Concordia)

copy-right@sanjaymadria, UMR Second generation languages Called web data manipulation languages Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects Useful for data wrapping, transformation, and restructuring Useful for web site transformation and restructuring

copy-right@sanjaymadria, UMR How they Differ? Provide access to the structure of web objects they manipulate - return structure Model internal structures of web documents as well as the external links that connect them Support references to model hyperlinks and some support to ordered collections of records for more natural data representation Ability to create new complex structures as a result of a query

copy-right@sanjaymadria, UMR Examples Web OQL STRUQL Florid

copy-right@sanjaymadria, UMR W3QS (WWW Query System) at Technion - Israel Content queries Structural Queries Interfacing with user written programs and standard UNIX utilities Uses existing WWW indexes and search Services Provides view update facility

copy-right@sanjaymadria, UMR W3QS Accessible via any WWW browsers API can be used by programs running anywhere in the Internet Support queries on the web structure by specifying starting page, a search domain and depth of links. File content analysis tools and filling up of forms automatically

copy-right@sanjaymadria, UMR File Types Strict Inner Structure files such as Unix environment files - Semantics of the data is clearly linked to the syntax Semi-structured files - text files containing formatting codes such as Latex or HTML files- possible to use formatting codes to analyze their semantic content Raw Files - no relation between meaning of file and its inner structure

copy-right@sanjaymadria, UMR Content Queries Queries based on the content of a single node of hypertext SQLCOND is used to evaluate boolean expressions Example - node-format = Latex and Node.author =“Sanjay”

copy-right@sanjaymadria, UMR Structure Queries Information conveyed in the hypertext organization itself is conveyed. The result is a set of nodes and links from the hypertext structure that satisfy a given graph pattern; graph with nodes and edges are annotated with conditions. Components are pattern definition, search engines and form completion

copy-right@sanjaymadria, UMR Node1.title =“Good article” Node2.author = “Sanjay” Link1. rev=“doc” Structure Query Answer URL http://../myarticles.html URL http:///.tex Good articles \author {sanjay} A HREF=//..rev=“doc”>

copy-right@sanjaymadria, UMR Search for an article Select cp n2/* result from n1, l2, n2 where n1 in importantindexs.url Fill n1.form as IN importantindexes.fil with Keyword = “sanjay” SQLCOND (n2.format =Laytex) and (n2.author=“sanjay”)

copy-right@sanjaymadria, UMR Query to search hypertext pattern Return all the articles cited in the first chapter of the book. Each chapter includes several pointers to the bibliography, for example [Relativity] means link [Relativity] leads to the label ref2 in the references.html file. In the references.html file the labeled link looks like [relativity, sanjay] this link points to relative.tex

copy-right@sanjaymadria, UMR Select cp art/* result from Ind, l1,chap,l2,ref,l3 art where SQLCOND (ind.url = “http://) And (chap.url =/.chapter- 1.html/) AND l2.HREF = /.\#13.Name/) USING BFS.

copy-right@sanjaymadria, UMR Url http://cs.tech/bookindex.html INDEX Chapter 1 Chapter 2 … References Url http://…/Chapter-1.html ref 1 ref 2 ref 3 l1 ref 1 ref 2 ref 3 article http://relati ve.tex l3

copy-right@sanjaymadria, UMR WebSQL-University of Toronto Model web as relational database Use two relations Document and Anchor Document relation has one tuple for each document in the web and the anchor relation has one tuple for each anchor in each document

copy-right@sanjaymadria, UMR WebSQL SQL-like query language for extracting information from the web. Capable of systematic processing of either all the links in a page, all the pages that can be reached from a given URL through paths that match a pattern, or a combination of both. Provides transparent access to index servers

copy-right@sanjaymadria, UMR Document

copy-right@sanjaymadria, UMR Anchor

copy-right@sanjaymadria, UMR Give documents’s URLs which contain same title and keyword(s) Select d1.url, d2.url from document d1 such that d1 MENTIONS “keyword1” and document d2 such that d2 MENTIONS “keyword1” where d1.title = d2.title and NOT (d1.url = d2.url)

copy-right@sanjaymadria, UMR Find Labels of all Hyperlinks to Postscript Files SELECT a.label FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html" WHERE a.href CONTAINS ".ps.Z";

copy-right@sanjaymadria, UMR Documents about Databases SELECT Document d.url, d.title FROM d SUCH THAT "http://www.OtherDoc.html" ->|=> d WHERE d.title CONTAINS "databases"; N ote : -> path of length one within same server => path of length of one but different server

copy-right@sanjaymadria, UMR Retrieve all the documents in the same server that are pointed to from the document Whose URL is given Select d.url, d.title from Document d SUCH THAT “http//www. Cs.in -> d

copy-right@sanjaymadria, UMR Find all broken links in a page SELECT a.href FROM Anchor a SUCH THAT base = "http://the.document.to.test" WHERE protocol(a.href) = "http" AND doc(a.href) = null;

copy-right@sanjaymadria, UMR Web OQL (University of Toronto) Provides a framework that supports a large class of data Restructuring operations. Simple semistructured data model for documents and record-based data OQL-like syntax and regular expressions Serves as a two-way bridge between databases and the Web.

copy-right@sanjaymadria, UMR DATA MODEL Hypertrees are Ordered arc labeled trees with two types of arcs ; internal and external Internal arcs represent structured objects External arcs to represent refrences (huperlinks) among objects. Records as labels in the arcs Sets of related hypertrees as Web

copy-right@sanjaymadria, UMR ARCHITECTURE Wrappers map all data sources to trees The mapping can be done all at once or on demand

copy-right@sanjaymadria, UMR Example Extract from cspapers (paper database) title and URL of the full version of papers of “Smith” select [y.title,y’.URL] from x in cs papers, y in x’ where y.authors ~”smith”

copy-right@sanjaymadria, UMR Web Creation Create a new page for each research Group (using the group name as URL). Each page contains the publications of the corresponding group. Select x’ as x.group from x in cspapers Select q 1 as s 1, q 2 as s 2,...q m as s m where q’s are queries and each S’s is either a string query or keyword schema. “as” clause create a URL’s s 1,..s m assigned to each new page resulting from each query.

copy-right@sanjaymadria, UMR ARANEUS Data Model called ADM for Web Documents - nested web objects, page schemas Several languages for wrapping, querying, creating and updating web sites - object algebra Methods and Techniques for Web Site Design and Implementation Presentation in SIGMOD’99 Software is available at their home site

copy-right@sanjaymadria, UMR Wrappers - map logical access to attribute values in a page at the ADM level tp physical access to text in the HTML source using EDITOR ULIXES - SQL-like query languages PENELOPE - manipulation language Site integration, semantic heterogeneities Materialized views http://poincare.dia.uniroma3.it:8080/Araneous

copy-right@sanjaymadria, UMR Lore - motivation The data may be irregular and thus not conform to a rigid schema. Relational data model has null values, and OO models have inheritance and complex objects. Both have difficulties in designing schemas to incorporate irregular data. It may be difficult to decide in advance on a single, correct schema, The structure of the data may evolve rapidly, data elements may change types, or data not conforming to previous structure may be added.

copy-right@sanjaymadria, UMR Thus, there is a need for management of semi-structured data! Lore system manages semi-structured data. The data managed by Lore is not confined to a schema and it may be irregular or incomplete. OEM is the Lore’s data model. OEM - object Exchange Model - graph based self-describing object instance model where nodes are objects and edges are labeled with attribute names and leaf nodes have atomic values Lore is light weight object repository and Lorel is Lore’s query language.

copy-right@sanjaymadria, UMR Object Exchange Model - OEM Motivation - information exchange and extraction Why a new data model? … it not a new model. Each value exchanged is given an explicit label. Object  temp-in-Fahrenheit, integer, 80  - “temp- in-Fahrenheit” is the label. Each object is self- describing, with a label, type and value.  set-of-temps, set, {cmpnt1, cmpnt2}  cmpnt1 is  temp-in-Fahrenheit, integer, 80  cmpnt2 is  temp-in-Celsius, integer, 20 

copy-right@sanjaymadria, UMR Labels Plays two roles – identifying an object (component) –identifying the meaning of an object (component) Person-name both identifies cmpnt1 and coveys its meaning.  person-record, set, {cmpnt1, cmpnt2, cmpnt3}  cmpnt1 is  person-name, string, ``Fred’’  cmpnt2 is  office-num-in-bldg-5, integer, 333  cmpnt3 is  department, string, ``toy’’  In relational data this corresponds to ….

copy-right@sanjaymadria, UMR Labels - Issues What does the label mean? –Database of labels –Ontology of labels - within each source Labels are relative (more specific) to the source of the data object. Similar labels from different sources need to be resolved. Labels provide the flexibility in representing object structure

copy-right@sanjaymadria, UMR Self-describing data models Have been in existence for a long time? Why additional interest now? Use the ``nature’’ of self-describing data model for information exchange, and to extend the model to include object nesting. To provide an appropriate object request language (query facility)

copy-right@sanjaymadria, UMR OEM - Specification Each object in OEM has the following structure: –Label: A variable character string describing what the object represents. –Type: The data type of the object’s value. Each is either an atom type, or type set. –Value: A variable-length value of the object. –Object-ID: A unique variable-length identifier for the object or null. LabelTypeValueObject-ID

copy-right@sanjaymadria, UMR OEM - Summary OEM is an information exchange model. It does not specify how objects are stored at source. OEM does specify how objects are received at a client, but after objects are received they can be stored in any way the client likes. Note the schema-less nature of OEM is particularly useful when a client does not know in advance the labels or structure of OEM objects. Each source has a distinguished object with lexical identifier ``root’’.

copy-right@sanjaymadria, UMR Example doc1 is auths1 is auth11 is topic1 is call-no1 is doc2 is auths2 is auth21 is auth22 is auth23 is topic2 is call-no1 is docn is authsn is topic1 is call-no1 is biblio is the root object.

copy-right@sanjaymadria, UMR OEM - QL SELECT Fetch-expression FROM Object WHERE Condition The result of this query is itself an object, with special label ``answer’’:  answer, set, {obj1, obj2, …, objn}  Each returned obji is a component of object specified in the From clause of the query, where the component is located by the Fetch- expression and satisfies the Condition.

copy-right@sanjaymadria, UMR Path The notion of path is used in both Fetch- Expression in the Select clause and the condition in the Where clause. Path describes traversals through an object using subobject structure and labels. Example: ``biblio.doc.auth’’ Paths are used in Fetch-Expression to specify which components are are returned in the answer object. Paths are used in the condition to qualify the fetched objects or other (related) components in the same object structure.

copy-right@sanjaymadria, UMR Queries - Simple Retrieve the topic of each document for which `Ùllman’’ is one of the authors: SELECT biblio.doc.topic FROM root WHERE biblio.doc.auth-set.auth-ln = `Ùllman’’ Intuitively, the query’s where clause finds all paths through subobject structure with the sequence of labels [biblio,doc,auth-set,auth-ln] such that the object at the end of the path has value `Ùllman.’’ obj1 is obj2 is

copy-right@sanjaymadria, UMR Queries - ``wild-cards’’ Retrieve all documents with internal call number: SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no ``?’’ label matches any label. For this query, the doc labels can be replaced by any other strings and query would produce the same result. By convention, two occurrences of ? In the same query must match the same label unless variables are used. obj1 is

copy-right@sanjaymadria, UMR Queries - ``wild-paths’’ Retrieve all documents with internal call number: SELECT *.topic FROM root WHERE *.internal-call-no Symbol ``*’’ matches any path of length one or more. The use of * followed by a single label is a convenient and common way to locate objects with a certain label in complex structure. Similar to ?, two occurrences of * in the same query must match the same sequence of labels, unless variables are used. obj1 is

copy-right@sanjaymadria, UMR Queries - variables Retrieve each document for which both ``Hopcroft’’ and `Àho’’ are co-authors: SELECT biblio.doc FROM root WHERE biblio.doc.auth-set.auth-ln(a1)=`Àho’’ and biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’ Here, the query finds all the paths with structure [biblio, doc, auth-set], and with two distinct path completions with label auth with values `Àho’’ and ``Hopcroft’’ obj1 is the complete doc2

copy-right@sanjaymadria, UMR DBGroup &1 &2&3&4&5&6 &11&8 &9 &10&12&13&14&7&15&16 &17&18&19&20 Member Office Age NameProject Name Age Office RoomBuilding Room “Clark”“Smith ” 46“Gates 252” “Lore”“Tsimmis ” “Jones”28 An OEM Database “CIS”“411”“CIS”252

copy-right@sanjaymadria, UMR Lorel Queries - Simple Path Expression Retrieve the offices of members with age greater than 30 years: Query SELECT DBGroup.Member.Office WHERE DBGroup.Member.Age > 30 ResultOffice “Gates 252” Office Building “CIS” Room “411”

copy-right@sanjaymadria, UMR Queries - General Path Expression Query SELECT DBGroup.Member.Name WHERE DBGroup.Member.Office(.Room%|.Cubicle)? Like “%252” ResultName “Jones” Name “Smith” Room% matches all labels starting from Room, like Room68. “|” stands for disjunction. “?” indicates that the label pattern is optional. “like %252” specifies that the data value should end with string “252”.

copy-right@sanjaymadria, UMR Queries - SubQueries Retrieve Lore project members who work on other projects Query SELECT M.Name, ( SELECT M.Project.Title WHERE M.Project.Title != “Lore”) FROM DBGroup.Member M WHERE M.Project.Title = “Lore” ResultMember Name “Jones” Title “Tsimmis”

copy-right@sanjaymadria, UMR Lore - Summary Lore does facilitate query and updates on semi- structural databases There has been more work done on optimization using: data guides (vldb97). The system is up and running: http://WWW- DB.Stanford.EDU/lore/demo/ How is this related to WWW? XML-QL and related work provides the answer.

copy-right@sanjaymadria, UMR Extraction and Integration OEM and subsequent LORE(L) can be used for extracting information from multiple information sources. OEM helps navigate through unknown objects by SELECT ? FROM root Thus help browsing and schema discovery Efficient implementations are possible using partial fetch mechanism. Push and Pull information delivery systems are possible. How is this different from WebIR?

copy-right@sanjaymadria, UMR STRUDEL Web Site Management System web Site from multiple sources STruQL - based on OEM, graphs, regular expressions, result as graph Example - return all the postscript papers from homepages: Where homepages(p), p “paper” q ispostscript(q) collect postscriptpages(p) Where C1,...Ck Create N1,...Nn link L1,...Lp, Collect G1, …Gq

copy-right@sanjaymadria, UMR Supported by Strudel: a Website Management System with StruQL as query language where Biblio(X), X -> “paper” -> P, P -> “author” ->A, P -> “title” -> T, P -> “year” -> Y create Root(), HomePage(A), YearPage(A,Y), PubPage(P) linkRoot() -> “person” -> HomePage (A), HomePage(A) ->”yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” ->T Complex Constructors

copy-right@sanjaymadria, UMR WebDB View WWW as multimedia documents in the form of web pages WQL supports selection, aggregation, sorting, summary, grouping projection on title, URL, keywords, tables, forms, images etc.

copy-right@sanjaymadria, UMR Some More Results UnQL - AT&T AKIRA- Pennstate NoDose - SIGMOD’98

copy-right@sanjaymadria, UMR HTML to XML HTML documents Emerging Web Standards - XML XML good for data interchange across platforms enterprise wide conversion HTML to XML - IBM, Microsoft

copy-right@sanjaymadria, UMR XML - Motivation In HTML, both the tag semantics and tags are fixed. There is limited and strict interpretation of tags. HTML is widely successful in disseminating documents across internet. Though data can be disseminated through HTML, its extraction is painful, and laborious. EDI has been a predominate mode of exchanging data among businesses. But it has very rigid format that requires highly customized applications.

copy-right@sanjaymadria, UMR XML - Introduction XML aims to provide ease of authoring HTML documents with ease of data exchange that is possible with EDI. Tags are used to markup documents. XML is a meta-language for describing markup languages. XML provides a facility to define tags and structural relationships between them. No pre-defined tag set implied no preconceived semantics, semantics of XML document is will be defined by applications that process them or style sheets (XSL).

copy-right@sanjaymadria, UMR XML - Goals Straightforward to use over internet Support wide variety of applications, authoring, browsing, content analysis, etc. Easy to write programs that process XML documents and validate them. XML documents must be human-legible and reasonably clear. Design of XML shall be formal and concise - expressed as EBNF (extended Backus Naur Form) - amenable to modern compiler tools and techniques.

copy-right@sanjaymadria, UMR XML-features Some structure - not rigid Extensibility - User defined tags nested elements validation - documents may specify their own grammar DTP (Document Type Descriptor) - schema exists with data as tag names Application -EDI - extraction, conversion,, transformation, integration can be modeled using DOM

copy-right@sanjaymadria, UMR More terminology RDF - Resource Description Framework - a method to describe metdata for XML documents XSL - Extensible Stylesheet Language - language for transforming and formatting XML. Transformation Language - XSLT, XPath, XPointer

copy-right@sanjaymadria, UMR Example-HTML Print - Sanjay Madria Web Warehouse Tutorial, ADBIS’99 HTML Sanjay Madria Web Warehouse Tutorial, ADBIS’99 Very difficult to understand, structure is hidden, describes only appearance

copy-right@sanjaymadria, UMR XML Sanjay Madria Web Warehouse Tutorial ADBIS’99 another format:

copy-right@sanjaymadria, UMR XML Data database systems John Korth 5.87 DTD <!ELEMENT author(#PCDATA)|lastname)*

copy-right@sanjaymadria, UMR Firma Karl- Heinz Rosowski Maikstraße 14 22041 Hamburg 721 99 64 21110111 Firma Karl-Heinz Rosowski Maikstraße 14 22041 Hamburg 721 99 64 21110111 … HTML Version XML Version

copy-right@sanjaymadria, UMR XML - Document - Continued is the XML declaration. Elements:Most common form of markup. …. For example Jack Lemon Attributes: are name-value pairs that occur inside start-tags after the element name. For example: attaches value 12359 to attribute id of Address element. Entity References: to handle special characters of XML like “<“ in the XML documents.

copy-right@sanjaymadria, UMR Comments: CDATA Sections: a CDATA (string of characters) section instructs the parser to ignore most markup characters. For example source code,, between [CDATA[ and ]] all character data is passed to an application, with out interpretation.

copy-right@sanjaymadria, UMR XML - DTD - Element Type Declarations Element type declarations: identify the names of elements and the nature of their content. A typical element type declaration looks like: Address is the element name, and (Name, Street, ZIP?, City, Tel+, Fax*, Email?) is the content model. Every address must contain, Name, Street, City and Tel. ZIP and Email are optional, whereas there can be zero or more Fax numbers.

copy-right@sanjaymadria, UMR The declarations for Name, Street, ZIP …, must also be given. For example Attribute List Declarations: identify which elements may have attributes, what values the attributes may hold, and what value is default. Attribute values appear only within start-tags and empty-element tags.

copy-right@sanjaymadria, UMR XML - Summary HTML describes presentation XML describes content XML vs. HTML –users define new tags –arbitrary nesting –validation is possible

copy-right@sanjaymadria, UMR XML and Semi Structural Data Model XML data is fundamentally different than relational and object oriented data. XML is not rigidly structured. In relational and OO data model every data instance has a schema which is separate and independent of the data. XML data is self describing and can naturally model irregularities that cannot be modeled by relational or OO data model.

copy-right@sanjaymadria, UMR For example, data items may have missing elements or multiple occurrences of the same element; elements may have atomic values in some data items and structured values in others; and collections of elements can have heterogeneous structure. Even XML data that has an associated DTD is self-describing (the schema is always stored with the data) and, except for very restricted forms of DTDs, may have all the irregularities described above. XML is an instance of semistructured data.

copy-right@sanjaymadria, UMR XML-QL Regular path expression pattern matching used edge labeled graphs extract data from existing XML documents and construct new XML documents support for ordered and unordered views on XML document simple and declarative

copy-right@sanjaymadria, UMR XML-QL The simplest XML-QL queries extract data from an XML document. Consider the following DTD:

copy-right@sanjaymadria, UMR XML-QL Example Data An Introduction to DB Systems Date Addison-Wesley Foundations for OR Databases Date Darwen Addison-Wesley

copy-right@sanjaymadria, UMR Matching Data Using Patterns XML uses element patterns to match data in an XML document. Find all authors of books whose publisher is Addison-Wesley in XML document www.a.b.c/bib.xml WHERE Addison- Wesley $t $a IN “www.a.b.c/bib.xml” CONSTRUCT $a matches every element in the XML document that has at least one element, one element, and one publisher element whose is Addison-Wesley. For each such match it binds $t and $a to every title and author pair.

copy-right@sanjaymadria, UMR XML-QL Constructing XML Data Often we would like format the result. Find all authors and titles of books whose publisher is Addison-Wesley in XML document www.a.b.c/bib.xml WHERE Addison-Wesley $t $a IN “www.a.b.c/bib.xml” CONSTRUCT $a $t

copy-right@sanjaymadria, UMR Constructing XML Data -cont. Result of the query: Date Introduction to Database Systems Date Foundations for OR Databases Darwen Foundations for OR Databases One result for each author, duplicating title information.

copy-right@sanjaymadria, UMR XML-QL Nested Queries. WHERE $t Addison-Wesley CONTENT_AS $p IN “www.a.b.c/bib.xml” CONSTRUCT $t WHERE $a in $p CONSTRUCT $a Date Introduction to Database Systems Date Darwen Foundations for OR Databases

copy-right@sanjaymadria, UMR XML-QL Join Queries XML queries cab express “joins” by matching two or more elements that contain same value. Find all articles that have at least one author who has written a book since 1995. WHERE $f // firstname $f $l // lastname $l CONTENT_AS $a IN "www.a.b.c/bib.xml" $f // join on same firstname $f $l // join on same lastname $l IN "www.a.b.c/bib.xml", y > 1995 CONSTRUCT $a

copy-right@sanjaymadria, UMR XML-QL Data Model for XML XML graph G in which each node is represented by a unique string called object identifier (OID), G’s edges are labelled with element tags, G’s nodes are labeled with sets of attribute value pairs, G’s leaves are labeled with one string value, and G has a distinguished node called root.

copy-right@sanjaymadria, UMR XML-QL Data Model for XML The model allows several edges between the same two nodes with the following restriction: between any two nodes there can be at most one edge with a given label a node cannot have two leaf children with the same label and same string value XML graphs are not only derived from XML documents, but are also generated by queries.

copy-right@sanjaymadria, UMR XML- Element Identity, Ids, and IDREFS For element sharing XML reserves an attribute of type ID which allows a unique key to be associated with an element. An attribute of type IDREF allows an element to refer to another element with the designated key, and one of the type IDREFS may refer to multiple elements.

copy-right@sanjaymadria, UMR John Smith...... 1995

copy-right@sanjaymadria, UMR XML- Element Identity, Ids, and IDREFS

copy-right@sanjaymadria, UMR The following query produces all lastname, title pairs by joining the author element's IDREF attribute value with the person element's ID attribute value. WHERE ELEMENT_AS $t, ELEMENT_AS $l CONSTRUCT $t $l The idiom ELEMENT_AS $t binds $t to a element with arbitrary contents. The element expression matches a element with empty contents.

copy-right@sanjaymadria, UMR XML-QL- Advanced Examples Tag Variables Regular Path Expressions Transforming XML Data (from one DTD to another) Integrating Data from different XML sources Embedding queries in data XML-QL check http://www3.org/TR/NOTE-xml-ql

copy-right@sanjaymadria, UMR Summary Even before you blink your eye…. Lot of work has gone in web data models and query languages Some problems are addressed: Semi-structural semi-structural data model based query languages schema inference from semi-structural data model efficient processing of queries on semi-structural data efficient indexing and storage structures integration with XML Traditional WebSQL/WebOQL Web Warehousing Which way will you go?

copy-right@sanjaymadria, UMR Further issues Distributed query processing Continuous result processing with push/pull result replenishment Labels, labels every where, with XML more labels every where … how are semantics of queries across multiple information sources handled IR gives too many relevant/irrelevant results Query Processing requires some schema knowledge that is difficult to handle across multiple sources Can these two be bridged? Cooperative solutions. Next Agents, Agents everywhere, What are they doing? Will it work or Will it be a fad?

UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla

Similar presentations

Presentation on theme: "UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla

Similar presentations

Presentation on theme: "UMR Web Data Management Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla"— Presentation transcript:

Similar presentations

About project

Feedback