Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
1 A B C
Simplifications of Context-Free Grammars
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
AP STUDY SESSION 2.
1
1 XML warehouse – XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2013 Elsevier Inc. All rights reserved.
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
David Burdett May 11, 2004 Package Binding for WS CDL.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
The 5S numbers game..
Inspections on an iPad, iPhone, iPod Touch, Android Tablet or Android Phone.
Media-Monitoring Final Report April - May 2010 News.
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Break Time Remaining 10:00.
The basics for simulations
EE, NCKU Tien-Hao Chang (Darby Chang)
PP Test Review Sections 6-1 to 6-6
Employee & Manager Self Service Overview
1 IMDS Tutorial Integrated Microarray Database System.
Briana B. Morrison Adapted from William Collins
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Copyright © [2002]. Roger L. Costello. All Rights Reserved. 1 XML Schemas Reference Manual Roger L. Costello XML Technologies Course.
Chapter 1: Expressions, Equations, & Inequalities
FAFSA on the Web Preview Presentation December 2013.
SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.
Artificial Intelligence
Before Between After.
Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.
: 3 00.
5 minutes.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
WorkKeys Internet Version Training
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
famous photographer Ara Guler famous photographer ARA GULER.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 9 TCP/IP Protocol Suite and IP Addressing.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.
Introduction Peter Dolog dolog [at] cs [dot] aau [dot] dk Intelligent Web and Information Systems September 9, 2010.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Advanced Users Training 1 ENTERPRISE REPORTING FINANCIAL REPORTS.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Xyleme, A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( ) Serge Abiteboul, INRIA & Xyleme.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Presentation transcript:

Xyleme, January Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA

2 Organization The Web and XML Xyleme 1. Data Acquisition and Maintenance 2. XML Repository 3. Semantic Data Integration 4. Query Processing 5. Query Subscription Conclusion

Xyleme, January Zurich3 The Web and XML

4 The Web today Terabytes of data Private web: not publicly available pages Deep web: data hidden behind forms A lot of public pages –1 billion in [06/2000] –several millions of servers

5 The Web today Browsing Search engines –Google indexes more than 1 billion pages 11/00 –in: list of words –out: sorted list of URLs based on occurrence of words in documents based on the link structure of the web

6 The Web today Queries: keywords to retrieve URLs –Imprecise –Query results cannot be directly processed –Difficult to extract data of interest Applications: based on hand-made wrappers –Expensive –Incomplete –Short-lived, not adapted to the Web constant changes

7 The Coming of XML HTML –comes from SGML –hypertext language –fixed number of tags –content and presentation are mixed –very difficult to extract data from a page –old standard XML –also –semistructured data –not fixed –not mixed –very easy –new standard

8 HTML = Hypertext Language Ref Name Price X23 Camera R2D2 Robot Z25 PC Information System HTML The X23 new camera replaces the X22. It comes equipped with a flash (worth by itself $ ) and provides great quality for only $. Text + presentation Where is the data ? hard

9 XML = Semistructured Data Ref Name Price X23 Camera R2D2 Robot Z25 PC Information System camera … Robot …... XML Data + Structure Semistructured: more flexible easy

10 XML : Tree Types Semantics and structure are in paths –product-table/product/reference –product-table/product/price product designationdescription price reference product-table

11 XML Very active/noisy field - standards –schema (XML schema), stylesheet (XSL), resource description (RDF...) –WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... How fast will XML conquer the web? –so far rather slow (about 1% now of the visible web; much more in intranets) –much faster since the arrival of Explorer 5.5

Xyleme, January Zurich12 A Dynamic Warehouse for the XML Data of the Web Xyleme

13 Xyleme Warehouse –Xyleme stores huge quantities of data (teraB) –Xyleme is not a search engine (only index) or a mediator (only virtual data) XML –Xyleme is focused on XML, i.e., trees Dynamic –Xyleme is interested in data evolution/changes

14 Xyleme September 1999: a group of researchers from –Inria Rocquencourt, Verso Group –U. of Mannheim, Database Group –U. of Orsay, IASI Group –CNAM, Vertigo Group September 2000: creation of a start-up November 2000: about 15 people

15 Corporate Information Today Web Information System manual searches using browsers ad-hoc applications written by web-experts tailored for specific tasks and data. I.e. inflexible and expensive manual updates

16 Corporate Information with Xyleme Web Information System Repository Query Engine Xyleme-warehouse Crawling & interpreting data publishing updates queries searches

17 Five Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store this data and index it so that it can be processed efficiently 3. Query Processing support efficiently an SQL-style query language

18 Five Challenges - continued 4. Semantic Integration Understand DTD and tags, partition the Web into semantic domains, provide a simple view of each domain 5. Change Control Monitor the web and offer services such as Query Subscription

19 Challenges - continued Scale to the web Size of data: millions/billions of pages Size of index: terabytes Number of customers –thousands of simultaneous queries –millions of subscriptions

20 Repository and Index Manager Change Control Query Processor Semantic Module User Interface Xyleme Interface Functional Architecture I N T E R N E T Web Interface Acquisition & Crawler Loader

21 Architecture Cluster of PCs Developed with Linux and C++ Communications –local: Corba –external: HTTP Distribution between autonomous machines

22 Index I N T E R N E T Change Control and Semantic Integration Change Control and Semantic Integration ETHERNETETHERNET Repository RepositorryRepository Loader |Query Architecture Acquisition and Maintenance Acquisition and Maintenance

Xyleme, January Zurich23 1. Data Acquisition and Maintenance

24 Goals Discover XML pages on the web that are of interest for customers –For this crawl the web (HTML+XML) Maintain them up to date Do this under bounded resources

25 Life Cycle of a page in Xyleme The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D –The meta data of D is read type, last_date_update... –The document D is loaded The document D is re(read) regularly

26 Main Issues Loading of pages –we can load up to 5 millions of pages/day on a standard PC –main cost is Internet connection Metadata management Page scheduling –decide which page to read or refresh next

27 Metadata Management Example: management of the link matrix –page i points to page j –for 1 billion URL, about 30 children/url –matrix has edges (very sparse) For each page that is read, –find the IDs of the 30 children –50 pages/second 1500 database calls/second

28 Page Scheduling Decide which page to read next –discovery (read first) and refresh (read again) Based on: Importance of the page –read often important pages –also used to order query results Change rate of the page –dont read a page that is probably up-to-date

29 Page Scheduling for Refresh Determine refresh frequency f i for each page i to minimize a cost function MinimizeUnder the constraint 1…N cost i (f i ) G 1…N f i where cost i (f i ), penalty for page i, depends on the estimated importance and staleness of the page

30 Cost Function cost i (f i ), penalty for page i, depends on the estimated importance and staleness of the page Importance of the page –link structure –pub/sub Staleness of the data –penalty for being out of date –penalty for aging

31 Evaluation of Change Rate Based on the Last Date of Change –provided by HTTP header of the page –in general reliable but … Based on the number M of changes detected the last N times the pages was refreshed –limits: do not know the actual number of changes First one more precise

32 Page Importance: Link Structure Intuition: a page is important if many important pages reference it : fixpoint Link Matrix –M(i,j) if page i refers to page j –M is a matrix –out(i) : the outdegree of page i Fixpoint –W 0 (k) = 1/N (initialization) –W m (k) = i [M(i,k) * W m-1 (i)/out(i) ]

33 Page Importance : Algorithm WmWm M(i,-) W m-1 (k) += M(i,-) is stored as a list computation of W m (line/line) for i = 1 to N do [ read M(i,-) ; process the line ] k W m (k) out(k)

34 Page Importance: Fixpoint Techniques for fixpoint convergence Some results –convergence is fast ( OK after 10) –simple precision suffices –possible on a standard PC Distribution and incremental evaluation

35 Page Importance: Refresh Standard importance for HTML/XML pages HTML pages are useful only to discover XML Taking pub/sub into account circle = HTML square = XML triangle = pub/sub

Xyleme, January Zurich36 2. XML Repository

37 Storing XML documents Relational store (e.g., Oracle 8i) –binary long objects: not possible to access directly elements –very typed data and Tables: efficient –otherwise: too many joins and inefficient Object database store (ODMG) –better adapted XML Native storage: Natix

38 Natix Repository Goal –minimize I/O for direct access and scanning –efficient direct accesses using indexing –good compaction but not at the cost of access Efficient storage of trees –use fixed length storage pages –variable length records inside a page Main issue: tree balancing

39 Tree Balancing Record 1 Record 3Record 2

40 Tree Balancing - continued Large collections may use several records

Xyleme, January Zurich41 3. Semantic Data Integration

42 Web Heterogeneity Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration –one abstract DTD for the domain –gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1 abstract DTD

43 Relationship is not visible unless one knows the relationships between story and tale. Cluster DTDs and Documents

44 Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.) to obtain domains cdtd1. cdtd2. cdtd3. adtd1 adtd2 adtd4 Many concrete DTDs Fewer abstract DTDs cdtd7. cdtd8. cdtd9. cdtd10. cdtd4. cdtd5. cdtd6.

45 Wordnet: Useful Relationships Synonyms One concept, two terms Hypernyms / Hyponyms two concepts linked through generalization/specialization - e.g., vehicle & car Meronyms / Holonyms two concepts linked through composition/inclusion - e.g., country & city

46 Choose an Abstract DTD / Domain Automatically –The analysis of a cluster, leads to « clusters of tags » – Use a thesaurus (e.g., Wordnet) to build a hierarchy from the clusters of tags Manually –Performed by a domain expert Hybrid

47 Mapping Concrete to Abstract For each concrete DTD in a domain, find how it relates to the abstract DTD: – Associate concrete tags to abstract tags using linguistic tools –Provide relationships between paths in the concrete and abstract DTD E.g.: cdtd3/œuvre/nom/prénom and adtd2/book/author/name/firstname Possibly automatic, manual or hybrid

Xyleme, January Zurich48 4. Query Processing

49 Xyleme Query Language Today: A mix of OQL and XQL Tomorrow: the future W3C standard Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains flash and product/description contains camera

50 Data Distribution Cluster of documents = physical collection of documents ( semantic domain) Distribution Storage machine –in charge of a cluster of documents Index machine –index for a cluster

51 Step 0: Indexing Standard inverted index –word documents that contain this word Xyleme index –word elements that contain this word document + element identifier Goal: more work can be performed without accessing data

52 Step 1: Localization Query on an abstract dtd Localization of machines that host concrete DTDs that will participate in the query global query on abstract dtd union of queries on local machines local queries catalogue/product/price relevant for machine 56 machine 45

53 Step 2: Optimization Algebraic rewriting Linear search strategy based on simple heuristics –use in memory indexes –minimize communication Optimization of the global plan Optimization of the local plans

54 Step 3: Execution A plan usually consists of: 1. parallel translation from abstract queries to concrete patterns on the relevant index machines 2. parallel index scans to identify the relevant elements for a concrete pattern 3. parallel construction of resulting elements 4. pipeline evaluation (i.e., no intermediate data structure) Note: 2. Requires smart indexes

55 Execution: Abstract2Concrete For each concrete pattern, the local plan is optimized dynamically for each concrete pattern scan the element ids &234 &177 for catalogue/product/price scan relevant concrete pattern d1//camera/price d2/product/cost d3/piano/price...

56 Element Identifiers Essential for query processing Identifier = (preorder rank/postorder rank) –X ancestor of Y pre(X) post(Y) –E.g., 2 2 => (2,4) ancestor (5,2) A B C D E F G Text

57 Patterns and Indexes product name description camera (d1, 12, 200), (d1, 201, 400) (d1,1,11), (d1, 205,224)(d1,228, 237) (d1, 229), (d2, 14) Heuristics: to perform joins, start with the smallest cardinality (to minimize size intermediary results)

Xyleme, January Zurich58 5. Change Control

59 The Web changes all the time Data acquisition + maintenance –keep the warehouse up-to-date Version management –representation and storage of changes Change monitoring –query subscription

60 Versions Version some documents or some sites Version some continuous queries continuous query: query that is evaluated regularly get each Monday the list of movies showing in Paris

61 Representing Versions: Deltas Version storage –current document –persistent identifiers for elements –description of changes - completed deltas Deltas are XML documents Changes can be processed like other data –exchanged: send me changes since June 1st! –queried: what are the products inserted since 2/1/99?

62 Completed Delta DVD 500 <move xid=16 new_parent=11 new_position=2 old_parent=11 old_position=1 />... persistent identifier

63 Query Subscription Users may subscribe to certain events, e.g., changes in a page, a set of pages, changes in pages from a particular semantic domain, containing some specific words or with a particular DTD changes of particular elements somewhere (new products in a catalog) Users may request to be notified immediately at the time the event is detected regularly, e.g., weekly after a certain number of event detections

64 Example subscription myPariscope % what are the new movie entries in Pariscope site monitoring newMovies select URL where URL extends and new(self) % manage the changes in the movies showing in Paris continuous delta Showing select... from... where when daily notify daily% send me a daily report

65 Step 1: Atomic Event Detection HTML parser XML loader metadata manager document & alerts d/46 complex event detection atomic event 46: URL matches pattern atomic event 67: XML document contains the tag painter d/46,67 5 millions of pages/day d loading

66 Step2: Complex Event Detection HTML parser XML loader complex event detection comple event 12: 67 & 46 (XML document contains the tag painter and URL matches pattern Millions of alerts of pages/day Millions of subscriptions

67 Step 3: Notification Processor notification processor continuous queries Millions of notifications/day complex event detection clock triggers alerts notification/monitoring notification/results

Xyleme, January Zurich68 Conclusion

69 One Question Only The web is turning from a large collection of documents into a huge knowledge base When will I be able to get the precise knowledge I need? Database + Knowledge Base + Linguistic +...

Xyleme, January Zurich70 Merci