Xyleme, January Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA
2 Organization The Web and XML Xyleme 1. Data Acquisition and Maintenance 2. XML Repository 3. Semantic Data Integration 4. Query Processing 5. Query Subscription Conclusion
Xyleme, January Zurich3 The Web and XML
4 The Web today Terabytes of data Private web: not publicly available pages Deep web: data hidden behind forms A lot of public pages –1 billion in [06/2000] –several millions of servers
5 The Web today Browsing Search engines –Google indexes more than 1 billion pages 11/00 –in: list of words –out: sorted list of URLs based on occurrence of words in documents based on the link structure of the web
6 The Web today Queries: keywords to retrieve URLs –Imprecise –Query results cannot be directly processed –Difficult to extract data of interest Applications: based on hand-made wrappers –Expensive –Incomplete –Short-lived, not adapted to the Web constant changes
7 The Coming of XML HTML –comes from SGML –hypertext language –fixed number of tags –content and presentation are mixed –very difficult to extract data from a page –old standard XML –also –semistructured data –not fixed –not mixed –very easy –new standard
8 HTML = Hypertext Language Ref Name Price X23 Camera R2D2 Robot Z25 PC Information System HTML The X23 new camera replaces the X22. It comes equipped with a flash (worth by itself $ ) and provides great quality for only $. Text + presentation Where is the data ? hard
9 XML = Semistructured Data Ref Name Price X23 Camera R2D2 Robot Z25 PC Information System camera … Robot …... XML Data + Structure Semistructured: more flexible easy
10 XML : Tree Types Semantics and structure are in paths –product-table/product/reference –product-table/product/price product designationdescription price reference product-table
11 XML Very active/noisy field - standards –schema (XML schema), stylesheet (XSL), resource description (RDF...) –WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... How fast will XML conquer the web? –so far rather slow (about 1% now of the visible web; much more in intranets) –much faster since the arrival of Explorer 5.5
Xyleme, January Zurich12 A Dynamic Warehouse for the XML Data of the Web Xyleme
13 Xyleme Warehouse –Xyleme stores huge quantities of data (teraB) –Xyleme is not a search engine (only index) or a mediator (only virtual data) XML –Xyleme is focused on XML, i.e., trees Dynamic –Xyleme is interested in data evolution/changes
14 Xyleme September 1999: a group of researchers from –Inria Rocquencourt, Verso Group –U. of Mannheim, Database Group –U. of Orsay, IASI Group –CNAM, Vertigo Group September 2000: creation of a start-up November 2000: about 15 people
15 Corporate Information Today Web Information System manual searches using browsers ad-hoc applications written by web-experts tailored for specific tasks and data. I.e. inflexible and expensive manual updates
16 Corporate Information with Xyleme Web Information System Repository Query Engine Xyleme-warehouse Crawling & interpreting data publishing updates queries searches
17 Five Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store this data and index it so that it can be processed efficiently 3. Query Processing support efficiently an SQL-style query language
18 Five Challenges - continued 4. Semantic Integration Understand DTD and tags, partition the Web into semantic domains, provide a simple view of each domain 5. Change Control Monitor the web and offer services such as Query Subscription
19 Challenges - continued Scale to the web Size of data: millions/billions of pages Size of index: terabytes Number of customers –thousands of simultaneous queries –millions of subscriptions
20 Repository and Index Manager Change Control Query Processor Semantic Module User Interface Xyleme Interface Functional Architecture I N T E R N E T Web Interface Acquisition & Crawler Loader
21 Architecture Cluster of PCs Developed with Linux and C++ Communications –local: Corba –external: HTTP Distribution between autonomous machines
22 Index I N T E R N E T Change Control and Semantic Integration Change Control and Semantic Integration ETHERNETETHERNET Repository RepositorryRepository Loader |Query Architecture Acquisition and Maintenance Acquisition and Maintenance
Xyleme, January Zurich23 1. Data Acquisition and Maintenance
24 Goals Discover XML pages on the web that are of interest for customers –For this crawl the web (HTML+XML) Maintain them up to date Do this under bounded resources
25 Life Cycle of a page in Xyleme The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D –The meta data of D is read type, last_date_update... –The document D is loaded The document D is re(read) regularly
26 Main Issues Loading of pages –we can load up to 5 millions of pages/day on a standard PC –main cost is Internet connection Metadata management Page scheduling –decide which page to read or refresh next
27 Metadata Management Example: management of the link matrix –page i points to page j –for 1 billion URL, about 30 children/url –matrix has edges (very sparse) For each page that is read, –find the IDs of the 30 children –50 pages/second 1500 database calls/second
28 Page Scheduling Decide which page to read next –discovery (read first) and refresh (read again) Based on: Importance of the page –read often important pages –also used to order query results Change rate of the page –dont read a page that is probably up-to-date
29 Page Scheduling for Refresh Determine refresh frequency f i for each page i to minimize a cost function MinimizeUnder the constraint 1…N cost i (f i ) G 1…N f i where cost i (f i ), penalty for page i, depends on the estimated importance and staleness of the page
30 Cost Function cost i (f i ), penalty for page i, depends on the estimated importance and staleness of the page Importance of the page –link structure –pub/sub Staleness of the data –penalty for being out of date –penalty for aging
31 Evaluation of Change Rate Based on the Last Date of Change –provided by HTTP header of the page –in general reliable but … Based on the number M of changes detected the last N times the pages was refreshed –limits: do not know the actual number of changes First one more precise
32 Page Importance: Link Structure Intuition: a page is important if many important pages reference it : fixpoint Link Matrix –M(i,j) if page i refers to page j –M is a matrix –out(i) : the outdegree of page i Fixpoint –W 0 (k) = 1/N (initialization) –W m (k) = i [M(i,k) * W m-1 (i)/out(i) ]
33 Page Importance : Algorithm WmWm M(i,-) W m-1 (k) += M(i,-) is stored as a list computation of W m (line/line) for i = 1 to N do [ read M(i,-) ; process the line ] k W m (k) out(k)
34 Page Importance: Fixpoint Techniques for fixpoint convergence Some results –convergence is fast ( OK after 10) –simple precision suffices –possible on a standard PC Distribution and incremental evaluation
35 Page Importance: Refresh Standard importance for HTML/XML pages HTML pages are useful only to discover XML Taking pub/sub into account circle = HTML square = XML triangle = pub/sub
Xyleme, January Zurich36 2. XML Repository
37 Storing XML documents Relational store (e.g., Oracle 8i) –binary long objects: not possible to access directly elements –very typed data and Tables: efficient –otherwise: too many joins and inefficient Object database store (ODMG) –better adapted XML Native storage: Natix
38 Natix Repository Goal –minimize I/O for direct access and scanning –efficient direct accesses using indexing –good compaction but not at the cost of access Efficient storage of trees –use fixed length storage pages –variable length records inside a page Main issue: tree balancing
39 Tree Balancing Record 1 Record 3Record 2
40 Tree Balancing - continued Large collections may use several records
Xyleme, January Zurich41 3. Semantic Data Integration
42 Web Heterogeneity Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration –one abstract DTD for the domain –gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1 abstract DTD
43 Relationship is not visible unless one knows the relationships between story and tale. Cluster DTDs and Documents
44 Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.) to obtain domains cdtd1. cdtd2. cdtd3. adtd1 adtd2 adtd4 Many concrete DTDs Fewer abstract DTDs cdtd7. cdtd8. cdtd9. cdtd10. cdtd4. cdtd5. cdtd6.
45 Wordnet: Useful Relationships Synonyms One concept, two terms Hypernyms / Hyponyms two concepts linked through generalization/specialization - e.g., vehicle & car Meronyms / Holonyms two concepts linked through composition/inclusion - e.g., country & city
46 Choose an Abstract DTD / Domain Automatically –The analysis of a cluster, leads to « clusters of tags » – Use a thesaurus (e.g., Wordnet) to build a hierarchy from the clusters of tags Manually –Performed by a domain expert Hybrid
47 Mapping Concrete to Abstract For each concrete DTD in a domain, find how it relates to the abstract DTD: – Associate concrete tags to abstract tags using linguistic tools –Provide relationships between paths in the concrete and abstract DTD E.g.: cdtd3/œuvre/nom/prénom and adtd2/book/author/name/firstname Possibly automatic, manual or hybrid
Xyleme, January Zurich48 4. Query Processing
49 Xyleme Query Language Today: A mix of OQL and XQL Tomorrow: the future W3C standard Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains flash and product/description contains camera
50 Data Distribution Cluster of documents = physical collection of documents ( semantic domain) Distribution Storage machine –in charge of a cluster of documents Index machine –index for a cluster
51 Step 0: Indexing Standard inverted index –word documents that contain this word Xyleme index –word elements that contain this word document + element identifier Goal: more work can be performed without accessing data
52 Step 1: Localization Query on an abstract dtd Localization of machines that host concrete DTDs that will participate in the query global query on abstract dtd union of queries on local machines local queries catalogue/product/price relevant for machine 56 machine 45
53 Step 2: Optimization Algebraic rewriting Linear search strategy based on simple heuristics –use in memory indexes –minimize communication Optimization of the global plan Optimization of the local plans
54 Step 3: Execution A plan usually consists of: 1. parallel translation from abstract queries to concrete patterns on the relevant index machines 2. parallel index scans to identify the relevant elements for a concrete pattern 3. parallel construction of resulting elements 4. pipeline evaluation (i.e., no intermediate data structure) Note: 2. Requires smart indexes
55 Execution: Abstract2Concrete For each concrete pattern, the local plan is optimized dynamically for each concrete pattern scan the element ids &234 &177 for catalogue/product/price scan relevant concrete pattern d1//camera/price d2/product/cost d3/piano/price...
56 Element Identifiers Essential for query processing Identifier = (preorder rank/postorder rank) –X ancestor of Y pre(X) post(Y) –E.g., 2 2 => (2,4) ancestor (5,2) A B C D E F G Text
57 Patterns and Indexes product name description camera (d1, 12, 200), (d1, 201, 400) (d1,1,11), (d1, 205,224)(d1,228, 237) (d1, 229), (d2, 14) Heuristics: to perform joins, start with the smallest cardinality (to minimize size intermediary results)
Xyleme, January Zurich58 5. Change Control
59 The Web changes all the time Data acquisition + maintenance –keep the warehouse up-to-date Version management –representation and storage of changes Change monitoring –query subscription
60 Versions Version some documents or some sites Version some continuous queries continuous query: query that is evaluated regularly get each Monday the list of movies showing in Paris
61 Representing Versions: Deltas Version storage –current document –persistent identifiers for elements –description of changes - completed deltas Deltas are XML documents Changes can be processed like other data –exchanged: send me changes since June 1st! –queried: what are the products inserted since 2/1/99?
62 Completed Delta DVD 500 <move xid=16 new_parent=11 new_position=2 old_parent=11 old_position=1 />... persistent identifier
63 Query Subscription Users may subscribe to certain events, e.g., changes in a page, a set of pages, changes in pages from a particular semantic domain, containing some specific words or with a particular DTD changes of particular elements somewhere (new products in a catalog) Users may request to be notified immediately at the time the event is detected regularly, e.g., weekly after a certain number of event detections
64 Example subscription myPariscope % what are the new movie entries in Pariscope site monitoring newMovies select URL where URL extends and new(self) % manage the changes in the movies showing in Paris continuous delta Showing select... from... where when daily notify daily% send me a daily report
65 Step 1: Atomic Event Detection HTML parser XML loader metadata manager document & alerts d/46 complex event detection atomic event 46: URL matches pattern atomic event 67: XML document contains the tag painter d/46,67 5 millions of pages/day d loading
66 Step2: Complex Event Detection HTML parser XML loader complex event detection comple event 12: 67 & 46 (XML document contains the tag painter and URL matches pattern Millions of alerts of pages/day Millions of subscriptions
67 Step 3: Notification Processor notification processor continuous queries Millions of notifications/day complex event detection clock triggers alerts notification/monitoring notification/results
Xyleme, January Zurich68 Conclusion
69 One Question Only The web is turning from a large collection of documents into a huge knowledge base When will I be able to get the precise knowledge I need? Database + Knowledge Base + Linguistic +...
Xyleme, January Zurich70 Merci