Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.

Similar presentations


Presentation on theme: "Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay."— Presentation transcript:

1

2 Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

3 Form and content The Web today –HTML generated by hand, wysisyg editors, ‘webified’ databases –HTML specifies rendering for human reading –Screen scraping required to consolidate data The Web in the future –Common interchange format (XML) –Concentrate on content, not form –Represent data class broader than relations

4 Role of databases Contribute –Data storage and indexing –Query processing and optimization –Views, transformations, integration Adopt –Search modalities –Content-based approximate search –Linguistic analysis

5 Features of semi-structured data No explicit schema, or volatile schema Schema size comparable to data size Structure changes without notice Heterogeneous, deeply nested, irregular Has nature of documents rather than tables

6 Semi-structured data model example &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

7 Syntax { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

8 Some observations Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections

9 Object ID’s and references Jane Mary John o555 o456 o123 children mother

10 Names and acronyms OEM (Object Exchange Model): a semi- structured data model from Stanford, 1995 Lore: a system for storing data adhering to the OEM Lorel: a query language for Lore XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML XML-QL: Query language for XML

11 Lorel query examples select Bib.paper.title from Bib.paper where Bib.paper.year >1995 select Bib.paper.title from Bib.paper where Bib.paper.year >1995 select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X Alternative Transitive closure Navigating partially known structures

12 XML-QL query examples where Morgan Kaufmann $a in “www.a.b.c/bib.xml” construct $a where Morgan Kaufmann $a in “www.a.b.c/bib.xml” construct $a where $a in “www.a.b.c/bib.xml” construct $a $l where $a in “www.a.b.c/bib.xml” construct $a $l

13 XML storage in ternary relation &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” Ref Val Too many joins Label name storage redundant

14 Storage optimization through mining paper author title year fn ln Paper1 Paper2 Inline common cases Tolerate a few nulls

15 Schema extraction Schema: a template for type/semantics specification Conformance –Does that data conform to a given schema ? Classification –If so, which objects belong to what classes/types? Applications –Storage and query optimization

16 Graph simulation Given two edge-labeled graphs G1 and G2, a simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R x1x2 a R G1G2 y1 a R y2

17 Upper and lower bound schema Lower bound schema –Conformance: find simulation R from S to D –Classification: check if (c,x) in R –Used in storage optimization Upper bound schema (data guides) –Conformance: find simulation R from D to S –Classification: check if (x,c) in R –Used in path index generation and query optimization

18 Sample data &r &p8&p1&p2&p3&p4&p5&p6&p7 &c company employee worksfor manages managedby manages managedby

19 Lower bound schema Root &r Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby worksfor employee

20 Storage using lower bound schema Root Company Employee string company person works-for c.e.o. address name managed-by name Employee Store rest in overflow graph Lower-bound schema

21 Upper bound schema (DataGuides) Root &r Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby manages managedby worksfor

22 Query optimization issues Select x from A.B x where exists y in x.C: y=5 DDB CCC A 555 BBB CCC A 445 BBB CCC A 445 B B D D

23 What makes the problem difficult Selectivity estimation Index selection Access cost models Clustering choices

24 Part Two Information Retrieval and Databases Soumen Chakrabarti CSE, IIT Bombay

25 Information retrieval (IR) Search –‘Inverted’ index –Boolean match –Relevance ranking Classification –Learn topics from examples Clustering –Discover topics from a document collection Never done inside a relational database cat dog D5: 3, 37, 50 D7: 9, 20 D7: 7, 90, 400 D20: 22, 533

26 Current style of loose integration RDBMS provides hooks Declare some columns as textual with keyword index Inserts, updates, and deletes trigger external program, e.g., Verity search engine Search engine maintains separate indices Simple query rewriting to combine relational and text-match where-clauses

27 Reasons Space –BLOB vs. pure relational representation –Average English word is only 5 bytes Time –Most text engines are resigned to flexible (i.e., no) model for data consistency –Much faster read-only access than relational database lookups

28 New features desired Operations that are more complex than keyword search can benefit from tighter coupling with RDBMS Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99) –Misspelling book title, author name common –Variant of OEM edge label (author/writer/poet) Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)

29 Case study: generalized ‘like’ SQL has limited string matching constructs –like ‘%x’, ‘x%’, ‘%x%’ –x must be exact match Need more lenient match –Applications: LDAP, IR String edit distance is not suitable –“Given query, order strings in database in increasing order of edit distance and pick top 5”

30 Sliding-window matching nascentpascal nasascscecenentpasscacal Given a query, scan to get a set of 3-grams Similarity of string in database to query = number of shared 3-grams rascal ras

31 Issues Minimally disruptive architecture Low storage overheads Fast query processing Good selectivity estimates Combining with other predicates for ranking Efficiently handling updates


Download ppt "Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay."

Similar presentations


Ads by Google