Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006.

Similar presentations


Presentation on theme: "1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006."— Presentation transcript:

1 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

2 2 Omar Benjelloun - New Bases for New Data Relational databases are great A simple, understandable model for data High-level, declarative language for queries and updates: SQL Efficient optimization techniques Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information BossEmpManager JoeBill Steve

3 3 Omar Benjelloun - New Bases for New Data … but data has changed Data is distributed, behind applications, dynamically changing Data is heterogeneous Data may be uncertain Today Data is stored in relational databases (or XML) Techniques for data integration, data exchange … Lots of code Traditional Database Management Systems (DBMS’s) are too rigid New characteristics should be represented in the data New bases are needed foundations (models and languages) Processing and optimization techniques

4 4 Omar Benjelloun - New Bases for New Data Applications Information integration Data is distributed on multiple heterogenous, independent sources Conflicting information from the sources: inconsistency, uncertainty Varying and evolving reliability of sources Where data came from can be critical information Scientific data management Receptor (e.g., sensor) data management Data cleaning (entity resolution) And many others…

5 5 Omar Benjelloun - New Bases for New Data Agenda Distributed and dynamic data: Active XML A “glue” language to connect data and programs XML documents with embedded calls to Web services Distributed interactions through the exchange of AXML data Techniques to query and control the exchange of AXML data Uncertain data: ULDB’s An extension of the relational model with uncertainty and lineage Efficient query evaluation Computing probabilities Conclusion

6 6 Omar Benjelloun - New Bases for New Data Active XML

7 7 Omar Benjelloun - New Bases for New Data Distributed data management Information is everywhere services XML services XML services XML services XML Internet Web service Web service Data warehouses Databases Web sites PC, PDA, cell phones, home appliances, cars…

8 8 Omar Benjelloun - New Bases for New Data The golden triangle of distributed data management XML a standard for data representation & exchange Extensible Markup Language Labeled ordered trees Rich types: XML Schema Query languages XPath, XQuery Web services Standards for distributed computing XQuery XPath XML SOAP WSDL

9 9 Omar Benjelloun - New Bases for New Data What is Active XML (AXML)? AXML is a declarative language for distributed information management and an infrastructure to support this language, in a peer-to-peer framework.

10 10 Omar Benjelloun - New Bases for New Data Active XML documents XML documents with embedded calls to Web services Intensional Some of the data is given explicitly Some is given intensionally (i.e. the means to acquire data when needed are given) Dynamic If the external sources change, the same document will provide different information Reaction to world changes

11 11 Omar Benjelloun - New Bases for New Data Not a new idea in databases, nor on the Web Mixing calls to data is an old idea Procedural attributes in relational systems Basis of Object-oriented Databases In Web programming Sun’s JSP, PHP+MySQL Calls to Web services inside documents Macromedia FLEX, Apache Jelly, Microsoft XAML What is new is the exploitation of the idea…

12 12 Omar Benjelloun - New Bases for New Data Web services in brief A number of standards XML SOAP: Exchange of messages between applications WSDL: Description of service interfaces (e.g. input/output types) UDDI: Advertisement and discovery of services … other proposed standards (choreography, security, etc.) For us: means to provide, invoke and describe remote functions with XML input/output. They make AXML documents universally understandable.

13 13 Omar Benjelloun - New Bases for New Data A sample AXML document Le Monde 06/10/2003 Paris exhibits GetTemp city “Paris” newspaper title date “06/10/2003” “Le Monde” GetEvents “Exhibits” AXML documents may contain calls: to any existing Web services (e-bay.net, google.com…) to any AXML Web services (to be defined)

14 14 Omar Benjelloun - New Bases for New Data Materialization Replacing the call by its result is not the only option Calls are not necessarily RPC-style synchronous invocations Le Monde 06/10/2003 Paris exhibits GetTemp city “Paris” newspaper title date “06/10/2003” “Le Monde” GetEvents “Exhibits” Y! temp “16°C” SOAP call 16°C

15 15 Omar Benjelloun - New Bases for New Data AXML Web services Parameters:AXML data Result:AXML data Distribute computations: by sending as parameters data containing service calls, one can delegate some work to other peers. Partial computations: by returning data containing service calls, one can give to the receiver the control of these calls. Great flexibility

16 16 Omar Benjelloun - New Bases for New Data Distributed interactions

17 17 Omar Benjelloun - New Bases for New Data Exchanging Active XML

18 18 Omar Benjelloun - New Bases for New Data To call or not to call ? GetEvents “Exhibits” newspaper title date “Le Monde” “06/10/2003” GetTemp city “Paris” temp “16°C” Y!  Materialization can be performed  by the sender, before sending a document…  or by the receiver, after receiving it. GetEvents “Exhibits” newspaper title date “Le Monde” “06/10/2003” GetTemp city “Paris” temp “16°C”

19 19 Omar Benjelloun - New Bases for New Data Why control the materialization of calls? For added functionality, e.g. Intensional data allows to get up-to-date information. For security reasons or capabilities, e.g. I don’t trust this Web service/domain, I don’t have the right credentials to invoke it, It costs money, Maybe the receiver doesn’t know Active XML! For performance reasons, e.g. A proxy can invoke all the services on behalf of a PDA. … and many more reasons you can think of!

20 20 Omar Benjelloun - New Bases for New Data We extend XML Schema, with intensional types: XMLSchema int How to control it? Using types Static analysis algorithms use signatures of services: WSDL int... r g f q Capabilities ACL Cost... Sender data exchange Schema fq g Capabilities ACL Cost... Receiver g g g g g g q q q f f r r

21 21 Omar Benjelloun - New Bases for New Data Data: newspaper= title.date.(GetTemp|temp).(GetEvents|exhibit*) title= data date= data temp= data city= data exhibit= title.(GetDate|date) Functions: GetTemp(city)-> temp GetEvents(data)-> (exhibit|performance)* GetDate(title)-> date The extended schema language Rewriting: replace call(s) by an arbitrary output of the service. To simplify, we use here a DTD-like syntax GetTemp city “Paris” newspaper title date “06/10/2003” “Le Monde” GetEvents “Exhibits”

22 22 Omar Benjelloun - New Bases for New Data Rewritings The Goal: Given an AXML document d a schema s, Can we rewrite d so that it matches s? Safe rewriting: one that for sure leads to s (we know without making any call) Possible rewriting: one that may lead to s (depending on the answers of services)

23 23 Omar Benjelloun - New Bases for New Data Difficulties Infinite search space Vertical Horizontal Main problem The result of a Web service call is unknown We just know a signature (input/output types) We want a very efficient solution Foundations of the problem String & tree automata, with existential and universal transitions.

24 24 Omar Benjelloun - New Bases for New Data Results The general problem is undecidable [MSS03] Restrictions on the considered rewritings Left-to-right: No “going back and forth” K-depth: bound on the nesting of function calls (Search space still infinite but finitely representable) Under these restrictions We have algorithms to find safe/possible rewritings. They are PTIME (for deterministic schemas). We can also do it between schemas. Implementation demo at VLDB 2003 (customizable news syndication)

25 25 Omar Benjelloun - New Bases for New Data Safe rewriting algorithm (flavor) Build an FSA that accepts all k-depth rewritings of the initial word. Build an FSA that recognizes the complement of the target type. GetEvents q1 title q6 date q2 q3 GetTemp q0 q4  q5  q7 exhibit performance   temp p0 p1 title p2 date p3 temp p4 GetEvents p6 * p5 exhibit * **** *

26 26 Omar Benjelloun - New Bases for New Data Safe rewriting algorithm Compute the intersection of these languages: A smart marking determines whether a safe rewriting exists. Then run the word on the marked automaton to find an actual rewriting. Optimizations: lazy construction of the automata parallel evaluation of calls  q0,p0 q1,p1q2,p2q3,p3q4,p4 q6,p3 q5,p2 q3,p6q7,p6 q4,p6 q7,p6 q7,p3q4,p3 q7,p5 q4,p5 title date temp GetEvents performance GetTemp performance exhibit      

27 27 Omar Benjelloun - New Bases for New Data Querying Active XML

28 28 Omar Benjelloun - New Bases for New Data Querying AXML Data Given a (tree pattern) query: /newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”] Materialize the document? Call only the services that may contribute data to the query answer. The problem: Lazy evaluation of service calls To call or not to call, this time when evaluating a query GetTemp city “Paris” newspaper title getDate “Le Monde” GetEvents “Exhibits” exhibits GetExhibits “Paris” City temp “19°C”

29 29 Omar Benjelloun - New Bases for New Data Lazy evaluation Difficulties: Calls can be found everywhere in the document May appear dynamically (as a result of previous calls) May become (ir)relevant due to previous invocations Need to take signatures of calls into consideration A possible approach: modify the query processor Top-down evaluation Trigger the calls found on the way Not so great: –Computation is blocked –Optimization opportunities are lost

30 30 Omar Benjelloun - New Bases for New Data NFQ’s Given a query to evaluate: Derive a set of “node-focused” queries (NFQ), that find the relevant calls when evaluated on the document. Need to be reevaluated, as the document evolves! newspaper temp > 18°C exhibits exhibit location “Le Louvre” newspaper temp > 18°C exhibits * * * Etc.

31 31 Omar Benjelloun - New Bases for New Data Optimizations Service calls sequencing Analysis of the relationship between calls (through the NFQ’s) Layering, and parallelization inside each layer. Filtering by type analysis Match output types of services to the data expected by queries “Pushing” queries to capable services Acceleration: Via relaxation: –NFQ approximation –Superset of the relevant calls Via a special access structure, similar to a DataGuide: –Restricted to paths that lead to service calls –Indexes the calls Experimental assessment 10x speed-up when combining optimizations

32 32 Omar Benjelloun - New Bases for New Data There is more… The AXML peer system Manages persistent AXML documents Provides AXML services Open source Language extensions to control the activation of calls Continuous services Theoretical foundations …check out http://www.activexml.nethttp://www.activexml.net

33 33 Omar Benjelloun - New Bases for New Data Uncertain data

34 34 Omar Benjelloun - New Bases for New Data Basic Premise Traditional relational DB Every data item’s value must be exact Every data item is in the database or not Where data came from and how it evolves is not important ULDB’s relax these constraints by making 1. Data 2. Uncertainty 3. Lineage all first-class interrelated concepts

35 35 Omar Benjelloun - New Bases for New Data Previous work Models for uncertainty Labeled nulls, c-tables, probabilistic models,... Trade-off between expressiveness Simplicity of representation, complexity of operations We investigated this space in [DBHM06] Models for lineage In relational databases, data warehouses Definition of lineage can be tricky for complex queries First to consider lineage together with uncertainty

36 36 Omar Benjelloun - New Bases for New Data Uncertainty Possible worlds: SAWWitnessCar GrannyVW CopFord GrannyBMW GrannyVW CopFord GrannyBMW CopFord ? CopFord x-tuple alternate maybe CopVW GrannyVW CopVW GrannyBMW CopVW CopVW Simple formalism not complete not closed under joins

37 37 Omar Benjelloun - New Bases for New Data Lineage SAWWitnessCar GrannyVW CopFord OWNSSuspectCar ChrisVW ChrisBMW MikeVW MikeFord  witness, suspect ACCUSESWitnessSuspect GrannyChris GrannyMike CopMike

38 38 Omar Benjelloun - New Bases for New Data ULDB’s SAWWitnessCar GrannyVW CopFord OWNSSuspectCar ChrisVW ChrisBMW MikeVW MikeFord ACCUSESWitnessSuspect GrannyChris GrannyMike CopMike GrannyBMW ? GrannyChris ? ? ?

39 39 Omar Benjelloun - New Bases for New Data ULDB’s SAWWitnessCar GrannyVW CopFord OWNSSuspectCar ChrisVW ChrisBMW MikeVW MikeFord ACCUSESWitnessSuspect GrannyChris GrannyMike CopMike Grann y BMW GrannyChris ? ? ? ?

40 40 Omar Benjelloun - New Bases for New Data Properties ULDB’s are simple x-tuples: set of alternate tuples, with or without ‘?’ lineage: associates with each alternate a set of alternates / external symbols ULDB’s are expressive Complete: can represent any finite set of possible worlds (with lineage) Simple implementation of monotonic queries, with correct lineages Natural probabilistic extension ULDB’s are efficient Query processing can use existing query optimizers Tuple certainty/membership can be tested in polynomial time

41 41 Omar Benjelloun - New Bases for New Data Query processing

42 42 Omar Benjelloun - New Bases for New Data Querying ULDB’s DQ(D) ULDB’s Possible worlds D 1, D 2, …, D n Query semantics Q(D 1 ), Q(D 2 ), …, Q(D n ) Q(D i ): add query result as new relation and lineage to D i Algorithm Relational databases (with lineage)

43 43 Omar Benjelloun - New Bases for New Data Algorithm SAWWitnessCar GrannyVW CopFord OWNSSuspectCar ChrisVW ChrisBMW MikeVW MikeFord  witness, suspect ACCUSESWitnessSuspect GrannyChris GrannyMike CopMike BMWGranny FordKid GrannyChris KidMike ? ? ? ? BMWGrann y ? FordKid ? MikeKid

44 44 Omar Benjelloun - New Bases for New Data Properties Efficient algorithm Query processing phase can use standard query optimizer Lineages are easy to propagate “Grouping” phase requires a single pass on the result Initial prototype represents a ULDB as a relational DB uses simple query rewriting techniques Algorithm works for any monotonic query (including SPJU queries)

45 45 Omar Benjelloun - New Bases for New Data Probabilities

46 46 Omar Benjelloun - New Bases for New Data Probabilistic ULDB’s Semantics: As before, with a probability for each possible world Without lineages Alternates of the same x-tuple correspond to disjoint events Alternates of different x-tuples correspond to independent events Lineages Capture correlations Help propagate probabilities for query results SAWWitnessCar GrannyVW CopFord GrannyBMW ? CopVW 0.2 0.5 0.3 0.7 0.3

47 47 Omar Benjelloun - New Bases for New Data Probabilistic query answering Compute queries as before Compute probabilities on demand Traverse lineages transitively to the leaves Combine probabilities of reached alternates Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’ ? ? ?? ? 0.20.30.40.10.30.51

48 48 Omar Benjelloun - New Bases for New Data Future work Richer queries Duplicate elimination, difference, aggregation Supported through new kinds of lineages (e.g., disjunctive, negative) Querying the uncertainty and the lineage More operations Updates (and their lineage), close to versioning “Uncertain operations”, e.g., entity resolution, inconsistency repairs More optimization techniques More theory

49 49 Omar Benjelloun - New Bases for New Data Conclusion

50 50 Omar Benjelloun - New Bases for New Data New “Bases” for new data The database way Simple models Declarative languages Optimization techniques … for new features of data Distribution and decentralization: Active XML Uncertainty and lineage: ULDB’s There are more challenges Real-world side effects, semantic reasoning and strong requirements security, privacy, personalization Big challenge: Doing it all in a coherent way One “big” model? Integration of models?

51 51 Omar Benjelloun - New Bases for New Data Merci


Download ppt "1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006."

Similar presentations


Ads by Google