1 Querying and storing XML Week 7 Typechecking and Static Analysis March 5-8, 2013.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

XDuce Tabuchi Naoshi, M1, Yonelab.
Equivalence of Extended Symbolic Finite Transducers Presented By: Loris D’Antoni Joint work with: Margus Veanes.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
Pushdown Automata Chapter 12. Recognizing Context-Free Languages We need a device similar to an FSM except that it needs more power. The insight: Precisely.
Pushdown Automata Part II: PDAs and CFG Chapter 12.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
Finite state automaton (FSA)
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
CS Chapter 2. LanguageMachineGrammar RegularFinite AutomatonRegular Expression, Regular Grammar Context-FreePushdown AutomatonContext-Free Grammar.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Formal Language Finite set of alphabets Σ: e.g., {0, 1}, {a, b, c}, { ‘{‘, ‘}’ } Language L is a subset of strings on Σ, e.g., {00, 110, 01} a finite language,
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 Static Type Analysis of Path Expressions in XQuery Using Rho-Calculus Wang Zhen (Selina) Oct 26, 2006.
Streaming Tree Transducers Loris D'Antoni University of Pennsylvania Joint work with Rajeev Alur 1.
XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.
Basics of automata theory
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Pushdown Automata (PDA) Intro
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Querying and Storing XML Week 3 XSLT, DTDs, Schemas, Constraints January 29-Feb 1, 2013.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Computing languages by (bounded) local sets Dora Giammarresi Università di Roma “Tor Vergata” Italy.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
1 Approximate Data Exchange Michel de Rougemont Adrien Vieilleribière University Paris II & LRI University Paris-Sud & LRI ICDT 2007.
1 XDuce XDuce: A statically Type XML Processing: Hosoya and Pierce Presented by: Guy Korland Based on presentation by: Tabuchi
Lecture 17 Undecidability Topics:  TM variations  Undecidability June 25, 2015 CSCE 355 Foundations of Computation.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Operational Semantics Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Tree Automata First: A reminder on Automata on words Typing semistructured data.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
Certifying and Synthesizing Membership Equational Proofs Patrick Lincoln (SRI) joint work with Steven Eker (SRI), Jose Meseguer (Urbana) and Grigore Rosu.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
XML: Extensible Markup Language
RE-Tree: An Efficient Index Structure for Regular Expressions
Chapter 2 Finite Automata
Hierarchy of languages
Instructor: Aaron Roth
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CSCI 2670 Introduction to Theory of Computing
Lecture 5 Scanning.
COMPILER CONSTRUCTION
CSCE 355 Foundations of Computation
Presentation transcript:

1 Querying and storing XML Week 7 Typechecking and Static Analysis March 5-8, 2013

QSX March 5-8, Translating updates to queries

March 5-8, 2013 QSX 3 From XML updates to queries [Fan,Cong 2007] Find an automatic method to rewrite XML updates to queries Input: an XML update ΔT Output: an equivalent XML query Q, referred to as a complement query, such that for any XML tree T, Q(T) = T + ΔT Motivation: Updates and queries can be evaluated/optimized in a uniform framework: no need for implementing updates separately Immediate support of XML updates by existing XML query engine No need to hack the engine Composition of updates and queries becomes trivial: composition of queries Always possible? Certainly: XQuery and XSLT are Turing-complete But how efficient?

March 5-8, 2013 QSX 4 Rewrite XML updates to queries let $xp = doc(T)/p declare function local:insert ($n, $xp) { if $n[element] then element { fn:local-name($n) } // copy the element tag { for $c in $n/* return local:insert($c, $xp) }, //recursive call: children {if $n in $xp then {e} // insertion } else $n// copy the node if it is not an element } Rewriting insert e into p to a complement query in XQuery: Recursively traverse T and insert the subtree eSimilarly for deletions

March 5-8, 2013 QSX 5 Updating virtual views How can one update a virtual XML view of distributed sources? Rewrite update Δ to an equivalent complement query Q Querying the “updated” virtual view: Q(V) + Δ = Q’(Q(V))

QSX March 5-8, Regular expression types

March 5-8, 2013 QSX 7 Why type-check XML transformations? Goal: Check that valid input -> valid output e.g. check that XSLT/XQuery produces valid HTML Can we prove this once and for all? rather than revalidating after every run? Typechecking can also support other goals: optimization/static analysis identifying silly bugs ("dead" code that never matches anything; missing cases)

March 5-8, 2013 QSX 8 Example: XSLT Turns a sequence of People into a table HTML Tables have to have at least one row! Phone Book

March 5-8, 2013 QSX 9 Influential approach: XDuce A functional language for transforming XML based on pattern matching type Person = person[name[String], [String]*, phone[String]?] fun person2row(var p as Person) : Tr = match p with person[name[var name], [var ]] -> tr[td[name],td[ ]] |...

March 5-8, 2013 QSX 10 XDuce, continued fun people2rows(var xs as Person+) : Tr+ = match xs with var p as Person, var ps as Person* -> person2row p,people2rows ps | var p as Person -> person2row p fun people2table(var xs as Person+) : Table = table[people2rows xs]

March 5-8, 2013 QSX 11 Regular expression types τ ::= String | X | a[τ] | τ, τ | τ|τ | τ* a[τ] means "element labeled a with content τ" X is a type variable Can model DTDs, XML Schemas as special case at least as far as element structure goes does not model attributes, or XML Schema interleaving

March 5-8, 2013 QSX 12 Recursive type definitions Consider type definitions type X 1 = τ 1 type X 2 = τ 2... Definitions may be recursive, but recursion must be "guarded" by element tag e.g. type X = X,X not allowed

March 5-8, 2013 QSX 13 Meaning of types Suppose E(X) is the definition of X for each type variable X (i.e. type X = E(X) declared) Define τ E as follows: X E = E(X) E String E = Σ * a[ τ ] E = {a[v] | v ∈ τ E } τ 1, τ 2E = {v 1,v 2 | v 1 ∈ τ 1E, v 2 ∈ τ 2E } τ 1 | τ 2E = τ 1E ∪ τ 2E τ * E = {v 1,...,v n | v 1,...,v n ∈ τ E }

March 5-8, 2013 QSX 14 DTDs as a special case DTD rules: a → b,c*,(d|e)?,f b → c,d... becomes: type A = a[B,C*,(D|E)?,F] type B = b[C,D]... (just introduce new type Elt for each element name elt ) Same idea works for XML Schemas

March 5-8, 2013 QSX 15 Complications Typechecking undecidable for any Turing-complete language and for many simpler ones Schema languages for XML use regular languages/tree automata decision problems typically EXPTIME-complete Nevertheless, efficient-in-practice techniques can be developed

March 5-8, 2013 QSX 16 Other checks Typing: All results are contained in declared types Exhaustiveness: every possible input matches some pattern match (x:Person) with person[name[var n]] ->... Irredundancy: no pattern is "vacuous" match (x:Person) with person[name[var n],var es as *] ->... | person[name[var n], [var e]] ->... All can be reduced to subtype checks

March 5-8, 2013 QSX 17 Subtyping as inclusion We can define a semantic subtyping relation: τ <: τ' ⟺ τ E ⊆ τ' E i.e. every tree matching τ also matches τ'. How can we check this? One solution: tree automata

QSX March 5-8, Type inclusion & tree automata

March 5-8, 2013 QSX 19 XML trees as binary trees First-child, next-sibling encoding Leaves are "nil" / () r ca aba a r c aa b a a

March 5-8, 2013 QSX 20 Binary tree automata (bottom-up) M = (Q,Σ,δ,q 0,F) is a (bottom-up) binary tree automaton if: q 0 ∈ Q -- initial state (leaf nodes) F ⊆ Q -- set of final states δ : Σ × Q × Q → Q -- transition function

March 5-8, 2013 QSX 21 Binary tree automata (top-down) M = (Q,Σ,δ,q 0,F) is a (top-down) binary tree automaton if: q 0 ∈ Q -- initial state (root node) F ⊆ Q -- set of final states (for all leaf nodes) δ : Q × Σ → Q × Q -- transition function Determinstic top-down TA are strictly less expressive than bottom-up TA however, nondeterministic versions equivalent

March 5-8, 2013 QSX 22 DTDs and ambiguity Recall DTDs have to be "unambiguous" Example: NOT (a+b)+(a+c) but a+b+c is equivalent NOT (a+b) * a but (b * a) + is equivalent These restrictions (and similar ones for XML Schemas) lead to deterministic, top- down (tree) automata.

March 5-8, 2013 QSX 23 Translating types First give each subexpression its own type name type Person = person[name[String], [String]*, phone[String]?] type Person = person[Name, s, PhoneOpt] type Name = name[String] type s = [String], s | () type PhoneOpt = phone[String] | ()

March 5-8, 2013 QSX 24 To binary form Next reorganize into binary steps type Person = person[Q1],Empty type Q1 = name[String], Q2 | name[String], Q3 type Q2 = [String],Q2 | [String],Q3 type Q3 = phone[String],Empty | () type Empty = ()

March 5-8, 2013 QSX 25 To tree automaton type Person = person[Q1],Empty type Q1 = name[String], Q2 | name[String], Q3 type Q2 = [String],Q2 | [String],Q3 type Q3 = phone[String],Empty | () type Empty = () Q = {Person,Q1,Q2,Q3,String,Empty} Σ = {person,name, ,phone,string} q 0 = Person F = {Empty,Q3,String} q1q1 q2q2 aδ(q 1,q 2,a) Q1EmptypersonPerson StringQ2nameQ1 StringQ3nameQ1 StringQ2 Q2 StringQ3 Q3 StringEmptyphoneQ3

March 5-8, 2013 QSX 26 Checking containment Like sequential case, tree automata / languages closed under union, intersection, negation some of the constructions incur exponential blowup Containment testing is EXPTIME-complete Naive algorithm: test L(M) ⊆ L(N) by constructing automaton M' for L(M) ∩ L(N ̅ ), testing emptiness Does not perform well on problems encountered in programming Hosoya, Vouillon & Pierce [2005] give practical algorithm aimed at XDuce typing problems

QSX March 5-8, Types for XQuery

March 5-8, 2013 QSX 28 Motivation [Collazzo et al. 2006] For XDuce, want to ensure valid result find silly bugs in pattern matching ensure exhaustiveness & irredundancy For XQuery, still want result validity, but different issues functions, but no pattern matching (as such) but does have for/iteration & path expressions parts of queries can have "silly" bugs

March 5-8, 2013 QSX 29 Example Schema (we'll use XDuce notation): type Person = person[name[String], [String]*, phone[String]?] Query: return all name/phone of people with for $y in $x/person[emial] return $y/name,$y/phone Silly bug : "emial" misspelled; path will never select anything (in this type anyway)

March 5-8, 2013 QSX 30 XQuery core language: μXQ Tree variables x ̅ have single tree values v ̅ Forest variables x have arbitrary values v

March 5-8, 2013 QSX 31 Typing rules for XQuery: Basics

March 5-8, 2013 QSX 32 Typing rules for XQuery: Let, If

March 5-8, 2013 QSX 33 Path steps

March 5-8, 2013 QSX 34 Comprehensions: simple idea (what XQuery does)

March 5-8, 2013 QSX 35 Example type Doc = a[ b[d[]]*, c[e[], f[]]? ] prime( b[d[]]*, c[e[], f[]]? ) = b[d[]]|c[e[],f[]] quantifier( b[d[]]*, c[e[], f[]]? ) = * $doc : Doc ⊢ $doc/a/* : b[d[]]*, c[e[], f[]]? $doc : Doc ⊢ for $x in $doc/a/* return $x/* : (d[] | e[],f[])* $doc : Doc, $x : b[d[]] | c[e[], f[]] ⊢ $x/* : d[] | e[],f[] Overapproximation: e,f,d and e,f,e,f can never happen

March 5-8, 2013 QSX 36 Comprehensions: more precisely

March 5-8, 2013 QSX 37 Example type Doc = a[ b[d[]]*, c[e[], f[]]? ] prime( b[d[]]*, c[e[], f[]]? ) = b[d[]]|c[e[],f[]] quantifier( b[d[]]*, c[e[], f[]]? ) = * $doc : Doc ⊢ $doc/a/* : b[d[]]*, c[e[], f[]]? $doc : Doc ⊢ for $x in $doc/a/* return $x/* : d[]*, (e[],f[])? $doc : Doc ⊢ $x in b[d[]]*, c[e[], f[]]? → $x/* : d[]*, (e[],f[])? Remembers more of the regexp structure (but overall system still approximate)

March 5-8, 2013 QSX 38 Path errors Detect path errors by: Noticing when a non-() expression has type () Tracking which parts of expression have this property (propagating sets of expression locations through typing judgements)

March 5-8, 2013 QSX 39 Path errors For expressions that visit same subexpression multiple times only error if subexpression never produces useful output take intersection e.g. for $x in $doc/person[emial]... not an error if $doc : person[ [...] | emial[...]]

March 5-8, 2013 QSX 40 Subtleties [Collazzo et al. 2006] also considers splitting E.g. considering a[T 1 |...|T n ] as a[T 1 ]|...|a[T n ] This complicates system, but can make results more precise

QSX March 5-8, Schema-based query- update independence analysis

March 5-8, 2013 QSX 42 Motivation [Benedikt & Cheney, 2009] Suppose we want to maintain multiple (materialized) views or constraints expressed by queries Q 1, Q 2,... When the database is updated If we can determine (quickly) that query and update are independent then can skip view/constraint maintenance

March 5-8, 2013 QSX 43 Query-update independence DB U(DB) Q Q Q(U(DB)) = ?? U Q(DB)

March 5-8, 2013 QSX 44 Approach a static analysis that safely detects independence and takes schema into account for XQuery Update with all XPath axes fast enough to be useful as an optimization

March 5-8, 2013 QSX 45 Independence example d a for $x in /c/a return d[$x] / ca aba

March 5-8, 2013 QSX 46 Independence example for $x in /c/a return d[$x] / ca aba d a for $x in /c/a return d[$x] / ca afooa d a = bar

March 5-8, 2013 QSX 47 Independence non- example for $x in /a return d[$x] / ca aba a for $x in /a return d[$x] / ca afooa ba d a fooa d  bar bar

March 5-8, 2013 QSX 48 Approach Exploits a schema S that describes the input Statically calculate: c = Copied nodes of Q a = Accessed Nodes of Q u = Updated Nodes of U c, a, u are sets of type names in S

March 5-8, 2013 QSX 49 Independence test We show that Q and I are independent modulo S if: a is disjoint from u that is, the update has no impact on accessed nodes and c //* is disjoint from u that is, the update does not impact any copied nodes or their descendants

March 5-8, 2013 QSX 50 / ca a Analysis example Accesse d nodes for $x in /c/a return d[$x] / ca aba d a

March 5-8, 2013 QSX 51 Analysis example for $x in /c/a return d[$x] / ca a ba d a Copied node a a

March 5-8, 2013 QSX 52 Analysis example for $x in /c/a return d[$x] / ca a ba d a Updated node / ca a fooa bar

March 5-8, 2013 QSX 53 Analysis example for $x in /c/a return d[$x] / ca a ba d a for $x in /c/a return d[$x] / ca afooa d a Updated nodes disjoint from accessed or copied nodes bar =

March 5-8, 2013 QSX 54 Updated node avoids accessed & copied nodes Analysis example II / c a aba / ca afooa for $x in /a return d[$x] a ba d bar Updated node is a descendant of a copied node!

March 5-8, 2013 QSX 55 Analysis example II / c a aba / ca afooa for $x in /a return d[$x] a ba d for $x in /a return d[$x] a fooa d Query and update are not independent bar bar ≠

March 5-8, 2013 QSX 56 Correctness We can use type names for sets of accessed, copied, updated nodes Symbolically evaluate query/update over schema However, there is a complication: Aliasing How do we know that it is safe to use type names to refer to and change “parts” of schema?

March 5-8, 2013 QSX 57 Problem For example: doc tt bb

March 5-8, 2013 QSX 58 Problem For example: doc tt bb TU S B B

March 5-8, 2013 QSX 59 Problem For example: has multiple typings doc tt bb UT S B B

March 5-8, 2013 QSX 60 Solution: Closure under Aliasing Keep track of which type names “alias” or, can refer to the same node currently, based on tree language overlap test Close sets a, c, u under aliasing example: any T could be a U, and vice versa Note: This may be overkill unnecessary (has no effect) for DTDs probably unnecessary for unambiguous XML Schema

March 5-8, 2013 QSX 61 Summary Independence analysis can statically determine whether view needs to be maintained Subsequent work explores more precise static information: path-based independence analysis ("destabilizers") - no schema required [Cheney & Benedikt VLDB 2010] independence based on more precise type analysis [Bidoit-Tollu et al. VLDB 2012] Future work: combine path- and schema-based approach? what about XML views of relations? can static analysis help with incremental maintenance?

March 5-8, 2013 QSX 62 Next week Provenance Why and Where: A characterization of data provenance Provenance management in curated databases Annotated XML: queries and provenance