XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Finite-State Machines with No Output Ying Lu
4b Lexical analysis Finite Automata
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages We need a device similar to an FSM except that it needs more power. The insight: Precisely.
Determinization of Büchi Automata
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
CS5371 Theory of Computation
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Document Type Descriptors (DTDs) Imposing Structure on XML Documents.
Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
Theory of Computing Lecture 22 MAS 714 Hartmut Klauck.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
Great Theoretical Ideas in Computer Science.
Regular Expressions (RE) Empty set Φ A RE denotes the empty set Empty string λ A RE denotes the set {λ} Symbol a A RE denotes the set {a} Alternation M.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Basics of automata theory
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Athasit Surarerks THEORY OF COMPUTATION 07 NON-DETERMINISTIC FINITE AUTOMATA 1.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Context Free Grammars. Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger”
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
CS 3813: Introduction to Formal Languages and Automata Chapter 2 Deterministic finite automata These class notes are based on material from our textbook,
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
CS 203: Introduction to Formal Languages and Automata
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.2: Pushdown Automata) Prof. Karen Daniels, Fall 2010 with acknowledgement.
Deterministic Finite Automata COMPSCI 102 Lecture 2.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Great Theoretical Ideas in Computer Science for Some.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Algorithms for hard problems Automata and tree automata Juris Viksna, 2015.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
Complexity and Computability Theory I Lecture #5 Rina Zviel-Girshin Leah Epstein Winter
Tree Automata First: A reminder on Automata on words Typing semistructured data.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1/29/02CSE460 - MSU1 Nondeterminism-NFA Section 4.1 of Martin Textbook CSE460 – Computability & Formal Language Theory Comp. Science & Engineering Michigan.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS 404 Introduction to Compiler Design
Lexical analysis Finite Automata
Formal Language & Automata Theory
Two issues in lexical analysis
Recognizer for a Language
REGULAR LANGUAGES AND REGULAR GRAMMARS
Hierarchy of languages
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Finite Automata.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Chapter 1 Regular Language
Presentation transcript:

XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt

How Expressive can a Schema Be? Arbitrary deep binary tree with A elements, and a single B element What would documents look like that satisfy this schema? How would one check validity? What would be the cost? What are the pros and cons of allowing such schemas? This schema is a frequent example in teaching material on XML Schema

Let’s see what SAXON says …

cos-element-consistent: Error for type 'oneB'. Multiple elements with name 'A', with different types, appear in the model group. cos-element-consistent: Error for type 'onlyAs'. Multiple elements with name 'A', with different types, appear in the model group. cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles. Here is the Full Error Message from Eclipse I.e., in a given context, elements with the same name must have the same content. Easy to check! That’s more subtle...

The Country Example in XML Schema <xsd:schema xmlns:xsd=" targetNamespace=" xmlns=" elementFormDefault="qualified"> As DTD:

Also this is not validated … cos-nonambig: king and king (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles. Let’s check what this means!

What the W3C Standard Explains … Schema Component Constraint: Unique Particle Attribution A content model must be formed such that during ·validation· of an element information item sequence, the particle contained directly, indirectly or ·implicitly· therein with which to attempt to ·validate· each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

Questions and Ideas Questions: How can one make the standard formal? How can a validator implement the standard? Ideas: Content models are specified by regular expressions A regular expression E can be translated into a finite state automaton A (Glushkov automaton) that checks which strings satisfy E  Construct A from E and check whether A is deterministic

Formalization Alphabet  (i.e., set of symbols): the element names occurring in the content model Regular expressions over  are generated with the rule e, f  a | (e  f) | (e|f) | (e)+ | (e)* where e, f are expressions and a   Language L(e) of an expression e (inductively defined) Exercise: Which of the following are in the language defined by a*  (b | c)  a+ ? –aba –abca –aab –aaacaaa In the following, we denote concatenation by a dot, no more by a comma.

Regular Expressions and DTDs These are formalizations of DTDs and validation: A DTD is a pair (d, s) where s   is the start symbol d maps every  -symbol to a regular expression over  A document tree t satisfies d (t is valid wrt d) iff the root of t is labeled s for every node n in t, with symbol a, the string formed by the names of the children of n satisfies d(a)  Validation is checking whether a string satisfies a regexp

Markings Distinguish between the different occurrences of a symbol in a regexp by using numbers: markings of regexps Examples: a 1 *  (b 2 | c 3 )  a 4 + is a marking of a*  (b | c)  a+ king 1 | queen 2 | king 3  queen 4 is a marking of king | queen | king  queen Definition A marking e′ of a regular expression e is an assignment of numbers to every symbol in e.

Unmarked Version Consider a regular expression e and a e marking of e Definition: For w  L(e), we denote by w # the corresponding unmarked string in L(r). Example: If w = b 2 a 1 a 3, then w # = baa

“Unique Particle Attribution”: Formalization Brüggemann-Klein/Wood [1998] Definition: A regular expression r is deterministic iff there are no strings uxv, uyw ∈ L(r′) with |x| = |y| = 1 x  y, (x and y are different marked symbols) x # = y # (their unmarking is the same). Example: (a | b)* a is not deterministic because there are marking ((a 1 + b 2 ) ∗ a 3 ) strings b 2 a 1 a 3 and b 2 a 3  u x v u x w How can we check, whether e is deterministic?

Finite State Automata Regular anguages can also be defined using automata A finite state automaton (FSA) consists of: –a set of states Q. –an alphabet  (i.e., a set of symbols) –a transition function , which maps every pair (q,a) to a set of states q’ –an initial state q 0 –a set of accepting states F A word a 1 …a n is in the language defined by an automaton if there is a path from q 0 to a state in F with edges labeled a 1,…,a n The automaton is deterministic if every pair (q,a) is only mapped to a single state

q0q0 q1q1 q2q2 a a b q3q3 b c Which Language Does this FSA Define?

Non-Deterministic Automata An automaton is non-deterministic if there is a state q and a letter a such that there are at least two transitions from q via edges labeled with a What words are in the language of a non-deterministic automaton? We now create a Glushkov automaton from a regular expression

Creating a Glushkov Automaton from a Regular Expression a*  (b|c)  a+ Step 1: Create a marking of the expression a 1 *  (b 1 |c 1 )  a 2 +

Creating a Glushkov Automaton from a Regular Expression Step 2: Create a state q 0 and create a state for each subscripted letter a 1 *  (b 1 |c 1 )  a 2 + Step 3: Choose as accepting states all subscripted letters with which it is possible to end a word How do we find these states? q 0 a 1 b 1 c 1 a 2

q 0 a 1 b 1 c 1 a 2 Creating a Glushkov Automaton from a Regular Expression Step 4: Create a transition from a state l i to a state k j if there is a word in which k j follows l i. Label the transition with k a 1 *  (b 1 |c 1 )  a 2 + How do we find these transitions?

Exercises What are the Glushkov automata of a*  b  (a  b)* (a | b)*  a  (a | b) (a | b)*  a ?

Recognizing Deterministic Regular Expressions Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998) A regular expression is deterministic (one-unambiguous) iff its Glushkov automaton is deterministic.

Construction of the Glushkov Automaton For an arbitrary alphabet  and a language L   * we define two sets first(L) =  a     u   *. a  u  L  last(L) =  a     u   *. u  a  L  and the function follow(L,a) =  b     u,v   *. u  a  b  v  L . Consider an expression e and its marking e We can construct the Glushkov automaton for e if we know the sets first(L(e)), last(L(e)), the function follow(L(e),  ), and if we know whether   L(e). empty word Why?

Construction of the Glushkov Automaton Where do we get this info? If e = a 1, then first(L(e)) =  a 1  last(L(e)) =  a 1  follow(L(e),  ) is not defined for any l i  Also,  L( e) If e = (f | g), then first(L(e)) = first(L(f))  first(L(g)) last(L(e)) = last(L(f))  last(L(g)) follow(L(e), l i ) = follow(L(f), l i ) if l i  L(f) = follow(L(g), l i ) if l i  L(g) Also,   L(e) if   L(f) or   L(g) For e = f*, f+, f  g, exercise!

Construction of the Glushkov Automaton If e = (f  g), then first(L(e)) = first(L(f))  first(L(g)) if   L(f) = first(L(f))  otherwise last(L(e)) = last(L(f))  last(L(g)) if   L(g) = first(L(g))  otherwise follow(L(e), l i ) = follow(L(f), l i ) if l i in f but not l i  last(L(f)) = follow(L(g), l i )  first(L(g)) if l i  last(L(f)) = follow(L(g), l i ) if l i in g Also,   L(e) if   L(f) and   L(g)

Construction of the Glushkov Automaton If e = (f*), then first(L(e)) = first(L(f)) last(L(e)) = last(L(f)) follow(L(e), l i ) = follow(L(f), l i ) if l i in f but not l i  last(L(f)) = follow(L(f), l i )  first(L(f)) if l i  last(L(f)) Also,   L(e) if   L(f) and   L(g)

Recognizing Deterministic Regular Expressions Observation: For each operator, first, last, and follow can be computed in quadratic time.  This yields an O(n 3 ) algorithm. Theorem (Brüggemann-Klein, Wood, 1998) There is an O(n 2 ) algorithm to check whether a regexp is deterministic.

More Results Theorems (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a | b)* a (a | b) Deterministic regular languages are not closed under union, concatenation, or Kleene-star. I.e., there is no easy syntactic characterization If it exists, an equivalent deterministic regular expression can be constructed in exponential time. It is possible to help users, but that is costly

Theory for XML Schema XML schema allows schemas where the same element appears with different types However, it is illegal to have two elements of the same name, but different types in one content model. Also, content models must be deterministic. Consequence: Documents can be validated in a deterministic top-down pass

References This material draws upon slides by Sara Cohen Frank Neven, notes by Leonid Libkin and the papers by A. Brüggemann-Klein and D. Wood