The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.

Slides:



Advertisements
Similar presentations
Semantics Static semantics Dynamic semantics attribute grammars
Advertisements

XDuce Tabuchi Naoshi, M1, Yonelab.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
8. Introduction to Denotational Semantics. © O. Nierstrasz PS — Denotational Semantics 8.2 Roadmap Overview:  Syntax and Semantics  Semantics of Expressions.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Kathleen Fisher AT&T Labs Research Yitzhak Mandelbaum, David Walker Princeton The Next 700 Data Description Languages.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
CS 355 – Programming Languages
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Information Retrieval in Practice
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
Kathleen Fisher AT&T Labs Research Robert Gruber Google PADS: A Domain-Specific Language for Processing Ad Hoc Data.
Prof. Bodik CS 164 Lecture 61 Building a Parser II CS164 3:30-5:00 TT 10 Evans.
Software Requirements
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Overview of Software Requirements
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
Describing Syntax and Semantics
Overview of Search Engines
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Recursive Descent Parsing for XML Developers Roger L. Costello 15 October
FAST : a Transducer Based Language for Manipulating Trees Presented By: Loris D’Antoni Joint work with: Margus Veanes, Ben Livshits, David Molnar.
1 Week 4 Questions / Concerns Comments about Lab1 What’s due: Lab1 check off this week (see schedule) Homework #3 due Wednesday (Define grammar for your.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Mathematical Modeling and Formal Specification Languages CIS 376 Bruce R. Maxim UM-Dearborn.
Haskell. 2 GHC and HUGS Haskell 98 is the current version of Haskell GHC (Glasgow Haskell Compiler, version 7.4.1) is the version of Haskell I am using.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
(Business) Process Centric Exchanges
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
CS 363 Comparative Programming Languages Semantics.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Chapter 3 Part II Describing Syntax and Semantics.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Semantics In Text: Chapter 3.
1 / 48 Formal a Language Theory and Describing Semantics Principles of Programming Languages 4.
CPSC 422, Lecture 21Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 21 Oct, 30, 2015 Slide credit: some slides adapted from Stuart.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
 Fall Chart 2  Translators and Compilers  Textbook o Programming Language Processors in Java, Authors: David A. Watts & Deryck F. Brown, 2000,
POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research.
Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.
Software Engineering, COMP201 Slide 1 Software Requirements BY M D ACHARYA Dept of Computer Science.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
TTCN-3 Testing and Test Control Notation Version 3.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
 Problem Analysis  Coding  Debugging  Testing.
COMP 412, FALL Type Systems C OMP 412 Rice University Houston, Texas Fall 2000 Copyright 2000, Robert Cartwright, all rights reserved. Students.
PADL 2008 A Generic Programming Toolkit for PADS/ML Mary Fernández, Kathleen Fisher, Yitzhak Mandelbaum AT&T Labs Research J. Nathan Foster, Michael Greenberg.
Information Retrieval in Practice
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Modeling Arithmetic, Computation, and Languages
Syntax Analysis Sections :.
Programming Languages 2nd edition Tucker and Noonan
ABNF in ACL2 Alessandro Coglio Kestrel Institute Workshop 2017.
Lecture 5 Scanning.
Presentation transcript:

The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

“The Next 700 …” Program(s)Data Format(s) Programming Language(s) PL Semantics Data Description Language(s) DDL Semantics

What Data Needs Describing? There's much data in databases and common formats like XML; there’s much data that’s ad hoc. Ad hoc data lacks readily available parsing, querying, analysis or transformation tools It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.

Ad Hoc Data in Biology !autogenerated-by: DAG-Edit version rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: <biological_process ; GO: %behavior ; GO: ; synonym:behaviour %adult behavior ; GO: ; synonym:adult behaviour %adult feeding behavior ; GO: ; synonym:adult feeding behaviour % feeding behavior ; GO: %adult locomotory behavior ; GO: ;... from

Ad Hoc Data in Chemistry C5=CC=CC=C5)=O)C1

Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0"

Ad Hoc Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

Data Description Languages Data description languages describe many ad hoc formats and provide the following features: –Descriptions serves as documentation, including semantic of data –Compiler generates tools from description: parser, printer, query engine, converter to XML, statistical profiler, etc. –Parser includes robust error detection and recovery. –Parsers can handle high data volume. > 1GB/second Netflow traffic from Cisco routers.

Many Data Description Languages Logical Descriptions –ASN.1 –ASDL Physical Descriptions –PacketTypes (SIGCOMM ‘00) –DataScript (GPCE ‘02) –PADS (PLDI ‘05) Basis for current work Logical Physical

Contributions A core data description calculus (DDC) –Based on dependent type theory –Simple, orthogonal, composable types –Types are transducers from external data source to internal data representation. Encodings of high-level DDLs in low-level DDC –Explain semantics of PADS language in particular. PacketTypes PADS Datascript DDC

Base Types and Sequences C(e): base type can be parameterized by expression e.  x:T.T’: dependent product describes sequence of values. –Variable x gives name to first value in sequence. Examples: “123hello|”int * string(‘|’) * char(123, “hello”, ‘|’) “3513”  width:int_fw(1). int_fw(width) (3,513) “:hello:”  term:char.string(term) * char (‘:’,“hello”,‘:’)

Constraints {x:T | e}: set types allow you to constrain the type T and express relationships between elements of the data. Examples: ‘a’{c:char | c = ‘a’} (abbrev: S c (‘a’))inl ‘a’ “101”, “82” {x:int | x > 100} inl 101, inr error(82) “43|105|67”  min:int.S c (‘|’) *  max:{m:int | min ≤ m}.S c (‘|’) * {avg:int | min ≤ avg & avg ≤ max} (43, inl ‘|’, inl 105, inl ‘|’, inl 67)

Unions and the Empty String true: matches the empty string. T + T’ : deterministic, exclusive or: try T; on failure, try T’. Examples: “54”, “n/a”int + S s (“n/a”)inl 54, inr “n/a” “2341”, “”int + trueinl 2341, inr ()

Array Features What features do we need to handle data sequences? –Elements –Separator between elements –Termination condition (“are we done yet?”) –Terminator after sequence Examples: “ ” “Bill|Cathy|Jane|Bob;”

False and Arrays T seq(T s ; e, T t ) specifies: –Element type T –Separator types T s. –Termination condition e. –Terminator type T t. false: reads nothing, flagging an error. Example: IP address. “ ”int seq(S c (‘.’); len 4, false)[192,168,1,1]

Abstraction and Application Can parameterize types over values: x.T Correspondingly, can apply types to values: T e Example: IP address with terminator none term.int seq(S c (‘.’); len 4, S c (term)) none “ |”IP_addr ‘|’ * S c (‘|’)([1,2,3,4],inl ‘|’)

Absorb, Compute and Scan Absorb, Compute and Scan are active types. –absorb(T) : consume data from source; produce nothing. –compute(e:  ) : consume nothing; output result of computation e. –scan(T) : scan data source for type T. Examples: “|”absorb(S c (‘|’))() “10|12”  width:int.S c (‘|’) *  length:int. area:compute(width  length:int) (10,12,120) “^%$!&_|”scan(S c (‘|’))(6,inl ‘|’)

Type Kinding Kinding ensures types are well formed.  |- T : s  k  |- e : s  |- T e: k  |- T : type  |- T’ : type  |- T + T’: type  |- T : type ,x:s |- e : bool (s = …)  |- {x:T | e}: type

Parsing Semantics of Types Semantics expressed as parsing functions written in the polymorphic -calculus. –Sem(T) : DDC Type  Function –Input data and offset, output new offset, value and parse descriptor. –For specifics, see upcoming technical report.

Types of Parser Output Parsers produce values with following type in the host language: DDCHost Language [C(e)] rep I ( C) + noval [true] rep unit [  x:T.T’] rep [T] rep * [T’] rep [ x.T] rep, [T e] rep [T] rep [T + T’] rep [T] rep + [T’] rep + noval [{x:T | e}] rep [T] rep + ([T] rep error) unrecoverable error semantic error dependency erased Base Types Products Union Abs. and App. Set types

Properties of the Calculus Theorem: If  |- T : k then –[T] = F well formed types yield parsers –  |- F : bits * offset  offset * [T] rep * [T] pd a T-Parser returns values with types that correspond to T. Theorem: Parsers report errors accurately. –Errors in parse descriptor correspond to actual errors in data. –Parsers check all semantic constraints. –More …

Making Use of the Calculus IPADS DDC  |- t  T IPADS t ::= C(e) | Pfun(x:s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts t def ;} | Popt t | t Pwhere x.e | Palt{fields} | t [t; e,t] | Pcompute e | Plit c fields ::= | fields x : t; alts ::= | alts e => t;

Example: Popt and Plit  |- Popt t  T + true  |- t  T  |- Plit c  scan(absorb({x:char | x = c }))  |- c : char true T 1 + T 2 C(e) {x:T | e} absorb(T) scan(T)

Example: Pswitch  |- Pswitch e of {e 1 => t 1 ; e 2 => t 2 ; … t def }  ( c.{x:T 1 | c = e 1 } + {x:T 2 | c = e 2 } + …+ T def ) e  |- t i  T i (i = 1…n) T + T’ x.T {x:T|e}  |- t def  T def

Future work What are the set of languages recognized by the DDC? How does the expressive power of the DDC relate to CFGs and regular expressions? Implement recursive types in PADS system based on the recursive types of the DDC. Add polymorphism to DDC and PADS.

Summary Data description languages are well-suited to describing ad hoc data. No one DDL will ever be right - different domains and applications will demand different languages with differing levels of expressiveness and abstraction. Our work defines the first semantics for data description languages. For more information, visit

Cut slides follow

A Brief History In the beginning, there was just one program (maybe two). No need for programming language. That program was copied and changed until there were many programs. High-level programming language was invented. Nice, but not right for all situations - many new programming languages appeared. How do these languages related to each other? –Programming language semantics was born.

A Brief History In the beginning, there was just one data format (binary). No need for data description language. That format was evolved until there were many formats. Data description language was invented. One language did not suit all and many new data description languages appeared. –This is where we are today We’d like to help answer that question by devising the first data description language semantics.