The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker
“The Next 700 …” Program(s)Data Format(s) Programming Language(s) PL Semantics Data Description Language(s) DDL Semantics
What Data Needs Describing? There's much data in databases and common formats like XML; there’s much data that’s ad hoc. Ad hoc data lacks readily available parsing, querying, analysis or transformation tools It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.
Ad Hoc Data in Biology !autogenerated-by: DAG-Edit version rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: <biological_process ; GO: %behavior ; GO: ; synonym:behaviour %adult behavior ; GO: ; synonym:adult behaviour %adult feeding behavior ; GO: ; synonym:adult feeding behaviour % feeding behavior ; GO: %adult locomotory behavior ; GO: ;... from
Ad Hoc Data in Chemistry C5=CC=CC=C5)=O)C1
Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0"
Ad Hoc Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co
Data Description Languages Data description languages describe many ad hoc formats and provide the following features: –Descriptions serves as documentation, including semantic of data –Compiler generates tools from description: parser, printer, query engine, converter to XML, statistical profiler, etc. –Parser includes robust error detection and recovery. –Parsers can handle high data volume. > 1GB/second Netflow traffic from Cisco routers.
Many Data Description Languages Logical Descriptions –ASN.1 –ASDL Physical Descriptions –PacketTypes (SIGCOMM ‘00) –DataScript (GPCE ‘02) –PADS (PLDI ‘05) Basis for current work Logical Physical
Contributions A core data description calculus (DDC) –Based on dependent type theory –Simple, orthogonal, composable types –Types are transducers from external data source to internal data representation. Encodings of high-level DDLs in low-level DDC –Explain semantics of PADS language in particular. PacketTypes PADS Datascript DDC
Base Types and Sequences C(e): base type can be parameterized by expression e. x:T.T’: dependent product describes sequence of values. –Variable x gives name to first value in sequence. Examples: “123hello|”int * string(‘|’) * char(123, “hello”, ‘|’) “3513” width:int_fw(1). int_fw(width) (3,513) “:hello:” term:char.string(term) * char (‘:’,“hello”,‘:’)
Constraints {x:T | e}: set types allow you to constrain the type T and express relationships between elements of the data. Examples: ‘a’{c:char | c = ‘a’} (abbrev: S c (‘a’))inl ‘a’ “101”, “82” {x:int | x > 100} inl 101, inr error(82) “43|105|67” min:int.S c (‘|’) * max:{m:int | min ≤ m}.S c (‘|’) * {avg:int | min ≤ avg & avg ≤ max} (43, inl ‘|’, inl 105, inl ‘|’, inl 67)
Unions and the Empty String true: matches the empty string. T + T’ : deterministic, exclusive or: try T; on failure, try T’. Examples: “54”, “n/a”int + S s (“n/a”)inl 54, inr “n/a” “2341”, “”int + trueinl 2341, inr ()
Array Features What features do we need to handle data sequences? –Elements –Separator between elements –Termination condition (“are we done yet?”) –Terminator after sequence Examples: “ ” “Bill|Cathy|Jane|Bob;”
False and Arrays T seq(T s ; e, T t ) specifies: –Element type T –Separator types T s. –Termination condition e. –Terminator type T t. false: reads nothing, flagging an error. Example: IP address. “ ”int seq(S c (‘.’); len 4, false)[192,168,1,1]
Abstraction and Application Can parameterize types over values: x.T Correspondingly, can apply types to values: T e Example: IP address with terminator none term.int seq(S c (‘.’); len 4, S c (term)) none “ |”IP_addr ‘|’ * S c (‘|’)([1,2,3,4],inl ‘|’)
Absorb, Compute and Scan Absorb, Compute and Scan are active types. –absorb(T) : consume data from source; produce nothing. –compute(e: ) : consume nothing; output result of computation e. –scan(T) : scan data source for type T. Examples: “|”absorb(S c (‘|’))() “10|12” width:int.S c (‘|’) * length:int. area:compute(width length:int) (10,12,120) “^%$!&_|”scan(S c (‘|’))(6,inl ‘|’)
Type Kinding Kinding ensures types are well formed. |- T : s k |- e : s |- T e: k |- T : type |- T’ : type |- T + T’: type |- T : type ,x:s |- e : bool (s = …) |- {x:T | e}: type
Parsing Semantics of Types Semantics expressed as parsing functions written in the polymorphic -calculus. –Sem(T) : DDC Type Function –Input data and offset, output new offset, value and parse descriptor. –For specifics, see upcoming technical report.
Types of Parser Output Parsers produce values with following type in the host language: DDCHost Language [C(e)] rep I ( C) + noval [true] rep unit [ x:T.T’] rep [T] rep * [T’] rep [ x.T] rep, [T e] rep [T] rep [T + T’] rep [T] rep + [T’] rep + noval [{x:T | e}] rep [T] rep + ([T] rep error) unrecoverable error semantic error dependency erased Base Types Products Union Abs. and App. Set types
Properties of the Calculus Theorem: If |- T : k then –[T] = F well formed types yield parsers – |- F : bits * offset offset * [T] rep * [T] pd a T-Parser returns values with types that correspond to T. Theorem: Parsers report errors accurately. –Errors in parse descriptor correspond to actual errors in data. –Parsers check all semantic constraints. –More …
Making Use of the Calculus IPADS DDC |- t T IPADS t ::= C(e) | Pfun(x:s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts t def ;} | Popt t | t Pwhere x.e | Palt{fields} | t [t; e,t] | Pcompute e | Plit c fields ::= | fields x : t; alts ::= | alts e => t;
Example: Popt and Plit |- Popt t T + true |- t T |- Plit c scan(absorb({x:char | x = c })) |- c : char true T 1 + T 2 C(e) {x:T | e} absorb(T) scan(T)
Example: Pswitch |- Pswitch e of {e 1 => t 1 ; e 2 => t 2 ; … t def } ( c.{x:T 1 | c = e 1 } + {x:T 2 | c = e 2 } + …+ T def ) e |- t i T i (i = 1…n) T + T’ x.T {x:T|e} |- t def T def
Future work What are the set of languages recognized by the DDC? How does the expressive power of the DDC relate to CFGs and regular expressions? Implement recursive types in PADS system based on the recursive types of the DDC. Add polymorphism to DDC and PADS.
Summary Data description languages are well-suited to describing ad hoc data. No one DDL will ever be right - different domains and applications will demand different languages with differing levels of expressiveness and abstraction. Our work defines the first semantics for data description languages. For more information, visit
Cut slides follow
A Brief History In the beginning, there was just one program (maybe two). No need for programming language. That program was copied and changed until there were many programs. High-level programming language was invented. Nice, but not right for all situations - many new programming languages appeared. How do these languages related to each other? –Programming language semantics was born.
A Brief History In the beginning, there was just one data format (binary). No need for data description language. That format was evolved until there were many formats. Data description language was invented. One language did not suit all and many new data description languages appeared. –This is where we are today We’d like to help answer that question by devising the first data description language semantics.