Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

Slides:



Advertisements
Similar presentations
Programming Paradigms and languages
Advertisements

8. Introduction to Denotational Semantics. © O. Nierstrasz PS — Denotational Semantics 8.2 Roadmap Overview:  Syntax and Semantics  Semantics of Expressions.
Kathleen Fisher AT&T Labs Research Yitzhak Mandelbaum, David Walker Princeton The Next 700 Data Description Languages.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Information Retrieval in Practice
Software Engineering and Design Principles Chapter 1.
Managing Data Resources
What do Computer Scientists and Engineers do? CS101 Regular Lecture, Week 10.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.
Computers: Tools for an Information Age
Software Requirements
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Course Instructor: Aisha Azeem
4/20/2017.
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
Software Development Concepts ITEC Software Development Software Development refers to all that is involved between the conception of the desired.
Introduction to High-Level Language Programming
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
GENERAL CONCEPTS OF OOPS INTRODUCTION With rapidly changing world and highly competitive and versatile nature of industry, the operations are becoming.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
ITEC224 Database Programming
Software Engineering Chapter 8 Fall Analysis Extension of use cases, use cases are converted into a more formal description of the system.Extension.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
© 2007 by Prentice Hall 1 Introduction to databases.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
Architectural Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
Cohesion and Coupling CS 4311
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
FDT Foil no 1 On Methodology from Domain to System Descriptions by Rolv Bræk NTNU Workshop on Philosophy and Applicablitiy of Formal Languages Geneve 15.
Chapter 7 Low-Level Programming Languages. 2 Chapter Goals List the operations that a computer can perform Discuss the relationship between levels of.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Computing System Fundamentals 3.1 Language Translators.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
 Programming - the process of creating computer programs.
ANU comp2110 Software Design lecture 8 COMP2110 Software Design in 2004 lecture 8 Software Architecture 1 of 2 (design, lecture 3 of 6) Goal of this small.
Formal Specification: a Roadmap Axel van Lamsweerde published on ICSE (International Conference on Software Engineering) Jing Ai 10/28/2003.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
PADL 2008 A Generic Programming Toolkit for PADS/ML Mary Fernández, Kathleen Fisher, Yitzhak Mandelbaum AT&T Labs Research J. Nathan Foster, Michael Greenberg.
Information Retrieval in Practice
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Part 3 Design What does design mean in different fields?
课程名 编译原理 Compiling Techniques
Potter’s Wheel: An Interactive Data Cleaning System
Model-Driven Analysis Frameworks for Embedded Systems
Automated Analysis and Code Generation for Domain-Specific Models
Presentation transcript:

Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez (AT&T)

2 Data, data everywhere! Incredible amounts of data stored in well-behaved formats: Tools Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants… Databases: XML:

3 Ad hoc Data Vast amounts of data in ad hoc formats. Ad hoc data is semi-structured: –Not free text. –Not as rigid as data in relational databases. Examples from many different areas: –Physics –Computer system maintenance and administration –Biology –Finance –Government –Healthcare –More!

4 Ad Hoc Data in Biology format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: ,PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution

5 Ad Hoc Data in Chemistry C5=CC=CC=C5)=O)C1

6 Ad Hoc Data in Finance HA START OF TEST CYCLE aA BXYZ U1AB B HL START OF OPEN INTEREST d FZYX G1AB HM END OF OPEN INTEREST HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY

7 Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" [15/Oct/1997:18:53: ] "GET /tr/img/gift.gif HTTP/1.0” [15/Oct/1997:18:39: ] "GET /tr/img/wool.gif HTTP/1.0" [16/Oct/1997:12:59: ] "GET / HTTP/1.0" ekf - [17/Oct/1997:10:08: ] "GET /img/new.gif HTTP/1.0" 304 -

8 Ad Hoc Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

9 Properties of Ad hoc Data Data arrives “as is” -- you don’t choose the format Documentation is often out-of-date or nonexistent. –Hijacked fields. –Undocumented “missing value” representations. Data is buggy. –Missing data, “extra” data, … –Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), … –Errors are sometimes the most interesting portion of the data. Data sources often have high volume. –Data might not fit into main memory. Data can be created by malicious sources attempting to exploit software vulnerabilities –c.f. Ethereal network monitoring system

10 The Goal(s) What can we do about ad hoc data? –how do we read it into programs? –how do we detect errors? –how do we correct errors? –how do we query it? –how do we discover its structure and properties? –how do we view it? –how do we transform it into standard formats like CSV, XML? –how do we merge multiple data sources? In short: how do we do all the things we take for granted when dealing with standard formats in a fault- tolerant and efficient, yet nearly effortless way?

11 Enter Pads Pads: a system for Processing Ad hoc Data Sources Three main components: –a data description language for concise and precise specifications of ad hoc data formats and semantic properties –a compiler that automatically generates a suite of programming libraries & end-to-end applications –a visual interface to support both novice and expert users

12 One Description, Many Tools Data Description (Type T) compiler query engine parserprinter visual data browser xml translator... programming library complete application

13 Some Advantages Over Ad Hoc Methods Big bang for buck: 1 description, many tools Descriptions document data sources –the documentation IS the tool generator so documentation is automatically kept up-to-date with implementation Descriptions are easy to write, easy to understand. –descriptions are high-level & declarative –description syntax exploits programmer intuition concerning types Tools are robust –Error handling code generated automatically; doesn’t clutter documentation. Descriptions & generated tools can be analyzed and reasoned about –eg: data size, tool termination & safety properties, coherence of generated parsers & printers

14 The PADS Project PADS/C [PLDI 05; POPL 06] –Based on C type structure. –Generates C libraries. too bad C doesn’t actually support libraries.... –LaunchPADS visual interface [Daly et al., SIGMOD 06] PADS/ML (Mandelbaum’s thesis) –Based on the ML type structure. polymorphic, dependent datatypes –Generates ML modules. better reuse & library structure functional data processing = far greater programmer productivity –New framework for tool development. Format-independent algorithms architected using functors vs macros –Implementation status. Version 1.0 up and running Many more exciting things to do Describe real formats: –Newick tree-structured data –Reglens galaxy catalogues –Palm PDA databases –AT&T call-detail data –AT&T billing data –Web server logs –Gene ontologies –DNS packets –OPRA data –More …

15 Outline Motivation and PADS Overview Data Description in PADS/ML Implementation architecture The Semantic of PADS Conclusions

16 Base Types and Records Base types: C (e). –Describe atomic portions of data. –Parameterized by host-language expression. –Examples: Pint, Pchar, Pstring_FW(n), Pstring(c). Tuples and Records: t * t’ and {x:t; y:t’}. –Record fields are dependent: field names can be referenced by types of later fields. –Example to follow.

17 122Joe|Wright|45|95|79 n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|85 Burton|30|82|71 126George|Lucas|32|62|40 Base Types and Records Tim Pint* Pstring(‘|’)* Pchar 125| Movie-director Bowling Score (MBS) Format

18 13C Programming31Types and Programming Languages20Twenty Years of PLDI36Modern Compiler Implementation in ML 27Elements o f ML Programming Base Types and Records 13C Programming { width: ; title: Pstring_FW(width) } Pint Bookshelf Listing (BL) Format

19 Constraints Constrained types: [x:t | e]. –Enforce the constraint e on the underlying type t. BurtonTim125|30| [c:Pchar | c = ‘|’] ptype Scores = { min:Pint; ‘|’; max: [m:Pint | min ≤ m]; ‘|’; avg: [a:Pint | min ≤ a & a ≤ max] } 8271|| Pchar ‘|’

20 122Joe|Wright|45|95|79 n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|85 125Tim|Burton|30|82|71 126George|Lucas|32|62|40 Datatypes Describe alternatives in data source with datatypes. –Parser tries each alternative in order. pdatatype Id = None of “n/a” | Some of Pint n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|85

21 Recursive Datatypes Describe inductively-defined formats. pdatatype IntList =Cons of Pint * ‘|’ | Last of Pint * IntList 79|4031|71|

22 Polymorphic Types Parameterize types by other types. pdatatype (Elt) List =Cons of Elt * ‘|’ | Last of Elt * (Elt) List pdatatype IntList = Cons of Pint * ‘|’ | Last of Pint * IntList ptype IntList = Pint List ptype CharList = Pchar List

23 Dependent Types Parameterize types by values. pdatatype IntList = Cons of Pint * ‘|’ * IntList | Nil of Pint ptype IntListBar = Pint List(‘|’) ptype CharListComma = Pchar List (‘,’) pdatatype (Elt) List (x:char) = Cons of Elt * x * (Elt) List(x) | Nil of Elt

24 More Dependent Types pdatatype GuidedOption (tag: int) = pmatch tag of 0 => Zero of Pstring | 1 => One of Pint | 2 => Two of Pint * Pint | _ => None ptype source = {tag: Pint; payload: GuidedOption (tag)} “Switched” datatypes:

ptype Timestamp = Ptimestamp_explicit_FW(8, "%H:%M:%S", gmt) ptype Pip = Puint8 * ’.’ * Puint8 * ’.’ * Puint8 * ’.’ * Puint8 ptype (Alpha) Pnvp(p : string -> bool) = { name : [name : Pstring(’=’) | p name]; ’=’; value : Alpha } ptype (Alpha) Nvp(name:string) = Alpha Pnvp(fun s -> s = name) ptype SVString = Pstring_SE("/;|\\|/") ptype Nvp_a = SVString Pnvp(fun _ -> true) ptype Details = { source: Pip Nvp("src_addr"); ’;’; dest : Pip Nvp("dest_addr"); ’;’; start_time : Timestamp Nvp("start_time"); ’;’; end_time : Timestamp Nvp("end_time"); ’;’; cycle_time: Puint32 Nvp("cycle_time") } ptype Semicolon = Pcharlit(’;’) ptype Vbar = Pcharlit(’|’) pdatatype Info(alarm_code : int) = Pmatch alarm_code with > Details of Details | _ -> Generic of (Nvp_a,Semicolon,Vbar) Plist pdatatype Service = Dom of "DOMESTIC" | Int of "INTERNATIONAL" | Spec of "SPECIAL" ptype Raw_alarm = { alarm : [ i : Puint32 | i = 2 or i = 3]; ’:’; start : Timestamp Popt; ’|’; clear : Timestamp Popt; ’|’; code : Puint32; ’|’; src_dns : SVString Nvp("dns1"); ’;’; dest_dns : SVString Nvp("dns2"); ’|’; info : Info(code); ’|’; service : Service } let checkCorr ra =... ptype Alarm = [x:Raw_alarm | checkCorr x] ptype Source = (Alarm,Peor,Peof) Plist 2: ||5001|dns1=abc.com;dns2=xyz.com|c=slow link;w=lost packets|INTERNATIONAL 3:| |5074|dns1=bob.com;dns2=alice.com|src_addr= ; dst_addr= ;start_time= ;end_time= ;cycle_time=17412|SPECIAL Sample Regulus Data: PADS/ML Regulus Format:

26 Outline Motivation and PADS Overview Data Description in PADS/ML Implementation architecture The Semantic of PADS Conclusions

27 Parsing With PADS data description (type T) user code data rep (type ~ T) parse descriptor (type ~ T) compiler parser

28 Example: MBS Representation n/aEd|Wood|10|47|31 ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores } pdatatype Id = None of “n/a” | Some of Pint type MBS-Entry = { id:Id; first:string; last:string; scores:Scores } datatype Id = None | Some of int

29 Tool Generation With PADS/ML data description (type T) format-specific traversal functor data rep (type ~ T) parse descriptor (type ~ T) compiler parser format- independent tool module tools in this pattern: accumulator, debugger, histograms, clusters, format converters

30 Types as Modules PADS/ML generates a module for each type/description Parameterized types ==> Functors Recursive types ==> Recursive modules –sigh: combination of recursive modules & functors not supported in O’Caml, so we’re reduced to a bit of a hack for recursion sig type rep type pd fun parser : Pads.handle -> rep * pd module Traverse (tool : TOOL) : sig... end end

31 Outline Motivation and PADS Overview Data Description in PADS/ML Implementation architecture The Semantic of PADS Conclusions

32 Motivation To crystallize design principles. –Example: error counting methodology in PADS/C. To ensure system correctness. –Example: parsers return data of expected type. As basis for evolution and experimentation. –Critical to design of PADS/ML. To communicate core ideas. –Designing the next 700 data description languages.

33 PADS and DDC Developed semantic framework based on Data Description Calculus (DDC). Explains PADS/ML and other languages with DDC. Give denotational semantics to DDC. PADS/C PADS/ML The Next 700 DDC

34 Data Description Calculus DDC: calculus of dependent types for describing data. Expressions e with type  drawn from F-omega A kinding judgment specifies well-formed descriptions. t ::= unit | bottom | C(e) |  x:t.t | t + t | t & t | {x:t | e} | t seq(t,e,t) | x.e | t e | .t | t t |  | .t | compute (e:  ) | absorb(t) | scan(t)

35 Choosing a Semantics Semantics of REs, CFGs given as sets of strings but fails to account for: –Relationship between internal and external data. –Error handling. –Types of representation and parse descriptor. DDC –Denotational semantics of types as parsers in F-omega

36 A 3-Fold Semantics t Parser Representation Parse Descriptor Description [ t ] [ t ] rep [ t ] pd Interpretations of t[ {x:t | e} ] rep = [ t ] rep + [ t ] rep [  x:t.t’ ] rep = [ t ] rep * [ t’ ] rep [ {x:t | e} ] pd = hdr * [ t ] pd [  x:t.t’ ] pd = hdr * [ t ] pd * [ t’ ] pd

37 Type Correctness t Parser Representation Parse Descriptor Description Theorem [ t ] : bits  [ t ] rep * [ t ] pd [ t ] [ t ] rep [ t ] pd Interpretations of t

38 Outline Motivation and PADS Overview Data Description in PADS/ML Implementation architecture The Semantic of PADS Conclusions

39 Related Work parser generator technology: –Lex & Yacc no dependency semantic actions entwined with data description no higher-level tools –Parser combinators semantic actions entwined with data description no higher-level tools

40 Reminder: One Description, Many Tools Data Description (Type T) compiler query engine parserprinter visual data browser xml translator... programming library complete application

41 Parser combinators: One algorithm, One Tool parser

42 Related Work Other “data description” languages –Data Format Description Language (DFDL) –Binary Format Description Language (BFD) –PacketTypes [SIGCOMM ’00] –DataScript [GPCE ’02] None have a well-defined semantics or Pads tool support

43 Current & Future Work Tools and Applications –Description inference. –Support for specific domains (microbiology) Language Design –Transformation language for ad hoc data. –Description language for distributed Describe locations, versions, timing, relationships, etc. Theory –Analyze data descriptions for interesting properties, e.g. equivalence, data size, termination, emptiness (always fails). –Coherence of parsing & printing

44 Summary The PADS vision: reliable, efficient and effortless ad hoc data processing PADS/ML: –Data description based on polymorphic, dependent datatypes –“Types as modules” implementation –Solid theoretical basis. Visit

45 The End Questions?

46 Cut slides follow

47 Switched Datatypes Choose branch based on parameter. pdatatype Id = match t with 0  Name of Pstring_FW(3) | 1  Num of Pint

48 PADS/C vs. PADS/ML Next-generation PADS language and compiler. –Based on PADS/C experience and insights from semantics. Targeted at ML. –Functional languages better suited to data transformation. Higher level of abstraction than PADS/C. –Key features: polymorphic, recursive datatypes. Improved compiler design: –New framework for tool development. –Greater focus on modularity.

49 Existing Approaches C, Perl, or shell scripts: most popular. –Time consuming & error prone to hand code parsers. –Difficult to maintain (worse than the ad hoc data itself in some cases!). –Often incomplete, particularly with respect to errors. Error code, if written, swamps main-line computation. If not written, errors can corrupt “good” data. Lex & Yacc –Good match for programming languages. –Bad match for ad hoc data. Compiler converts descriptions into robust, format- specific tools.

50 Parsing With PADS Robust parser at the core of generated tools.

51 Using Ad hoc Data Parsing only brings you part way. –Queries must be written in ML. –A lot of work. What about a declarative query? 122Joe|Wright|45|95|79 124Chris|Nolan|80|93|85 125Tim|Burton|30|82|71 126George|Lucas|32|62|40 Can Ed Wood bowl? n/aEd|Wood|10|47|31

52 From Ad hoc Data To XML XML –Encoding for semi-structured data. –Good match! XQuery –Declarative XML query language for semi-structured sources. –Standardized by W3C, many implementations.

53 PADX = PADS + XQuery Galax [Fernandez, et al.] –Complete, open-source XQuery implementation. PADX –Integrates PADS and Galax. –Supports declarative queries over ad hoc data sources.

54 Using PADX User describes format in PADS. PADX provides –XML “view” of data in XML Schema. –Customized XQuery engine. Query PADS-specific and other XML sources. User provides –Ad hoc data –Queries expressed in XQuery.

55 Describing MBS Format Example Movie-director Bowling Score data PADS/ML Description n/aEd|Wood|10|47|31 ptype MBS-Entry = { id:Id; first:Pstring(‘|’); ‘|’; last:Pstring(‘|’); ‘|’; scores:Scores }

56 Viewing and Querying MBS Virtual XML view Query: What is Ed Wood’s maximum score? $pads/Psource/MBS-Entry[first = “Ed”][last = “Wood”]/scores/max Query: Which directors have scored less than 50? $pads/Psource/MBS-Entry[scores/min < 50] n/a Ed Wood ptype MBS-Entry = { id:Id; first:Pstring(‘|’); ‘|’; last:Pstring(‘|’); ‘|’; scores:Scores } n/aEd|Wood|10|47|31

57 Challenges & Solutions Semantics –Map PADS language to XML Schema. Re-engineer Galax Data Model –Create abstract data model. –Generate description-specific concrete data models. Efficiently query large-scale data sources. –Provide lazy access to data. –Implement custom memory-management.