David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.

Slides:



Advertisements
Similar presentations
Programming Paradigms and languages
Advertisements

8. Introduction to Denotational Semantics. © O. Nierstrasz PS — Denotational Semantics 8.2 Roadmap Overview:  Syntax and Semantics  Semantics of Expressions.
COMPSCI 105 S Principles of Computer Science 12 Abstract Data Type.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
The KB on its way to Web 2.0 Lower the barrier for users to remix the output of services. Theo van Veen, ELAG 2006, April 26.
Information Retrieval in Practice
Architecture-driven Modeling and Analysis By David Garlan and Bradley Schmerl Presented by Charita Feldman.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
CSCI 4550/8556 Computer Networks Comer, Chapter 3: Network Programming and Applications.
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
Kathleen Fisher AT&T Labs Research Robert Gruber Google PADS: A Domain-Specific Language for Processing Ad Hoc Data.
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Chapter 1 Program Design
© 2006 Pearson Addison-Wesley. All rights reserved2-1 Chapter 2 Principles of Programming & Software Engineering.
Overview of Search Engines
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Introduction SWE 619. Why Is Building Good Software Hard? Large software systems enormously complex  Millions of “moving parts” People expect software.
UNIT-V The MVC architecture and Struts Framework.
Introduction to IT and Communications Technology Justin Champion Network Connections & Number Systems.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
GENERAL CONCEPTS OF OOPS INTRODUCTION With rapidly changing world and highly competitive and versatile nature of industry, the operations are becoming.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Copyright © 2006 Addison-Wesley. All rights reserved.1-1 ICS 410: Programming Languages.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Cohesion and Coupling CS 4311
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Computing System Fundamentals 3.1 Language Translators.
The Software Development Process
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Saving State on the WWW. The Issue  Connections on the WWW are stateless  Every time a link is followed is like the first time to the server — it has.
© 2006 Pearson Addison-Wesley. All rights reserved2-1 Chapter 2 Principles of Programming & Software Engineering.
© 2006 Pearson Addison-Wesley. All rights reserved 2-1 Chapter 2 Principles of Programming & Software Engineering.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
Web Services An Introduction Copyright © Curt Hill.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
CS223: Software Engineering Lecture 13: Software Architecture.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Introduction of Wget. Wget Wget is a package for retrieving files using HTTP and FTP, the most widely-used Internet protocols. Wget is non-interactive,
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Information Retrieval in Practice
Advanced Computer Systems
The architecture of the P416 compiler
Types for Programs and Proofs
CSCI-235 Micro-Computer Applications
Layered Architectures
Ada – 1983 History’s largest design effort
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Last Class: Communication in Distributed Systems
Presentation transcript:

David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists

2 Computer Science in the 21st Century One part computation to determine the answer to your problem. One part communication to tell someone about it.

Who: actress Jennifer Aniston and actor Brad Pitt When: July 29, 2000 Where: The nuptials took place on the grounds of TV producer Marcy Carsey's Malibu estate The Ceremony: As the sun sank low in the California sky, two hundred assembled guests watched as John Aniston, known to daytime television fans for his work on Days of Our Lives, walked his daughter down the aisle. Shielded by a flower-bedecked canopy, the bride and groom were able to say....

4

5 Our Common Communication Infrastructure Behind the scenes, much of this information is represented in standardized data formats Standardized data formats: – Web pages in HTML – Pictures in JPEG – Movies in MPEG – “Universal” information format XML – Standard relational database formats A plethora of data processing tools: – Visualizers (Browsers Display JPEG, HTML,...) – Query languages allow users extract information (SQL, XQuery) – Programmers get easy access through standard libraries Java XML libraries --- JAXP – Many applications handle it natively and convert back and forth MS Word

6 Ad Hoc Data Massive amounts of data are stored in XML, HTML or relational databases but there’s even more data that isn’t An ad hoc data format is any nonstandard data format for which convenient parsing, querying, visualizing, transformation tools are not available – ad hoc data is everywhere.

7 Ad Hoc data from Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc % 31% GE General Electric Co % -8% HD Home Depot Inc % 63% IBM Intl Business Machines % -13% INTC Intel Corp % -47% Data provided by William O'Neil + Co., Inc. © All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes.

8 Ad Hoc data from !autogenerated-by: DAG-Edit version rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: <biological_process ; GO: %behavior ; GO: ; synonym:behaviour %adult behavior ; GO: ; synonym:adult behaviour %adult feeding behavior ; GO: ; synonym:adult feeding behaviour % feeding behavior ; GO: %adult locomotory behavior ; GO: ;...

9 Ad Hoc Data From Steve Kleinstein (Immune Response Simulation Data) (~6:0:0:0:0~1:0:0:0:1,1:1:0:0:0) (~6:0:0:0:0~1:1:0:0:0) (~5:0:0:0:0~1:1:0:0:0) (~2:0:0:0:0~1:1:0:0:0,1:1:0:0:0,1:0:0:1:0) (~1:0:0:0:0~1:0:0:1:0) (~13:0:0:0:0~2:0:0:0:1,1:0:0:1:0,2:0:0:1:0) :0:0:0:0....

10 Ad Hoc Data in Chemistry C5=CC=CC=C5)=O)C1

11 Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0"

12 Ad Hoc Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

13 Who uses ad hoc data? Ad hoc data sources are everywhere – containing valuable information of all kinds – everybody wants it: chemists, physicists, biologists, economists, computer scientists, network administrators,... just about anyone who writes their own programs

14 The challenge of ad hoc data What can we do about ad hoc data? – how do we read it into programs? – how do we detect errors? – how do we correct errors? – how do we query it? – how do we view it? – how do we gather statistics on it? – how do we load it into a database? – how do we transform it into a standard format like XML? – how do we combine multiple ad data sources? – how do we filter, normalize and transform it? In short: how do we do all the things we take for granted when dealing with standard formats in a reliable, fault-tolerant and efficient, yet effortless way?

15 Most people use C / Perl / Shell scripts But: – Writing hand-coded parsers is time consuming & error prone. – Reading and maintaining them in the face of even small format changes can be difficult. – Such programs are often incomplete, particularly with respect to errors. – Not all that efficient unless the author invests extra effort For reliable, fault-tolerant, efficient data processing, we can do better!

16 Why not use traditional parsers? Overall, a very heavy-weight solution – people just do not do it – specifying a lexer and parser separately can be a barrier data specs as Lex and Yacc files are relatively complicated – lexing and parsing tools only solve a small part of the problem internal data structures built by hand printer by hand transforms by hand viewers by hand query engine by hand Error processing is fairly rigid We can do better!

17 Enter Pads Pads: a system for Processing Ad hoc Data Sources Two main components: – a data description language for concise and precise specifications of ad hoc data formats and properties – a compiler that automatically generates a suite of data processing tools robust libraries for C programming – parser that flags all errors and automatically recovers – printing utilities an interface that allows users to query ad hoc data converter to XML a statistical profiler – collects stats on common values appearing in all parts of the data; records error stats visual interface & viewer (coming soon!)

18 The rest of the talk Introduction to ad hoc data sources (check) Pads Tools Pads Language Pads Semantics Wrap-up

19 Pads Tool Generation Architecture Pads Compiler Gene Ontology description Statistical Profiler Tool gene data Profile ACE 25% BKJ 25%... XML Formatter Tool gene data Viewer Tool gene data

20 Pads Tool Generation Architecture Pads Compiler Gene Ontology description Gene Ontology Generated Parser Pads Base Library Gene Ontology Statistical Profiler Glue code for statistical profile

21 Pads Programmer Tools Pads Compiler Gene Ontology description Gene Ontology Generated Parser Pads Base Library Ad Hoc User Program Ad Hoc User Program in C

22 The Statistical Profiler Tool for each part of a data source, profiler reports errors & most common values. from example weblog data:.length : uint good: bad: 3824 pcnt-bad: min: 35 max: avg: top 10 values out of 1000 distinct values: tracked % of values val: 3082 count: 1254 %-of-good: val: 170 count: 1148 %-of-good: val: 43 count: 1018 %-of-good:

23 The Statistical Profiler Tool ad hoc data is often poorly documented or out-of-date even the documentation of weblog data from our textbook was missing some information: good: bad: 3824 pcnt-bad: – web server sometimes return a ‘-’ instead of length of bytes, which wasn’t mentioned in the textbook data descriptions can be written in a iterative fashion – use the profiler at each stage to uncover additional information about the data and refine the description

Pads Language

25 PADS language Based on Type Theory – in most modern programming languages, types (int, bool, struct, object...) describe program data the source of most of my research – in Pads, types describe physical data formats, semantic properties of data, and a mapping into an internal program representation (ie, a parser) Can describe ASCII, binary, and mixed data formats.

26 PADS language Basic Types – Rich and extensible. – Pint8, Puint8, Pint16,... – Pstring(:term-char:) – Pstring_FW(:size:) – Pstring_ME(:reg_exp:) – Pdate,... Supports user-defined compound types to describe data source structure: – Pstruct, Parray, Punion, Ptypedef, Penum

27 Example: CLF web log Common Log Format from Web Protocols and Practice. (Bala and Rexford) Fields: – IP address of remote host – Remote identity (usually ‘-’ to indicate name not collected) – Authenticated user (usually ‘-’ to indicate name not collected) – Time associated with request – Request – Response code – Content length [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

28 Example: Pstruct Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" For reading a sequence of different data elements:

29 Example: Punion Punion auth_id { Pchar unavailable : unavailable == '-'; Pstring(:' ':) id; }; Union declarations allow the user to describe variations. Implementation tries branches in order. Stops when it finds a branch whose constraints are all true [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

30 Example: Parray Parray nIP { Puint8[4]: Psep(‘.’) && Pterm(‘ ’); }; Array declarations allow the user to specify: Size (fixed, lower-bounded, upper-bounded, unbounded.) Boolean-valued constraints Psep and Pterm predicates Array terminates upon exhausting EOF/EOR, reaching terminator, or reaching maximum size [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

31 Example: User constraints int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 0)) return 1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1; } Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version, meth); /- HTTP version number of request '\"'; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

32 Example: Parameterization & Dependency “Early” data often affects parsing of later data: – Lengths of sequences – Branches of switched unions To accommodate this usage, we allow PADS types to be parameterized: Pstruct packet_t (: Puint32 length:) {... Pstring_FW(: length :) payload; };

Pads Semantics

34 Semantics: The Big Picture As a theorist, I want to be able describe the meanings (semantics) of programs and programming languages Why bother? What is the point? – communication spread ideas, techniques and algorithms often means extracting the essence of a language and reducing it to a simple set of mathematical relations – verification prove properties of implementations particularly security-relevant or safety-critical applications – generalization the mathematics brings out the central principles and invariants leads to more general, compositional, scalable solutions – it’s just fun immensely satisfying to come up with the perfect formal system where all parts compose and blend seemlessly together

35 Semantics for Pads: Goals – Communication Pads descriptions can be incorporated into just about any language. ML? Java? Perl? Matlab? Language designers need a precise specification to do so – Verification In some cases, we find the implementation incomplete or making arbitrary choices (eg: error correction semantics) Every once in awhile, the implementation is outright wrong (eg: array semantics) – Generalization Semantics allows us to compare and contrast Pads with related languages & add features (eg: intersection types & overlays from PacketTypes; recursive types; more)

36 Semantics for Pads: Overview Pads is large language and if we tried to formalize the whole thing right from the get-go, we wouldn’t succeed – we’d get lost in details and make mistakes – we’d be unable to structure our proofs of key properties – we wouldn’t communicate the essential elements to our fellow researchers Strategy: – pick out the key ingredients & eliminate the ugly, but unimportant details – develop an idealized version of the real language each type in our idealized version of pads represents a single, simple pure idea each type composes with all others we give a semantics to each individual construct; we get a semantics for complex objects by putting several simple individual ones together

37 Semantics for Pads: Overview Part 1: Specify idealized (abstract) syntax of types T ::= True(parse nothing successfully) | False(parse nothing unsuccessfully) | {x:T | P(x)} (constrained type; parse data as T and check P) | C (arg) (parse parameterized base type; eg: string(:’ ‘:)) | T1  T2(union type; parse one or the other) | T1  T2(intersection type; parse data as both T1 and T2) |  x:T1.T2(dependent pair; parse T1, call it x, then parse T2) | T seq(arg)(sequence type; parse Ts until finding arg) | x.T (type parameterized by argument x) | T (arg)(parameterized type applied to argument) | hide T (skip data described by T; eg: absorb ‘|’ ) | spoof (arg)(parse nothing; add arg to internal representation) basics

38 Semantics for Pads: Overview Part 1: Specify idealized (abstract) syntax of types T ::= True(parse nothing successfully) | False(parse nothing unsuccessfully) | {x:T | P(x)} (constrained type; parse data as T and check P) | C (arg) (parse parameterized base type; eg: string(:’ ‘:)) | T1  T2(union type; parse one or the other) | T1  T2(intersection type; parse data as both T1 and T2) |  x:T1.T2(dependent pair; parse T1, call it x, then parse T2) | T seq(arg)(sequence type; parse Ts until finding arg) | x.T (type parameterized by argument x) | T (arg)(parameterized type applied to argument) | hide T (skip data described by T; eg: absorb ‘|’ ) | spoof (arg)(parse nothing; add arg to internal representation) basics structured types

39 Semantics for Pads: Overview Part 1: Specify idealized (abstract) syntax of types T ::= True(parse nothing successfully) | False(parse nothing unsuccessfully) | {x:T | P(x)} (constrained type; parse data as T and check P) | C (arg) (parse parameterized base type; eg: string(:’ ‘:)) | T1  T2(union type; parse one or the other) | T1  T2(intersection type; parse data as both T1 and T2) |  x:T1.T2(dependent pair; parse T1, call it x, then parse T2) | T seq(arg)(sequence type; parse Ts until finding arg) | x.T (type parameterized by argument x) | T (arg)(parameterized type applied to argument) | hide T (skip data described by T; eg: absorb ‘|’ ) | spoof (arg)(parse nothing; add arg to internal representation) basics structured types para- meterized types

40 Semantics for Pads: Overview Part 1: Specify idealized (abstract) syntax of types T ::= True(parse nothing successfully) | False(parse nothing unsuccessfully) | {x:T | P(x)} (constrained type; parse data as T and check P) | C (arg) (parse parameterized base type; eg: string(:’ ‘:)) | T1  T2(union type; parse one or the other) | T1  T2(intersection type; parse data as both T1 and T2) |  x:T1.T2(dependent pair; parse T1, call it x, then parse T2) | T seq(arg)(sequence type; parse Ts until finding arg) | x.T (type parameterized by argument x) | T (arg)(parameterized type applied to argument) | absorb T (skip data described by T; eg: absorb ‘|’ ) | compute (arg)(parse nothing; add arg to internal representation) basics structured types para- meterized types transforms

41 Semantics for Pads: Overview Part 2: Specify denotational semantics of types – in general, a denotational semantics describes one language (poorly understood) in terms of another language (better understood) – in our case, we specify the meaning of Pads types (poorly understood) in terms of the polymorphic -calculus (better understood, at least by me) semantics(T) = bits.e a parser function mapping external bits to data structures in the -calculus

42 Semantics for Pads: Overview Part 3: Prove Pads has the required properties – Theorem: Parsers never generate “bad” internal representations of external data. ie, representations are well-typed in the implementation language. – Theorem: Parsers check all semantic constraints.

Wrap-up

44 Challenges of Ad Hoc Data Revisited Data arrives “as is” – Format determined by data source, not consumers. The Pads language allows consumers to describe data in just about any format. – Often has little documentation. A Pads description can serve as documentation for data source. The statistical profiler helps analysts understand data. – Some percentage of data is “buggy.” Constraints allow consumers to express expectations about data. Parsers check for errors and say where errors are located. Ad hoc data is a rich source of information for chemists, biologists, computer scientists, if they could only get at it. – Pads generates a collection of useful tools automatically from data descriptions Pads is our answer to the challenge of ad hoc data sources.

45 Related work DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000] – Primarily for networking data – Binary data formats only – Stop on first error – No value-added tools (Profiler; XML conversion; Query engine) – No semantics

46 Current and Future Work Pads Language – recursion and pointers (eg: for tree- and graph-structured data) – integrated pre- and post-processing (eg: encryption, compression) – composition and reuse (via polymorphism, modules) – multi-source data integration Pads Compiler – parsing and querying optimization (eg: dealing with massive data sets) Pads Tools – new architecture for robust & reliable tool generation – application-specific customization error correction, data normalization, ignoring or rearranging components – general data transformation – visual interface for nonprogrammers Pads Applications – genomics data (with Olga Troyanskaya) – networking and telephony data (AT&T) – a great domain for interdisciplinary undergraduate research projects

47 Pads Summary The overarching goal of Pads is to make understanding, querying and transforming ad hoc data an effortless task. We do so with new programming language technology based on the principles of Type Theory. AT&T Research: Kathleen Fisher Mary Fernandez Joel Gottlieb Robert Gruber (now Google) Ricardo Medel (summer intern) Princeton: Joe Kovba (UGrad) Yitzhak Mandelbaum (Grad) David Walker

End!