Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.

Slides:



Advertisements
Similar presentations
| 7. März 2007 | PHP & SAP | Martin Probst PHP talking to SAP Martin Probst.
Advertisements

Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Database System Concepts and Architecture
CGI & HTML forms CGI Common Gateway Interface  A web server is only a pipe between user-agents  and content – it does not generate content.
SOAP.
Kathleen Fisher AT&T Labs Research Yitzhak Mandelbaum, David Walker Princeton The Next 700 Data Description Languages.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
Presentation 7 part 2: SOAP & WSDL. Ingeniørhøjskolen i Århus Slide 2 Outline Building blocks in Web Services SOA SOAP WSDL (UDDI)
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.
The Analytical Engine Module 6 Program Translation.
From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.
Application architectures
Kathleen Fisher AT&T Labs Research Robert Gruber Google PADS: A Domain-Specific Language for Processing Ad Hoc Data.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Application architectures
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
1 Lecture 2 : Computer System and Programming. Computer? a programmable machine that  Receives input  Stores and manipulates data  Provides output.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
AIT 616 Fall 2002 PHP. AIT 616 Fall 2002 PHP  Special scripting language used to dynamically generate web documents  Open source – Free!!!  Performs.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
T Network Application Frameworks and XML Web Services and WSDL Sasu Tarkoma Based on slides by Pekka Nikander.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Network Services Networking for Home and Small Businesses – Chapter.
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Internet Addresses. Universal Identifiers Universal Communication Service - Communication system which allows any host to communicate with any other host.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Network Services Networking for Home and Small Businesses – Chapter 6.
Dodick Zulaimi Sudirman Lecture 14 Introduction to Web Service Pengantar Teknologi Internet Introduction to Internet Technology.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic.
.Net and Web Services Security CS795. Web Services A web application Does not have a user interface (as a traditional web application); instead, it exposes.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
XRules An XML Business Rules Language Introduction Copyright © Waleed Abdulla All rights reserved. August 2004.
1 Cisco Unified Application Environment Developers Conference 2008© 2008 Cisco Systems, Inc. All rights reserved.Cisco Public Introduction to Etch Scott.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
A FIRST TOUCH ON NOSQL SERVERS: COUCHDB GENOVEVA VARGAS SOLAR, JAVIER ESPINOSA CNRS, LIG-LAFMIA, FRANCE
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
Web Services from 10,000 feet Part I Tom Perkins NTPCUG CertSIG XML Web Services.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
Martin Kruliš by Martin Kruliš (v1.1)1.
POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
14 October 2002GGF6 / CGS-WG1 Working with CIM Ellen Stokes
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
SOAP, Web Service, WSDL Week 14 Web site:
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 9 Web Services: JAX-RPC,
PADL 2008 A Generic Programming Toolkit for PADS/ML Mary Fernández, Kathleen Fisher, Yitzhak Mandelbaum AT&T Labs Research J. Nathan Foster, Michael Greenberg.
Architecture Review 10/11/2004
COMP2322 Lab 2 HTTP Steven Lee Feb. 8, 2017.
Computer System and Programming
Objectives Identify the built-in data types in C++
XML in Web Technologies
Data, Databases, and DBMSs
Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP
Last Class: Communication in Distributed Systems
Presentation transcript:

Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data

AT&T Data, data, everywhere! Incredible amounts of data stored in well-behaved formats: Tools Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants… Databases: XML:

AT&T … but not all data is well-behaved! Vast amounts of chaotic ad hoc data: Tools Perl? Awk? C?

AT&T format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: ,PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: ,PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution Ad Hoc Data in Genetics

AT&T Ad Hoc Data in Biology: Newick Format ((raccoon: ,bear: ): ,((sea_lion: , seal: ): ,((monkey: ,cat: ): , weasel: ): ): ,dog: ); (Bovine: ,(Gibbon: ,(Orang: ,(Gorilla: ,(Chi mp: , Human: ): ): ): ): ,Mouse: ):0. 10; (Bovine: ,(Hylobates: ,(Pongo: ,(G._Gorilla: , (P._paniscus: ,H._sapiens: ): ): ): ): , Rodent: );

AT&T Ad Hoc Data in Business HA START OF TEST CYCLE aA BXYZ U1AB B HL START OF OPEN INTEREST d FZYX G1AB HM END OF OPEN INTEREST HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY k LYXW B1KB G HB END OF TEST CYCLE HA START OF TEST CYCLE aA BXYZ U1AB B HL START OF OPEN INTEREST d FZYX G1AB HM END OF OPEN INTEREST HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY k LYXW B1KB G HB END OF TEST CYCLE

AT&T Ad Hoc Data in Finance Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc % 31% GE General Electric Co % -8% HD Home Depot Inc % 63% IBM Intl Business Machines % -13% INTC Intel Corp % -47% Data provided by William O'Neil + Co., Inc. © All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes. Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc % 31% GE General Electric Co % -8% HD Home Depot Inc % 63% IBM Intl Business Machines % -13% INTC Intel Corp % -47% Data provided by William O'Neil + Co., Inc. © All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes.

AT&T Ad Hoc Binary Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

AT&T Ad Hoc Data from AT&T Name & UseRepresentationSize Web server logs (CLF): Measure web workloads Fixed-column ASCII records  12 GB/week Sirius data: Monitor service activation Variable-width ASCII records 2.2GB/week Call detail: Detect fraud Fixed-width binary records ~7GB/day Altair data: Track billing process Various Cobol data formats ~4000 files/day Regulus data: Monitor IP network ASCII  15 sources, ~15 GB/day Netflow: Monitor IP network Data-dependent number of fixed-width binary records >1Gigabit/second

AT&T Technical Challenges of Ad Hoc Data Data arrives “as is.” Documentation is often out-of-date or nonexistent. –Hijacked fields. –Undocumented “missing value” representations. Data is buggy. –Missing data, human error, malfunctioning machines, race conditions on log entries, “extra” data, … –Processing must detect relevant errors and respond in application- specific ways. –Errors are sometimes the most interesting portion of the data. Data sources often have high volume. –Data may not fit into main memory.

AT&T Existing Approaches Lex/Yacc –Over- and under-kill. Perl/C –Code brittle with respect to changes in input format. –If written, error-detection code swamps main-line computation. If not written, errors can corrupt “good” data. –Everything has to be coded by hand. –Analysis often ends up interwoven with parsing. Data description languages (PacketTypes, Datascript) –Binary data –Focus on correct data. qr/^(\d+)\|(?:[^|]*\|){12}(?:[^|]*\|[^|]*\|)*$STATE\|/;

AT&T Our Approach: PADS Data expert writes declarative description of data source: –Physical format information –Semantic constraints Many data consumers use description and parser. Description serves as living documentation. Parser exhaustively detects errors without cluttering user code. From description, we generate auxiliary tools. PLDI 2005

AT&T PADS Architecture

AT&T PADS Architecture

AT&T PADS Architecture

AT&T PADS Language Provides rich and extensible set of base types. –Pint8, Puint8, … // -123, 44 –Pstring(: ’ | ’ :) // hello | Pstring_FW(:3:) // catdog Pstring_ME(: ” /a*/ ” :) // aaaaaab –Pdate, Ptime, Pip, … Provides type constructors to describe data source structure: Pstruct, Parray, Punion, Ptypedef, Penum Allows arbitrary predicates to describe expected properties. Type-based model: types indicate how to process associated data.

AT&T Running Example: Web Server Logs Common Log Format from Web Protocols and Practice. Fields: –IP address of remote host –Remote identity (usually ‘-’ to indicate name not collected) –Authenticated user (usually ‘-’ to indicate name not collected) –Time associated with request –Request (request method, request-uri, and protocol version) –Response code –Content length [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T Example: Pstruct Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [ ” ; Pdate(:']':) date; /- Timestamp of request “ ] ” ; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T Example: Parray Parray Phostname{ Pstring_SE(:"/[. ]/":)[] : Psep('.') && Pterm(Pnosep); }; Array declarations allow the user to specify: Size (fixed, lower-bounded, upper-bounded, unbounded) Psep, Pterm, and termination predicates Constraints over sequence of array elements Array terminates upon exhausting EOF, reaching terminator, reaching maximum size, or satisfying termination predicate [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T Example: Punion Punion auth_id { Pchar unavailable : unavailable == '-'; Pstring(:' ':) id; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" Union declarations allow the user to describe variations. Implementation tries branches in order. Stops when it finds a branch whose constraints are all true. Switched unions jump to a particular branch based on a selector.

AT&T Example: Ptypedef Ptypedef Puint16_FW(:3:) response : response x => { 100 <= x && x < 600}; Typedefs allow the user to add constraints to existing types [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T Example: Penum Penum method { GET, PUT, POST, HEAD, DELETE, LINK, /- Unused after HTTP 1.0 UNLINK /- Unused after HTTP 1.0 }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" Enumerations are strings on disk, 32-bit integers in memory.

AT&T Example: User Constraints int chkVersion(http_v version, method meth) { … }; Pstruct http_request { '\"'; method meth; ' '; Pstring(:' ':) req_uri; ' '; http_v version : chkVersion(version, meth); '\"'; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T Dependencies “Early” data often affects parsing of later data: –Lengths of sequences –Branches of switched unions To accommodate this usage, we allow PADS types to be parameterized: Punion packets_t (: Puint8 which, Puint8 length:) { Pswitch (which) { Pcase 1: header_t header; Pcase 2: body_t body; Pcase 3: trailer_t trailer; Pdefault: Pstring_FW(: length :) unknown; }; Punion packets_t (: Puint8 which, Puint8 length:) { Pswitch (which) { Pcase 1: header_t header; Pcase 2: body_t body; Pcase 3: trailer_t trailer; Pdefault: Pstring_FW(: length :) unknown; };

AT&T Common Log Format in PADS Parray Phostname{ Pstring_SE(:"/[. ]/":) [] : Psep('.') && Pterm(Pnosep); }; Punion host { Pip ip; / Phostname host; /- }; Punion auth_id { Pchar unauthorized : unauthorized == '-'; Pstring(:' ':) id; }; Penum method { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK }; Pstruct version { "HTTP/"; Puint8 major; '.'; Puint8 minor; }; int chkVersion(version v, method m) { if ((v.major == 1) && (v.minor == 1)) return 1; if ((m == LINK) || (m == UNLINK)) return 0; return 1; }; Pstruct request { '\"'; method meth; ' '; Pstring(:' ':) req_uri; ' '; version version : chkVersion(version, meth); '\"'; }; Ptypedef Puint16_FW(:3:) response : response x => { 100 <= x && x < 600}; Punion length { Pchar unavailable : unavailable == '-'; Puint32 len; }; Precord Pstruct entry { host client; ' '; auth_id remoteID; ' '; auth_id auth; " ["; Pdate(:']':) date; "] "; request request; ' '; response response; ' '; length length; }; Psource Parray clf { entry []; } [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

AT&T PADS Parsing Perror_t entry_read(P_t *pdc, entry_m* mask, entry_pd* pd, entry* rep); Invariant: If mask is “check and set” and parse descriptor reports no errors, then the in-memory representation is correct.

AT&T Using the Generated Code P_t *p; entry rep; entry_pd pd; entry_m mask; P_open(&p, 0, 0); P_io_fopen(p, “clf/data/ "); entry_m_init(p, &mask, P_CheckAndSet);... while (!P_io_at_eof(p)) { entry_read(p, &mask, &pd, &rep); if (pd.nerr > 0) { entry_write2io(p, ERR_FILE, &pd, &rep); } else { cnvIPAddress(&rep); if (entry_verify(&rep)) { entry_write2io(p, CLEAN_FILE, &pd, &rep); } else { error(2, "Data transform failed."); }}}

AT&T Leverage! Convert PADS description into a collection of tools: –Accumulators –Histograms –Clustering tool –Formatters –Translator into XML, with corresponding XML Schema. –XQueries using Galax’s data interface –Ad hoc data management console –… Long term goal: Provide a compelling suite of tools to overcome inertia of a new language and system.

AT&T Accumulators Statistical profile of data: Bird’s eye view of 4000 daily feeds. Used to vet data (and debug PADS descriptions)..length : uint32 good: bad: 3824 pcnt-bad: min: 35 max: avg: top 10 values out of 1000 distinct values: tracked % of values val: 3082 count: 1254 %-of-good: val: 170 count: 1148 %-of-good: SUMMING count: 9655 %-of-good: length : uint32 good: bad: 3824 pcnt-bad: min: 35 max: avg: top 10 values out of 1000 distinct values: tracked % of values val: 3082 count: 1254 %-of-good: val: 170 count: 1148 %-of-good: SUMMING count: 9655 %-of-good: Not all lengths were legal!

AT&T Pretty Printer Customizable program to reformat data: Users can override pretty printing on a per type basis. Used to normalize monitoring data before loading into a relational database [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" |-|-|10/16/97:01:46:51|GET|/tk/p.txt|1|0|200| |-|-|10/16/97:01:46:51|GET|/tk/p.txt|1|0|200|30 Normalize time zones Normalize delimiters Drop unnecessary values Filter/repair errors

AT&T XML Conversion PADS compiler generates canonical XML Schema: Converter maps ad hoc data to XML conforming to schema. … … <xs:element name="pd" type="entry_pd" minOccurs="0" maxOccurs="1"/> … … <xs:element name="pd" type="entry_pd" minOccurs="0" maxOccurs="1"/>

AT&T $pads/Psource/elt [date/rep >= xs:dateTime(" :00:00:00") and date/rep < xs:dateTime(" :00:00:00")]] $pads/Psource/elt [date/rep >= xs:dateTime(" :00:00:00") and date/rep < xs:dateTime(" :00:00:00")]] XQuery Integration XQueries can run over ad hoc data with PADS descriptions without converting data to XML. PADS compiler generates a description-specific instance of the Galax data API, conforming to the generated XML Schema.

AT&T LaunchPADS: GUI for ad hoc data SIGMOD 2006 Demo

AT&T PacketTypes PADS DataScript DDC Formal Theory A core data description calculus (DDC) –Based on dependent type theory –Simple, orthogonal, composable types –Types transduce external data source to internal representation. Encodings of high-level DDLs in low-level DDC POPL 2006

AT&T Future Research Directions Design –How can we specify error-aware data transformations? –Can we infer a data transformation between two descriptions? –How can we express application-specific information? –Can we automatically generate conforming data? –Can we integrate with a data visualization system? Implementation –How can we specialize generated libraries to incorporate application- specific information? –How can we optimize Xqueries over PADS data sources? Theory –What is the expressiveness of PADS vs. context free grammars? –How do we add parametric polymorphism to PADS? Engineering –How do we build the system to make it easy to add new base types? –New libraries and tools? New language bindings?

AT&T Try it! Available for download with a CPL license. Demo of accumulators, format program, and XML conversion. Send us feedback! (Growing!) PADS Team: Kathleen Fisher (AT&T) Mary Fernandez (AT&T) Joel Gottlieb (AT&T) David Walker (Princeton) Yitzhak Mandelbaum (Princeton) Mark Daly (Princeton) Robert Gruber (Google) Martin Strauss (Michigan) Xuan Zheng (Michigan)