PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

Disk Storage, Basic File Structures, and Hashing
Programming Paradigms and languages
Programming Languages and Paradigms
SOAP.
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
The Assembly Language Level
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
CSCI 4550/8556 Computer Networks Comer, Chapter 3: Network Programming and Applications.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Kathleen Fisher AT&T Labs Research Robert Gruber Google PADS: A Domain-Specific Language for Processing Ad Hoc Data.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Servlets Compiled by Dr. Billy B. L. Lim. Servlets Servlets are Java programs which are invoked to service client requests on a Web server. Servlets extend.
1 Project: File System Textbook: pages Lubomir Bic.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Console Application Programming Brian Hendricks. Getting Started On the web –New user help adcon.fnal.gov/controls/clib/new_user.html –Library help adcon.fnal.gov/controls/libraries.html.
Overview of Mini-Edit and other Tools Access DB Oracle DB You Need to Send Entries From Your Std To the Registry You Need to Get Back Updated Entries From.
A First Program Using C#
Networking Nasrullah. Input stream Most clients will use input streams that read data from the file system (FileInputStream), the network (getInputStream()/getInputStream()),
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Comp2513 Forms and CGI Server Applications Daniel L. Silver, Ph.D.
Imperative Programming
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Web Application Access to Databases. Logistics Test 2: May 1 st (24 hours) Extra office hours: Friday 2:30 – 4:00 pm Tuesday May 5 th – you can review.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
9/15/2015© 2008 Raymond P. Jefferis IIILect Application Layer.
Sistem Jaringan dan Komunikasi Data #9. DNS The Internet Directory Service  the Domain Name Service (DNS) provides mapping between host name & IP address.
Using JavaBeans and Custom Tags in JSP Lesson 3B / Slide 1 of 37 J2EE Web Components Pre-assessment Questions 1.The _____________ attribute of a JSP page.
Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.
Internet Addresses. Universal Identifiers Universal Communication Service - Communication system which allows any host to communicate with any other host.
(Business) Process Centric Exchanges
Website Development with PHP and MySQL Saving Data.
JSP Filters 23-Oct-15. JSP - FILTERS A filter is an object that can transform a request or modify a response. Filters are not servlets; they don't actually.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Cohesion and Coupling CS 4311
1 File Management Chapter File Management n File management system consists of system utility programs that run as privileged applications n Concerned.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Types(1). Lecture 52 Type(1)  A type is a collection of values and operations on those values. Integer type  values..., -2, -1, 0, 1, 2,...  operations.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
JSP BASICS AND ARCHITECTURE. Goals of JSP Simplify Creation of dynamic pages. Separate Dynamic and Static content.
JS (Java Servlets). Internet evolution [1] The internet Internet started of as a static content dispersal and delivery mechanism, where files residing.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
1. 2 Purpose of This Presentation ◆ To explain how spacecraft can be virtualized by using a standard modeling method; ◆ To introduce the basic concept.
Hello world !!! ASCII representation of hello.c.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
Module 11: File Structure
SQL and SQL*Plus Interaction
z/Ware 2.0 Technical Overview
Compiler Construction (CS-636)
Node.js Express Web Services
Hypertext Transport Protocol
A Security Review Process for Existing Software Applications
Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004
Topics Introduction to File Input and Output
Advanced UNIX progamming
WebDAV Design Overview
Topics Introduction to File Input and Output
Presentation transcript:

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

February The big picture Abundance of ad hoc data streams from which valuable information can be extracted. – Call-detail data, web logs, provisioning streams, tcpdump data, http packets, dns packets, etc. Desired operations: – Programmatic manipulation – Format translation (into XML, relational database, etc.) – Declarative interaction Filtering, querying, aggregation, statistical profiling

February Technical challenges Data arrives “as is.” – Format determined by data source, not consumers. – Often has little documentation. – Some percentage of data is “buggy.” Often streams have high volume. – Detect relevant errors (without necessarily halting program) – Control how data is read (e.g. read header but skip body vs. read entire record). Parsing routines must be written to support any of the desired operations.

February Why not use C / Perl / Shell scripts… ? Writing hand-coded parsers is time consuming & error prone. Reading them a few months later is difficult. Maintaining them in the face of even small format changes can be difficult. Programs break in subtle and machine-specific ways (endien-ness, word-sizes). Such programs are often incomplete, particularly with respect to errors.

February Why not use traditional parsers? Specifying a lexer and parser separately can be a barrier. Need to handle data-dependent parsing. Need more flexible error processing. Need support for multiple-entry points.

February Solution: PADS System (In Progress) One person writes declarative description of data source: – Physical format information – Semantic constraints Many people use PADS data description and generated library. PADS system generates – C library interface for processing data. Reading ( original / binary / XML / …) Writing ( original / binary / XML / … ) Auxiliary functions … – Querying support

February PADS architecture PADS Compiler Application-specific customizations PADS description Generated library PADS Library User Code Application

February Outline Motivation Explain PADS language by example Describe generated library Discuss how design has evolved through interactions with various users.

February PADS language Can describe ASCII, EBCDIC (Cobol), binary, and mixed data formats. Allows arbitrary boolean constraint expressions to describe expected properties of data. Type-based model: each type indicates how to process associated data. Provides rich and extensible set of base types. – Pint8, Puint8, Pa_int8, Pa_uint8, Pb_int8, … Pint16,… – Pstring(:term-char:), Pstring_FW(:size:), Pstring_ME(:reg_exp:) Supports user-defined compound types to describe data source structure: – Pstruct, Parray, Punion, Ptypedef, Penum

February Example: CLF web log Common Log Format from Web Protocols and Practice. Fields: – IP address of remote host, either resolved (as above) or symbolic – Remote identity (usually ‘-’ to indicate name not collected) – Authenticated user (usually ‘-’ to indicate name not collected) – Time associated with request – Request (request method, request-uri, and protocol version) – Response code – Content length [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

February Example: Pstruct Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

February Example: Punion Punion id { Pchar unavailable : unavailable == '-'; Pstring(:' ':) id; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" Union declarations allow the user to describe variations. Implementation tries branches in order. Stops when it finds a branch whose constraints are all true. Switched version branches on supplied tag.

February Example: Parray Parray nIP { Puint8[4]: Psep(‘.’) && Pterm(‘ ’); }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" Array declarations allow the user to specify: Size (fixed, lower-bounded, upper-bounded, unbounded.) boolean-valued constraints Psep and Pterm predicates Array terminates upon exhausting EOF/EOR, reaching terminator, or reaching maximum size.

February Example: Penum Penum http_method { GET, PUT, POST, HEAD, DELETE, LINK, /- Unused after HTTP 1.0 UNLINK /- Unused after HTTP 1.0 }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0" Enumerations are strings on disk, 32-bit integers in memory.

February Example: Ptypedef Ptypedef Puint32 bid :: bid x => {x >= 100}; Typedefs allow the user to add constraints to existing types.

February Advanced features: User constraints int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 0)) return 1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1; } Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version, meth); /- HTTP version number of request '\"'; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

February Advanced features: Sharing information “Early” data often affects parsing of later data: – Lengths of sequences – Branches of switched unions To accommodate this usage, we allow PADS types to be parameterized: Punion packets_t (: Puint8 which, Puint8 length:) { Pswitch (which) { Pcase 1: header_t header; Pcase 2: body_t body; Pcase 3: trailer_t trailer; Pdefault: Pstring_FW(: length :) unknown; };

February PADS compiler Converts description to C header and implementation files. For each built-in/user-defined type: – Functions (read, write, auxiliary functions) – In-memory representation – Mask (check constraints, set representation, suppress printing) – Parse descriptor Reading invariant: If mask is check and set and parse descriptor reports no errors, then in-memory representation satisfies all constraints in data description.

February Generated representation Pstruct http_request { '\"'; http_method meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_version version : check(version, meth); '\"'; }; typedef struct { http_method meth; /* Request method */ Pstring req_uri; /* Requested uri */ http_version version; /* check(version, meth) */ } http_request;

February Generated mask (m) Pstruct http_request { '\"'; http_method meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_version version : check(version, meth); '\"'; }; typedef struct { P_base_m structLevel; http_method_m meth; P_base_m req_uri; /* Check, Set, Print,… */ http_version_m version; } http_request_m;

February Generated parse descriptor (pd) Pstruct http_request { '\"'; http_method meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_version version : check(version, meth); '\"'; }; typedef struct { int nerr; P_errCode_t errCode; P_loc_t loc; int panic; /* Structural error */ http_method_pd meth; P_base_pd req_ uri; http_version_pd version; } http_request_pd;

February Sample use PDC_t *pdc; http_weblog entry; http_weblog_m mask; http_weblog_pd pd; P_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */); P_IO_fopen(pdc, fileName);... call init functions... http_weblog_mask(&mask, PCheck & PSet); while (!P_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &pd, &entry); if (pd.nerr != 0) {... Error handling... }... Process/query entry... };... call cleanup functions... P_IO_fclose(pdc); P_close(pdc);

February Expanding our horizons Design to this point primarily based on our experiences with Hancock data streams. To get feedback on the expressiveness of our language and the utility of the generated libraries, we started collecting other users: – Data analysts (Chris Volinksy, David Poole) – Cobol gurus (Andrew Hume, Bethany Robinson, …) – Internet miner (Trevor Jim)

February Data analysts: The domain Data: ASCII files, several gigabytes in size, sequence of provisioning records, each of which is a sequence of state, time-stamp pairs. Desired application: aggregation queries. – How many records that go through state 3 end in state 5? – What is the average length of time spent in state 4? – … Original applications written in brittle awk and perl code. customer_id | order# | state1 | ts1 | … | staten | tsn

February Data analysts: Lessons learned PADS expressive enough to describe data sources. PADS descriptions much less brittle than originals. Generated code faster than original implementations. Support for declarative querying could be a big win, as analysts can then generate desired aggregates with only declarative programming. – Data and desired queries “semi-structured” rather than relational.

February Developing support for declarative querying Wanted to leverage existing query language. XQuery, standardized query language for XML, is appropriately expressive for our queries. Modified Galax, open-source implementation of XQuery, to create data API, allowing Galax to “read” data not in XML format. (Mary Fernandez, Jérôme Simeon). Extended PADSC to generate data API. Currently, can run queries over PADS data if data fits in memory. Next: extend Galax API and generated library to support streaming interface.

February Cobol gurus: The domain Data: EBCDIC encoded files with Cobol copy-book descriptions. Thousands of files arriving on daily basis, hundreds of relevant copy-books, with more formats arriving regularly. Desired application: developing a “bird’s eye” view of data. Original applications: hand-written summary programs on an ad hoc basis.

February Cobol gurus: Lessons learned Wrote a translator that converts from Cobol copy books to PADSL. – Added Palternates, which parse a block of data multiple times, making all parses available in memory. – Generated description uses Palternates, Parrays, Pstructs, and Punions. – Uses parameters to control data-dependent array lengths. – Successfully translated available copybooks. Added accumulators to generated library. – Aggregate parsed data, including errors. – Generate reports, providing bird’s eye view.

February Internet miner: The domain Data: data formats described by RFCs. Some binary (dns, for example), some ASCII (http, for example). Desired applications: detecting security-related protocol violations, data mining, semi-automatic generation of reference implementations. Original applications: being developed with PADS.

February Internet miners: Lessons learned Constraints are a big win, because they allow semantic conditions to be checked. Parameterization used a lot, particularly to check buffer lengths in binary formats (dns). Additional features needed: – Predicates over parsed prefixes to express array termination. – Regular expression literals and user-defined character classes. – Positional information in constraint language. – Recursive declarations. – In-line declarations. Developing script to semi-automate translation of RFC EBNF (Trevor Jim). – URI spec successfully translated, HTTP close.

February Related work ASN.1, ASDL – Describe logical representation, generate physical. DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000] – Binary only – Stop on first error

February PADS to do Finalize initial release: documentation and release process. Conduct careful performance study and tune library accordingly. Address various issues mentioned previously: – Generate streaming API for Galax – Add recursive specifications – Support in-line declarations Develop a formal semantics. Write more applications that use PADS descriptions.

February Longer term … Allow library generation to be customized with application- specific information: – Repair errors, ignore certain fields, customize in-memory representation, etc. Support data translation – Requires mapping from one in-memory representation to another. Leverage semantic information from specification – Test data generation – In-memory completion functions…

February Technical challenges revisited Data arrives “as is.” – Format determined by data source, not consumers. PADS language allows consumers to describe data as it is. – Often has little documentation. PADS description can serve as documentation for data source. – Some percentage of data is “buggy.” Constraints allow consumers to express expectations about data. Generated code reports errors when constraints violated. Often streams are high volume. – Detect relevant errors (without necessarily halting program) Masks specify relevancy; returned descriptors characterize errors. – Control how data is read Multiple entry-points allow different levels of granularity.

February Getting PADS PADS will be available shortly for download with a non- commercial-use license.

February Example: arrays and unions Parray nIP { Puint8 [4] : Psep('.'); }; Parray sIP { Pstring_SE(:"[. ]":) [] : Psep('.') && Pterm ('.'); } Punion host { nIP resolved; / sIP symbolic; /- }; Punion auth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

February Generated type declarations typedef struct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ … } http_weblog; typedef struct { host_m client; auth_id_m remoteID; … } http_weblog_m; typedef struct { int nerr; int errCode; PDC_loc loc; int panic; host_pd client; auth_id_pd remoteID; …; } http_weblog_pd;

February Generated accumulator representation (acc) pstruct http_request { '\"'; http_method meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_version version : check(version, meth); '\"'; }; typedef struct { PDC_uint64_acc nerr; http_method_acc meth; PDC_string_acc req_uri; http_version_acc version; } http_request_acc;

February Generated read function pstruct http_request { '\"'; http_method meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_version version : check(version, meth); '\"'; }; PDC_error_t http_request_read(PDC_t *pdc, http_request_m *m, http_request_pd *pd, http_request *rep); We also generate initialization and cleanup functions for representations and error descriptors for variable width data.