PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
MIT Lincoln Laboratory A Service-Oriented Approach to Application Development Robert Darneille & Gary Schorer WPI MQP Presentations ICS Group 10 October.
CGI & HTML forms CGI Common Gateway Interface  A web server is only a pipe between user-agents  and content – it does not generate content.
SOAP.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
Multi-Model Digital Video Library Professor: Michael Lyu Member: Jacky Ma Joan Chung Multi-Model Digital Video Library LYU9904 Multi-Model Digital Video.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
Honors Compilers The Course Project Feb 28th 2002.
Encapsulation by Subprograms and Type Definitions
Kathleen Fisher AT&T Labs Research Robert Gruber Google PADS: A Domain-Specific Language for Processing Ad Hoc Data.
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
COBOL for the 21 st Century Stern, Stern, Ley Chapter 1 INTRODUCTION TO STRUCTURED PROGRAM DESIGN IN COBOL.
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
Overview of Mini-Edit and other Tools Access DB Oracle DB You Need to Send Entries From Your Std To the Registry You Need to Get Back Updated Entries From.
INTRODUCTION TO WEB DATABASE PROGRAMMING
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Structured COBOL Programming, Stern & Stern, 9th edition
Comp2513 Forms and CGI Server Applications Daniel L. Silver, Ph.D.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Interoperability Scenario Producing summary versions of compound multimedia historical documents.
FTP (File Transfer Protocol) & Telnet
Project Overview Bibliographic merging, Endeca, and Web application.
The OSI Model and the TCP/IP Protocol Suite Outline: 1.Protocol Layers 2.OSI Model 3.TCP/IP Model 4.Addressing 1.
Exploring an Open Source Automation Framework Implementation.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
(Business) Process Centric Exchanges
Website Development with PHP and MySQL Saving Data.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
JSP Filters 23-Oct-15. JSP - FILTERS A filter is an object that can transform a request or modify a response. Filters are not servlets; they don't actually.
Distributed Information Retrieval Using a Multi-Agent System and The Role of Logic Programming.
1 Cisco Unified Application Environment Developers Conference 2008© 2008 Cisco Systems, Inc. All rights reserved.Cisco Public Introduction to Etch Scott.
INT-5: Integrate over the Web with OpenEdge® Web Services
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
Database Management Systems.  Database management system (DBMS)  Store large collections of data  Organize the data  Becomes a data storage system.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Experiment Management System CSE 423 Aaron Kloc Jordan Harstad Robert Sorensen Robert Trevino Nicolas Tjioe Status Report Presentation Industry Mentor:
Implementing an RDF Schema for Pathology Images, From the Association for Pathology Informatics Jules J. Berman, Ph.D., M.D. APIII, Pittsburgh, PA Monday,
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
Hui Song Reflecting Running Systems as Standard Models A series of small applications of model-driven techniques.
WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory
3/18/2002AIM AB Review of WSRP/WSIA Adaptation Description Language, Past and Present Directions. Ravi Konuru, IBM.
SOAP, Web Service, WSDL Week 14 Web site:
 Project Team: Suzana Vaserman David Fleish Moran Zafir Tzvika Stein  Academic adviser: Dr. Mayer Goldberg  Technical adviser: Mr. Guy Wiener.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Architecture Review 10/11/2004
XML: Extensible Markup Language
Lecture 1 Introduction Richard Gesick.
z/Ware 2.0 Technical Overview
Node.js Express Web Services
Layered Architectures
Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004
PDAP Query Language International Planetary Data Alliance
WEB API.
Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP
Data Model.
Techniques to Invoke Web Services from SAS
OPeNDAP/Hyrax Interfaces
Presentation transcript:

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June The big picture Plethora of high-volume data streams, from which valuable information can be extracted. – Call-detail data, web logs, provisioning streams, tcpdump data, etc. Desired operations: – Programmatic manipulation – Format translation (into XML, relational database, etc.) – Declarative interaction Filtering, querying, aggregation, statistical profiling

Data Binding, June Technical challenges Data arrives “as is.” – Format determined by data source, not consumers. – Often has little documentation. – Some percentage of data is “buggy.” Often streams have high volume. – Detect relevant errors (without necessarily halting program) – Control how data is read (e.g. read header but skip body vs. read entire record). Parsing routines must be written to support any of the desired operations.

Data Binding, June Why not use C / Perl / Shell scripts… ? Problems with hand-coded parsers: Writing them is time consuming and error prone. Reading them a few months later is difficult. Maintaining them in the face of even small format changes can be difficult. Programs break in subtle and machine-specific ways (endien- ness, word-sizes). Such programs are often incomplete, particularly with respect to errors.

Data Binding, June Solution: PADS System (In Progress) One person writes declarative description of data source: – Physical format information – Semantic constraints. Many people use PADS data description and generated library. PADS system generates – C library interface for processing data. Reading ( original / binary / XML / …) Writing ( original / binary / XML / … ) Accumulators … – Application for querying stream.

Data Binding, June PADS language Can describe ASCII, EBCDIC (Cobol), binary, and mixed data formats. Allows arbitrary boolean constraint expressions to describe expected properties of data. Type-based model: each type indicates how to read associated data. Provides rich and extensible set of base types. – Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8 – Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:) Supports user-defined compound types to describe file structure: – Pstruct, Parray, Punion, Ptypedef, Penum

Data Binding, June PADS compiler Converts description to C header and implementation files. For each built-in/user-defined type: – Functions (read, accumulate, write, test data generation) – In-memory representation – Error description – Mask (check constraints, set representation, suppress printing) Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.

Data Binding, June Example: CLF web log Common Log Format from Web Protocols and Practice. Fields: – IP address of remote host, either resolved (as above) or symbolic – Remote identity (usually ‘-’ to indicate name not collected) – Authenticated user (usually ‘-’ to indicate name not collected) – Time associated with request – Request (request method, request-uri, and protocol version) – Response code – Content length [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

Data Binding, June Example: CLF web log in PADS Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

Data Binding, June PADSL example: user constraint int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 1)) return 1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1; } Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version, meth); /- HTTP version number of request '\"'; }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

Data Binding, June PADSL example: arrays and unions Parray nIP { Puint8 [4] : Psep == '.'; }; Parray sIP { Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' '; } Punion host { nIP resolved; / sIP symbolic; /- }; Punion auth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication }; [15/Oct/1997:18:46: ] "GET /turkey/amnty1.gif HTTP/1.0"

Data Binding, June Generated type declarations typedef struct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ … } http_weblog; typedef struct { host_m client; auth_id_m remoteID; … } http_weblog_m; typedef struct { int nerr; int errCode; PDC_loc loc; int panic; host_ed client; auth_id_ed remoteID; …; } http_weblog_ed;

Data Binding, June Sample use PDC_t *pdc; http_weblog entry; http_weblog_m mask; http_weblog_ed ed; PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */); PDC_IO_fopen(pdc, fileName);... call init functions... http_weblog_mask(&mask, PCheck & PSet); while (!PDC_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &ed, &entry); if (ed.nerr != 0) {... Error handling... }... Process/query entry... };... call cleanup functions... PDC_IO_fclose(pdc); PDC_close(pdc);

Data Binding, June Related work ASN.1, ASDL – Describe logical representation, generate physical. DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000] – Binary only – Stop on first error

Data Binding, June PADS to do Allow library generation to be customized with application- specific information: – Repair errors, ignore certain fields, customize in-memory representation, etc. Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel). Support data translation – Requires mapping from one in-memory representation to another. Develop user-base and integrate feedback. – What would you want in such a tool?

Data Binding, June Getting PADS PADS will be available shortly for download with a non- commercial-use license.

Data Binding, June PADS architecture PADS Compiler Application-specific customizations PADS data description C library Hancock stream description Query tool … PADS Library

Data Binding, June Technical challenges revisited Data arrives “as is.” – Format determined by data source, not consumers. PADS language allows consumers to describe data as it is. – Often has little documentation. PADS description can serve as documentation for data source. – Some percentage of data is “buggy.” Constraints allow consumers to express expectations about data. Generated code reports errors when constraints violated. Often streams are high volume. – Detect relevant errors (without necessarily halting program) Masks specify relevancy; returned descriptors characterize errors. – Control how data is read Multiple entry-points allow different levels of granularity.