Presentation is loading. Please wait.

Presentation is loading. Please wait.

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.

Similar presentations


Presentation on theme: "PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber."— Presentation transcript:

1 PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

2 Data Binding, June 2003 2 The big picture Plethora of high-volume data streams, from which valuable information can be extracted. – Call-detail data, web logs, provisioning streams, tcpdump data, etc. Desired operations: – Programmatic manipulation – Format translation (into XML, relational database, etc.) – Declarative interaction Filtering, querying, aggregation, statistical profiling

3 Data Binding, June 2003 3 Technical challenges Data arrives “as is.” – Format determined by data source, not consumers. – Often has little documentation. – Some percentage of data is “buggy.” Often streams have high volume. – Detect relevant errors (without necessarily halting program) – Control how data is read (e.g. read header but skip body vs. read entire record). Parsing routines must be written to support any of the desired operations.

4 Data Binding, June 2003 4 Why not use C / Perl / Shell scripts… ? Problems with hand-coded parsers: Writing them is time consuming and error prone. Reading them a few months later is difficult. Maintaining them in the face of even small format changes can be difficult. Programs break in subtle and machine-specific ways (endien- ness, word-sizes). Such programs are often incomplete, particularly with respect to errors.

5 Data Binding, June 2003 5 Solution: PADS System (In Progress) One person writes declarative description of data source: – Physical format information – Semantic constraints. Many people use PADS data description and generated library. PADS system generates – C library interface for processing data. Reading ( original / binary / XML / …) Writing ( original / binary / XML / … ) Accumulators … – Application for querying stream.

6 Data Binding, June 2003 6 PADS language Can describe ASCII, EBCDIC (Cobol), binary, and mixed data formats. Allows arbitrary boolean constraint expressions to describe expected properties of data. Type-based model: each type indicates how to read associated data. Provides rich and extensible set of base types. – Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8 – Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:) Supports user-defined compound types to describe file structure: – Pstruct, Parray, Punion, Ptypedef, Penum

7 Data Binding, June 2003 7 PADS compiler Converts description to C header and implementation files. For each built-in/user-defined type: – Functions (read, accumulate, write, test data generation) – In-memory representation – Error description – Mask (check constraints, set representation, suppress printing) Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.

8 Data Binding, June 2003 8 Example: CLF web log Common Log Format from Web Protocols and Practice. Fields: – IP address of remote host, either resolved (as above) or symbolic – Remote identity (usually ‘-’ to indicate name not collected) – Authenticated user (usually ‘-’ to indicate name not collected) – Time associated with request – Request (request method, request-uri, and protocol version) – Response code – Content length 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

9 Data Binding, June 2003 9 Example: CLF web log in PADS Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

10 Data Binding, June 2003 10 PADSL example: user constraint int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 1)) return 1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1; } Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version, meth); /- HTTP version number of request '\"'; }; 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

11 Data Binding, June 2003 11 PADSL example: arrays and unions Parray nIP { Puint8 [4] : Psep == '.'; }; Parray sIP { Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' '; } Punion host { nIP resolved; /- 135.207.23.32 sIP symbolic; /- www.research.att.com }; Punion auth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication }; 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

12 Data Binding, June 2003 12 Generated type declarations typedef struct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ … } http_weblog; typedef struct { host_m client; auth_id_m remoteID; … } http_weblog_m; typedef struct { int nerr; int errCode; PDC_loc loc; int panic; host_ed client; auth_id_ed remoteID; …; } http_weblog_ed;

13 Data Binding, June 2003 13 Sample use PDC_t *pdc; http_weblog entry; http_weblog_m mask; http_weblog_ed ed; PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */); PDC_IO_fopen(pdc, fileName);... call init functions... http_weblog_mask(&mask, PCheck & PSet); while (!PDC_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &ed, &entry); if (ed.nerr != 0) {... Error handling... }... Process/query entry... };... call cleanup functions... PDC_IO_fclose(pdc); PDC_close(pdc);

14 Data Binding, June 2003 14 Related work ASN.1, ASDL – Describe logical representation, generate physical. DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000] – Binary only – Stop on first error

15 Data Binding, June 2003 15 PADS to do Allow library generation to be customized with application- specific information: – Repair errors, ignore certain fields, customize in-memory representation, etc. Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel). Support data translation – Requires mapping from one in-memory representation to another. Develop user-base and integrate feedback. – What would you want in such a tool?

16 Data Binding, June 2003 16 Getting PADS PADS will be available shortly for download with a non- commercial-use license. http://www.research.att.com/projects/pads

17 Data Binding, June 2003 17 PADS architecture PADS Compiler Application-specific customizations PADS data description C library Hancock stream description Query tool … PADS Library

18 Data Binding, June 2003 18 Technical challenges revisited Data arrives “as is.” – Format determined by data source, not consumers. PADS language allows consumers to describe data as it is. – Often has little documentation. PADS description can serve as documentation for data source. – Some percentage of data is “buggy.” Constraints allow consumers to express expectations about data. Generated code reports errors when constraints violated. Often streams are high volume. – Detect relevant errors (without necessarily halting program) Masks specify relevancy; returned descriptors characterize errors. – Control how data is read Multiple entry-points allow different levels of granularity.


Download ppt "PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber."

Similar presentations


Ads by Google