Download presentation
Presentation is loading. Please wait.
Published byIsabel Perkins Modified over 9 years ago
1
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu
2
who am I? why am I here?
3
Our Common Communication Infrastructure Much information is represented in standardized data formats: Web pages in HTML Pictures in JPEG Movies in MPEG “Universal” information format XML Standard relational database formats A plethora of data processing tools: Visualizers (Browsers Display JPEG, HTML,...) Query languages allow users extract information (SQL, XQuery) Programmers get easy access through standard libraries ► ► Java XML libraries --- JAXP Many applications handle it natively and convert back and forth ► ► MS Word
4
Ad Hoc Data Massive amounts of data are stored in XML, HTML or relational databases but there’s even more data that isn’t An ad hoc data format is any nonstandard, but structured data format for which convenient parsing, querying, visualizing, transformation tools are not available. (not natural language)
5
Ad Hoc Data from Web Server Logs (CLF) 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/ddorg/confirm HTTP/1.0" 200 941
6
Ad Hoc Data from Crashreporter.log Sat Jun 24 06:38:46 2006 crashdump[2164]: Started writing crash report to: /Logs/Crash/Exit/ pro.crash.log Sun Jun 25 07:23:46 2006 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port
7
AT&T Phone Call Provisioning Data 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15227 2|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222| EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|ED TF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982 9152271|9152271|1|0|0|0|0||no_ii152271|EDTF_1|0|SC1MF1F|UNO|EDTF_CRTE|100 1649600|EDTF_OS_10|1001649601 9152270|9152270|1|0|0|0|0||no_ii152270|EDTF_1|0|marshak1|UNO|EDTF_CRTE|100 1563200|EDTF_OS_10|1001649601
8
Ad Hoc data from DNS Packets 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027...............' 00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465.ns1...hostmaste 00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I......... 00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6............... 00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00......linux..... 00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c............mail 00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man............. 00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000................ 000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e..ns0........... 000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100......_gc...!... 000000c0: 0002 5800 1d00 0000 640c c404 7068 7973..X.....d...phys 000000d0: 0872 6573 6561 7263 6803 6174 7403 636f.research.att.co
9
Ad Hoc data from www.investors.com Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc 73.68 -0.22 0% 31% 64 93 GE General Electric Co 36.01 0.13 0% -8% 59 56 HD Home Depot Inc 37.99 -0.89 -2% 63% 84 38 IBM Intl Business Machines 89.51 0.23 0% -13% 66 35 INTC Intel Corp 23.50 0.09 0% -47% 39 33 Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes.
10
Ad Hoc data from www.geneontology.org !autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;...
11
The Challenge of Ad Hoc Data Data arrives “as is.” Documentation is often out-of-date or nonexistent. Data is buggy. Missing data, “extra” data, … Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), … Errors are sometimes the most interesting portion of the data. Data sources may be enormous AT&T sources can generate up to 2GB/second There are no software libraries, manuals, or armies of consultants to help you....
12
Email Raw Data Data Entry: Create Format Description Data Analysis Data Exit: Data Transformation External Systems Description libraries Automatic inference Manual customization Visual support database queries grep support google-style search binary viewer/editor anomaly detection statistical classification format-independent algorithms plug-and-play export to XML, HTML, S, database, Excel language support for custom rewriting plug-and-play ASCII log files Binary Traces Goal: An end-to-end, real-time data analysis, transformation and programming framework
13
The PADS System (version 1.0) [pldi 05, popl 06, popl 07] “Ad Hoc” Data Source Analysis Report XML PADS Data Description PADS Compiler Generated Libraries (Parsing, Printing, Traversal) PADS Runtime System (I/O, Error Handling) XML Converter Data Profiler Graphing Tool Query Engine Custom App GraphInformation ? generic description- directed programs coded once written by hand
14
Trivial Example Data Sources: type payload = union { int32 i; stringFW(3) s2; }; type source = struct { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } “0, 24” “foo, 16” “bar, end” Description: Key points to know: Descriptions based on programming language “types” Broad collection of “base types” (ints, strings, dates, ip addresses...) Structured types includes “structs,” “unions” and “arrays” .... but has many other features: dependency, constraints, recursion,... has formal semantics & proven properties
15
The PADS System (version 2.0) Tokenization Structure Discovery Format Refinement Data Description Scoring Function Raw Data PADS Compiler Profiler XMLifier Analysis Report XML Format Inference Structure Discovery Format Refinement
16
Structure Discovery: Overview Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized data Guess a top-level type constructor Partition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunks “0, 24” “foo, 16” “bar, end” “ INT, INT ” “ STR, INT ” “ STR, STR ” tokenize
17
Structure Discovery: Overview Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized data Guess a top-level type constructor Partition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunks “ INT, INT ” “ STR, INT ” “ STR, STR ” discover “”, ?? struct ? candidate structure so far INT STR INT STR sources
18
Structure Discovery: Overview Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized data Guess a top-level type constructor Partition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunks discover “”, ?? struct INT STR INT STR “”, ? ? struct union INT ? STR INT STR
19
Structure Discovery: Details Compute frequency distribution histogram for each token. (And recompute at every level of recursion). “ INT, INT ” “ STR, INT ” “ STR, STR ” percentage of sources Number of occurrences per source
20
Structure Discovery: Details Cluster tokens into groups with similar histograms Similar histograms ► strong evidence tokens coexist in same description component ► use symmetric relative entropy to measure similarity Only the “shape” of the histogram matters ► normalize histograms by sorting columns in descending size ► result: comma & quote grouped together
21
Structure Discovery: Details Find most promising token group to divide and conquer: Structs == Groups with high coverage & low “residual mass” Arrays == Groups with high coverage, sufficient width & high “residual mass” Unions == Other token groups Struct involving comma, quote identified in histogram above Overall procedure gives good starting point for rewriting system
22
Format Refinement Reanalyze example data with aid of rough description Rewrite format description to: simplify presentation ► merge & rewrite structures improve precision ► reorganize description structure ► add constraints (sortedness, uniqueness, linear relations, functional dependencies) fill in missing details ► find completions where structure discovery bottoms out ► refine base types (termination conditions for strings, integer sizes)
23
Format Refinement Three main sub-phases Phase 1: Tagging/Table generation ► Convert rough description into tagged description + relational table Phase 2: Constraint inference ► Analyze table and infer constraints ► Use TANE algorithm [Huhtala et al. 99] Phase 3: Format rewriting ► Use inferred constraints & type isomorphisms to rewrite rough description ► Greedy search to optimize information-theoretic score
24
Refinement: Simple Example
25
“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” …
26
“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery
27
“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id1id2 2 11 2 id3 -- 0... alpha (id4) int (id5)alpha(id6) id4 -- id5... id6 --... foobeg--... 24
28
“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) constraint inference id1id2 2 11 2 id3 -- 0... alpha (id4) int (id5)alpha(id6) id4 -- id5... id6 --... foobeg--... 24
29
“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int str int str structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) constraint inference rule-based structure rewriting struct “ ” union 0str int str struct,, id1id2 2 11 2 id3 -- 0... more accurate: -- first int = 0 -- rules out “int, alpha-string” records str (id4) int (id5)str(id6) id4 -- id5... id6 --... foobeg--... 24
30
Biggest Weakness Degree of success often hinges on the inference system having a tokenization scheme that matches the tokenization scheme of the data source. Good tokens capture high-level, human abstractions compactly. Techniques for learning tokenizations from data directly? Techniques for using multiple, ambiguous tokenization schemes simultaneously?
31
Related Work Most common domains for grammar inference: xml/html natural language Systems that focus on ad hoc data rare and the few that don’t support PADS tool suite: Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01 Top-down structure discovery Arasu & Garcia-Molina ’03 (extracting data from web pages) Grammar induction using MDL & grammar rewriting search Stolcke and Omohundro ’94 “Inducing probabilistic grammars...” T. W. Hong ’02, Ph.D. thesis on information extraction from web pages Higuera ’01 “Current trends in grammar induction”
32
Conclusions Still a work in progress, but we are able to produce XML and statistical reports fully automatically from ad hoc data sources. We’ve tested on approximately 15 real, mostly systemy data sources (web logs, crash reports, AT&T phone call data, etc.) with what we believe is relatively good success For papers & software, see our website at: http://www.padsproj.org/dpw@cs.princeton.edu
33
End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.