From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu
Data, data, everywhere AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data” Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available –not free text; not html; not xml Common problems: no documentation, evolving formats, huge volume, error-filled... Web Logs Network Monitoring Billing Info Router Configs Call Details
Data, data, everywhere [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" [15/Oct/1997:18:53: ] "GET /tr/img/gift.gif HTTP/1.0” [15/Oct/1997:18:39: ] "GET /tr/img/wool.gif HTTP/1.0" [16/Oct/1997:12:59: ] "GET / HTTP/1.0" ekf - [17/Oct/1997:10:08: ] "GET /img/new.gif HTTP/1.0" web server common log format
Data, data, everywhere AT&T phone call provisioning data | |1| | | | ||no_ii |EDTF_6|0|MARVINS1|UNO|10| | |1| | | | ||no_ii1522 2|EDTF_6|0|MARVINS1|UNO|10| |20| |17| |19| |27| |29| |IA0288| |IE0288| |E DTF_CRTE| |EDTF_OS_1| |16| |26|
Data, data, everywhere HA START OF TEST CYCLE aA BXYZ U1AB B HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY k LYXW B1KB G HB END OF TEST CYCLE
Data, data, everywhere format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: , PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution
Goal Billing Info Raw Data ASCII log files Call Detail XML CSV Standard formats & schema Visual Information End-user tools We want to create this arrow
Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07] “Ad Hoc” Data Source Analysis Report XML PADS Data Description PADS Compiler Generated Libraries (Parsing, Printing, Traversal) PADS Runtime System (I/O, Error Handling) XML Converter Data Profiler Graphing Tool Query Engine Custom App GraphInformation ? generic description- directed programs coded once
PADS Language Overview Rich base type library: –integers: Pint8, Puint32, … –strings: Pstring(’|’), Pstring_FW(3),... –systems data: Pdate, Ptime, Pip, … Type constructors describe complex data sources: –sequences: Pstruct, Parray, –choices: Punion, Penum, Pswitch –constraints: arbitrary predicates describe expected semantic properties –parameterization: allows definition of generic descriptions Data formats are described using a specialized language of types A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.
The Last Mile: The PADS System 2.0 Chunking & Tokenization Structure Discovery Format Refinement PADS Data Description Scoring Function Raw Data PADS Compiler Profiler XMLifier Analysis Report XML Format Inference Engine Chunking & Tokenization Structure Discovery
Convert raw input into sequence of “chunks.” Supported divisions: –Various forms of “newline” –File boundaries Also possible: user-defined “paragraphs” Chunking Process
Tokenization Tokens/Base types expressed as regular expressions. Basic tokens Integer, white space, punctuation, strings Distinctive tokens IP addresses, dates, times, MAC addresses,...
Histograms
Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Clustering Cluster 1 Group clusters with similar frequency distributions Cluster 2Cluster 3 Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.
Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.
Find subcontexts Tokens in selected cluster: Quote(2) Comma White
Then Recurse...
Inferred type
Structure Discovery Review Compute frequency distribution for each token. Cluster tokens with similar frequency distributions. Create hypothesis about data structure from cluster distributions –Struct –Array –Union –Basic type (bottom out) Partition data according to hypothesis & recurse Once structure discovery is complete, later phases massage & rewrite candidate description to create final form “123, 24” “345, begin” “574, end” “9378, 56” “12, middle” “-12, problem” …
Testing and Evaluation Evaluated overall results qualitatively –Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation –Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity Evaluated accuracy quantitatively –For many formats: 95%+ accuracy from 5% of available data Evaluated performance quantitatively –Hours to days to hand-write formats –after fixing the format, appears to scale linearly with data size –<1 min on 300K data
Technical Summary [ PADS 1.0 is an effective implementation framework for many data processing tasks PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries ASCII log files Binary Traces struct { } XML CSV
End
Execution Time Data sourceSD (s)Ref (s)Tot (s)HW (h) 1967Transactions.short MER_T01_01.cvs Ai Asl.log Boot.log Crashreporter.log Crashreporter.log.mod Sirius Ls-l.txt Netstat-an Page_log quarterlypersonalincome Railroad.txt Scrollkeeper.log Windowserver_last.log Yum.txt SD: structure discovery Ref: refinement Tot: total HW: hand-written
Training Time
Minimum Necessary Training Sizes Data source90%95% Sirius Transaction.short55 Ai Asl.log510 Scrollkeeper.log55 Page_log55 MER_T01_01.csv55 Crashreporter.log1015 Crashreporter.log.mod515 Windowserver_last.log515 Netstat-an2535 Yum.txt3045 quarterlypersonalincome10 Boot.log4560 Ls-l.txt5065 Railroad.txt6075
Problem: Tokenization Technical problem: –Different data sources assume different tokenization strategies –Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions –Matching tokenization of underlying data source can make a big difference in structure discovery. Current solution: –Parameterize learning system with customizable configuration files –Automatically generate lexer file & basic token types Future solutions: –Use existing PADS descriptions and data sources to learn probabilistic tokenizers –Incorporate probabilities into sophisticated back-end rewriting system Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead