From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.

Slides:



Advertisements
Similar presentations
Programming Paradigms and languages
Advertisements

CS7100 (Prasad)L16-7AG1 Attribute Grammars Attribute Grammar is a Framework for specifying semantics and enables Modular specification.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
XML Why is XML so popular? –It’s flexible –It enables you to create your own documents with elements (tags) that you define Non-XML example: This is a.
Database Administration
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
From Dirt to Shovels: Inferring PADS descriptions from ASCII Data July 2007 Kathleen Fisher David Walker Peter White Kenny Zhu.
Modules, Hierarchy Charts, and Documentation
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Overview of Search Engines
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Introduction to Systems Analysis and Design Trisha Cummings.
A Simplified Approach to Web Service Development Peter Kelly Paul Coddington Andrew Wendelborn.
Semantic Web. Course Content
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
© 2006 IBM Corporation IBM WebSphere Portlet Factory Architecture.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Information Systems Engineering. Lecture Outline Information Systems Architecture Information System Architecture components Information Engineering Phases.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Ch. 1: Software Development (Read) 5 Phases of Software Life Cycle: Problem Analysis and Specification Design Implementation (Coding) Testing, Execution.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
Introduction CPSC 388 Ellen Walker Hiram College.
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
The Nature of Data and Information 11 IPT Miss O’Grady.
Chapter 1 Introduction. Chapter 1 -- Introduction2  Def: Compiler --  a program that translates a program written in a language like Pascal, C, PL/I,
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Intermediate 2 Computing Unit 2 - Software Development.
From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
Declarative Languages and Model Based Development of Web Applications Besnik Selimi South East European University DAAD: 15 th Workshop “Software Engineering.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
Week 7 Lecture Part 2 Introduction to Database Administration Samuel S. ConnSamuel S. Conn, Asst Professor.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Chapter – 8 Software Tools.
Separating Test Execution from Test Analysis StarEast 2011 Jacques Durand (Fujitsu America, Inc.) 1.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
PADL 2008 A Generic Programming Toolkit for PADS/ML Mary Fernández, Kathleen Fisher, Yitzhak Mandelbaum AT&T Labs Research J. Nathan Foster, Michael Greenberg.
CS510 Compiler Lecture 4.
Introduction to Parsing (adapted from CS 164 at Berkeley)
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Lecture 12: Data Wrangling
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Introduction to Information Retrieval
An Electronic Borrowing System Using REST
Probabilistic Databases
Query Optimization.
Automation of Control System Configuration TAC 18
Presentation transcript:

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu

Data, data, everywhere AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data” Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available –not free text; not html; not xml Common problems: no documentation, evolving formats, huge volume, error-filled... Web Logs Network Monitoring Billing Info Router Configs Call Details

Data, data, everywhere [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" [15/Oct/1997:18:53: ] "GET /tr/img/gift.gif HTTP/1.0” [15/Oct/1997:18:39: ] "GET /tr/img/wool.gif HTTP/1.0" [16/Oct/1997:12:59: ] "GET / HTTP/1.0" ekf - [17/Oct/1997:10:08: ] "GET /img/new.gif HTTP/1.0" web server common log format

Data, data, everywhere AT&T phone call provisioning data | |1| | | | ||no_ii |EDTF_6|0|MARVINS1|UNO|10| | |1| | | | ||no_ii1522 2|EDTF_6|0|MARVINS1|UNO|10| |20| |17| |19| |27| |29| |IA0288| |IE0288| |E DTF_CRTE| |EDTF_OS_1| |16| |26|

Data, data, everywhere HA START OF TEST CYCLE aA BXYZ U1AB B HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY k LYXW B1KB G HB END OF TEST CYCLE

Data, data, everywhere format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: , PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution

Goal Billing Info Raw Data ASCII log files Call Detail XML CSV Standard formats & schema Visual Information End-user tools We want to create this arrow

Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07] “Ad Hoc” Data Source Analysis Report XML PADS Data Description PADS Compiler Generated Libraries (Parsing, Printing, Traversal) PADS Runtime System (I/O, Error Handling) XML Converter Data Profiler Graphing Tool Query Engine Custom App GraphInformation ? generic description- directed programs coded once

PADS Language Overview Rich base type library: –integers: Pint8, Puint32, … –strings: Pstring(’|’), Pstring_FW(3),... –systems data: Pdate, Ptime, Pip, … Type constructors describe complex data sources: –sequences: Pstruct, Parray, –choices: Punion, Penum, Pswitch –constraints: arbitrary predicates describe expected semantic properties –parameterization: allows definition of generic descriptions Data formats are described using a specialized language of types A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.

The Last Mile: The PADS System 2.0 Chunking & Tokenization Structure Discovery Format Refinement PADS Data Description Scoring Function Raw Data PADS Compiler Profiler XMLifier Analysis Report XML Format Inference Engine Chunking & Tokenization Structure Discovery

Convert raw input into sequence of “chunks.” Supported divisions: –Various forms of “newline” –File boundaries Also possible: user-defined “paragraphs” Chunking Process

Tokenization Tokens/Base types expressed as regular expressions. Basic tokens Integer, white space, punctuation, strings Distinctive tokens IP addresses, dates, times, MAC addresses,...

Histograms

Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Clustering Cluster 1 Group clusters with similar frequency distributions Cluster 2Cluster 3 Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

Find subcontexts Tokens in selected cluster: Quote(2) Comma White

Then Recurse...

Inferred type

Structure Discovery Review Compute frequency distribution for each token. Cluster tokens with similar frequency distributions. Create hypothesis about data structure from cluster distributions –Struct –Array –Union –Basic type (bottom out) Partition data according to hypothesis & recurse Once structure discovery is complete, later phases massage & rewrite candidate description to create final form “123, 24” “345, begin” “574, end” “9378, 56” “12, middle” “-12, problem” …

Testing and Evaluation Evaluated overall results qualitatively –Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation –Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity Evaluated accuracy quantitatively –For many formats: 95%+ accuracy from 5% of available data Evaluated performance quantitatively –Hours to days to hand-write formats –after fixing the format, appears to scale linearly with data size –<1 min on 300K data

Technical Summary [ PADS 1.0 is an effective implementation framework for many data processing tasks PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries ASCII log files Binary Traces struct { } XML CSV

End

Execution Time Data sourceSD (s)Ref (s)Tot (s)HW (h) 1967Transactions.short MER_T01_01.cvs Ai Asl.log Boot.log Crashreporter.log Crashreporter.log.mod Sirius Ls-l.txt Netstat-an Page_log quarterlypersonalincome Railroad.txt Scrollkeeper.log Windowserver_last.log Yum.txt SD: structure discovery Ref: refinement Tot: total HW: hand-written

Training Time

Minimum Necessary Training Sizes Data source90%95% Sirius Transaction.short55 Ai Asl.log510 Scrollkeeper.log55 Page_log55 MER_T01_01.csv55 Crashreporter.log1015 Crashreporter.log.mod515 Windowserver_last.log515 Netstat-an2535 Yum.txt3045 quarterlypersonalincome10 Boot.log4560 Ls-l.txt5065 Railroad.txt6075

Problem: Tokenization Technical problem: –Different data sources assume different tokenization strategies –Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions –Matching tokenization of underlying data source can make a big difference in structure discovery. Current solution: –Parameterize learning system with customizable configuration files –Automatically generate lexer file & basic token types Future solutions: –Use existing PADS descriptions and data sources to learn probabilistic tokenizers –Incorporate probabilities into sophisticated back-end rewriting system Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead