From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.

Slides:

Advertisements

Similar presentations

AS ICT Finding your way round MS-Access The Home Ribbon This ribbon is automatically displayed when MS-Access is started and when existing tables.

Advertisements

A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.

Chapter 10 Database Applications Copyright © 2011 by The McGraw-Hill Companies, Inc. All Rights Reserved. McGraw-Hill.

Microsoft Excel 2003 Illustrated Complete Excel Files and Incorporating Web Information Sharing.

INSERT BOOK COVER 1Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Access 2010 by Robert Grauer, Keith Mast,

Automating Bespoke Attack Ruei-Jiun Chapter 13. Outline Uses of bespoke automation ◦ Enumerating identifiers ◦ Harvesting data ◦ Web application fuzzing.

Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.

Information Retrieval in Practice

Software Engineering COMP 201

Requirements Specification

Database Management An Introduction.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.

Software Reliability Methods Sorin Lerner. Software reliability methods: issues What are the issues?

Real-time and Retrospective Analysis of Video Streams and Still Image Collections using MPEG-7 Ganesh Gopalan, College of Oceanic and Atmospheric Sciences,

David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Tutorial 11: Connecting to External Data

Chapter 3 Software Two major types of software

Overview of Search Engines

Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.

Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)

GIS Application in Firewall Security Log Visualization Juliana Lo.

FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)

A summary of the report written by W. Alink, R.A.F. Bhoedjang, P.A. Boncz, and A.P. de Vries.

Classroom User Training June 29, 2005 Presented by:

Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.

1 Data Modeling : ER Model Lecture Why We Model  We build models of complex systems because we cannot comprehend any such system in its entirety.

COMPUTER PROGRAMMING Source: Computing Concepts (the I-series) by Haag, Cummings, and Rhea, McGraw-Hill/Irwin, 2002.

4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.

DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.

Public Domain/Open Source Software Evaluation Photo Organizer.

Statistics Monitor of SPMSII Warrior Team Pu Su Heng Tan Kening Zhang.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

The Generic Gaming Engine Andrew Burke Advisor: Prof. Aaron Cass Abstract Games have long been a source of fascination. Their inherent complexity has challenged.

Project Overview Graduate Selection Process Project Goal Automate the Selection Process.

1.NET Web Forms Business Forms © 2002 by Jerry Post.

Title Page programmemanagementsystem KPMD (IT Solutions) Ltd Blades Enterprise Centre, Bramall Lane, Sheffield S2 4SU, United Kingdom telephone: +44 (0)114.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Presenter: Shanshan Lu 03/04/2010

Analysis of Complex Systems John Sherwood Period 2.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.

CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.

Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.

Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.

Introduction to Compiling

DataFlow Diagram – Level 0

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.

From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White.

The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.

Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

Data Mining What is to be done before we get to Data Mining?

Part 1 The Basics of Information Systems. Purpose of Information Systems Information systems ◦ Collects, stores and organizes information ◦ Retrieves.

XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.

Objective of the course Understanding the fundamentals of the compilation technique Assist you in writing you own compiler (or any part of compiler)

Data mining in web applications

Information Retrieval in Practice

Building Enterprise Applications Using Visual Studio®

Introduction to Compiler Construction

Search Engine Architecture

Client Access, Queries, Stored Procedures, JDBC

Presentation transcript:

From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu

who am I? why am I here?

Our Common Communication Infrastructure Much information is represented in standardized data formats:   Web pages in HTML   Pictures in JPEG   Movies in MPEG   “Universal” information format XML   Standard relational database formats A plethora of data processing tools:   Visualizers (Browsers Display JPEG, HTML,...)   Query languages allow users extract information (SQL, XQuery)   Programmers get easy access through standard libraries ► ► Java XML libraries --- JAXP   Many applications handle it natively and convert back and forth ► ► MS Word

Ad Hoc Data Massive amounts of data are stored in XML, HTML or relational databases but there’s even more data that isn’t An ad hoc data format is any nonstandard, but structured data format for which convenient parsing, querying, visualizing, transformation tools are not available. (not natural language)

Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" [16/Oct/1997:14:32: ] "POST /scpt/ddorg/confirm HTTP/1.0"

Ad Hoc Data from Crashreporter.log Sat Jun 24 06:38: crashdump[2164]: Started writing crash report to: /Logs/Crash/Exit/ pro.crash.log Sun Jun 25 07:23: crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port

AT&T Phone Call Provisioning Data | |1| | | | ||no_ii |EDTF_6|0|MARVINS1|UNO|10| | |1| | | | ||no_ii15222| EDTF_6|0|MARVINS1|UNO|10| |20| |17| |19| |27| |29| |IA0288| |IE0288| |ED TF_CRTE| |EDTF_OS_1| |16| |26| | |1|0|0|0|0||no_ii152271|EDTF_1|0|SC1MF1F|UNO|EDTF_CRTE| |EDTF_OS_10| | |1|0|0|0|0||no_ii152270|EDTF_1|0|marshak1|UNO|EDTF_CRTE| |EDTF_OS_10|

Ad Hoc data from DNS Packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

Ad Hoc data from Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc % 31% GE General Electric Co % -8% HD Home Depot Inc % 63% IBM Intl Business Machines % -13% INTC Intel Corp % -47% Data provided by William O'Neil + Co., Inc. © All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes.

Ad Hoc data from !autogenerated-by: DAG-Edit version rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: <biological_process ; GO: %behavior ; GO: ; synonym:behaviour %adult behavior ; GO: ; synonym:adult behaviour %adult feeding behavior ; GO: ; synonym:adult feeding behaviour % feeding behavior ; GO: %adult locomotory behavior ; GO: ;...

The Challenge of Ad Hoc Data Data arrives “as is.” Documentation is often out-of-date or nonexistent. Data is buggy.  Missing data, “extra” data, …  Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), …  Errors are sometimes the most interesting portion of the data. Data sources may be enormous  AT&T sources can generate up to 2GB/second There are no software libraries, manuals, or armies of consultants to help you....

Raw Data Data Entry: Create Format Description Data Analysis Data Exit: Data Transformation External Systems Description libraries Automatic inference Manual customization Visual support database queries grep support google-style search binary viewer/editor anomaly detection statistical classification format-independent algorithms plug-and-play export to XML, HTML, S, database, Excel language support for custom rewriting plug-and-play ASCII log files Binary Traces Goal: An end-to-end, real-time data analysis, transformation and programming framework

The PADS System (version 1.0) [pldi 05, popl 06, popl 07] “Ad Hoc” Data Source Analysis Report XML PADS Data Description PADS Compiler Generated Libraries (Parsing, Printing, Traversal) PADS Runtime System (I/O, Error Handling) XML Converter Data Profiler Graphing Tool Query Engine Custom App GraphInformation ? generic description- directed programs coded once written by hand

Trivial Example Data Sources: type payload = union { int32 i; stringFW(3) s2; }; type source = struct { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } “0, 24” “foo, 16” “bar, end” Description: Key points to know:  Descriptions based on programming language “types”  Broad collection of “base types” (ints, strings, dates, ip addresses...)  Structured types includes “structs,” “unions” and “arrays” .... but has many other features: dependency, constraints, recursion,...  has formal semantics & proven properties

The PADS System (version 2.0) Tokenization Structure Discovery Format Refinement Data Description Scoring Function Raw Data PADS Compiler Profiler XMLifier Analysis Report XML Format Inference Structure Discovery Format Refinement

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level type constructor  Partition tokenized data into smaller chunks  Recursively analyze and compute types from smaller chunks “0, 24” “foo, 16” “bar, end” “ INT, INT ” “ STR, INT ” “ STR, STR ” tokenize

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level type constructor  Partition tokenized data into smaller chunks  Recursively analyze and compute types from smaller chunks “ INT, INT ” “ STR, INT ” “ STR, STR ” discover “”, ?? struct ? candidate structure so far INT STR INT STR sources

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level type constructor  Partition tokenized data into smaller chunks  Recursively analyze and compute types from smaller chunks discover “”, ?? struct INT STR INT STR “”, ? ? struct union INT ? STR INT STR

Structure Discovery: Details Compute frequency distribution histogram for each token. (And recompute at every level of recursion). “ INT, INT ” “ STR, INT ” “ STR, STR ” percentage of sources Number of occurrences per source

Structure Discovery: Details Cluster tokens into groups with similar histograms  Similar histograms ► strong evidence tokens coexist in same description component ► use symmetric relative entropy to measure similarity  Only the “shape” of the histogram matters ► normalize histograms by sorting columns in descending size ► result: comma & quote grouped together

Structure Discovery: Details Find most promising token group to divide and conquer:  Structs == Groups with high coverage & low “residual mass”  Arrays == Groups with high coverage, sufficient width & high “residual mass”  Unions == Other token groups Struct involving comma, quote identified in histogram above Overall procedure gives good starting point for rewriting system

Format Refinement Reanalyze example data with aid of rough description Rewrite format description to:  simplify presentation ► merge & rewrite structures  improve precision ► reorganize description structure ► add constraints (sortedness, uniqueness, linear relations, functional dependencies)  fill in missing details ► find completions where structure discovery bottoms out ► refine base types (termination conditions for strings, integer sizes)

Format Refinement Three main sub-phases  Phase 1: Tagging/Table generation ► Convert rough description into tagged description + relational table  Phase 2: Constraint inference ► Analyze table and infer constraints ► Use TANE algorithm [Huhtala et al. 99]  Phase 3: Format rewriting ► Use inferred constraints & type isomorphisms to rewrite rough description ► Greedy search to optimize information-theoretic score

Refinement: Simple Example

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” …

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id1id id alpha (id4) int (id5)alpha(id6) id4 -- id5... id foobeg

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int alpha int alpha structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) constraint inference id1id id alpha (id4) int (id5)alpha(id6) id4 -- id5... id foobeg

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int str int str structure discovery (id2) struct “ ”, union int (id3) tagging/ table gen (id1) id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) constraint inference rule-based structure rewriting struct “ ” union 0str int str struct,, id1id id more accurate: -- first int = 0 -- rules out “int, alpha-string” records str (id4) int (id5)str(id6) id4 -- id5... id foobeg

Biggest Weakness Degree of success often hinges on the inference system having a tokenization scheme that matches the tokenization scheme of the data source. Good tokens capture high-level, human abstractions compactly. Techniques for learning tokenizations from data directly? Techniques for using multiple, ambiguous tokenization schemes simultaneously?

Related Work Most common domains for grammar inference:  xml/html  natural language Systems that focus on ad hoc data rare and the few that don’t support PADS tool suite:  Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01 Top-down structure discovery  Arasu & Garcia-Molina ’03 (extracting data from web pages) Grammar induction using MDL & grammar rewriting search  Stolcke and Omohundro ’94 “Inducing probabilistic grammars...”  T. W. Hong ’02, Ph.D. thesis on information extraction from web pages  Higuera ’01 “Current trends in grammar induction”

Conclusions Still a work in progress, but we are able to produce XML and statistical reports fully automatically from ad hoc data sources. We’ve tested on approximately 15 real, mostly systemy data sources (web logs, crash reports, AT&T phone call data, etc.) with what we believe is relatively good success For papers & software, see our website at:

End