From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White.

Slides:

Advertisements

Similar presentations

Introduction to C Programming

Advertisements

MP3 proposal. Template  Title  Your group name and group members  Application overview  Main functions  Detail description  Timeline and task assignment.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.

Fast Algorithms For Hierarchical Range Histogram Constructions

Decision Tree Approach in Data Mining

Microsoft Excel 2003 Illustrated Complete Excel Files and Incorporating Web Information Sharing.

From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.

Automating Bespoke Attack Ruei-Jiun Chapter 13. Outline Uses of bespoke automation ◦ Enumerating identifiers ◦ Harvesting data ◦ Web application fuzzing.

AUTOMATIC ORGANIZING AND FORMATTING FOR LECTURE NOTES SHIQING (LICIA) HE ADIVISOR: PROF.KRISTINA STRIEGNITZ SPRING 2014 STRUCTURING THE UNSTRUCTURED NOTE:

Basic Statistical Concepts

Information Retrieval in Practice

Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.

Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

MCB 371/372 Sequence alignment Sequence space 4/4/05 Peter Gogarten Office: BSP 404 phone: ,

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data July 2007 Kathleen Fisher David Walker Peter White Kenny Zhu.

Probability and Statistics in Engineering Philip Bedient, Ph.D.

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Masquerade Detection Mark Stamp 1Masquerade Detection.

Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.

Chap 11 Engineering Statistics PREP004 – Introduction to Applied Engineering College of Engineering - University of Hail Fall 2009.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Continuous Random Variables

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Chapter Two: Summarizing and Graphing Data 2.2: Frequency Distributions 2.3: ** Histograms **

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Probabilistic and Statistical Techniques 1 Lecture 3 Eng. Ismail Zakaria El Daour 2010.

CERN-PH-SFT-SPI August Ernesto Rivera Contents Context Automation Results To Do…

CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.

Confidential ACL Functions Corporate Audit Services Technology Solutions Team Charlene Vallandingham and Jack Hauschild September 29, 2008.

Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Post-Ranking query suggestion by diversifying search Chao Wang.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

After testing users Compile Data Compile Data Summarize Summarize Analyze Analyze Develop recommendations Develop recommendations Produce final report.

Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs.

Check it out! : Standard Normal Calculations.

Costas Busch - LSU1 Parsing. Costas Busch - LSU2 Compiler Program File v = 5; if (v>5) x = 12 + v; while (x !=3) { x = x - 3; v = 10; } Add v,v,5.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

1 © 2004 Cisco Systems, Inc. All rights reserved. Session Number Presentation_ID Cisco Technical Support Seminar Using the Cisco Technical Support Website.

LOGO Song Identification System Team members: Nguyen Ngoc Tan Ho Vinh Thinh Nguyen Huu Duy Nguyen Hoang Diep Nguyen Trong Dai Le Thanh Tung Supervisor:

Design and Analysis of Algorithms Faculty Name : Ruhi Fatima Course Description This course provides techniques to prove.

Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

Data mining in web applications

Information Retrieval in Practice

Introduction to Parsing (adapted from CS 164 at Berkeley)

Potter’s Wheel: An Interactive Data Cleaning System

CS222P: Principles of Data Management Notes #11 Selection, Projection

Data Integration for Relational Web

Parsing Costas Busch - LSU.

CS222: Principles of Data Management Notes #11 Selection, Projection

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.

Presentation transcript:

From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White

A System Admin’s Life

Web Server Logs…

System Logs…

Application Configs…

User s

Script Outputs and more…

Automatically Generate Tools from Data! XML converter Data profiler Grapher, etc.

Architecture Tokenization Structure Discovery Format Refinement Data Description Scoring Function Raw Data PADS Compiler Profiler XML converter Analysis Report XML Format Inference LearnPADS Tokenization Structure Discovery Format Refinement

Simple End-to-End Data Sources: Punion payload { Pint32 i; PstringFW(3) s2; }; Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } “0, 24” “foo, 16” “bar, end” Description: XML output: 0 24 bar end

Tokenization Parse strings; convert to symbolic tokens  Basic token set skewed towards systems data ► Int, string, date, time, URLs, hostnames …  A config file allows users to define their own new token types via regular expressions “0, 24” “foo, 16” “bar, end” “ INT, INT ” “ STR, INT ” “ STR, STR ” tokenize

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level description  Partition tokenized data into smaller chunks  Recursively analyze and compute descriptions from smaller chunks

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level description  Partition tokenized data into smaller chunks  Recursively analyze and compute descriptions from smaller chunks “ INT, INT ” “ STR, INT ” “ STR, STR ” discover “”, ?? struct ? candidate structure so far INT STR INT STR sources

Structure Discovery: Overview Top-down, divide-and-conquer algorithm:  Compute various statistics from tokenized data  Guess a top-level description  Partition tokenized data into smaller chunks  Recursively analyze and compute descriptions from smaller chunks discover “”, ?? struct INT STR INT STR “”, ? ? struct union INT ? STR INT STR

Structure Discovery: Details Compute frequency distribution histogram for each token. (And recompute at every level of recursion). “ INT, INT ” “ STR, INT ” “ STR, STR ” percentage of sources Number of occurrences per source

Structure Discovery: Details Cluster tokens with similar histograms into groups  Similar histograms ► tokens with strong regularity coexist in same description component ► use symmetric relative entropy to measure similarity  Only the “shape” of the histogram matters ► normalize histograms by sorting columns in descending size ► result: comma & quote in one group, int & string in another

Structure Discovery: Details Classify the groups into:  Structs == Groups with high coverage & low “residual mass”  Arrays == Groups with high coverage, sufficient width & high “residual mass”  Unions == Other token groups Pick group with strongest signal to divide and conquer More mathematical details in the paper Struct involving comma, quote identified in histogram above Overall procedure gives good starting point for refinement

Format Refinement Reanalyze source data with aid of rough description and obtain functional dependencies and constaints Rewrite format description to:  simplify presentation ► merge & rewrite structures  improve precision ► add constraints (uniqueness, ranges, functional dependencies)  fill in missing details ► find completions where structure discovery bottoms out ► refine base types (integer sizes, array sizes, seperators and terminators) Rewriting is guided by local search that optimizes an information- theoretic score (more details in the paper)

Refinement: Simple Example

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … struct “ ”, union int str intstr structure discovery Constraints: id3 = 0 id1 = id2 constraint inference rule-based structure rewriting struct “ ” union 0str int str struct,, Greater Accuracy First int is 0 No “int, str” (id2) struct “ ”, union int (id3) tagging/ table gen (id1) str (id4) int (id5) str (id6) id1id2id3id4id5id foo---beg...

Evaluation

Benchmark Formats Data source ChunksBytesDescription 1967Transactions.short Transaction records MER_T01_01.cvs Comma-separated records Ai Web server log Asl.log Log file of MAC ASL Boot.log Mac OS boot log Crashreporter.log Original crashreporter daemon log Crashreporter.log.mod Modified crashreporter daemon log Sirius AT&T phone provision data Ls-l.txt Command ls -l output Netstat-an Output from netstat -an Page_log Printer log from CUPS quarterlypersonalincome Spread sheet Railroad.txt US Rail road info Scrollkeeper.log Log from cataloging system Windowserver_last.log Log from Mac LoginWindow server Yum.txt Log from package installer Yum Available at

Training Time vs. Training Size

Training Accuracy vs Training Size

Conclusions We are able produce XML and statistical reports fully automatically from ad hoc data sources. We’ve tested on approximately 15 real, mostly systemsy data sources (web logs, crash reports, AT&T phone call data, etc.) with what we believe is a good success For papers, online demos & pads software, see our website at:

LearnPADS On the Web

End

Related Work Most common domains for grammar inference:  xml/html  natural language Systems that focus on ad hoc data are rare and those that do don’t support PADS tool suite:  Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01 Top-down structure discovery  Arasu & Garcia-Molina ’03 (extracting data from web pages) Grammar induction using MDL & grammar rewriting search  Stolcke and Omohundro ’94 “Inducing probabilistic grammars...”  T. W. Hong ’02, Ph.D. thesis on information extraction from web pages  Higuera ’01 “Current trends in grammar induction”  Garofalakis et al. ’00 “XTRACT for infering DTDs”

Scoring Function Finding a function to evaluate the “goodness” of a description involves balancing two ideas:  a description must be concise ► people cannot read and understand enormous descriptions  a description must be precise ► imprecise descriptions do not give us much useful information Note the trade-off:  increasing precision (good) usually increases description size (bad)  decreasing description size (good) usually decreases precision (bad) Minimum Description Length (MDL) Principle:  Normalized Information-theoretic Scores Transmission Bits = BitsForDescription(T) + BitsForData(D given T)