David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

Programming Paradigms and languages
Principles of programming languages 1: Introduction (with a simple language) Isao Sasano Department of Information Science and Engineering.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Programming System development life cycle Life cycle of a program
What do Computer Scientists and Engineers do? CS101 Regular Lecture, Week 10.
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. slide 1 CS 125 Introduction to Computers and Object- Oriented Programming.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. slide 1 CS 125 Introduction to Computers and Object- Oriented Programming.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
U of R eXtensible Catalog Team MetaCat. Problem Domain.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Overview of Search Engines
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
© 2011 Autodesk Automating Autodesk® Revit® Server Rod Howarth Software Development Manager – Bornhorst + Ward.
COMPUTER SOFTWARE Section 2 “System Software: Computer System Management ” CHAPTER 4 Lecture-6/ T. Nouf Almujally 1.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
MAHI Research Database Data Validation System Software Prototype Demonstration September 18, 2001
4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
PHP TUTORIAL. HISTORY OF PHP  PHP as it's known today is actually the successor to a product named PHP/FI.  Created in 1994 by Rasmus Lerdorf, the very.
JavaScript II ECT 270 Robin Burke. Outline JavaScript review Processing Syntax Events and event handling Form validation.
10 Adding Interactivity to a Web Site Section 10.1 Define scripting Summarize interactivity design guidelines Identify scripting languages Compare common.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
JAVA SERVER PAGES. 2 SERVLETS The purpose of a servlet is to create a Web page in response to a client request Servlets are written in Java, with a little.
Data Visualization Project B.Tech Major Project Project Guide Dr. Naresh Nagwani Project Team Members Pawan Singh Sumit Guha.
.Net and Web Services Security CS795. Web Services A web application Does not have a user interface (as a traditional web application); instead, it exposes.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use ECE/CS 352: Digital Systems.
Project Overview Graduate Selection Process Project Goal Automate the Selection Process.
1 Cisco Unified Application Environment Developers Conference 2008© 2008 Cisco Systems, Inc. All rights reserved.Cisco Public Introduction to Etch Scott.
INTRODUCTION TO COMPUTING CHAPTER NO. 04. Programming Languages Program Algorithms and Pseudo Code Properties and Advantages of Algorithms Flowchart (Symbols.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Computing System Fundamentals 3.1 Language Translators.
CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
10 Copyright © 2004, Oracle. All rights reserved. Building ADF View Components.
Part 1 The Basics of Information Systems. Purpose of Information Systems Information systems ◦ Collects, stores and organizes information ◦ Retrieves.
Introduction of Wget. Wget Wget is a package for retrieving files using HTTP and FTP, the most widely-used Internet protocols. Wget is non-interactive,
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Advanced Computer Systems
CSCI-235 Micro-Computer Applications
课程名 编译原理 Compiling Techniques
Learn more about office users Feature usage study by document element statistics
Learn more about office users Feature usage study by document element statistics
UmbrellaDB v0.5 Project Report #3
Last Class: Communication in Distributed Systems
Presentation transcript:

David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists

2 Standard Data Formats Behind the scenes, much of this information is represented in standardized data formats Standardized data formats: – Web pages in HTML – Pictures in JPEG – Movies in MPEG – “Universal” information format XML – Standard relational database formats A plethora of data processing tools: – Visualizers (Browsers Display JPEG, HTML,...) – Query languages allow users extract information (SQL, XQuery) – Programmers get easy access through standard libraries Java XML libraries --- JAXP – Many applications handle it natively and convert back and forth MS Word

3 Ad Hoc Data Formats Massive amounts of data are stored in XML, HTML or relational databases but there’s even more data that isn’t An ad hoc data format is any nonstandard data format for which convenient parsing, querying, visualizing, transformation tools are not available – ad hoc data is everywhere.

4 Ad Hoc data from Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Price Price Volume EPS RS Symbol Name Price Change % Change % Change Rating Rating AET Aetna Inc % 31% GE General Electric Co % -8% HD Home Depot Inc % 63% IBM Intl Business Machines % -13% INTC Intel Corp % -47% Data provided by William O'Neil + Co., Inc. © All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes.

5 Ad Hoc data from !autogenerated-by: DAG-Edit version rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: <biological_process ; GO: %behavior ; GO: ; synonym:behaviour %adult behavior ; GO: ; synonym:adult behaviour %adult feeding behavior ; GO: ; synonym:adult feeding behaviour % feeding behavior ; GO: %adult locomotory behavior ; GO: ;...

6 Ad Hoc Data in Chemistry C5=CC=CC=C5)=O)C1

7 The challenge of ad hoc data What can we do about ad hoc data? – how do we read it into programs? – how do we detect errors? – how do we correct errors? – how do we query it? – how do we view it? – how do we gather statistics on it? – how do we load it into a database? – how do we transform it into a standard format like XML? – how do we combine multiple ad data sources? – how do we filter, normalize and transform it? In short: how do we do all the things we take for granted when dealing with standard formats in a reliable, fault-tolerant and efficient, yet effortless way?

8 Enter Pads Pads: a system for Processing Ad hoc Data Sources Two main components: – a data description language for concise and precise specifications of ad hoc data formats and properties – a compiler that automatically generates a suite of data processing tools robust libraries for C programming – parser that flags all errors and automatically recovers – printing utilities – constraint checking utilities converter to XML a statistical profiler – collects stats on common values appearing in all parts of the data; records error stats visual interface & viewer (coming soon!)

9 Pads Tool Generation Architecture Pads Compiler Gene Ontology description Statistical Profiler Tool gene data Profile ACE 25% BKJ 25%... XML Formatter Tool gene data Viewer Tool gene data

10 Pads Tool Generation Architecture Pads Compiler Gene Ontology description Gene Ontology Generated Parser Pads Base Library Gene Ontology Statistical Profiler Glue code for statistical profile

11 Pads Programmer Tools Pads Compiler Gene Ontology description Gene Ontology Generated Parser Pads Base Library Ad Hoc User Program Ad Hoc User Program in C

12 The Statistical Profiler Tool for each part of a data source, profiler reports errors & most common values. from example weblog data:.length : uint good: bad: 3824 pcnt-bad: min: 35 max: avg: top 10 values out of 1000 distinct values: tracked % of values val: 3082 count: 1254 %-of-good: val: 170 count: 1148 %-of-good: val: 43 count: 1018 %-of-good:

13 The Statistical Profiler Tool ad hoc data is often poorly documented or out-of-date even the documentation of weblog data from our textbook was missing some information: good: bad: 3824 pcnt-bad: – web server sometimes return a ‘-’ instead of length of bytes, which wasn’t mentioned in the textbook data descriptions can be written in a iterative fashion – use the profiler at each stage to uncover additional information about the data and refine the description

14 PADS language Based on Type Theory – in most modern programming languages, types (int, bool, struct, object...) describe program data the source of most of my research – in Pads, types describe physical data formats, semantic properties of data, and a mapping into an internal program representation (ie, a parser) – in Pads, types include base types for ints of different kinds, strings of different kinds, dates, urls,... structs and arrays for reading sequences unions, switched unions and enums for alternatives parameterized types to express dependencies & constraints recursive types to express recursive hierarchies (coming soon!) – Can describe ASCII, binary, and mixed data formats.

15 Future Work Ad Hoc Data Transformation & Integration – language and compiler support for moving data from the format you are given to the format you really want specifying simple transforms: permuting, dropping, computing fields; normalizing representations of dates, times, places... correcting errors integrating multiple sources Pads Applications – genomics data (with Olga Troyanskaya, Princeton CS) – networking and telephony data (AT&T) – financial data (Richard Liao, Princeton ORFE)

16 Challenges of Ad Hoc Data Revisited Data arrives “as is” – Format determined by data source, not consumers. The Pads language allows consumers to describe data in just about any format. – Often has little documentation. A Pads description can serve as documentation for a data source. The statistical profiler helps analysts understand data. – Some percentage of data is “buggy.” Constraints allow consumers to express expectations about data. Parsers check for errors and say where errors are located. Ad hoc data is a rich source of information for financial analysts, chemists, biologists, computer scientists, if they could only get at it. – Pads generates a collection of useful tools automatically from data descriptions

17 Pads Summary The overarching goal of Pads is to make understanding, analyzing and transforming ad hoc data an effortless task. We do so with new programming language technology based on the principles of Type Theory. AT&T Research: Kathleen Fisher Mary Fernandez Joel Gottlieb Robert Gruber (now Google) Ricardo Medel (summer intern) Princeton: Mark Daly (UGrad) Yitzhak Mandelbaum (Grad) David Walker

End!