Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences.

Slides:



Advertisements
Similar presentations
The NVO Data Discovery Portal Tom McGlynn NASA/GSFC.
Advertisements

Computability Dr. Colin Campbell Course Element 2 (EMAT20531)
Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.
1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Design of Questionnaire for teachers FocusNo. of Qs (A) General Feedback 3Q12 Q8 Q13 (B) Teaching & Learning Style 2Q11 Q10 (C) Language Development 4141.
IPM THEORY CHALLENGE QUIZ NUMBER 1. Q1 - We are able to place organisations into which of the following categories based on their prime purpose A.Profit.
IPM THEORY CHALLENGE QUIZ NUMBER 3 Unit 3 Outcome 2.
AP Calculus BC Review for Quiz -Determining convergence of geometric series -Creating a power series -Finding a Taylor Series sum expression.
AS ICT Finding your way round MS-Access The Home Ribbon This ribbon is automatically displayed when MS-Access is started and when existing tables.
Querying Integrated Observation and Measurement data SONet June 8,
The Within-Strip Discrete Unit Disk Cover Problem Bob Fraser (joint work with Alex López-Ortiz) University of Waterloo CCCG Aug. 8, 2012.
Provenance Challenge, Sept Modeling Provenance through User views Sarah Cohen-Boulakia Shirley Cohen Susan Davidson Thunyarat (Bam) Amornpetchkul.
CSCI3170 Introduction to Database Systems
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Making Cloud Storage Provenance- Aware Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard School of Engineering and Applied Sciences.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Content-Based Image Retrieval
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Extended Learning Module J (Office 2010 Version) Implementing.
Provenance-Aware Storage Systems The First Workshop on Provenance Aware Storage Systems October 20, 2005 Margo Seltzer.
An End-User Perspective On Using NatQuery Building a Dynamic Variable T
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
Adaptive Hypermedia on the Web: Methods, Technology and Applications Paul De Bra Eindhoven University of Technology Eindhoven, The Netherlands Centrum.
By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)
Image Query (IQ) Project Update Building queries one question mark at a time March, 2009.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
CORE 2: Information systems and Databases STORAGE & RETRIEVAL 2 : SEARCHING, SELECTING & SORTING.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Create Forms Lesson 5. Software Orientation Creating Forms A form is a database object –enter, edit, or display data from a table or query Providing.
Advanced File Processing
Provenance-aware Storage Systems Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University.
® IBM Software Group © 2009 IBM Corporation Rational Publishing Engine RQM Multi Level Report Tutorial David Rennie, IBM Rational Services A/NZ
GDT V5 Web Services. GDT V5 Web Services Doug Evans and Detlef Lexut GDT 2008 International User Conference August 10 – 13  Lake Las Vegas, Nevada GDT.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Basic & Advanced Reporting in TIMSNT ** Part Two **
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
Linux file system "On a UNIX system, everything is a file; if something is not a file, it is a process." Sorts of files (on a Linux system) Directories:
Course ILT Forms and queries Unit objectives Create forms by using AutoForm and the Form Wizard, and add or modify form headers and footers Open and enter.
McGraw-Hill/Irwin ©2005 The McGraw-Hill Companies, All rights reserved ©2005 The McGraw-Hill Companies, All rights reserved McGraw-Hill/Irwin.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
Microsoft FrontPage 2003 Illustrated Complete Integrating a Database with a Web Site.
Introduction Current Work Design & Implementation Conclusions PQLite: Provenance Query Language PQLite: An Overly Simplistic Query Language for Data Provenance.
Introduction to KE EMu
Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems.
B+ Trees: An IO-Aware Index Structure Lecture 13.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Extended Learning Module J (Office 2007 Version) Implementing.
Aggregator  Performs aggregate calculations  Components of the Aggregator Transformation Aggregate expression Group by port Sorted Input option Aggregate.
SPADE on Android
CAA Database Overview Sinéad McCaffrey. Metadata ObservatoryExperiment Instrument Mission Dataset File.
Large Scale Data Management with GridSite Web-centric data access and visualization Ian Stokes-Rees SBGrid/Sliz Lab Harvard Medical School Ian Stokes-Rees.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
IST 220 – Intro to Databases
Process API COMP 755.
Linux file system "On a UNIX system, everything is a file;
Ramesh Baral Team: Marjani Peterson, Andre Guerrero
Guide To UNIX Using Linux Third Edition
Dynamic Sql Not so scary?
Rational Publishing Engine RQM Multi Level Report Tutorial
Grauer and Barber Series Microsoft Access Chapter One
Final Project Geog 375 Daniel Hewitt.
Presentation transcript:

Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

Reminder: What is PASS? Storage systems (e.g., file systems) in which provenance is a first class entity. Provenance: –is generated and maintained as transparently as possible. –can be indexed and queried. –will be created from objects imported from non- PASS sources. –is maintained in the presence of deletes, copies, renames, etc.

Collecting Provenance % sort a > b fork open b (W) exec sort a open a (R) read a write b close a close b task_struct Inode cache b argv=sort a name=sort modules=pasta… kernel=Linux… env=USER… sort input=sort a input=a To file system Kernel

Things to Keep in Mind Our focus is provenance collection, not query. We collect provenance of everything. Provenance collection is done in the operating system. Queries are simply queries against the database maintained by the kernel. Our kernel database is Berkeley DB.

Results Summary Workflow: we ran the shell script –Dropped in all the programs and simply ran them on Linux. –Chose not to run the slicer, because the license worried us. Query: command-line query tool: nq –Successfully ran all queries –Generated a lot more output than you really want. –Strategy is to keep everything and provide pruning to let users see what they want.

Query Tool: NQ General form: –nq [SELECTION] SEARCH [FILTER] OUTPUT-TYPE SELECTION: select FIELD … from FIELD : FIELD-NAME, concat(FIELD-NAME), $ANNOTATION, nameof(FIELD-NAME), typeof(FIELD-NAME) SEARCH: ancestors FILE*, descendents FILE*, everything FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR OUTPUT-TYPE: report, report html, table EXPR : existing, nonexisting, EXPR op EXPR

Q1: Provenance of Graphic X nq 'ancestors atlas-x.gif report [passfile; challenge/atlas-x.gif] version 1 type: passfile name: challenge/atlas-x.gif input: [proc; pid 2937; /usr/local/bin/convert] version 0 annotation: dim=x annotation: run=base annotation: studyModality=mindreading And 4806 other objects… Results: QUERIES\q1.htmlQUERIES\q1.html

Q2: Q1 excluding prior to softmean Query: nq ancestors atlas-x.gif anchor (type == proc && name == AIR5.2.5/bin/softmean) report Result: essentially a subset of Q1essentially a subset of Q1 –only 148 objects identified

Q3: Q2 w/stages We did not create annotations to map to stages, so this query degenerates to the same one as Query 2.

Q4: align_warp w/specific parameter values Query: nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report Results: –We did our run on Monday –Returns 8 instances:Returns 8 instances Four from the main workload Four from the variant workload used in Query 7

Q5: images with max=4095 Two alternate approaches: –Three phase solution Create list of header files that are ancestors of align_warp Pass list of files to scanheader; grep max=4095 Find all the descendents of the headers –Annotation approach Run scanheader on all headers Make results of scanheader annotations Query on the annotations –We used the first approach

Q5 Continued Create list of files to query ALIGN_WARPS=`$NQ $NQOPTS 'select ident from everything where type == "proc" && basename == "align_warp" table `$NQ $NQOPTS select name from ancestors {'"$ALIGN_WARPS"' } depth where basename ~ "*.hdr table Call scanheader on everything returned above, selecting those files where max=4095

Q5 Continued Query on the list returned above nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr } where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg" report' Results

Q6: images produced by softmean with a particular align_warp parameter Three stage query: –Find align_warp processes LIGN_WARPS=`nq select ident from everything where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12* table'` –Find appropriate softmean processes SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' } where type == "proc" && basename == "softmean" table' –Find images produced by softmean processes nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 where type == "passfile" && basename ~ "*.img" report Results: [passfile; challenge/q7/atlas.img] version 1 name: challenge/q7/atlas.img [passfile; challenge/atlas.img] version 1 name: challenge/atlas.img

Q7: Difference between original and new workflow We use standard diff of textual output nq 'ancestors atlas-x.gif report' > q7-a.tmp nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp diff -u q7-a.tmp q7-b.tmp Result:Result [passfile; challenge/atlas-x.gif] version [passfile; challenge/q7/atlas-x.jpg] version 1 type: passfile - name: challenge/atlas-x.gif - input: [proc; pid 2937; /usr/local/bin/convert] version 0 + name: challenge/q7/atlas-x.jpg + input: [proc; pid 2961; /usr/bin/pnmtojpeg] version 0 annotation: dim=x - annotation: run=base - annotation: studyModality=mindreading + annotation: run=q7 + annotation: studyModality=visual

Q8: FindUChicago align_warp outputs Three stage query: –Find everything annotated with UChicago INPUTS=`nq 'select ident from everything where $center == "UChicago" table –Find those Uchicago objects that are the result of align_warp `WARPS=`nq 'select ident from descendents { '"$INPUTS"' } depth 1 where type == "proc" && basename == "align_warp" table –Now, find all the outputs of those processes `nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'

Q8 Continued Results [passfile; challenge/q7/warp3.warp]version 1 type: passfile name: challenge/q7/warp3.warp [passfile; challenge/q7/warp2.warp]version 1 type: passfile name: challenge/q7/warp2.warp [passfile; challenge/warp3.warp]version 1 type: passfile name: challenge/warp3.warp [passfile; challenge/warp2.warp]version 1 type: passfile name: challenge/warp2.warp

Q9: Find user annotations for objects where some annotations have a given value Setup –We added annotations to all six output images –We annotated one set of outputs with modality visual and the other modality mind-reading. Query nq 'select annotations from everything where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && ($studyModality == "speech" || $studyModality == "visual" || $studyModality == "audio") report'

Q9 Continued Results [passfile; challenge/q7/atlas-z.jpg]version 1 annotation: dim=z annotation: run=q7 annotation: studyModality=visual [passfile; challenge/q7/atlas-y.jpg] version 1 annotation: dim=y annotation: run=q7 annotation: studyModality=visual [passfile; challenge/q7/atlas-x.jpg] version 1 annotation: dim=x annotation: run=q7 annotation: studyModality=visual

Conclusions/Observations We have the data We are not UI people Output is remarkably complete –Sometimes makes it difficult to extract the information you want. Output is BIG if you ask for everything, but … … you can ask for everything and get it