Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences.

Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

Reminder: What is PASS? Storage systems (e.g., file systems) in which provenance is a first class entity. Provenance: –is generated and maintained as transparently as possible. –can be indexed and queried. –will be created from objects imported from non- PASS sources. –is maintained in the presence of deletes, copies, renames, etc.

Collecting Provenance % sort a > b fork open b (W) exec sort a open a (R) read a write b close a close b task_struct Inode cache b argv=sort a name=sort modules=pasta… kernel=Linux… env=USER… sort input=sort a input=a To file system Kernel

Things to Keep in Mind Our focus is provenance collection, not query. We collect provenance of everything. Provenance collection is done in the operating system. Queries are simply queries against the database maintained by the kernel. Our kernel database is Berkeley DB.

Results Summary Workflow: we ran the shell script –Dropped in all the programs and simply ran them on Linux. –Chose not to run the slicer, because the license worried us. Query: command-line query tool: nq –Successfully ran all queries –Generated a lot more output than you really want. –Strategy is to keep everything and provide pruning to let users see what they want.

Query Tool: NQ General form: –nq [SELECTION] SEARCH [FILTER] OUTPUT-TYPE SELECTION: select FIELD … from FIELD : FIELD-NAME, concat(FIELD-NAME), $ANNOTATION, nameof(FIELD-NAME), typeof(FIELD-NAME) SEARCH: ancestors FILE*, descendents FILE*, everything FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR OUTPUT-TYPE: report, report html, table EXPR : existing, nonexisting, EXPR op EXPR

Q1: Provenance of Graphic X nq 'ancestors atlas-x.gif report 922.0 [passfile; challenge/atlas-x.gif] version 1 type: passfile name: challenge/atlas-x.gif input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 annotation: dim=x annotation: run=base annotation: studyModality=mindreading And 4806 other objects… Results: QUERIES\q1.htmlQUERIES\q1.html

Q2: Q1 excluding prior to softmean Query: nq ancestors atlas-x.gif anchor (type == proc && name == AIR5.2.5/bin/softmean) report Result: essentially a subset of Q1essentially a subset of Q1 –only 148 objects identified

Q3: Q2 w/stages We did not create annotations to map to stages, so this query degenerates to the same one as Query 2.

Q4: align_warp w/specific parameter values Query: nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report Results: –We did our run on Monday –Returns 8 instances:Returns 8 instances Four from the main workload Four from the variant workload used in Query 7

Q5: images with max=4095 Two alternate approaches: –Three phase solution Create list of header files that are ancestors of align_warp Pass list of files to scanheader; grep max=4095 Find all the descendents of the headers –Annotation approach Run scanheader on all headers Make results of scanheader annotations Query on the annotations –We used the first approach

Q5 Continued Create list of files to query ALIGN_WARPS=`$NQ $NQOPTS 'select ident from everything where type == "proc" && basename == "align_warp" table `$NQ $NQOPTS select name from ancestors {'"$ALIGN_WARPS"' } depth where basename ~ "*.hdr table Call scanheader on everything returned above, selecting those files where max=4095

Q5 Continued Query on the list returned above nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr } where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg" report' Results

Q6: images produced by softmean with a particular align_warp parameter Three stage query: –Find align_warp processes LIGN_WARPS=`nq select ident from everything where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12* table'` –Find appropriate softmean processes SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' } where type == "proc" && basename == "softmean" table' –Find images produced by softmean processes nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 where type == "passfile" && basename ~ "*.img" report Results: 940.0 [passfile; challenge/q7/atlas.img] version 1 name: challenge/q7/atlas.img 917.0 [passfile; challenge/atlas.img] version 1 name: challenge/atlas.img

Q7: Difference between original and new workflow We use standard diff of textual output nq 'ancestors atlas-x.gif report' > q7-a.tmp nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp diff -u q7-a.tmp q7-b.tmp Result:Result -922.0 [passfile; challenge/atlas-x.gif] version 1 +945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 type: passfile - name: challenge/atlas-x.gif - input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 + name: challenge/q7/atlas-x.jpg + input: 945.2 [proc; pid 2961; /usr/bin/pnmtojpeg] version 0 annotation: dim=x - annotation: run=base - annotation: studyModality=mindreading + annotation: run=q7 + annotation: studyModality=visual

Q8: FindUChicago align_warp outputs Three stage query: –Find everything annotated with UChicago INPUTS=`nq 'select ident from everything where $center == "UChicago" table –Find those Uchicago objects that are the result of align_warp `WARPS=`nq 'select ident from descendents { '"$INPUTS"' } depth 1 where type == "proc" && basename == "align_warp" table –Now, find all the outputs of those processes `nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'

Q8 Continued Results 930.0 [passfile; challenge/q7/warp3.warp]version 1 type: passfile name: challenge/q7/warp3.warp 929.0 [passfile; challenge/q7/warp2.warp]version 1 type: passfile name: challenge/q7/warp2.warp 907.0 [passfile; challenge/warp3.warp]version 1 type: passfile name: challenge/warp3.warp 906.0 [passfile; challenge/warp2.warp]version 1 type: passfile name: challenge/warp2.warp

Q9: Find user annotations for objects where some annotations have a given value Setup –We added annotations to all six output images –We annotated one set of outputs with modality visual and the other modality mind-reading. Query nq 'select annotations from everything where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && ($studyModality == "speech" || $studyModality == "visual" || $studyModality == "audio") report'

Q9 Continued Results 947.0 [passfile; challenge/q7/atlas-z.jpg]version 1 annotation: dim=z annotation: run=q7 annotation: studyModality=visual 946.0 [passfile; challenge/q7/atlas-y.jpg] version 1 annotation: dim=y annotation: run=q7 annotation: studyModality=visual 945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 annotation: dim=x annotation: run=q7 annotation: studyModality=visual

Conclusions/Observations We have the data We are not UI people Output is remarkably complete –Sometimes makes it difficult to extract the information you want. Output is BIG if you ask for everything, but … … you can ask for everything and get it

Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences.

Similar presentations

Presentation on theme: "Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences.

Similar presentations

Presentation on theme: "Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences."— Presentation transcript:

Similar presentations

About project

Feedback