Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences
Reminder: What is PASS? Storage systems (e.g., file systems) in which provenance is a first class entity. Provenance: –is generated and maintained as transparently as possible. –can be indexed and queried. –will be created from objects imported from non- PASS sources. –is maintained in the presence of deletes, copies, renames, etc.
Collecting Provenance % sort a > b fork open b (W) exec sort a open a (R) read a write b close a close b task_struct Inode cache b argv=sort a name=sort modules=pasta… kernel=Linux… env=USER… sort input=sort a input=a To file system Kernel
Things to Keep in Mind Our focus is provenance collection, not query. We collect provenance of everything. Provenance collection is done in the operating system. Queries are simply queries against the database maintained by the kernel. Our kernel database is Berkeley DB.
Results Summary Workflow: we ran the shell script –Dropped in all the programs and simply ran them on Linux. –Chose not to run the slicer, because the license worried us. Query: command-line query tool: nq –Successfully ran all queries –Generated a lot more output than you really want. –Strategy is to keep everything and provide pruning to let users see what they want.
Query Tool: NQ General form: –nq [SELECTION] SEARCH [FILTER] OUTPUT-TYPE SELECTION: select FIELD … from FIELD : FIELD-NAME, concat(FIELD-NAME), $ANNOTATION, nameof(FIELD-NAME), typeof(FIELD-NAME) SEARCH: ancestors FILE*, descendents FILE*, everything FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR OUTPUT-TYPE: report, report html, table EXPR : existing, nonexisting, EXPR op EXPR
Q1: Provenance of Graphic X nq 'ancestors atlas-x.gif report [passfile; challenge/atlas-x.gif] version 1 type: passfile name: challenge/atlas-x.gif input: [proc; pid 2937; /usr/local/bin/convert] version 0 annotation: dim=x annotation: run=base annotation: studyModality=mindreading And 4806 other objects… Results: QUERIES\q1.htmlQUERIES\q1.html
Q2: Q1 excluding prior to softmean Query: nq ancestors atlas-x.gif anchor (type == proc && name == AIR5.2.5/bin/softmean) report Result: essentially a subset of Q1essentially a subset of Q1 –only 148 objects identified
Q3: Q2 w/stages We did not create annotations to map to stages, so this query degenerates to the same one as Query 2.
Q4: align_warp w/specific parameter values Query: nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report Results: –We did our run on Monday –Returns 8 instances:Returns 8 instances Four from the main workload Four from the variant workload used in Query 7
Q5: images with max=4095 Two alternate approaches: –Three phase solution Create list of header files that are ancestors of align_warp Pass list of files to scanheader; grep max=4095 Find all the descendents of the headers –Annotation approach Run scanheader on all headers Make results of scanheader annotations Query on the annotations –We used the first approach
Q5 Continued Create list of files to query ALIGN_WARPS=`$NQ $NQOPTS 'select ident from everything where type == "proc" && basename == "align_warp" table `$NQ $NQOPTS select name from ancestors {'"$ALIGN_WARPS"' } depth where basename ~ "*.hdr table Call scanheader on everything returned above, selecting those files where max=4095
Q5 Continued Query on the list returned above nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr } where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg" report' Results
Q6: images produced by softmean with a particular align_warp parameter Three stage query: –Find align_warp processes LIGN_WARPS=`nq select ident from everything where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12* table'` –Find appropriate softmean processes SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' } where type == "proc" && basename == "softmean" table' –Find images produced by softmean processes nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 where type == "passfile" && basename ~ "*.img" report Results: [passfile; challenge/q7/atlas.img] version 1 name: challenge/q7/atlas.img [passfile; challenge/atlas.img] version 1 name: challenge/atlas.img
Q7: Difference between original and new workflow We use standard diff of textual output nq 'ancestors atlas-x.gif report' > q7-a.tmp nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp diff -u q7-a.tmp q7-b.tmp Result:Result [passfile; challenge/atlas-x.gif] version [passfile; challenge/q7/atlas-x.jpg] version 1 type: passfile - name: challenge/atlas-x.gif - input: [proc; pid 2937; /usr/local/bin/convert] version 0 + name: challenge/q7/atlas-x.jpg + input: [proc; pid 2961; /usr/bin/pnmtojpeg] version 0 annotation: dim=x - annotation: run=base - annotation: studyModality=mindreading + annotation: run=q7 + annotation: studyModality=visual
Q8: FindUChicago align_warp outputs Three stage query: –Find everything annotated with UChicago INPUTS=`nq 'select ident from everything where $center == "UChicago" table –Find those Uchicago objects that are the result of align_warp `WARPS=`nq 'select ident from descendents { '"$INPUTS"' } depth 1 where type == "proc" && basename == "align_warp" table –Now, find all the outputs of those processes `nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'
Q8 Continued Results [passfile; challenge/q7/warp3.warp]version 1 type: passfile name: challenge/q7/warp3.warp [passfile; challenge/q7/warp2.warp]version 1 type: passfile name: challenge/q7/warp2.warp [passfile; challenge/warp3.warp]version 1 type: passfile name: challenge/warp3.warp [passfile; challenge/warp2.warp]version 1 type: passfile name: challenge/warp2.warp
Q9: Find user annotations for objects where some annotations have a given value Setup –We added annotations to all six output images –We annotated one set of outputs with modality visual and the other modality mind-reading. Query nq 'select annotations from everything where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && ($studyModality == "speech" || $studyModality == "visual" || $studyModality == "audio") report'
Q9 Continued Results [passfile; challenge/q7/atlas-z.jpg]version 1 annotation: dim=z annotation: run=q7 annotation: studyModality=visual [passfile; challenge/q7/atlas-y.jpg] version 1 annotation: dim=y annotation: run=q7 annotation: studyModality=visual [passfile; challenge/q7/atlas-x.jpg] version 1 annotation: dim=x annotation: run=q7 annotation: studyModality=visual
Conclusions/Observations We have the data We are not UI people Output is remarkably complete –Sometimes makes it difficult to extract the information you want. Output is BIG if you ask for everything, but … … you can ask for everything and get it