Workflows for Social Science Ken Turner Computing Science and Mathematics 31st January 2012
Workflows in Social Science l low-level (micro) flows are sequences of steps using some statistical package, e.g.: retrieve datasets D1 and D2 recode variable V1 cross-tabulate V1 and V2 l high-level (macro) flows combine the capabilities of separate services, e.g.: l data retrieval l data cleaning l data fusion l data analysis
High-Level Workflows in D AMES l an approach has been developed for high- level workflows in social science: l the services are external, being packages that conform to web/grid computing standards l the workflow logic is defined graphically l this is automatically analysed, and translated into BPEL (Business Process Execution Logic) l the supporting tools are: l C RESS : workflow definition and translation l ActiveBPEL: workflow orchestration
Statistical Analysis Services l services appearing in workflows can be supported by statistical packages: l a syntax file (R, Stata, …) is mapped to a web service (with a little help) l services to call these are automatically generated l an overall workflow using these services can be defined and uploaded to the DAMES portal l this encourages: l modularity and re-use of analyses l flexible combination of statistical scripts
C RESS l Communication Representation Employing Systematic Specification: l graphical workflow notation l application/language/platform-independent l automated analysis and implementation l mature, having been developed over 14 years l supported by other packages: l C HIVE : graphical workflow editor l M USTARD : workflow validator l C LOVE : workflow verifier l M INT : performance analyser
C RESS Methodology Workflow Diagram Precise Specification automatic specification Rigorous Analysis validation/ verification Implementation Code automatic compilation Performance Analysis Performance Analysis scenario evaluation design corrections
C RESS Example l the following example illustrates mapping one occupation to two different schemes l only an outline is given, omitting the details l the cooperating services are: l lookup: performs parallel mapping (workflow) l allocator: finds an available job mapper then does the mapping (workflow) l factory: manages mapper resources (partner) l mapper: performs a mapping for some scheme (partner)
Parallel Job Translation 3 Invoke allocator.job.translate mapping1 code1 3 Invoke allocator.job.translate mapping1 code1 4 Invoke allocator.job.translate mapping2 code2 4 Invoke allocator.job.translate mapping2 code2 1 Receive lookup.job.translate schemes 1 Receive lookup.job.translate schemes 2 Fork 6 Reply lookup.job.translate codes 6 Reply lookup.job.translate codes 5 Join
Job Mapper Allocation 1 Receive allocator.job.translate mapping 1 Receive allocator.job.translate mapping 2 Invoke factory.job.allocator scheme mapper 2 Invoke factory.job.allocator scheme mapper 4 Reply allocator.job.translate mapping 4 Reply allocator.job.translate mapping 3 Invoke mapper.job.translate job mapping 3 Invoke mapper.job.translate job mapping
Summary l low-level workflows define the sequence of basic steps in a statistical package l high-level workflows invoke external analysis services and combine their results l workflows can use scripts for various statistical packages mapped to services l C RESS allows high-level workflows to be defined, analysed and executed