Download presentation
Presentation is loading. Please wait.
Published byRoberta Hopkins Modified over 8 years ago
1
1 Ilkay ALTINTAS Assistant Director, National Laboratory for Advanced Data Research Manager, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD Accelerating the Scientific Exploration Process with Scientific Workflows
2
2 Why do we need to know about scientific workflow systems?
3
3 The answer lies in how today’s science is being conducted! Traditional scientific process Observe Hypothesize Conduct experiment Analyze data Compare results and Conclude Predict “ A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it. ” (First Law of Mentat), Frank Herbert, Dune.
4
4 The answer lies in how today’s science is being conducted! Today’s scientific process Observe Hypothesize Conduct experiment Analyze data Compare results and Conclude Predict + + ++ More to add to this picture: network, Grid, portals, +++ Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, microarrays, satellite-based sensors, sensor networks, field studies… Analysis, Prediction / Models and model execution: Potentially large computation and visualization
5
5 Increasing Usage of Technology in Geosciences Online data acquisition and access Managing large databases –Indexing data on spatial and temporal attributes –Quick subsetting operations Large scale resource sharing and management Collaborative and distributed applications Parallel gridding algorithms on large data sets using high performance computing Integrate data with other related data sets, e.g. geologic maps, and hydrology models Provide easy-to-use user interfaces from portals and scientific workflow environments The Wishlist
6
6 High-Throughput Computational Chemistry Wishlist Perform synchronously many embarrassingly parallel calculations with different molecules, configurations, methods and/or other parameters Use existing data, codes, procedures, and resources with the usual sophistication and accuracy Let different programs seamlessly communicate with each other Integrate preparation and analysis steps Distribute jobs automatically onto a computational grid Provide interfaces to database infrastructures to obtain starting structures and to allow data mining of results Stay as general, flexible, manageable, and reusable as possible
7
7 SWF Systems Requirements Design tools-- especially for non-expert users Ease of use-- fairly simple user interface having more complex features hidden in the background Reusable generic features –Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) customizable Extensibility for the expert user Registration, publication & provenance of data products and “process products” (=workflows) Dynamic plug-in of data and processes from registries/repositories Distributed WF execution (e.g. Web and Grid awareness) Semantics awareness WF Deployment –as a web site, as a web service,“Power apps” (a la SciRUN II)
8
8 So, what is a scientific workflow?
9
9 The Big Picture: Supporting the Scientist Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE Conceptual SWF Executable SWF From “Napkin Drawings” … … to Executable Workflows
10
10 Phylogeny Analysis Workflows Local Disk Multiple Sequence Alignment Phylogeny Analysis Tree Visualization
11
11 Promoter Identification Workflow Source: Matt Coleman (LLNL)
12
12 Promoter Identification Workflow
13
13
14
14
15
15 Enter initial inputs, Run and Display results
16
16 Custom Output Visualizer
17
17 Mineral Classification Workflow
18
18 PointInPolygonalgorithm
19
19 Enter initial inputs, Run and Display results
20
20 Browser display of results Output Visualizers
21
21 TSI Workflow-2 (D. Swesty)
22
22 TSI-2 Workflow Overview Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other varsWrite diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page If Running
23
23 TSI-2 Executable Workflow Screenshot
24
24 TSI-2 Web Interface for Monitoring
25
25 TSI-2 Workflow Running Interface
26
26 Fusion Simulation Codes: (a) GTC; (b) XGC with M3D –e.g. (a) currently 4,800 (soon: 9,600) nodes Cray XT3; 9.6TB RAM; 1.5TB simulation data/run GOAL: –automate remote simulation job submission –continuous file movement to secondary analysis cluster for dynamic visualization & simulation control –… with runtime-configurable observables Select JobMgr Submit Simulation Job Submit FileMover Job Execution Log (=> Data Provenance) Overall architect (& prototypical user): Scott Klasky (ORNL) WF design & implementation: Norbert Podhorszki (UC Davis) CPES Fusion Simulation Workflow
27
27 Pipelined Execution Model Reusable Actor “Class” SpecializeActor “instances” Specialized Actor “Instances” Inline Documentation Easy-to-edit Parameter Settings Inline Display CPES Analysis Workflow Concurrent analysis pipeline (@Analysis Cluster): –convert ; analyze ; copy-to-Web-portal –easy configuration, re-purposing Overall architect (& prototypical user): Scott Klasky (ORNL) WF design & implementation: Norbert Podhorszki (UC Davis)
28
28 Scientific Workflow Systems Combination of –data integration, analysis, and visualization steps –larger, automated "scientific process" Mission of scientific workflow systems –Promote “scientific discovery” by providing tools and methods to generate scientific workflows –Create an extensible and customizable graphical user interface for scientists from different scientific domains –Support computational experiment creation, execution, sharing, reuse and provenance –Design frameworks which define efficient ways to connect to the existing data and integrate heterogeneous data from multiple resources Make technology useful through user’s monitor!!!
29
29 Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows Kepler is a Scientific Workflow System … and a cross-project collaboration 1st Beta release (June 2, 2006) www.kepler-project.org Builds upon the open-source Ptolemy II framework
30
30 Kepler is a Team Effort Ptolemy II Resurgence Griddles SRB LOOKING SKIDL CipresNLADR Contributor names and funding info are at the Kepler website!! Other contributors: - Chesire (UK Text Mining Center) - DART (Great Barrier Reef, Australia) - National Digital Archives + UCSD-TV (US) - …
31
31 Kepler Software Development Practice How does this all work? –Joint CVS -- special rules! Projects like SDM, Cipres, Resurgence have their specialized releases out of a common infrastructure! –Open-source (BSD) –Website: Wiki -- http: kepler-project.org –Communications: Busy IRC channel Mailing lists: Kepler-dev, Kepler-users, Kepler-members Telecons for design discussions –6-monthly hackatons –Focus group meetings: workshops and conference calls
32
32 Kepler Users Batch users –Portals –Other workflow systems as an engine User interface users –Workflow developers –Scientists –Software Developers –Engineers –Researchers
33
33 Actors are the Processing Components Actor –Encapsulation of parameterized actions –Interface defined by ports and parameters Port –Communication between input and output data –Without call-return semantics Model of computation –Communication semantics among ports –Flow of control –Implementation is a framework Examples –Simulink(The MathWorks) –LabVIEW ( from National Instruments) –Easy 5x (from Boeing) –ROOM(Real-time object-oriented modeling) –ADL(Wright) –… Actor-Oriented Design
34
34 Some actors in place for… Generic Web Service Client and Web Service Harvester Customizable RDBMS query and update Command Line wrapper tools (local, ssh, scp, ftp, etc.) Some Grid actors- Globus Job Runner, GridFTP-based file access, Proxy Certificate Generator SRB support Native R and Matlab support Interaction with Nimrod and APST Communication with ORBs through actors and services Imaging, Gridding, Vis Support Textual and Graphical Output …more generic and domain-oriented actors…
35
35 Directors are the WF Engines that… Implement different computational models Define the semantics of –execution of actors and workflows –interactions between actors Ptolemy and Kepler are unique in combining different execution models in heterogeneous models! Kepler is extending Ptolemy directors with specialized ones for web service based workflows and distributed workflows. Process Networks Rendezvous Publish and Subscribe Continuous Time Finite State Machines Dataflow Time Triggered Synchronous/reactive model Discrete Event Wireless
36
36 Dataflow as a Computation Model Dataflow: Abstract representation of how data flows in the system –Alternative to: stored-program (von Neumann) execution A dataflow program: a graph –nodes represent operations, edges represent data paths Sound, simple, powerful model of parallel computation –NOT having a locus of control makes it simple! –Naturally distributed model of computation: – Asynchronous: Many actors can be ready to fire simultaneously – Execution ("firing") of a node starts when (matching) data is available at a node's input ports. – Locally controlled events – Events correspond to the “firing” of an actor – Actor: – A single instruction – A sequence of instructions – Actors fire when all the inputs are available
37
37 Vergil is the GUI for Kepler Actor ontology and semantic search for actors Search -> Drag and drop -> Link via ports Metadata-based search for datasets Actor Search Data Search
38
38 Actor Search Kepler Actor Ontology Used in searching actors and creating conceptual views (= folders) Currently more than 200 Kepler actors added!
39
39 Data Search and Usage of Results Kepler DataGrid – Discovery of data resources through local and remote services SRB, Grid and Web Services, Db connections – Registry of datasets on the fly using workflows
40
40 Kepler can be used as a batch execution engine Configuration phase Subset: DB2 query on DataStar Portal Grid Subset Analyze moveprocess Visualize moverenderdisplay Interpolate: Grass RST, Grass IDW, GMT… Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/ Output Processing Monitoring/ Translation
41
41 Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy …Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery
42
42 Kepler Authentication Framework Actors manage data, programs, computing resources in –Distributed & Heterogeneous environments –Under various secure administration How to use ONE system handle all of the authentication jobs? Data: Database SRB XML File System … Programs: Command Line MPI Parallel Online CGI Web Service Grid Application Resources: Mobil Device Laptop Desktop Cluster Supercomputer Grid Job Management: OS Gondor PBS qsub GRAM Web Portal …
43
43 Kepler Archives Purpose: Encapsulate WF data and actors in an archive file –… inlined or by reference –… version control More robust workflow exchange Easy management of semantic annotations Plug-in architecture (Drop in and use) Easy documentation updates A jar-like archive file (.kar) including a manifest All entities have unique ids (LSID) Custom object manager and class loader UI and API to create, define, search and load.kar files
44
44 Kepler Object Manager Designed to access local and distributed objects Objects: data, metadata, annotations, actor classes, supporting libraries, native libraries, etc. archived in kar files Advantages: –Reduce the size of Kepler distribution Only ship the core set of generic actors and domains –Easy exchange of full or partial workflows for collaborations –Publish full workflows with their bound data Becomes a provenance system for derived data objects => Separate workflow repository and distributions easily
45
45 Kepler Provenance Framework OPTIONAL! –Modeled as a separate concern in the system –Listens to the execution and saves information customized by a set of parameters Context: who, what, where, when, and why that is associated with the run Input data and its associated metadata Workflow outputs and intermediate data products Workflow definition (entities, parameters, connections): a specification of what exists in the workflow and can have a context of its own Information about the workflow evolution -- workflow trail Types of Provenance Information: –Data provenance Intermediate and end results including files and db references –Process provenance Keep the wf definition with data and parameters used in the run –Error and execution logs –Workflow design provenance
46
46 Kepler Provenance Recording Utility Parametric and customizable –Different report formats –Variable levels of detail Verbose-all, verbose-some, medium, on error –Multiple cache destinations Saves information on –User name, Date, Run, etc…
47
47 What other system functions does provenance relate to? Failure recovery Smart re-runs Semantic extensions Kepler Data Grid Reporting and Documentation Authentication Data registration Re-run only the updated/failed parts Guided documentation generation an updates
48
48 Advantages of Scientific Workflow Systems Formalization of the scientific process Easy to share, adapt and reuse –Deployable, customizable, extensible Management of complexity and usability –Support for hierarchical composition –Interfaces to different technologies from a unified interface –Can be annotated with domain-knowledge Tracking provenance of the data and processes –Keep the association of results to processes –Make it easier to validate/regenerate results and processes –Enable comparison between different workflow versions Execution monitoring and fault tolerance Interaction with multiple tools and resources at once
49
49 Evolving Challenges For Scientific Workflows Access to heterogeneous data and computational resources and link to different domain knowledge Interface to multiple analysis tools and workflow systems –One size doesn’t fit all! Support computational experiment creation, execution, sharing, reuse and provenance –Manage complexity, user and process interactivity –Extensions for adaptive and dynamic workflows Track provenance of workflow design (= evolution), execution, and intermediate and final results’ Efficient failure recovery and smart re-runs Support various file and process transport mechanisms –Main memory, Java shared file system, …
50
50 Evolving Challenges For Scientific Workflows Support the full scientific process –Use and control instruments, networks and observatories in observing steps –Scientifically and statistically analyze and control the data collected by the observing steps, –Set up simulations as testbeds for possible observatories Come up with efficient and intuitive workflow deployment methods Do all these in a secure and usable way!!!
51
51 Some success stories… Our collaborative efforts with the SDM Center over the last few years, has enabled us to publish several peer reviewed papers, and abstracts that were an important factor in maintaining current funding in the low-dose radiation research program (http://lowdose.tricity.wsu.edu/). … Having access to SDM Center scientists and their automated workflow and parallel processing tools over the next two years will be critical for the identification and characterization of regulatory element profiles of IR-responsive genes and will provide valuable understanding of the genetic mechanisms of IR-response and should provide powerful biological indicators of genetic susceptibilities for tissue and genetic damage.http://lowdose.tricity.wsu.edu/ - Matthew A. Coleman, Bioscience Program, Lawrence Livermore National Laboratory, 2005 "We are finally seeing some nice payoffs in terms of easy-to-use computational chemistry software with some unique capabilities. The framework illustrates nicely the interplay between technology and applications - including the compute software, the middleware, and the grid computing capabilities.” - Kim K. Baldridge, PI, Resurgence project, 2004 During SciDAC I, members of the Terascale Supernova Initiative (TSI), now being expanded to PSI, collaborated extensively with members of your team. And the benefits were palpable. The successful deployment of a scientific workflow management and automation tool, which arose out of a fruitful collaboration between Doug Swesty and Eric Myra of Stony Brook and Terence Critchlow of LLNL, is one example. … Moreover, others of your effort (e.g., Mladen Vouk) are based at PSI partner sites and engaged in helping some of our application scientists (in this case, John Blondin), which will further enhance our overall research exchange. - Anthony Mezzacappa, PI, DOE SciDAC Petascale Supernova Initiative, 2005 The CIPRES project has as a key goal the creation of software infrastructure that allows developers in the community to easily contribute new software tools,... The modular nature of Kepler met our requirements, as it is a JAVA platform that allows users to construct linear, looping, and complex workflows from just the kinds of components. The CIPRES community is developing. By adopting this tool, we were able to focus on developing appropriate framework and registry tools for our community, and use the friendly Kepler user application interface as an entrée to our services. We are very excited about the progress we have made, and think the tool will be revolutionary for our user base. - Mark A. Miller, PI, NSF CIPRES project, 2006 It can help you too!
52
52 Scientific Workflows in NBCR Service-oriented architecture facilitates the usage of scientific workflows in NBCR! –Enable access to scientific applications on Grid resources Seamlessly via a number of user interfaces Easily from the perspective of a scientific user –Enable the creation of scientific workflows Possibly with the use of commodity workflow toolkits
53
53 MEME+MAST Workflow using Kepler
54
54 Kepler Opal Web Services Actor
55
55 Demonstration…
56
56 To Sum Up … is an open-source system and collaboration – is a 2.5 year-old project –grows by application pull from contributors –released Beta2.0 on June 24th, 2006 There is a lot more to cover and work on…
57
57 Ilkay Altintas altintas@sdsc.edu +1 (858) 822-5453 http://www.sdsc.edu Questions… Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.