Download presentation
Presentation is loading. Please wait.
Published byAshlynn Oliver Modified over 9 years ago
1
1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 20, 2010 Information management and workflow
2
Contents Review of last class, reading Information life-cycle Information visualization Checking in for project definitions? Discussion of reading Next class 2
3
Management Creation of logical collections –The primary goal of a Management system is to abstract the physical collection into logical collections. The resulting view is a uniform homogeneous library collection. Physical handling –This layer maps between the physical to the logical views. Here you find items like replication, backup, caching, etc. 3
4
Management Interoperability support –Normally the data does not reside in the same place, or various collections (like catalogues) should be put together in the same logical collection. Security support –Access authorization and change verification. This is the basis of trusting your information. Ownership –Define who is responsible for quality and meaning 4
5
Management Metadata collection, management and access. –Metadata are data about data. –Metainformation are information about information Persistence –Definition of lifetime. Deployment of mechanisms to counteract technology obsolescence. Knowledge and information discovery –Ability to identify useful relations and information inside the collection. 5
6
Management Dissemination and publication –Mechanism to make aware the interested parties of changes and additions to the collections. 6
7
Logical Collections Identifying naming conventions and organization Aligning cataloguing and naming to facilitate search, access, use Provision of contextual information 7
8
Physical Handling Where and who does it come from? How is it transferred into a physical form? Backup, archiving, and caching... Formats Naming conventions 8
9
Interoperability Support Bit/byte and platform/ wire neutral encodings Programming or application interface access Structure and vocabulary (metadata) conventions and standards 9
10
Security What mechanisms exist for securing? Who performs this task? Change and versioning (yes, the information may change), who does this, how? Who has access? How are access methods controlled, audited? Who and what – authentication and authorization? Encryption and integrity 10
11
Ownership Rights and policies – definition and enforcement Limitations on access and use Requirements for acknowledgement and use Who and how is quality defined and ensured? Who may ownership migrate too? How to address replication? How to address revised/ derivative products? 11
12
Metadata How to know what conventions, standards, best practices exist? How to use them, what tools? Understanding costs of incomplete and inconsistent metadata Understanding the line between metadata and data and when it is blurred Knowing where and how to manage metadata and where to store it (and where not to) 12
13
Persistence Where will you put your information so that someone else (e.g. one of your class members) can access it? What happens after the class, the semester, after you graduate? What other factors are there to consider? 13
14
Discovery If you choose (see ownership and security), how does someone find your information? How would you provide discovery of collections, versus files, versus ‘bits’? How to enable the narrowest/ broadest discovery? 14
15
Dissemination 15 Who should do this? How and what needs to be put in place? How to advertise? How to inform about updates? How to track use, significance?
16
Formats ASCII, UTF-8, ISO 8859-1 Self-describing formats Table-driven Markup languages and other web-based Databases Graphs Unstructured Discussion… 16
17
Metadata Dublin Core (dc.x) METS ISO in general, e.g. ISO/IEC 11179 Geospatial, ISO 19115-2, FGDC Time, ISO 8601, xsd:datetime Z39.50/ ISO23950 17
18
Summary of Management Creation of logical collections Physical handling Interoperability support Security support Ownership Metadata collection, management and access. Persistence Knowledge and information discovery Dissemination and publication 18
19
Information Workflow What it is? Why you would use it? Some more detail in the context of Kepler –www.kepler-project.orgwww.kepler-project.org Some pointers to other workflow systems 19
20
20 What is a workflow? General definition: series of tasks performed to produce a final outcome Information workflow – “analysis pipeline” –Automate tedious jobs that users traditionally performed by hand for each dataset –Process large volumes of data/ information faster than one could do by hand
21
21 Background: Business Workflows Example: planning a trip Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. Each task may depend on outcome of previous task –Days you reserve the hotel depend on days of the flight –If hotel has shuttle service, may not need to rent a car
22
22 What about information workflows? Perform a set of transformations/ operations on a data or information source Examples –Generating images from raw data –Identifying areas of interest in a large dataset –Classifying set of objects –Querying a web service for more information on a set of objects –Many others…
23
23 More on Workflows Formal models of the flow of data/ information among processing components May be simple and linear or more complex Can process many data/ information types: –Archives –Web pages –Streaming/ real time –Images (e.g., medical or satellite) –Simulation output –Observational data
24
24 Challenges Questions: –What are some challenges for users in implementing workflows? –What are some challenges to executing these workflows? –What are limitations of writing a program?
25
25 Challenges Mastering a programming language Visualizing workflow Sharing/exchanging workflow Formatting issues Locating datasets, services, or functions
26
26 Kepler Workflow Management System Graphical interface for developing and executing scientific workflows Users can create workflows by dragging and dropping Automates low-level processing tasks Provides access to repositories, compute resources, workflow libraries
27
27 Benefits of Workflows Documentation of aspects of analysis Visual communication of analytical steps Ease of testing/debugging Reproducibility Reuse of part or all of workflow in a different project
28
28 Additional Benefits Integration of multiple computing environments Automated access to distributed resources via other architectural components, e.g. web services and Grid technologies System functionality to assist with integration of heterogeneous components
29
Why not just use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can’t be easily reused May not have sufficient documentation to be adapted for another purpose 29
30
Why is a GUI useful? No need to learn a programming language Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing of workflows 30
31
The Kepler Project Goals –Produce an open-source workflow system enable scientists to design scientific workflows and execute them –Support scientists in a variety of disciplines e.g., biology, ecology, astronomy –Important features access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design
32
Usage statistics Source code access –154 people accessed source code –30 members have write permission –Projects using Kepler: SEEK (ecology) SciDAC (molecular bio,...) CPES (plasma simulation) GEON (geosciences) CiPRes (phylogenetics) CalIT2 ROADnet (real-time data) LOOKING (oceanography) CAMERA (metagenomics) Resurgence (Computational chemistry) NORIA (ocean observing CI) NEON (ecology observing CI) ChIP-chip (genomics) COMET (environmental science) Cheshire Digital Library (archival) Digital preservation (DIGARCH) Cell Biology (Scripps) DART (X-Ray crystallography) Ocean Life Assembling theTree of Life project Processing Phylodata (pPOD) FermiLab (particle physics) Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh
33
Distributed execution Opportunities for parallel execution –Fine-grained parallelism –Coarse-grained parallelism Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold –parameter sweep, iteration of stochastic models Current ‘plumbing’ approaches to distributed execution –workflow acts as a controller stages data resources writes job description files controls execution of jobs on nodes –requires expert understanding of the Grid system Scientists need to focus on just the computations –try to avoid plumbing as much as possible
34
–Higher-order component for executing a model on one or more remote nodes –Master and slave controllers handle setup and communication among nodes, and establish data channels –Extremely easy for scientist to utilize requires no knowledge of grid computing systems Distributed Kepler OUT IN MasterSlave Controller
35
Token {1,5,2} Need for integrated management of external data –EarthGrid access is partial, need refactoring –Include other data sources, such as JDBC, OpeNDAP, etc. –Data needs to be a first class object in Kepler, not just represented as an actor –Need support for data versioning to support provenance e.g., Need to pass data by reference –workflows contain large data tokens (100’s of megabytes) –intelligent handling of unique identifiers (e.g., LSID) Token ref-276 {1,5,2} Management AB
36
Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data –Enable data sharing & reuse –Enhance data discovery at global scales Scalable analysis and synthesis –Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues –Enable communication and collaboration for analysis –Enable reuse of analytical components –Support scientific workflow design and modeling
37
SEEK data access, analysis, mediation Data Access (EcoGrid) –Distributed data network for environmental, ecological, and systematics data –Interoperate diverse environmental data systems Workflow Tools (Kepler) –Problem-solving environment for scientific data analysis and visualization “scientific workflows” Semantic Mediation (SMS) –Leverage ontologies for “smart” data/component discovery and integration
38
Managing Heterogeneity Data comes from heterogeneous sources –Real-world observations –Spatial-temporal contexts –Collection/measurement protocols and procedures –Many representations for the same information (count, area, density) –Data, Syntax, Schema, Semantic heterogeneity Discovery and “synthesis” (integration) performed manually –Discovery often based on intuitive notion of “what is out there” –Synthesis of data is very time consuming, and limits use
39
Scientific workflow systems support data analysis KEPLER
40
Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions,...) A simple Kepler workflow (T. McPhillips)
41
Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. Lists Nexus files to process (project) Reads text filesParses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. A simple Kepler workflow (T. McPhillips)
42
An example workflow run, executed as a Dataflow Process Network A simple Kepler workflow
43
SMS motivation Scientific Workflow Life-cycle –Resource Discovery discover relevant datasets discover relevant actors or workflow templates –Workflow Design and Configuration data actor (data binding) data data (data integration / merging / interlinking) actor actor (actor / workflow composition) Challenge: do all this in the presence of … –100’s of workflows and templates –1000’s of actors (e.g. actors for web services, data analytics, …) –10,000’s of datasets –1,000,000’s of data items –… highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!
44
Some other workflow systems SCIRun Sciflo Triana Taverna Pegasus Some commercial tools: –Windows Workflow Foundation –Mac OS X Automator http://www.isi.edu/~gil/AAAI08TutorialSlides/5- Survey.pdfhttp://www.isi.edu/~gil/AAAI08TutorialSlides/5- Survey.pdf http://www.isi.edu/~gil/AAAI08TutorialSlides/ See reading for this week 44
45
Summary The progression toward more formal encoding of science workflow, and in our context data-science workflow (dataflow) is substantially improving data management Awareness of preservation and stewardship for valuable data and information resources is receiving renewed attention in the digital age Workflows are a potential solution to the data stewardship challenge 45
46
Discussion About management? Workflow? 46
47
Reading for this week Is retrospective 47
48
Check in for Project Assignment Analysis of existing information system content and architecture, critique, redesign and prototype redeployment 48
49
What is next Week 12 – Information Discovery, Information Integration, review of all course material and check on learning objectives (next week) Break on May 4, no class Week 13 – Project presentations (May 11, i.e. in 3 weeks) Note IDEA surveys will be sent out soon 49
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.