Download presentation
Presentation is loading. Please wait.
Published byOsborne Scott Modified over 9 years ago
1
Workflows & Tools
2
Data Analysis Review of typical data analyses Reproducibility & provenance Overview of workflows Computer-based scientific workflows (SWF) Benefits of SWF Examples of SWF and associated tools
3
Data Analysis After completing this lesson, the participant will be able to: ◦ Understand a subset of typical analyses used ◦ Define a workflow ◦ Define an SWF ◦ Discuss the benefits of workflows in general and SWF in particular ◦ Locate resources for using SWF
4
Data Analysis
5
Conducted via personal computer, grid, cloud computing Statistics, model runs, parameter estimations, production of graphs/plots etc.
6
Data Analysis Processing: subsetting, merging, manipulating ◦ Reduction: important for high-resolution datasets ◦ Transformation: unit conversions, linear and nonlinear algorithms 0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000 Datetimeair tempprecip Cmm 11-Jul-075:0027.6000 11-Jul-076:0027.6000 11-Jul-077:0027.7003 11-Jul-078:0028.2017 11-Jul-079:0028.5000 11-Jul-0710:0029.3000 11-Jul-0711:0030.1000 11-Jul-0712:0030.4000 Recreated from Michener & Brunt (2000)
7
Data Analysis Graphical analyses ◦ Visual exploration of data: search for patterns ◦ Quality assurance: outlier detection Box and whisker plot of temperature by month Scatter plot of August Temperatures Strasser, unpub. data
8
Data Analysis Statistical analyses Conventional statistics -Traditionally apply to experimental data -Examples: ANOVA, MANOVA, linear and nonlinear regression Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance Descriptive statistics Traditionally apply to observational or descriptive data Examples: diversity indices, cluster analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis Oksanen 2011 Example of Principle Component Analysis
9
Data Analysis Statistical analyses (continued) ◦ Temporal analyses: time series ◦ Spatial analyses: for spatial autocorrelation ◦ Nonparametric approaches: useful when conventional assumptions violated or underlying distribution unknown ◦ Other misc. analyses: risk assessment, generalized linear models, mixed models, etc. Analyses of very large datasets ◦ Data mining & discovery ◦ Online data processing
10
Data Analysis Re-analysis of outputs Final visualizations: charts, graphs, simulations etc. Science is iterative: The process that results in the final product can be complex
11
Data Analysis Reproducibility is at the core of scientific method Complex process = more difficult to reproduce Good documentation required for reproducibility ◦ Metadata: data about data ◦ Process metadata: data about process used to create, manipulate, and analyze data
12
Data Analysis Process metadata is information about the process used to get to the data outputs Related concept: data provenance ◦ Data provenance is information about the origins of data ◦ Good provenance = able to follow data throughout entire life cycle (collection, organization & quality control, analyses, visualization) ◦ Allows for Replication & reproducibility Analysis for potential defects, errors in logic, statistical errors Evaluation of hypotheses
13
Data Analysis A workflow is a formalization of process metadata Includes precise description of scientific procedure Includes conceptualized series of data ingestion, transformation, and analytical steps Three components of a workflow: 1.Inputs: Information or material required 2.Outputs: Information or material produced & potentially used as input in other steps 3.Transformation rules/algorithms (e.g. analyses)
14
Data Analysis Simplest form of workflow: flow chart Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning
15
Data Analysis Simplest form of workflow: flow chart Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Inputs & Outputs Summary statistics Data in Excel format Input: Raw T and S data Output: data in Excel format Input: data in Excel format
16
Data Analysis Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Transformation Rules Summary statistics Data in Excel format Simplest form of workflow: flow chart Transformation rules describe what is done to/with the data to obtain the relevant outputs for publication.
17
Data Analysis Science is becoming more computationally intensive Most transformations are done via computer programs Sharing workflows benefits science Defining a scientific workflow system makes documenting workflows easier
18
Data Analysis A scientific workflow is an “analytical pipeline” Each step can be implemented in different software systems Each step and its parameters/requirements are formally recorded This allows reuse of both individual steps and the overall workflow
19
Data Analysis Single access point for multiple analyses across software packages Keeps track of analysis and provenance: enables reproducibility ◦ Each step & its parameters/requirements formally recorded Workflow can be stored Allows sharing and reuse of individual steps or overall workflow ◦ Automate repetitive tasks ◦ Use across different disciplines and groups ◦ Can run analyses more quickly since not starting from scratch
20
Data Analysis Open-source, free, cross-platform Drag-and-drop interface for workflow construction Steps (analyses, manipulations, etc) in workflow represented by an “actor” Actors connect via inputs and outputs to form a workflow Possible applications ◦ Theoretical models or observational analyses ◦ Hierarchical modeling ◦ Can have nested workflows ◦ Can access data from web-based sources (e.g. databases) Downloads and more information at kepler-project.org
21
Data Analysis Drag & drop components from this list Actors in workflow
22
Data Analysis This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.
23
Data Analysis Resulting output
24
Data Analysis Open-source Workflow & provenance management support Geared toward exploratory computational tasks ◦ Can manage evolving SWF ◦ Maintains detailed history about steps & data www.vistrails.org Screenshot example
25
Data Analysis Social networking site to support scientists that use SWF Allows searching for, sharing, reuse of SWF Can comment on and discuss contributed SWF Gateway to journals and data repositories www.myexperiment.org
26
Data Analysis Scientists should document workflows used to create results ◦ Data provenance ◦ Analyses and parameters used ◦ Connections between analyses via inputs and outputs Documentation can be informal (for example, a flowchart) or formal (for example, Kepler software)
27
Data Analysis Modern science is computer-intensive ◦ Heterogeneous data, analyses, software Reproducibility is important Workflows = process metadata ◦ Necessary for reproducibility, repeatability, validation There are formal systems for documenting process metadata ◦ Enable storage, sharing, visualization, reuse
28
Data Analysis Gil, Y, E Deelman, M Ellisman, T Fahringer, G Fox, D Gannon, C Goble, M Livny, L Moreau, and J Myers. Examining the Challenges of Scientific Workflows. Computer 40:24–32, 2007. Michener, K, J Beach, M Jones, B Ludaescher, D Pennington, R Pereira, A Rajasekar, and M Schildhauer. A knowledge environment for the biodiversity and ecological sciences. Journal of Intelligent Information Systems, 29:111–126, August 2007. Ludäscher, B, I Altintas, S Bowers, J Cummings, T Critchlow, E Deelman, DD Roure, J Freire, C Goble, M Jones, S Klasky, T McPhillips, N. Podhorszki, C Silva, I Taylor, and M Vouk. Scientific Process Automation and Workflow Management. Computational Science Series Ch 13. Chapman & Hall, Boca Raton, 2009. McPhillips, T, S Bowers, D Zinn, B Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems 25: 541-551, 2009. B Ludäscher, I Altintas, C Berkley, D Higgins, E Jaeger-Frank, M Jones, E Lee, J Tao, and Y Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, 18, 2006. W Michener and J Brunt, editors. Ecological Data: Design, Management and Processing. Blackwell Science, 180p, 2000.
29
START QUIZ
30
Data Analysis Analyses The output which is the information produced The input that contains the information All of the above
31
Data Analysis Review this section Return
32
Data Analysis Proceed to the next question NEXT
33
Data Analysis Data mining and discovery Grid computing Pattern searching and decision trees Spatial analyses
34
Data Analysis Review this section Return
35
Data Analysis Proceed to the next question Next
36
Data Analysis Scatter plots Box-and-whisker plots Plots that show you potential data errors All graphical formats and analyses
37
Data Analysis Review this section Return
38
Data Analysis Proceed to the next question Next
39
Data Analysis Reproducibility Repeatability Validation All of the above
40
Data Analysis Review this section Return
41
Data Analysis Proceed to the next question Next
42
Data Analysis Scientific workflow Systematic workflow Scientific workforce Systematic work information
43
Data Analysis Review this section Return
44
Data Analysis Proceed to the next question Next
45
Data Analysis Each step can be implemented in different software systems with requirements formally recorded Single access point for multiple analyses. Workflow can be stored Allows sharing of individual steps. All of the above.
46
Data Analysis Review this section Return
47
Data Analysis Proceed to the next question Next
48
Data Analysis Good organization Good data maintenance Good provenance Good metadata
49
Data Analysis Review this section Return
50
Data Analysis You have completed this learning module. Next
51
Data Analysis We want to hear from you! CLICK the arrow to take our short survey.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.