Download presentation
Presentation is loading. Please wait.
1
Data Entry & Manipulation
Lecture adapted from: DataONE Education Module: Data Entry and Manipulation. DataONE. Retrieved Feb 4, From GEO 802, Data Information Literacy Winter 2019 – Lecture 3 Gary Seitz, MA
2
Lesson 3 Outline Best practices for creating data & spreadsheet files
Data entry options Data manipulation options Let’s spend some time reviewing the syllabus and getting acquainted with what you can expect for the next 11 weeks. Image credit: Surveying by Luis Prado from The Noun Project Analysis and Workflows Luis Prado from The Noun Project
3
Learning Objectives Recognize inconsistencies that can make a dataset difficult to understand and/or manipulate Identify data entry tools Identify validation measures that can be performed as data is entered Review best practices for data integration Describe the basic components of a relational database
4
Goals of data entry Create data sets that are: Valid
Organized to support ease of use The goals of data entry are to create data that are valid, or have gone through a process to assure quality, and are organized to support use of the data or for ease of archiving. CC image by Travis S on Flickr
5
Collecting data yourself
This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively
6
Structured data If you want your computer to process and analyse your data, a computer has to be able to read and process the data. This means it needs to be structured and in a machine-readable form.
7
Machine-readable data
is data (or metadata) which is in a format that can be understood by a computer. Human-readable data that is marked up so that it can also be read by machines Examples: microformats, RDFa Data file formats intended principally for machines (RDF, XML, JSON).
8
Unstructured data Unstructured has no fixed underlying structure. E.g. PDFs and scanned images may contain information which is pleasing to the human-eye as it is laid-out nicely, but they are not machine-readable.
9
Example: poor data entry
From a small mammal trapping study: Inconsistency between data collection events Location of Date information Inconsistent Date format Column names Order of columns Many researchers like to manage their data in Excel, and Excel makes it easy to use poor data entry practices. These are data entered in to Excel from a small mammal trapping study. Each block of data represents a different trapping period (2/13, 3/15, and 4/10/2010). Inconsistencies in how the data were entered for each sampling period make the data difficult to analyze and difficult for anyone but the data collector to understand. Note that the date is listed in different places in each block. Date is a column in the first block, but listed in the header in the block on the right. Inconsistent date formats were also used. In one place the date is formatted as day-month-year, with the first three letters of the month spelled out, while elsewhere the format is mm/dd/yyyy. Note also that the order of the columns is inconsistent- Site, Date in the first block, and Site, Plot in the bottom block. Even the columns are named differently. Species is called Species in the first block, and RodentSp in the block on the right. This can be confusing to any user who must try to make sense of these data! And it would be a nightmare to try to write metadata for this spreadsheet.
10
Example: poor data entry
Inconsistency between data collection events Different site spellings, capitalization, spaces in site names—hard to filter Codes used for site names for some data, but spelled out for others Mean1 value is in Weight column Text and numbers in same column – what is the mean of 12, “escaped < 15”, and 91? There are other problems with how these data was entered. Naming of sites is also inconsistent. For instance, Deep Well is used in the first block vs. DW in the block on the right. The file contains several typos, also such as rioSalado vs. rioSlado. A human can figure out what each of these site names refers to, but the names would have to be harmonized for a statistical program to use. It would be easier to filter for just Deep Well (with a space), and not have to know you need to filter for DeepWell (no space), also. Similarly, in one place a species code is capitalized PERO, and lowercase elsewhere. Further, in the first block of data, a mean was calculated for the weight of the rodents. The value for that mean, called Mean1, is in the same column as the weights of the individual animals. In later manipulations of these data, it would be easy to copy that value as though it represented the weight of a single animal. It is bad practice to mix types of information in one column. It is best for raw data should be maintained in one file, and calculations should be done elsewhere. In addition, there is text data mixed with numeric data in the Weight column in the block on the right – it says “escaped < 15” (presumably indicating that a rodent less than 15 grams escaped). A statistical program will not know how to deal with text data mixed with numeric data. What is the mean of 12, 91, and “escaped < 15”? To analyze all these data using statistical software, and to make it much easier to understand by any user, these data will need to be organized in to a column for each variable. Therefore it essential that only one type of info be entered in to each column, and that spellings, codes, formats, etc. be consistent.
11
European Spreadsheet Risks Interest Group (ESRIG)
Errors in data and the tools used to manage it are common enough that there is actually an international organisation called the European Spreadsheet Risks Interest Group that advises on how to reduce the errors people make when using spreadsheets. ESRIG tracks “horror stories” of where data errors in spreadsheets have led to real consequences. These include billions in missing oil revenue, tens of thousands of Olympic tickets being oversold, and huge salaries being given because of accidentally inserted zeros. Some errors are deliberate and criminal. Others are just down to the complexity of managing data. Some errors occur because of the way spreadsheets behave. Read one of these stories and tell us, what exactly went wrong
12
Best practices Columns of data are consistent:
only numbers, dates, or text Consistent Names, Codes, Formats (date) used in each column Data are all in one table, which is much easier for a statistical program to work with than multiple small tables which each require human intervention This shows the same data entered in a way that would make it easy to understand and analyze. The data are not entered in separate blocks arrayed in a single worksheet. They are entered in one table with columns defined by variables Date, Site, Plot, Species, and Weight, Adult, and Comments that are recorded for each sampling event. The columns of data have consistent types. Each column contains only numbers, dates, or text. There are consistent names, codes, and formats used in each column. For instance, all dates are in the same format (mm/dd/yyyy), and there are no typos in the Site Names. Species are all referred to by standard codes. Therefore, if the user wanted to subset the data for species = ‘PERO”, they could easily filter the file for just those data. Additionally, there are only numeric data in the Weight column, so a statistical program or Excel could readily calculate statistics on this column. Preparing metadata for this file would also be straightforward.
13
‘SEV_SmallMammalData_v.5.25.2010.csv’
Best practices Create descriptive column names without spaces or special characters Soil T30 Soil_Temp_30cm Species-Code Species_Code avoid using -,+,*,^ in column names. Some software may interpret these symbols as an operator [REMINDER] -> Use a descriptive file name For instance, a file named ‘SEV_SmallMammalData_v csv’ indicates the project the data is associated with (SEV), the theme of the data (SmallMammalData) and also when this version of the data was created (v ). This name is much more helpful than a file named mydata.xls. A best practice in data entry is to create descriptive column names without spaces or special characters. Sometimes statistical programs have special uses for some characters, so you should avoid using them in your data file.
14
Best practices Missing data
Preferably leave field empty* (NULL = no value) In numeric fields, use a distinct value such as 9999 to indicate a missing value In text fields, use NA (“Not Applicable” or “Not Available”) Use Data flags in a separate column to qualify missing value Date Time NO3_N_Conc NO3_N_Conc_Flag 1300 0.013 1330 0.016 1400 M1 1430 0.018 1500 0.001 E1 M1 = missing; no sample collected E1 = estimated from grab sample A preferred way to identify missing data is with an empty field. If for some reason an empty cell is not possible then use an impossible value such as 9999 in numeric fields and in text fields use NA. Use data flags in a separate column to qualify empty cells. For instance, in this example of stream chemistry data, the flag M1 indicates that the sample was not collected at that interval. * This is totally up for debate, and is largely discipline- and software-specific.
15
Sorting an Excel file with empty cells is not a good idea!
Best practices Enter complete lines of data Sorting an Excel file with empty cells is not a good idea! There are a lot of great things about spreadsheets, but one must be wary of problems that can arise from their use. Spreadsheets, for instance, can sort one column independently of all others. The data entry person for the upper spreadsheet elected to leave empty cells for site, treat, web, plot, quad. It’s obvious why and doesn’t cause the human reader any problems. But if someone happens to decide to sort on Species, it is no longer clear which species maps to which time period or to which measurements. This could make the spreadsheet unusable. It is good practice to fill in all cells when using a spreadsheet for data entry. A best practice is to enter complete lines of data, so that the data are sorted on one column without loss of information
16
Your turn: Spot the faux pas!
17
5 problems: Headers should be in a single row.
Do not embed charts, graphs, or images Do not leave empty rows or columns Do not leave empty cells Best practice problems Headers should be in a single row. This spreadsheet contains headers in row 1 and row 2, as well as redundant headers for two columns in rows 14 and 26 that are confusing. There should be a single unique header for each column, preferably in row 1. In addition, many of the columns in this spreadsheet do not have headers at all! Avoid the use of special characters. Special characters are often not exported correctly or not read correctly by other software programs. In this case, it would be easy to remove the percent signs from all of the cells in column H and then indicate in the header that the values in this column are percentages. Do not leave empty rows or columns. These may cause problems when data are exported. Empty rows and columns also tend to indicate the presence of multiple tables in one sheet, which appears to be the case with this example. A sheet should contain only one table of data. Do not leave empty cells. Empty cells sometimes cause problems for other software or when exporting data. Empty cells are also confusing, because it is unclear why the cell is empty. Was this not measured? Did the value seem unreliable so was omitted? Was this value deleted by accident? If a cell must be left empty, make a notation in a column of comments about why the cell is empty. Do not embed charts, graphs, or images. They are not included when data are exported. Charts and graphs should be included in a separate sheet if necessary. These may also be exported as images from Excel. Also note that the chart in this spreadsheet has no labels of any kind! From: Avoid the use of special characters
18
3 problems: Do not merge cells Do not use commas
Best practice problems Do not use colored text or cell shading. This formatting is lost when the data are exported or the file is opened in another program. Consider adding another column where the information indicated by the coloring and shading can be included in text form. Do not use commas. When data are exported as a comma separated value (.csv) file, the commas within your data cells can cause confusion. They also sometimes indicate that multiple pieces of data (like city, state) are included in a single column. Separate data into multiple fields if appropriate or use another kind of separator (colon, semicolon -- but not a symbol!) if one is necessary. Do not merge cells. Cell merging is typically lost when data are exported and is often misread by other software programs. This may result in the shifting of data cells in the affected rows or columns. In the case shown here, the cells in column 1 should be unmerged and the date information entered into the first cell in each row. From: Do not use colored text or cell shading
19
2 problems: Do not mix data types in a single column
Best practice problems Do not mix data types in a single column. In this column the first 10 rows are length measurements, then there is one row with the value "2 bobbins," followed by seven rows that are amounts of time. This is very confusing. Create new columns for different types of data, and avoid mixing text and numbers in the same column if possible. Do not embed comments. These are lost when the file is exported. Consider creating a new column where comments can be captured. From: Do not embed comments
20
Introduction to cleaning data
Section 1: Nuts and chewing gum looks at the the way data is presented in spreadsheets and how it might cause errors. Section 2: The Invisible Man is in your spreadsheet is concerned with the problems of white spaces and non-printable characters and how they affect our ability to use the data. Section 3: Your data is a witch’s brew deals with consistency in data entry, and how to choose the right unit and format for data. Section 4: Did you bring the wrong suitcase (again)? is about where to put data, and how to structure it. Accompanying these sections is a step-by-step recipe for cleaning a dataset. This is an extensive, handbook-style resource which we refer to in each section. It takes a set of ‘dirty’ data and moves it through the different steps to make it ‘clean’.
21
References Best Practices for Preparing Environmental Data Sets to Share and Archive. September Les A. Hook, Suresh K. Santhana Vannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson.
22
Data Quality
23
Types of “bad data” Incorrect data Inaccurate data
Business rule violations Inconsistent data Incomplete data Nonintegrated data
24
Incorrect data For data to be correct (valid), its values must adhere to its domain (valid values). For example, a month must be in the range of 1–12, or a person’s age must be less than 130. Taken From: ADELMAN, S., ABAI, M., & MOSS, L. T. (2005). Data strategy [...] [...]. Upper Saddle River, NJ [u.a.], Addison-Wesley.
25
Inaccurate data A data value can be correct without being accurate. For example, the city of London and web country code for France “.fr” are both accurate but when used together (such as London, France) the country is wrong because the city of London is not in France, and the accurate country code is “co.uk”
26
Nonintegrated data Data that has been created separately & not with the intention of future integration. E.g. customer data can exist on 2 or more outsourced systems under different customer numbers with different spellings of the customer name & even different phone numbers or addresses. Integrating data from such systems is a challenge.
27
Inconsistent data Uncontrolled data redundancy results in inconsistencies. Every organization is plagued with redundant and inconsistent data. For example names or places: “Smith, David” might also sit alongside “David Smith”. “London, UK” and “London, England”.
28
Incomplete data Data that might include elements such as Names, postal code, gender, age, AHV number might also only capture haphazardly elements such as ailment, GP name or even incomplete date of birth.
29
Data Entry Tools
30
Data entry tools Google Forms Spreadsheets Surveys
There are many data entry tools. Google docs and Excel spreadsheets are commonly used. Data entry tools typically perform data validation which allows you to control the kind of information that is entered. With data validation, you can: --provide users with a list of choices --restrict entries to a specific type or size Using data validation improves the quality of data by preventing the entry of errors.
31
Google forms This is an example of a data entry form created in Google Forms. Such forms are easy to create, and free. Here, a form field is being created that will allow the user to select from three locations where data were collected. In practice, Google Forms work best for entering survey data, or entering lots of text data. The advantages to using a data entry form, as opposed to entering data directly in to a spreadsheet, is that the form can enforce data entry rules – that is, you can create a pick-list of items for a user to select from. That way, you have consistent info being entered – a user will always enter Deep Well, instead of DW.
32
Googledocs Forms Data entered into a Google Form is stored in a Google Sheet spreadsheet. These data can be downloaded for further analysis. Other advantages of using a Google Form to capture data are that: users can access the form from anywhere* & from multiple devices/machines and the results are co-located data are backed up to the cloud variable access permissions for various group members and collaborators *Caveat: you must have network/internet/cellular connectivity (depending on the device you are using)
33
Data entry tools: Excel
Excel is a very popular data entry tool. It also allows you to enforce data validation rules. Here, a dropdown list has been generated that allows the user to only select entries from this list. In this way, only defined species codes get entered, and the data is consistent.
34
Excel: data validation
20 Here is another example of data validation using Excel. Height has been defined to contain values between 11 and 15. When 20 is entered, the user is told that they have entered an illegal value.
35
Data entry tools: Survey Monkey
I am using Qualtrics for a collaborative study of faculty data management plans. Data can be exported and analyzed in Excel, R, Matlab, etc… Advantages: controlled vocabulary data from 5 locations & people is co-located into one dataset data entry is FAST Qualtrics has filtering and drill-down capabilities for basic data exploration
36
Databases
37
Spreadsheet vs. Relational Database
Spreadsheets Great for charts, graphs, calculations Flexible about cell content type—cells in same column can contain numbers or text Lack record integrity (can sort a column independently of all others) Easy to use – but harder to maintain as complexity and size of data grows Databases Easy to query to select portions of data Data fields are typed – For example, only integers are allowed in integer fields Columns cannot be sorted independently of each other Steeper learning curve than a spreadsheet Some researchers are turning to database software instead of spreadsheets for their data management needs. Databases are a powerful option for storing and manipulating datasets. Here, we list some of the pros and cons of spreadsheets vs. databases (which include software such as Oracle, MySQL, SQL Server and Microsoft Access). Spreadsheets are good at making charts and graphs, and doing calculations. They are easy to use, but they become unwieldly as the number of records grows and a dataset becomes complex. Databases, on the other hand, work well with high volumes of data, and they are much easier to query in order to select data having particular characteristics. They also maintain data integrity – that is, one column cannot be sorted separately of all others, as spreadsheets can. Databases also enforce data typing, which is a best practice. This means that only data of type ‘text’, for example, can be entered in to a column of type ‘text’. This helps prevents data entry errors. Databases do have a steeper learning curve than a spreadsheet such as Excel does, but there are many benefits
38
What is a relational database?
Sample sites Samples samples Species *siteID site_name latitude longitude description *sampleID siteID sample_date speciesID height flowering flag comments *sampleID siteID sample_date speciesID height flowering flag comments *speciesID species_name common_name family order A set of tables Relationships A command language A relational database matches data stored in tables by using common characteristics found within the data set. This helps preserve data integrity and also makes it possible to flexibly mix and match data to get different combinations of information. A database consists of a set of tables, defined relationships between them (which table is related to which other table), and also a powerful command language that facilitates data manipulation. Here, a dataset for plant phenology has been divided into three tables, one describing site information, one describing characteristics of each sample, and one describing the plant species found. Relational databases are currently the predominant choice in storing data like financial records, medical records, personal information and manufacturing and logistical data.
39
Database features: explicit control over data types
Date Site Height Flowering <dates only> <text only> < real numbers only> < ‘y’ and ‘n’ only> Advantages quality control performance Database features includes explicit control over data types and has the advantages of quality control and performance. Here, in the plant phenology table, only dates are allowed in the Date column, only text is allows in the site column, only real numbers are allowed in the Height column. If a user tries to enter a ? Under flowering, the database will reject the entry. This is useful for defining how data is to be entered.
40
Relationships are defined between tables
Date Site Species Flowering? 2/13/2010 A BOGR2 y B HODR 4/15/2010 BOER4 C PLJA n Site Latitude Longitude A 34.1 -109.3 B 35.2 -108.6 C 32.6 -107.5 join Date Site Species Flowering? Latitude Longitude 2/13/2010 A BOGR2 y 34.1 -109.3 B HODR 35.2 -108.6 4/15/2010 BOER4 C PLJA n 32.6 -107.5 Relationships can be defined between two sets of data or in this example between two tables. Suppose that you have two tables used in the plant phenology study, one for observations and one for sites, and you want a table that contains both observations and the latitude and longitude of your sites. Because both tables contain Site info, they can be joined to create a table containing the info you want. Mix & match data on the fly
41
Structured Query Language (SQL)
This table is called SoilTemp Date Plot Treatment SensorDepth Soil_Temperature C R 30 12.8 B 10 13.2 6.3 A N 15.1 SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from SoilTemp where Date = ‘ ’ Date Plot Treatment SensorDepth Soil_Temperature C R 30 12.8 B 10 13.2 Database features also includes a powerful command language called Structured Query Language (SQL) The table at the top of this slide is named SoilTemp in the database. The first example SQL command returns all records collected on The second select statement, returns all records from table SoilTemp where treatment is N and SensorDepth is 0. From this example you can get a sense of how easy SQL is to use to subset data based on different criteria. This is only very simple SQL. There is much, much more than can be done with it. Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’ Date Plot Treatment SensorDepth Soil_Temperature A N 15.1
42
If you want to try a database …
… consider trying one of these: GUIs: Personal, single-user databases can be developed in MS Access, which is stored as a file on the user’s computer. MS Access comes with easy GUI tools to create databases, run queries, and write reports. A more robust database that is free, accommodates multiple users and will run on Windows or Linux is MySQL. GUI interfaces for MySQL include phpMyadmin (free) and Navicat (inexpensive). FileMaker
43
To learn more about designing a relational database
Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design (3rd Edition) by Michael J. Hernandez. Addison-Wesley
44
Conclusion Be aware of best practices when designing data file structures Choose a data entry method that allows some validation of data as it is entered Consider investing time in learning how to use a database if datasets are large or complex Be aware of best practices when designing data file structures. Choose a data entry method that allows validation of data entered and be sure to invest time in learning how to use a database especially if the dataset are large or complex. CC image by fo.ol on Flickr
45
Reproducibility Reproducibility at core of scientific method
Complex process = more difficult to reproduce Good documentation required for reproducibility Metadata: data about data Process metadata: data about process used to create, manipulate, and analyze data Reproducibility is at the core of scientific method. If results are not reproducible, the study loses credibility. The complex processes used to create final outputs can be quite difficult to reproduce. In order to maintain scientific integrity, good documentation of the data and the analytical process is essential. Documentation includes metadata, which is data about data, and process metadata, which is data about the process. CC image by Richard Carter on Flickr
46
Ensuring reproducibility: documenting the process
Process metadata: Information about process (analysis, data organization, graphing) used to get to data outputs Related concept: data provenance Origins of data Good provenance = able to follow data throughout entire life cycle Allows for Replication & reproducibility Analysis for potential defects, errors in logic, statistical errors Evaluation of hypotheses Process metadata is information about the process used to create any data outputs. This includes any data cleaning, transformation, reduction, and any analyses performed. A related concept to process metadata is “data provenance”. Provenance means “origin”, so data provenance is a description of the origins of the data. A mark of good provenance is that a person not directly involved with the project is able to follow the data through its life cycle and understand any steps used to create outputs. Good provenance allows for the ability to replicate analyses and reproduce results. Others can identify potential problems, logical, or statistical errors that might affect the study’s outcome. Others are also able to evaluate a study’s hypotheses for themselves. All of these possibilities mean greater accountability and more trustworthy science.
47
Workflows: the basics Precise description of scientific procedure
Conceptualized series of data ingestion, transformation, and analytical steps Three components Inputs: information or material required Outputs: information or material produced & potentially used as input in other steps Transformation rules/algorithms (e.g. analyses) Two types: Informal Formal/Executable A workflow is a formalization of the process metadata. Workflows are commonly used in other fields, including business. In general, a “workflow” is a precise description of the procedures used in a project. It is a conceptualized series of data ingestion, transformation, and analytical steps. A workflow consists of three components. First, there are inputs that contain the information required for the process, for example the raw data. Second the output is information or materials produced, such as final plots of the data. Third, there are transformation rules, algorithms, or analyses that are required to create the outputs from the inputs.
48
Formal/executable workflows
Benefits: Single access point for multiple analyses across software packages Keeps track of analysis and provenance: enables reproducibility Each step & its parameters/requirements formally recorded Workflow can be stored Allows sharing and reuse of individual steps or overall workflow Automate repetitive tasks Use across different disciplines and groups Can run analyses more quickly since not starting from scratch There are many benefits of using scientific workflows. First, they can provide a single access point for multiple analyses across software packages. Second they allow a researcher to keep track of analyses conducted, which enables reproducibility. Third, workflows can be stored as documentation of the research project. A stored workflow is essentially higher-level metadata which offers tremendous potential for scientific advancement. Finally, workflows allow researchers to share and reuse the workflow or its components. This means less time doing repetitive tasks, allows for collaboration across disciplines, and rapid analysis since time is not spent “reinventing the wheel”.
49
Formal/executable workflows
Example: Kepler Software Open-source, free, cross-platform Drag-and-drop interface for workflow construction Steps (analyses, manipulations etc) in workflow represented by “actor” Actors connect from a workflow Possible applications Theoretical models or observational analyses Hierarchical modeling Can have nested workflows Can access data from web-based sources (e.g. databases) Downloads and more information at kepler-project.org One example of a scientific workflow software program is Kepler. Kepler is an open-source and free cross-platform program. Cross-platform means it can work with any operating system. Kepler uses a drag-and-drop interface for scientists to construct their workflow. Steps in the analytical process are represented by an “actor”. These actors then connected to form a workflow. Possible applications of Kepler are listed here.
50
Formal/executable workflows Drag & drop components from this list
Example: Kepler Software Actors in workflow Drag & drop components from this list Here is a screenshot of the Kepler interface. It has a user-friendly GUI (pronounced gooey) or graphical user interface. The list of possible actors is searchable, and you can drag and drop the actors into the workflow creation space to the right. Actors are connected via inputs and outputs represented by black lines.
51
Formal/executable workflows
Example: Kepler Software This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics. This example workflow solves two coupled equations, one for the predator population, and one for the prey population. The solutions to the equations are then routed to the “Timed plotter” and “XY plotter” in the top of the panel. The text describes what this workflow is calculating.
52
Formal/executable workflows
Example: Kepler Software Output The resulting outputs from this workflow are plots of the predator and prey populations. Kepler and other scientific workflow tools are being developed for use by ecologists and environmental scientists who may not be comfortable creating scripted workflows using the command line. Along with scientific workflows, tools are being developed to facilitate their use, such as VisTrails.
53
Formal/executable workflows
Tutorial / Manual: Kepler Software docs/trunk/outreach/documentation/shipping/2.5/UserMa nual.pdf docs/trunk/outreach/documentation/shipping/2.5/getting- started-guide.pdf The resulting outputs from this workflow are plots of the predator and prey populations. Kepler and other scientific workflow tools are being developed for use by ecologists and environmental scientists who may not be comfortable creating scripted workflows using the command line. Along with scientific workflows, tools are being developed to facilitate their use, such as VisTrails.
54
Formal/executable workflows
Example: VisTrails Open-source Workflow & provenance management support Geared toward exploratory computational tasks Can manage evolving SWF Maintains detailed history about steps & data VisTrails is another example of an open source workflow tool that provides provenance and management support. It is geared toward VisTrails is an open source workflow tool that provides provenance and management support. It is geared toward exploratory and computational tasks. Using Vistrails, scientists can manage evolving scientific workflows and maintain detailed history about the steps taken and the data consumed and produced. Screenshot example
55
Workflows in general Science is becoming more computationally intensive Sharing workflows benefits science Scientific workflow systems make documenting workflows easier Minimally: document your analysis via informal workflows Emerging workflow applications (formal/executable workflows) will Link software for executable end-to-end analysis Provide detailed info about data & analysis Facilitate re-use & refinement of complex, multi-step analyses Enable efficient swapping of alternative models & algorithms Help automate tedious tasks Workflows are beneficial because they document the exact process used to create data outputs. This is especially true with the advent of more computationally intensive processes due to sensor networks, complex statistical programs, and integration of many types of data in a single study. One of the major advantages of workflows is they allow the analytical process to be shared with other scientists. This would be easier to accomplish if there was a formal way of creating and saving workflows. There are now scientific workflow systems in development that will make documenting workflows easier. This will also increase the ability to share workflows with others. The simplest form of a scientific workflow is using scripts to document the process of analysis. This is done often using scripted programs such as R or Matlab, or if multiple software packages are used, via the command line using programming languages such as Java, Python, or Perl. However executing codes and analyses via the command line is beyond the expertise of many ecological and environmental scientists.
56
Best practices for data analysis
Scientists should document workflows used to create results Data provenance Analyses and parameters used Connections between analyses via inputs and outputs Documentation can be informal (e.g. flowcharts, commented scripts) or formal (e.g. Kepler, VisTrails) Best practices for data analysis should involve the documentation of workflows, either conceptual or formal, to show how results were obtained. This includes data provenance, analyses used, parameters used, and connections between analyses via inputs and outputs. This documentation can be informal, like in a flowchart, or more formal, such as Kepler or vistrails. CC image geek calendar on Flickr
57
Summary Modern science is computer-intensive
Heterogeneous data, analyses, software Reproducibility is important Use of informal or formal workflows for documenting process metadata ensures reproducibility, repeatability, validation In summary, modern science is becoming more and more computationally-intensive, and scientists are working with heterogeneous data, analyses, and software. Reproducibility is more important than ever given this environment. Workflows are equivalent to process metadata, also known as provenance. Using both informal and formal workflows and are necessary for reproducibility, repeatability, and validation of your research.
58
Literature Borer, E. T., Seabloom, E. W., Jones, M. B ., & Schildhauer, M. (2009). Some simple guidelines for effective data management. Bulletin of the Ecological Society of America, 90(2), 205–214. Research Data Management Service Group (n.d.). Preparing tabular data for description and archiving. DataONE (n.d.). Education modules: Lesson 9 — Analysis and workflows. Noble, W. S. (2009). A quick guide to o rganizing computational biology projects. PLoS Computational Biology, 5(7), e
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.