Cornell University June 2016 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Lynn.

Slides:



Advertisements
Similar presentations
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
Advertisements

ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
Do files, log files, and workflow in Stata Biostatistics 212 Lecture 2.
Cornell University January 2015 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Jeramia.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Data Cleaning, Validation and Enhancement iDigBio Wet Collections Digitization Workshop March 4 – 6, 2013 KU Biodiversity Institute, University of Kansas.
David Novo De Novo Software Metadata (sharing info) technology already exists (banks, insurance companies) metadata – standards are close and relatively.
Important concepts in software engineering The tools to make it easy to apply common sense!
Ann Arbor ASA ‘Up and Running’ Series: SPSS Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with.
WV-INBRE West Virginia IDeA Network of Biomedical Research Excellence Managing the NextGen data pipeline Jim Denvir, Ph.D.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Data analysis Incorporating slides from IS208 (© Yale Braunstein) to show you how 208 and 214 are telling you many of the the same things; and how to use.
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Course director: Mark Pletcher Teaching Assistant: Lee Zane.
Reference and Instruction Automated Statistics Gathering and Reporting System Members: Patrick Chen (pyc7) Soo-Yung Cho (sc444) Gregg Herlacher (gah24)
Spreadsheet design an overview of further issues Research Methods Group Wim Buysse – ICRAF-ILRI Research Methods Group October 2004.
RESEARCH HUB AT THE UNIVERSITY LIBRARIES PENN STATE UNIVERSITY TOUR OF STATISTICAL PACKAGES.
Rebecca Boger Earth and Environmental Sciences Brooklyn College.
WEEK 3 (23-26 SEPTEMBER 2013) HOMEWORK DUE TUESDAY: “TEXTING BY THE NUMBERS” DUE WEDNESDAY: HOW ARE YOU UNDERSTANDING DATA AND TECHNOLOGY? DUE THURSDAY:
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Flow Cytometry and Reproducible Analysis Cliburn Chan Department of Biostatistics and Bioinformatics, DUMC.
ICTR Data Manager Interest Group Meeting April 15, 2011 Results from Data Manager’s Survey Kerry Stewart, Ed.D., FAHA, FAACVPR, FACSM, FSGC Professor,
Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Sterling Chadee Director of Statistics. The processing of the data from the field enumeration began in July 2011 until September All data processors.
Not our data, but we use it in research Wietse Dol, LEI-WUR 6 October 2014.
Reaching Students with Passion-Driven, Project-Based Statistics CAUSE Teaching & Learning webinar June 11, 2013.
Research Services Introduction to research data management - a physical science case study Slides provided by DaMaRO Project, University of Oxford.
Institute for Tribal Environmental Professionals Tribal Air Monitoring Support Center Melinda Ronca-Battista Brenda Sakizzie Jarrell Southern Ute Indian.
Copyright 2010, The World Bank Group. All Rights Reserved. Data Processing and Tabulation, Part I.
Raw Data Cleaning, Validation and Enhancement The Field Museum - Chicago, Illinois iDigBio Entomology Digitization Workshop Deborah Paul, iDigBio April.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Introduction to STATA for Clinical Researchers Jay Bhattacharya August 2007.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
1 Click to edit Master title style Demographic Analysis Panel Current and Future State FDA/PhUSE CSS - Working Group 5 - Analysis Standards Script Examples.
Questionnaire Development: SPSS and Reliability Personality Lab October 8, 2010.
Data Management In The Lab Lynn Yarmey National Snow and Ice Data Center Version 1.0 Review Date.
Example SPSS Basic Medical Statistics Course October 2010 Wilma Heemsbergen.
Background Cornell Institute for Social and Economic Research (CISER): Data and Computing Support for Social and Economic Researchers at Cornell University.
A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.
Not our data, but we use it in research Wietse Dol, LEI-WUR 9 February 2015, Forum C214.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation, use File > Save As > Tools (upper right) > Save Options > Embed.
PSY6010: Statistics, Psychometrics and Research Design Professor Leora Lawton Spring 2007 Wednesdays 7-10 PM Room 204.
LO3 Know the features and functions of information systems.
Data Organization Quality Assurance and Transformations.
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
I wonder if,……… How to use the scientific process to find answers to what you wonder about.
Impact of the New ASA Undergraduate Curriculum Guidelines on the Hiring of Future Undergraduates Robert Vierkant Mayo Clinic, Rochester, MN.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
R Workshop #2 Basic Data Analysis. What we did last week: Understand the basics of how R works Generated objects (vectors, matrices, etc.) Read in data.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
PREPARED BY: PN. SITI HADIJAH BINTI NORSANI. LEARNING OUTCOMES: Upon completion of this course, students should be able to: 1. Understand the structure.
How math majors can help your business Dr. Jonathan D Adler jadler.info.
Software Sustainability Institute Data Carpentry Aleksandra Pawlik Software Sustainability Institute Data Science Club, 17 th March.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
Introduction to the SPSS Interface
SPSS: Using statistical software — a primer
Accessing data – a user’s perspective
Mark Laufersweiler Research Data Specialist
Tools and Services Workshop
Joslynn Lee – Data Science Educator
UF Graduate Linguistics Society Seminar Series
Florida State University
What Are Databases? Organized by Dr. Farrokh Alemi PhD
HMI 7530– Programming in R Introduction
STAT 4030 – Programming in R Introduction
Analytics: Its More than Just Modeling
Department of Computer Science DCC University of Chile
Introduction to the SPSS Interface
Presentation transcript:

Cornell University June 2016 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Lynn Johnson (CSCU) Assistants Francoise Vermeylen Stephen Parry Kevin Packard David Kent David Bindel

Goal: Develop and teach workshops to help train the next generation of researchers in good data analysis and management practices to enable individual research progress and open and reproducible research.

Community driven effort Staff Executive Director Tracy K. Teal, PhD Associate Director Erin Becker, PhD Program Coordinator Maneesha Sane Steering Committee Members Karen Cranston, PhD, Principal Investigator, Open Tree of Life Hilmar Lapp, Director of Informatics, Duke Center for Genomic & Computational Biology Aleksandra Pawlik, PhD, Training Lead, Software Sustainability Institute Karthik Ram, PhD, rOpenSci co-founder, Berkeley Institute for Data Science Fellow Ethan White, PhD, Associate Professor, University of Florida Greg Wilson, PhD, Co-Founder and Director of Training, Software Carpentry Foundation Open source materials

I usually manage data in Excel and it's terrible and I want to do it better. I'm organizing GIS data and it's becoming a nightmare. My advisor insists that we store 50,000 barcodes in a spreadsheet, and something must be done about that. I'm having a hard time analyzing microarray, SNP or multivariate data with Excel and Access. I want to use public data. I work with faculty at undergrad institutions and want to teach data practices, but I need to learn it myself first. I'm interested in going in to industry and companies are asking for data analysis experience. I'm trying to reboot my lab's workflow to manage data and analysis in a more sustainable way. I'm re-entering data over and over again by hand and know there's a better way. I have overwhelming amounts of data. I'm tired of feeling out of my depth on computation and want to increase my confidence. Sentiments on data within the NSF BIO Centers (BEACON, SESYNC, NESCent, iPlant, iDigBio)

Two kinds of questions Raise your hand for a question that everyone could benefit Sticky note when your code doesn’t work and you need a helper to come

Reproducible Research Well documented and Repeatable

Reproducible Research Data analysis – Data and analysis can be re-created by anyone Including you in the future! Repeat analysis on updated data Repeat analyses on similar datasets – Scripted data management and analysis Manages and analyzes Provides a record of what was done Easy to edit and re-run

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Updated Raw Data Working Data

Raw Data Cleaned Data Data Cleaning Script Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values

Raw Data Cleaned Data Data Cleaning Script Summarizing Script Subset data for particular project Transform variables Average, min, max by group imputation Working Data

Raw Data Cleaned Data Analysis Results Data Cleaning Script Summarizing Script Analysis Script Linear Models Mixed Models Search for Correlates Loop! General Functions Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Plotting Table making Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Paper Writing Script Working Data

Raw Data Cleaned Data Working Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script New Raw Data Cleaned Data Working Data Analysis Results FiguresTables

Raw Data Cleaned Data Summarized Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Cleaned Data Working Data Analysis Results FiguresTables Re-use and edit scripts for new projects New Raw Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Working Data Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values Subset data for particular project Transform variables Average, min, max by group imputation Plotting Table making Linear Models Mixed Models Search for Correlates Loop! General Functions

Data Cleaning Script Summarizing Script Analysis Script Results Formatting Script Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values Subset data for particular project Transform variables Average, min, max by group imputation Linear Models Mixed Models Search for Correlates Loops! General Functions Plotting Table making Excel OpenRefine SQL databases R loops & functions Raw Data R markdown / RStudio Monday morning Monday Afternoon Tuesday Afternoon R dplyr, ggplot Tuesday Morning