Preparing Data for Analysis National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Regional Training Workshop on Influenza.

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Pengolahan dan Analisa Data Indra Budi Fasilkom UI.
Process Control Charts An Overview. What is Statistical Process Control? Statistical Process Control (SPC) uses statistical tools to observe the performance.
Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.
Measures of Central Tendency
Experimental Evaluation
Constructing a Data Management System National Center for Immunization & Respiratory Diseases Influenza Division Regional Training Workshop on Influenza.
Database Structure Basics National Center for Immunization & Respiratory Diseases Influenza Division Pam Kennedy Analyst, McKing Consulting Regional Training.
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 4 Summarizing Data.
Statistics for Everyone Workshop Fall 2010 Part 2 Descriptive Statistics: Measures of Central Tendency and Variability Workshop presented by Linda Henkel.
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
Intermediate Statistical Analysis Professor K. Leppel.
Describing Data: Numerical
Identifying Problem Sources at Data Entry and Collection National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Regional.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.
Measurement Tools for Science Observation Hypothesis generation Hypothesis testing.
TOPIC 1 STATISTICAL ANALYSIS
What is statistics? STATISTICS BOOT CAMP Study of the collection, organization, analysis, and interpretation of data Help us see what the unaided eye misses.
Plug & Play Middle School Common Core Statistics and Probability using TinkerPlots.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Measures of Central Tendency or Measures of Location or Measures of Averages.
Automated Data Analysis National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Data Management Training Cairo, Egypt April.
Statistical Analysis Mean, Standard deviation, Standard deviation of the sample means, t-test.
Introduction to Summary Statistics. Statistics The collection, evaluation, and interpretation of data Statistical analysis of measurements can help verify.
User Study Evaluation Human-Computer Interaction.
Influenza Mortality Surveillance… Making Real-Time National Mortality Surveillance a Reality National Center for Health Statistics Division of Vital Statistics.
University of Sunderland CSEM03 R.E.P.L.I. Unit 1 CSEM03 REPLI Research and the use of statistical tools.
FUNDAMENTAL STATISTIC
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
3-1 Stats Unit 3 Summary Statistics (Descriptive Statistics) FPP Chapter 4 For one variable - - Center of distribution "central value", "typical value"
Average Arithmetic and Average Quadratic Deviation.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Introduction for Basic Epidemiological Analysis for Surveillance Data National Center for Immunization & Respiratory Diseases Influenza Division.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Descriptive Statistics The goal of descriptive statistics is to summarize a collection of data in a clear and understandable way.
Measures of Central Tendency or Measures of Location or Measures of Averages.
National Mortality Surveillance: Building a Foundation Paul D. Sutton, Ph.D. Mortality Surveillance Team Lead NAPHSIS/NCHS Joint Meeting Phoenix, Arizona.
Data Analysis.
Chapter 6: Analyzing and Interpreting Quantitative Data
RESEARCH & DATA ANALYSIS
STATISTICS FOR SCIENCE RESEARCH (The Basics). Why Stats? Scientists analyze data collected in an experiment to look for patterns or relationships among.
Quality Control: Analysis Of Data Pawan Angra MS Division of Laboratory Systems Public Health Practice Program Office Centers for Disease Control and.
Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.
Descriptive Statistics for one variable. Statistics has two major chapters: Descriptive Statistics Inferential statistics.
Engineering College of Engineering Engineering Education Innovation Center Analyzing Measurement Data Rev: , MCAnalyzing Data1.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
AP PSYCHOLOGY: UNIT I Introductory Psychology: Statistical Analysis The use of mathematics to organize, summarize and interpret numerical data.
Continuous random variables
EMPA Statistical Analysis
Data Analysis.
Descriptive Statistics
1. Data Processing Sci Info Skills.
CHAPTER 1 Exploring Data
Psychology Unit Research Methods - Statistics
How Psychologists Ask and Answer Questions Statistics Unit 2 – pg
Data Analysis-Descriptive Statistics
Univariate Analysis/Descriptive Statistics
Introduction to Summary Statistics
Numerical Measures: Centrality and Variability
Scoring: Measures of Central Tendency
Unit 4 Statistics Review
Descriptive and inferential statistics. Confidence interval
Using statistics to evaluate your test Gerard Seinhorst
Queries Training Module.
Biology: Study of Life (Bio: Living “Logos”: Study of)
Descriptive Statistics
After the Count: Data Entry and Cleaning
Presentation transcript:

Preparing Data for Analysis National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Regional Training Workshop on Influenza Data Management Phnom Penh, Cambodia July 27 – August 2, 2013

Check for accuracy of observations and correct or eliminate inaccuracies – Important for both simple and complex data Questions to ask: – Are values outside of what you would normally observe? – If yes, are values due to inaccuracies in the data or to real changes in activity (i.e. an outbreak, start of influenza season) Values can be inaccurate due to many factors Data Entry mistake Incorrect measurement at site Incorrect analysis Data Cleaning: What is it?

To prepare your data for regular analysis – Steps: Prepare a copy for temporary cleaning, but also clean the original data source as corrections are validated If data is not cleaned at source, cleaning will need to be done each time analysis is attempted (i.e. records can be temporarily deleted until verified or corrected) To finalize a dataset for future analysis/create a clean copy to be used for research – Typically a more thorough process than cleaning during a flu season Data Cleaning: Why do it?

To check for validity and consistency of reported variables – Ensures that the data collected makes sense Examples: – # of ILI cases is not greater than the # of patient visits – The date of onset is before data of death – Only enrolled sites should be reporting & included in analysis of sentinel data To check for data outliers – A facility that normally sees ~100 patient visits will probably not see 1,000 patients during a week To identify and remove duplicate records Data Cleaning: Why do it?

How do you find data that has problems? – Eyeball method – Through quick, simple data queries Access or Excel queries as you go – Statistical methods – Through pre-programmed automated processes Used for elements that are routinely cleaned Example: Automated process for deleting duplicate records Methods to identify problems

Eyeball Method

To find duplicate records, using Access Quick and Simple Queries

To check validity of variables Quick and Simple Queries

Automated Processes: Duplicates

Measures of Center – Mean: Sum of the observations divided by the number of observations. – Median: The middle value in an ordered list – Mode: The most frequently occurring value Basic Statistic Measures Measures of Variation or Spread Standard Deviation: measures variation by indication how far, on average, the observations are from the mean

Equations in Excel MeanMedian Standard Deviation

Example: Checking for outliers – The US ILI system uses a statistical process to check for outliers: Look at # of patient visits over time from a given provider That # should be consistent within a certain degree of change (i.e. 4 standard deviations from the mean) All values above or below this value are selected and checked manually to verify whether or not the values are reasonable and make sense. Data Cleaning Processes

Data Outliers in Excel

Data Cleaning 01002: Data could not be disproved, left in : Fixed data based on returned workfolder 04108: Data looked OK to surveillance staff, this was the peak of pandemic, and we would have expected numbers to be high

List of errors found during the cleaning process Helps to keep track of changes made to records during the cleaning process. – Keep track of how the data has changed over time – Used for follow-up on questions to sites May be manual or automated – Based on needs of the data Error Logs

Example of Error Log DateState Specimen ID Patient IDFieldPrior Value Current ValueReason for Change Your InitialsComments 2/9/11MDA SPECIMEN idA A b coinfection H3 and 2009 H1N1AB changed one specimen id to 'b' so would be coded as two separate viruses 2/9/11MDA SPECIMEN idA A b coinfection H3 and 2009 H1N1AB changed one specimen id to 'b' so would be coded as two separate viruses 2/9/11MDA SPECIMEN idA A B coinfection B and 2009 H1N1AB changed one specimen id to 'b' so would be coded as two separate viruses 2/9/11SD M11VR SPECIMEN id M11VR M11VR (a, b, c) coinfection 2009 H1N1, H3, and BAB changed one specimen id to 'b' so would be coded as two separate viruses

Preparing data for analysis includes finding and cleaning as many data errors as possible – Statistical methods, the eyeball method, and simple queries can all be used to find potential data errors Data cleaning is important because data errors could alter the interpretation of data (i.e. could cause a perceived increase without a true increase in disease activity) Error logs are useful in accounting for errors and how they were dealt with Conclusions