Data: Presentation and Description

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
Introduction to Biostatistics. Biostatistics The application of statistics to a wide range of topics in biology including medicine.statisticsbiology.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Very Basic Statistics.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 2 Describing Data.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Lecture 3 Describing Data Using Numerical Measures.
Skewness & Kurtosis: Reference
Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Statistical Inference for more than two groups Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
BUSINESS STATISTICS I Descriptive Statistics & Data Collection.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Statistics Vocabulary. 1. STATISTICS Definition The study of collecting, organizing, and interpreting data Example Statistics are used to determine car.
Data Presentation Numerical Summary Measures Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU.
Data organization and Presentation. Data Organization Making it easy for comparison and analysis of data Arranging data in an orderly sequence or into.
Chapter 11 Summarizing & Reporting Descriptive Data.
Descriptive Statistics
COMPLETE BUSINESS STATISTICS
Exploratory Data Analysis
Methods for Describing Sets of Data
EMPA Statistical Analysis
Measurements Statistics
Analysis and Empirical Results
Doc.RNDr.Iveta Bedáňová, Ph.D.
Exploring Data Descriptive Data
ISE 261 PROBABILISTIC SYSTEMS
Chapter 3 Describing Data Using Numerical Measures
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Chapter 2: Methods for Describing Data Sets
Unit 4 Statistical Analysis Data Representations
4. Interpreting sets of data
Objective: Given a data set, compute measures of center and spread.
CHAPTER 5 Basic Statistics
Statistical Inference for more than two groups
How could data be used in an EPQ?
Description of Data (Summary and Variability measures)
Laugh, and the world laughs with you. Weep and you weep alone
DS1 – Statistics and Society, Data Collection and Sampling
Chapter 3 Describing Data Using Numerical Measures
Descriptive Statistics
Descriptive Statistics
Central tendency and spread
Topic 5: Exploring Quantitative data
Histograms: Earthquake Magnitudes
Basic Statistical Terms
Describing Distributions of Data
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Statistics: The Interpretation of Data
Numerical Descriptive Measures
Welcome!.
Honors Statistics Review Chapters 4 - 5
Review for Exam 1 Ch 1-5 Ch 1-3 Descriptive Statistics
How to present data or results in Thesis?
Prepared by: C.Cichanowicz, March 2011
Probability and Statistics
Descriptive Statistics
Business and Economics 7th Edition
Presentation transcript:

Data: Presentation and Description Statistics for Health Research Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics

Overview What is Data? Summarising data Displaying data SPSS

Why have you collected data? Most important question! Related to testing hypotheses If you have not got any hypotheses – Get some! Return to later

DATA – Where from? All data is a Sample – a subset of population How was it collected? Potential for bias? What does it represent?

Extrapolating from the sample to population Illustrations Ian Christie, Orthopaedic & Trauma Surgery, Copyright 2002 University of Dundee

Quantitative Data? Observation or measurement of one or more variables Variable is any quantity measured on a scale Unit of analysis can be person, group (e.g. practice, specimen, cell, time……..) Multilevel – patient and practice

Cross-classified 3 level multilevel model Hospitalk Practice levelj Patient leveli

Statistics Statistics encompasses - Design of study; Methods of collecting, and summarising data; Analysing and drawing appropriate conclusions from data

Variable types Categorical (qualitative?) Numerical (quantitative) E.g. type of drug – 1)selective 2) non-selective beta-blocker smoking status -1) smoker 2) ex 3) non-smoker Deciles of SIMD – 1,2,3,4,5,6,7,8,9,10 Numerical (quantitative) E.g. age, birth weight, BP, cholesterol

Categorical Binary - two categories (yes, no) Ordinal Categories are mutually exclusive and ordered Eg Disease stage (mild/moderate/ severe) Nominal Categories are mutually exclusive and unordered Eg Blood group type (A/B/AB/O) Binary - two categories (yes, no)

Takes any value in a range of values to any degree of precision Numerical Continuous Takes any value in a range of values to any degree of precision Eg Height in m, cholesterol, creatinine Discrete Integer values, often counts Eg number of cigarettes smoked, No. days in hospital

Organisation of data Generally each variable in separate columns and one row per subject Subject Age Gender Score 1 28 1 15 2 56 2 11 3 43 1 22

1st step in analysis? Look at the data!

Display and summarise data To get a feel for the data To spot errors and missing data Assess the range of values Also ..

Summarising Categorical data 1. Campylobactor 21. Giardia 2. Campylobactor 22. Crytosporidium 3. Escherichia coli 0157 23. Crytosporidium 4. Shigella sonnei 24. Campylobactor 5. Crytosporidium 25. Shigella sonnei 6. Giardia 26. SRSV 7. Crytosporidium 27. Crytosporidium 8. Campylobactor 28. Campylobactor 9. Campylobactor 29. Giardia 10. Crytosporidium 30. Giardia 11. Giardia 31. Escherichia coli 0157 12. Shigella sonnei 32. Shigella sonnei 13. SRSV 33. Crytosporidium 14. Giardia 34. SRSV 15. Escherichia coli 0157 35. Campylobactor 16. Campylobactor 36. Campylobactor 17. Giardia 37. Campylobactor 18. SRSV 38. Giardia 19. Campylobactor 39. Escherichia coli 0157 20. Crytosporidium 40. Campylobactor

Summarised by frequencies or percentage Infection N (%) Campylobactor 12 (30.0) Cryptosporidium 9 (22.5) Giardia 8 (20.0) SRSV 5 (12.5) Escherichia coli 0157 3 (7.5) Shigella Total 40 (100)

Numerical data Frequency distributions for continuous variable can be unfeasibly large Grouping may be necessary for presentation

Cumulative relative frequency (%) Frequency distribution for continuous variable Age group (years) Frequency Relative (%) Cumulative relative frequency (%) 0-4 59 12.2 5-9 83 17.1 29.3 10-14 94 19.4 48.7 15-19 72 14.8 63.5 20-24 61 12.6 76.1 25-29 48 9.9 86.0 30-34 36 7.4 93.4 35-49 32 6.6 100 485

Baseline measure cholesterol N (%) 4.0 52 (3.1) 4.1 51 (3.0) 4.2 49 (2.9) 4.3 65 (3.9) 4.4 60 (3.6) 4.5 80 (4.8) 4.6 88 (5.2) 4.7 99 (5.9) 4.8 94 (5.6) 4.9 84 (5.0) 5.0 68 (4.1) 5.1 66 (3.9) 5.2 79 (4.7) 5.3 74 (4.4) 5.4 75 (4.5) 5.5 5.6 70 (4.2) 5.7

Baseline group N (%) 4.0 to 4.4 277 (16.5) 4.5 to 4.9 445 (26.5) 5.0 to 5.4 362 (21.6) 5.5 to 5.9 340 (20.3) 6.0 to 6.9 253 (15.1) Total 1677

Guide for grouping data Obtain min and max values Choose between 5 and 15 intervals Summarise but not obscure data especially continuous data Intervals of equal width Good but not essential Remember to label tables!

Take care with missing values SPSS gives % missing in output if missing left blank in data Careful in reporting % as percentage of observed values or percentage of all subjects These will differ! Can use missing code (often 9) to make missing explicit in output

Graphs Simplicity Consistency Not duplicating tables or text Remember Title Remember Label axes

Graphs – Categorical data Bar charts Pie charts

Bar charts Used to display categorical (or discrete numerical data) One bar per category Height of bar equals its frequency Each bar same width and equally spaced Space between each bar Vertical axis must start at zero

Most common cancer deaths in UK, 2009 Plots and Statistics from CRUK website http://info.cancerresearchuk.org

Pie charts Displays one variable only Compare 2 groups using 2 charts

But avoid 3-dimensional plots!

Graphs – Numerical data Histograms Frequency polygon Scatter plots Box plots

Histograms Like bar charts but no spaces y axis always begins at zero Area of bar represents the frequency in each group

Check data carefully

Florence Nightingale’s ‘Coxcomb’ diagram of Mortality in the Crimea War

Summary measures – Numerical data Central Location (average) Spread or variability (distance of each data point from the average)

Central Location Mean Median Mode - most frequent value

Mean _ x = x1 + x2 +x3+ ….. + xn N Often written as ∑xi / N Where Sigma or ∑ is ‘Sum of’

2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 _ x = 24.02 8 = 3.00

Mean Advantages Uses all data values Very amenable to statistical analysis; most models use the mean Disadvantages (advantages to politicians and estate agents!) Distorted by outliers Distorted by skewed data

Median Arrange values in increasing order Median is the middle value Easy if odd number of values, for even number: 2.62 2.75 2.76 [2.86 3.05] 3.12 3.37 3.49 Median = 2.86 + 3.05 = 2.96 litres 2

Median Advantages Not distorted by outliers Not distorted by skewed data Disadvantages Ignores most of the information Less amenable to statistical modelling

Measures of spread 17 24 29 36 [47 52] 66 67 81 94 Mean = 51.3 Median = 49.5 50 51 51 51 [51 51] 51 51 51 55 Mean = 51.3 Median = 51

Range 17 24 29 36 [47 52] 66 67 81 94 Range 94-17 or 77 50 51 51 51 [51 51] 51 51 51 55 Range 50-55 or 5

Range from percentiles Data ordered from smallest to largest value; then divide into equal chunks: Percentiles Deciles –data in equal 10ths Quartiles = data in equal 4ths

Interquartile range (IQR) Data is ordered into quartiles: 4 5 7 | 9 10 12 | 14 19 26 | 39 40 42 8 (lower quartile) 32.5 (upper quartile) Interquartile range (IQR) = 32.5 - 8 = 24.5

IQR in Multiple Box-plots Outlier Upper Quartile Range Median IQR Lower Quartile

Distribution of data values around the mean 17 24 29 36 47 51.3 52 66 67 81 94 50 51 51 51 51 51.3 51 51 51 51 55

Variance 17 24 29 36 47 52 mean=34.16 years _ (x-x) 17 - 34.16 -17.16 17 - 34.16 -17.16 24 – 34.16 -10.16 29 – 34.16 -5.16 36 – 34.16 1.83 47 – 34.16 12.83 52 – 34.16 17.83

Variance 17 24 29 36 47 52 mean=34.16 _ _ (x-x) (x-x)2 _ _ (x-x) (x-x)2 17 -17.16 294.64 24 -10.16 103.36 29 -5.16 26.69 36 1.83 3.36 47 12.83 164.69 52 17.83 318.02 0 910.81

Variance (s2) _ S2 =  (x-x)2 n-1 S2= 910.81 5 S2=182.16

17 24 29 36 47 52 Mean = 34.16 years Variance = 182.2

Standard deviation (s) _ Std deviation (s) = √  (x-x)2 n-1 Std deviation = √ 182.16 = 13.49

17 24 29 36 47 52 Mean = 34.16 years SD = 13.49 Coefficient of Variation (CV) = SD / Mean = 0.39 Measure of variability for comparison of different scales

Which central measure goes with which measure of spread? Mean (SD) Median (IQR or Range)

Summary Do not underestimate value of looking at the data Gives a feel for the data before testing or modelling Check for missing data Check for outliers

From Jan 2010 IBM acquired copyright for SPSS

Statistics Baseline LDL N Valid 1383 Missing 0 Mean 3.454363 Median 3.506214 Std. Deviation .9889157 Variance .978 Skewness .039 Std. Error of Skewness .066 Range 7.2305 Minimum .3345 Maximum 7.5650 Percentiles 25 2.881000 50 3.506214 75 4.013000

Implementing Kaplan-Meier in SPSS From Colorectal.sav you need to specify: Survival time – time from surgery (tfsurg) Status – Dead = 1, censored = 0 (dead) Factor – e.g. hypertension comorbidity (hyperco) Select plot of survival

Implementing Kaplan-Meier plot in SPSS

Select options to obtain plot and median survival

Means and Medians for Survival Time Hypertension Meana Median Estimate Std. Error 95% Confidence Interval Estimate Std. Error 95% Confidence Interval Lower Bound Upper Bound Lower Bound Upper Bound 0 1608.130 79.571 1452.172 1764.089 1386.000 98.283 1193.366 1578.634 1 1374.811 118.809 1141.946 1607.676 909.000 180.498 555.223 1262.777 Overall 1546.238 64.938 1418.959 1673.517 1255.000 83.249 1091.831 1418.169 a. Estimation is limited to the largest survival time if it is censored.

Survival curves for women with glioma by diagnosis. Bland J M , Altman D G BMJ 2004;328:1073

Practical Read LDL.sav or colorectal.sav into SPSS (22) and explore the different types of data using appropriate tables and graphs DEBU website (http://medicine.dundee.ac.uk/dundee-epidemiology-and-biostatistics-unit-debu)