Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data: Presentation and Description

Similar presentations


Presentation on theme: "Data: Presentation and Description"— Presentation transcript:

1 Data: Presentation and Description
Statistics for Health Research Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics

2 Overview What is Data? Summarising data Displaying data SPSS

3 Why have you collected data?
Most important question! Related to testing hypotheses If you have not got any hypotheses – Get some! Return to later

4 DATA – Where from? All data is a Sample – a subset of population
How was it collected? Potential for bias? What does it represent?

5 Extrapolating from the sample to population
Illustrations Ian Christie, Orthopaedic & Trauma Surgery, Copyright 2002 University of Dundee

6 Quantitative Data? Observation or measurement of one or more variables
Variable is any quantity measured on a scale Unit of analysis can be person, group (e.g. practice, specimen, cell, time……..) Multilevel – patient and practice

7 Cross-classified 3 level multilevel model
Hospitalk Practice levelj Patient leveli

8 Statistics Statistics encompasses - Design of study;
Methods of collecting, and summarising data; Analysing and drawing appropriate conclusions from data

9 Variable types Categorical (qualitative?) Numerical (quantitative)
E.g. type of drug – 1)selective 2) non-selective beta-blocker smoking status -1) smoker 2) ex 3) non-smoker Deciles of SIMD – 1,2,3,4,5,6,7,8,9,10 Numerical (quantitative) E.g. age, birth weight, BP, cholesterol

10 Categorical Binary - two categories (yes, no)
Ordinal Categories are mutually exclusive and ordered Eg Disease stage (mild/moderate/ severe) Nominal Categories are mutually exclusive and unordered Eg Blood group type (A/B/AB/O) Binary - two categories (yes, no)

11 Takes any value in a range of values to any degree of precision
Numerical Continuous Takes any value in a range of values to any degree of precision Eg Height in m, cholesterol, creatinine Discrete Integer values, often counts Eg number of cigarettes smoked, No. days in hospital

12 Organisation of data Generally each variable in separate columns and one row per subject Subject Age Gender Score

13 1st step in analysis? Look at the data!

14

15 Display and summarise data
To get a feel for the data To spot errors and missing data Assess the range of values Also ..

16

17 Summarising Categorical data
1. Campylobactor 21. Giardia 2. Campylobactor 22. Crytosporidium 3. Escherichia coli 0157 23. Crytosporidium 4. Shigella sonnei 24. Campylobactor 5. Crytosporidium 25. Shigella sonnei 6. Giardia 26. SRSV 7. Crytosporidium 27. Crytosporidium 8. Campylobactor 28. Campylobactor 9. Campylobactor 29. Giardia 10. Crytosporidium 30. Giardia 11. Giardia 31. Escherichia coli 0157 12. Shigella sonnei 32. Shigella sonnei 13. SRSV 33. Crytosporidium 14. Giardia 34. SRSV 15. Escherichia coli 0157 35. Campylobactor 16. Campylobactor 36. Campylobactor 17. Giardia 37. Campylobactor 18. SRSV 38. Giardia 19. Campylobactor 39. Escherichia coli 0157 20. Crytosporidium 40. Campylobactor

18 Summarised by frequencies or percentage
Infection N (%) Campylobactor 12 (30.0) Cryptosporidium 9 (22.5) Giardia 8 (20.0) SRSV 5 (12.5) Escherichia coli 0157 3 (7.5) Shigella Total 40 (100)

19 Numerical data Frequency distributions for continuous variable can be unfeasibly large Grouping may be necessary for presentation

20 Cumulative relative frequency (%)
Frequency distribution for continuous variable Age group (years) Frequency Relative (%) Cumulative relative frequency (%) 0-4 59 12.2 5-9 83 17.1 29.3 10-14 94 19.4 48.7 15-19 72 14.8 63.5 20-24 61 12.6 76.1 25-29 48 9.9 86.0 30-34 36 7.4 93.4 35-49 32 6.6 100 485

21 Baseline measure cholesterol N (%) 4.0 52 (3.1) 4.1 51 (3.0) 4.2 49 (2.9) 4.3 65 (3.9) 4.4 60 (3.6) 4.5 80 (4.8) 4.6 88 (5.2) 4.7 99 (5.9) 4.8 94 (5.6) 4.9 84 (5.0) 5.0 68 (4.1) 5.1 66 (3.9) 5.2 79 (4.7) 5.3 74 (4.4) 5.4 75 (4.5) 5.5 5.6 70 (4.2) 5.7

22 Baseline group N (%) 4.0 to 4.4 277 (16.5) 4.5 to 4.9 445 (26.5) 5.0 to 5.4 362 (21.6) 5.5 to 5.9 340 (20.3) 6.0 to 6.9 253 (15.1) Total 1677

23 Guide for grouping data
Obtain min and max values Choose between 5 and 15 intervals Summarise but not obscure data especially continuous data Intervals of equal width Good but not essential Remember to label tables!

24 Take care with missing values
SPSS gives % missing in output if missing left blank in data Careful in reporting % as percentage of observed values or percentage of all subjects These will differ! Can use missing code (often 9) to make missing explicit in output

25 Graphs Simplicity Consistency Not duplicating tables or text
Remember Title Remember Label axes

26 Graphs – Categorical data
Bar charts Pie charts

27 Bar charts Used to display categorical (or discrete numerical data)
One bar per category Height of bar equals its frequency Each bar same width and equally spaced Space between each bar Vertical axis must start at zero

28

29

30 Most common cancer deaths in UK, 2009
Plots and Statistics from CRUK website

31 Pie charts Displays one variable only Compare 2 groups using 2 charts

32

33 But avoid 3-dimensional plots!

34 Graphs – Numerical data
Histograms Frequency polygon Scatter plots Box plots

35 Histograms Like bar charts but no spaces y axis always begins at zero
Area of bar represents the frequency in each group

36

37

38 Check data carefully

39

40 Florence Nightingale’s ‘Coxcomb’ diagram of Mortality in the Crimea War

41 Summary measures – Numerical data
Central Location (average) Spread or variability (distance of each data point from the average)

42 Central Location Mean Median Mode - most frequent value

43 Mean _ x = x1 + x2 +x3+ ….. + xn N Often written as ∑xi / N
Where Sigma or ∑ is ‘Sum of’

44 _ x = 8 = 3.00

45 Mean Advantages Uses all data values
Very amenable to statistical analysis; most models use the mean Disadvantages (advantages to politicians and estate agents!) Distorted by outliers Distorted by skewed data

46 Median Arrange values in increasing order Median is the middle value
Easy if odd number of values, for even number: [ ] Median = = 2.96 litres 2

47 Median Advantages Not distorted by outliers
Not distorted by skewed data Disadvantages Ignores most of the information Less amenable to statistical modelling

48 Measures of spread [47 52] Mean = Median = 49.5 [51 51] Mean = Median = 51

49 Range [47 52] Range or 77 [51 51] Range or 5

50 Range from percentiles
Data ordered from smallest to largest value; then divide into equal chunks: Percentiles Deciles –data in equal 10ths Quartiles = data in equal 4ths

51 Interquartile range (IQR)
Data is ordered into quartiles: | | | 8 (lower quartile) (upper quartile) Interquartile range (IQR) = =

52 IQR in Multiple Box-plots
Outlier Upper Quartile Range Median IQR Lower Quartile

53 Distribution of data values around the mean

54 Variance 17 24 29 36 47 52 mean=34.16 years _ (x-x) 17 - 34.16 -17.16
24 – 29 – 36 – 47 – 52 –

55 Variance 17 24 29 36 47 52 mean=34.16 _ _ (x-x) (x-x)2
_ _ (x-x) (x-x)2

56 Variance (s2) _ S2 =  (x-x)2 n-1 S2= 5 S2=182.16

57 Mean = years Variance = 182.2

58 Standard deviation (s)
_ Std deviation (s) = √  (x-x)2 n-1 Std deviation = √ = 13.49

59 Mean = years SD = 13.49 Coefficient of Variation (CV) = SD / Mean = 0.39 Measure of variability for comparison of different scales

60 Which central measure goes with which measure of spread?
Mean (SD) Median (IQR or Range)

61 Summary Do not underestimate value of looking at the data
Gives a feel for the data before testing or modelling Check for missing data Check for outliers

62 From Jan 2010 IBM acquired copyright for SPSS

63

64

65

66

67

68

69

70

71

72 Statistics Baseline LDL N Valid 1383 Missing 0 Mean Median Std. Deviation Variance .978 Skewness .039 Std. Error of Skewness .066 Range Minimum .3345 Maximum Percentiles

73 Implementing Kaplan-Meier in SPSS
From Colorectal.sav you need to specify: Survival time – time from surgery (tfsurg) Status – Dead = 1, censored = 0 (dead) Factor – e.g. hypertension comorbidity (hyperco) Select plot of survival

74 Implementing Kaplan-Meier plot in SPSS

75 Select options to obtain plot and median survival

76 Means and Medians for Survival Time Hypertension Meana Median
Estimate Std. Error 95% Confidence Interval Estimate Std. Error 95% Confidence Interval Lower Bound Upper Bound Lower Bound Upper Bound Overall a. Estimation is limited to the largest survival time if it is censored.

77 Survival curves for women with glioma by diagnosis.
Bland J M , Altman D G BMJ 2004;328:1073

78 Practical Read LDL.sav or colorectal.sav into SPSS (22) and explore the different types of data using appropriate tables and graphs DEBU website (


Download ppt "Data: Presentation and Description"

Similar presentations


Ads by Google