Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Similar presentations


Presentation on theme: "Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."— Presentation transcript:

1 Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2 Overview What is Data? What is Data? Summarising data Summarising data Displaying data Displaying data SPSS SPSS

3 Why have you collected data? Most important question! Most important question! Related to testing hypotheses Related to testing hypotheses If you have not got any hypotheses – Get some! If you have not got any hypotheses – Get some! Return to later Return to later

4 DATA – Where from? All data is a Sample – a subset of population All data is a Sample – a subset of population How was it collected? How was it collected? Potential for bias? Potential for bias?

5 Extrapolating from the sample to population Illustrations Ian Christie, Orthopaedic & Trauma Surgery, Copyright 2002 University of Dundee

6 Quantitative Data? Observation or measurement of one or more variables Observation or measurement of one or more variables Variable is any quantity measured on a scale Variable is any quantity measured on a scale Unit of analysis can be person, group (e.g. practice), specimen, time Unit of analysis can be person, group (e.g. practice), specimen, time Multilevel – patient and practice Multilevel – patient and practice

7 Cross-classified 3 level multilevel model Practice level j Patient level i Hospital k

8 Statistics Statistics encompasses - 1. Design of study; 2. Methods of collecting, and summarising data; 3. Analysing and drawing appropriate conclusions from data

9 Variable types Categorical (qualitative) Categorical (qualitative) –E.g. type of drug, eye colour, smoker Numerical (quantitative) Numerical (quantitative) –E.g. age, birth weight, BP

10 Categorical Nominal Nominal Categories are mutually exclusive and unordered Eg Blood group type (A/B/AB/O) Ordinal Ordinal Categories are mutually exclusive and ordered Eg Disease stage (mild/moderate/ severe) Binary - two categories (yes, no)

11 Numerical Discrete Integer values, often counts Eg number of cigarettes smoked Continuous Continuous Takes any value in a range of values Eg Height in cm, cholesterol

12 Organisation of data Generally each variable in separate columns and one row per subject SubjectAgeGenderScore 1 28 1 15 2 56 2 11 3 43 1 22

13 1 st step in analysis? Look at the data!

14

15 Display and summarise data To get a feel for the data To get a feel for the data To spot errors and missing data To spot errors and missing data Assess the range of values Assess the range of values Also.. Also..

16

17 Caregorical data 1. Campylobactor21. Giardia 2. Campylobactor22. Crytosporidium 3. Escherichia coli 015723. Crytosporidium 4. Shigella sonnei24. Campylobactor 5. Crytosporidium25. Shigella sonnei 6. Giardia26. SRSV 7. Crytosporidium27. Crytosporidium 8. Campylobactor28. Campylobactor 9. Campylobactor29. Giardia 10. Crytosporidium30. Giardia 11. Giardia31. Escherichia coli 0157 12. Shigella sonnei32. Shigella sonnei 13. SRSV33. Crytosporidium 14. Giardia34. SRSV 15. Escherichia coli 015735. Campylobactor 16. Campylobactor36. Campylobactor 17. Giardia37. Campylobactor 18. SRSV38. Giardia 19. Campylobactor39. Escherichia coli 0157 20. Crytosporidium40. Campylobactor

18 Infection N (%) Campylobactor 12 (30.0) Cryptosporidium 9 (22.5) Giardia 8 (20.0) SRSV 5 (12.5) Escherichia coli 0157 3 (7.5) Shigella Total Total 40 (100) Summarised by frequencies or percentage

19 Numerical data Frequency distributions for continuous variable unfeasibly large Frequency distributions for continuous variable unfeasibly large Grouping may be necessary for presentation Grouping may be necessary for presentation

20 Age group (years)Frequency Relative Frequency (%) Cumulative relative frequency (%) 0-45912.2 5-98317.129.3 10-149419.448.7 15-197214.863.5 20-246112.676.1 25-29489.986.0 30-34367.493.4 35-49326.6100 485100 Frequency distribution for continuous variable

21 Baseline measureN (%) 4.052 (3.1) 4.151 (3.0) 4.249 (2.9) 4.365 (3.9) 4.460 (3.6) 4.580 (4.8) 4.688 (5.2) 4.799 (5.9) 4.894 (5.6) 4.984 (5.0) 5.068 (4.1) 5.166 (3.9) 5.279 (4.7) 5.374 (4.4) 5.475 (4.5) 5.575 (4.5) 5.670 (4.2) 5.760 (3.6)

22 Baseline groupN (%) 4.0 to 4.4277 (16.5) 4.5 to 4.9445 (26.5) 5.0 to 5.4362 (21.6) 5.5 to 5.9340 (20.3) 6.0 to 6.9253 (15.1) Total Total 1677

23 Guide for grouping data Obtain min and max values Obtain min and max values Choose between 5 and 15 intervals Choose between 5 and 15 intervals Summarise but not obscure data especially continuous data Summarise but not obscure data especially continuous data Intervals of equal width Intervals of equal width – Good but not essential – Remember to label tables!

24 Take care with missing values SPSS gives % missing in output if missing left blank in data SPSS gives % missing in output if missing left blank in data Careful in reporting % as percentage of observed values or percentage of all subjects Careful in reporting % as percentage of observed values or percentage of all subjects These will differ! These will differ! Can use missing code (often 9) to make missing explicit in output Can use missing code (often 9) to make missing explicit in output

25 Graphs Simplicity Simplicity Consistency Consistency Not duplicating tables or text Not duplicating tables or text Remember Title Remember Title Remember Label axes Remember Label axes

26 Graphs – Categorical data Bar charts Bar charts Pie charts Pie charts

27 Bar charts Used to display categorical (or discrete numerical data) Used to display categorical (or discrete numerical data) One bar per category One bar per category Height of bar equals its frequency Height of bar equals its frequency Each bar same width and equally spaced Each bar same width and equally spaced Space between each bar Space between each bar Vertical axis must start at zero Vertical axis must start at zero

28

29

30 Most common cancer deaths in UK, 2009 Plots and Statistics from CRUK website http://info.cancerresearchuk.org

31 Pie charts Displays one variable only Displays one variable only Compare 2 groups using 2 charts Compare 2 groups using 2 charts

32

33 But avoid 3-dimensional plots!

34 Graphs – Numerical data Histograms Histograms Frequency polygon Frequency polygon Cumulative frequency polygon Cumulative frequency polygon Scatter plots Scatter plots Box plots Box plots

35 Histograms Like bar charts but no spaces Like bar charts but no spaces y axis always begins at zero y axis always begins at zero Area of bar represents the frequency in each group Area of bar represents the frequency in each group

36

37

38

39 Check data carefully

40

41 Florence Nightingale’s ‘Coxcomb’ diagram of Mortality in the Crimea War

42 Summary measures – Numerical data Central Location (average) Central Location (average) Spread or variability (distance of each data point from the average) Spread or variability (distance of each data point from the average)

43 Central Location Mean Mean Median Median Mode - most frequent value Mode - most frequent value

44 Mean _ x = x 1 + x 2 +x 3 + ….. + x n N Often written as ∑x i / N Where Sigma or ∑ is ‘Sum of’

45 2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 _ x = 24.02 8 = 3 litres

46 Mean Advantages Advantages – Uses all data values – Very amenable to statistical analysis; most models deal with mean Disadvantages (advantages to politicians and estate agents!) Disadvantages (advantages to politicians and estate agents!) – Distorted by outliers – Distorted by skewed data

47 Median Arrange values in increasing order Median is the middle value 2.62 2.75 2.76 [2.86 3.05] 3.12 3.37 3.49 Median = 2.86 + 3.05 = 2.96 litres 2

48 Median

49 Median Advantages Advantages – Not distorted by outliers – Not distorted by skewed data Disadvantages Disadvantages – Ignores most of the information – Less amenable to statistical modelling

50 Measures of spread 17 24 29 36 [47 52] 66 67 81 94 Mean = 51.3 Median = 49.5 50 51 51 51 [51 51] 51 51 51 55 Mean = 51.3 Median = 51

51 Range 17 24 29 36 [47 52] 66 67 81 94 Range 17-94 or 77 50 51 51 51 [51 51] 51 51 51 55 Range 50-55 or 5

52 Range from percentiles Data ordered from smallest to largest value Data ordered from smallest to largest value Percentiles Percentiles Deciles –data in equal 10ths Deciles –data in equal 10ths Quartiles = data in equal 4ths Quartiles = data in equal 4ths

53 Interquartile range (IQR) 4 5 7 | 9 10 12 | 14 19 26 | 39 40 42 8 (lower quartile) 32.5 (upper quartile) Interquartile range (IQR) = 32.5 - 8 = 24.5

54 Median Range IQR Multiple Box-plots Upper Quartile Lower Quartile Outlier

55 Distribution of data values around the mean MEAN 17 24 29 36 47 51.3 52 66 67 81 94 MEAN 50 51 51 51 51 51.3 51 51 51 51 55

56 Variance 17 24 29 36 47 52 mean=34.16 years _ (x-x) 17 - 34.16-17.16 24 – 34.16-10.16 29 – 34.16 -5.16 36 – 34.16 1.83 47 – 34.16 12.83 52 – 34.16 17.83 0

57 Variance 17 24 29 36 47 52 mean=34.16 _ _ (x-x) (x-x) 2 17-17.16294.64 24-10.16103.36 29 -5.16 26.69 36 1.83 3.36 47 12.83164.69 52 17.83318.02 0 910.81

58 Variance (s 2 ) _ S 2 =  (x-x) 2 n-1 S 2 = 910.81 5 S 2 =182.16

59 17 24 29 36 47 52 Mean = 34.16 years Variance = 182.2

60 Standard deviation (s) Standard deviation (s) _ Std deviation (s) = √  (x-x) 2 n-1 Std deviation = √ 182.16 = 13.49

61 17 24 29 36 47 52 Mean = 34.16 years SD = 13.49 Coefficient of Variation (CV) = SD / Mean = 0.39 Measure of variability for comparison of different scales

62 What central measure goes with what measure of spread? Mean (SD) Mean (SD) Median (IQR) Median (IQR)

63 Summary Summary Do not underestimate value of looking at the data Do not underestimate value of looking at the data Gives a feel for the data before testing or modelling Gives a feel for the data before testing or modelling Check for missing data Check for missing data Check for outliers Check for outliers

64 From Jan 2010 IBM acquired copyright for SPSS

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81 Implementing Kaplan-Meier in SPSS From Colorectal.sav you need to specify: Survival time – time from surgery (tfsurg) Survival time – time from surgery (tfsurg) Status – Dead = 1, censored = 0 (dead) Status – Dead = 1, censored = 0 (dead) Factor – e.g. hypertension comorbidity (hyperco) Factor – e.g. hypertension comorbidity (hyperco) Select plot of survival Select plot of survival

82 Implementing Kaplan-Meier plot in SPSS

83 Select options to obtain plot and median survival

84 Survival curves for women with glioma by diagnosis. Bland J M, Altman D G BMJ 2004;328:1073

85 Practical Read LDL.sav or colorectal.sav into SPSS (19) and explore the different types of data using appropriate tables and graphs Data available at MyDundee https://my.dundee.ac.uk/webapps/cmsmain/webui/_x y-2283598_4-t_AueCBgz2 or DEBU website (https://medicine.dundee.ac.uk/dundee-epidemiology- and-biostatistics-unit-debu)


Download ppt "Data: Presentation and Description Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."

Similar presentations


Ads by Google