Andrew Smith Describing childhood diet with cluster analysis Young Statisticians meeting. 12th April 2011
Describing diet with cluster analysis Pauline M. Emmett P. Kirstin Newby Kate Northstone World Cancer Research Fund MRC, Wellcome Trust, University of Bristol 2
Outline Introductions ALSPAC Food frequency questionnaires Dietary patterns Cluster analysis k-means cluster analysis Results 3 cluster solution Associations with socio-demographic variables 3
ALSPAC Avon Longitudinal Study of Parents and Children Birth cohort study 14,541 pregnant women and their children 4
Food frequency questionnaires 5
Dietary patterns Examine diet as a whole Analyse multivariate FFQ data Use correlations between foods PCA Cluster analysis 6 Image: Paul / FreeDigitalPhotos.net
Cluster analysis Separate subjects into non-overlapping groups Based on distances between individuals Unsupervised learning 7 Image: Boaz Yiftach / FreeDigitalPhotos.net
k-means cluster analysis Most widely used for dietary patterns Number of clusters, k, is specified beforehand Minimises –Distance from each subject to his/her cluster mean –Summed over all subjects in that cluster –Summed over all clusters 8
k-means cluster analysis 9
Problems with the standard algorithm Short-sighted Tends to find solutions that are at a local minimum –So run algorithm 100 times and choose solution that is minimum out of all minima 10
Standardising the input variables 11
Reliability of the cluster solution Split sample in half Perform separate analyses on each half See how many children change clusters Repeat 5 times –32 out of 8,279 children changed cluster (0.4%) 12
Processed 4177 children 13 Image: Suat Eman, Rawich, Master Isolated Images / FreeDigitalPhotos.net
Plant-based 2065 children 14 Image: Suat Eman, Paul, Rob Wiltshire, Simon Howden, winnond / FreeDigitalPhotos.net
Traditional British 2037 children 15 Image: Suat Eman, Filomena Scalise, Maggie Smith / FreeDigitalPhotos.net
Associations with socio-demographic vars Processed Plant-based Traditional British Processed Girls3, Boys2, (0.72, 0.93) 1.03 (0.89, 1.20) 1.18 (1.04, 1.34) 16
Associations with socio-demographic vars Maternal age Processed Plant-based Traditional British Processed < (0.33, 1.07) 1.07 (0.56, 2.05) 1.57 (1.02, 2.43) , (0.29, 0.92) 1.20 (0.64, 2.28) 1.60 (1.04, 2.46) 31+2, (0.21, 0.67) 1.50 (0.79, 2.88) 1.77 (1.13, 2.76) 17
Associations with socio-demographic vars Maternal education Processed Plant-based Traditional British Processed CSE Vocational (0.60, 1.17) 1.19 (0.82, 1.72) 1.01 (0.76, 1.32) O level2, (0.51, 0.83) 1.46 (1.10, 1.94) 1.05 (0.86, 1.30) A level1, (0.33, 0.55) 2.01 (1.50, 2.69) 1.18 (0.95, 1.48) Degree1, (0.23, 0.39) 2.75 (2.00, 3.76) 1.22 (0.94, 1.57) 18
Associations with socio-demographic vars Siblings Processed Plant-based Traditional British Processed 0 older2, older2, (1.03, 1.42) 1.12 (0.94, 1.36) 0.73 (0.62, 0.86) 2+ older (1.28, 1.97) 0.99 (0.76, 1.27) 0.64 (0.52, 0.80) 19
Associations with socio-demographic vars Siblings Processed Plant-based Traditional British Processed 0 younger2, younger2, (0.86, 1.19) 0.58 (0.48, 0.71) 1.69 (1.44, 1.99) 2+ younger (0.92, 1.57) 0.43 (0.33, 0.58) 1.90 (2.50, 2.40) 20
Summary Multivariate methods to compress FFQ data into dietary patterns k-means cluster analysis is widespread but must be applied carefully Processed, Plant-based and Traditional British clusters in 7-year-old children Associated with various socio-demographic variables 21