BIOSTATISTICS Explorative data analysis
Box plot QQ plot Classification analysis Copyright ©2012, Joanna Szyda INTRODUCTION
Explorative data analysis Confirmatory data analysis INDP.0P.132P.265P.397P Copyright ©2012, Joanna Szyda
CONFIRMATORY DATA ANALYSIS formulate a hypothesis determine the maximum I type error select and calculate a statistical test calculate the I type error decision on the hypothesis formulate a hypothesis determine the maximum I type error select and calculate a statistical test calculate the I type error decision on the hypothesis Copyright ©2012, Joanna Szyda
John Tukey no preassumed hypothesis use of various analytical tools: o statistical o graphical exploration of data structure identification of the important variables identification of outliers John Tukey no preassumed hypothesis use of various analytical tools: o statistical o graphical exploration of data structure identification of the important variables identification of outliers Copyright ©2012, Joanna Szyda EXPLORATORY DATA ANALYSIS
EXAMPLES OF EXPLORATORY DATA ANALYSIS
5 NUMBER DATA SUMMARY BOX PLOT - 5 number data summary Copyright ©2012, Joanna Szyda
BOX PLOT - 5 number data summary median: 50% data 1 quarile: 25% data 3 quartile: 75% data minimum maximum outlier Copyright ©2012, Joanna Szyda
EXAMPLES - box plot
Quantile:Quantile plot – comparing distributions distribution 2 quantiles distribution 1 quantiles Copyright ©2012, Joanna Szyda
QQ plot of SNP effects comparing − a theoretical distribution N − observed distribution interpretation −points on the y=x line → distributions are equal −steep line → Normal distribution has lower variance QQ plot of SNP effects comparing − a theoretical distribution N − observed distribution interpretation −points on the y=x line → distributions are equal −steep line → Normal distribution has lower variance Copyright ©2012, Joanna Szyda Q:Q plot – comparing distributions
QQ plot of SNP effects Comparison of 2 distributions Interpretation? QQ plot of SNP effects Comparison of 2 distributions Interpretation? Copyright ©2012, Joanna Szyda Q:Q plot – comparing distributions
CLASSIFICATION ANALYSIS
CLASSIFICATION METHODS - k nearest neighbors 1.Classification of observations = allocation of observations to a group 2.Classification based on some variables Training data set = known classification Test data set = unknown classification 3.E.g. Taxonomy of organisms on the basis of measurements Classification of irises based on flower shape Iris setosaIris versicolor Copyright ©2012, Joanna Szyda
Training data set sepal lengthsepal widthSpecies Iris-setosa 4.93Iris-setosa Iris-setosa Iris-setosa 53.6Iris-setosa Iris-setosa Iris-setosa 53.4Iris-setosa Iris-setosa Iris-setosa 73.2Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor 62.2Iris-versicolor Iris-versicolor Iris setosaIris versicolor Copyright ©2012, Joanna Szyda CLASSIFICATION METHODS - k nearest neighbors
Iris setosaIris versicolor Training data set sepal lengthsepal widthspecies Iris-setosa 4.93Iris-setosa Iris-setosa Iris-setosa 53.6Iris-setosa Iris-setosa Iris-setosa 53.4Iris-setosa Iris-setosa Iris-setosa 73.2Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor 62.2Iris-versicolor Iris-versicolor Test data set 52.4??? ??? Copyright ©2012, Joanna Szyda CLASSIFICATION METHODS - k nearest neighbors
Training data setk=8 sepal lengthsepal widthspeciesdistancenearest neighbors Iris-setosa Iris-setosa 0.37Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa 0.61Iris-setosa Iris-setosa 0.5Iris-setosa 73.2Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor 0.26Iris-versicolor Iris-versicolor Iris-versicolor 0.65Iris-versicolor Iris-versicolor Iris-versicolor 0.01Iris-versicolor Iris-versicolor Iris-versicolor 0.13Iris-versicolor Iris-versicolor 5.93Iris-versicolor Iris-versicolor Iris-versicolor 1.46 Test data set 52.4??? = Iris-versicolor ??? Copyright ©2012, Joanna Szyda CLASSIFICATION METHODS - k nearest neighbors
Training data setk=8 sepal lengthsepal widthspeciesdistancenearest neighbors Iris-setosa Iris-setosa 0.16Iris-setosa Iris-setosa 0.4Iris-setosa Iris-setosa 0.34Iris-setosa 53.6Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa 0.34Iris-setosa Iris-setosa 0.25Iris-setosa 73.2Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor 0.04Iris-versicolor Iris-versicolor Iris-versicolor 0.1Iris-versicolor Iris-versicolor 5.93Iris-versicolor Iris-versicolor Iris-versicolor 1.53 Test data set 52.4??? = Iris-versicolor ??? = Iris setosa Copyright ©2012, Joanna Szyda CLASSIFICATION METHODS - k nearest neighbors
IRISES – FULL DATA SET categories: I. setosa, I. versicolor, I. virginica 150 individuals decision areas based on petal width and petal length Copyright ©2012, Joanna Szyda CLASSIFICATION METHODS - k nearest neighbors
EDA Box plotQQ plot Classification methods