Data analysis Lecture 10 Tijl De Bie
Let’s do some real data analysis http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) A biologist comes to you and says: “I have some data on breast cancer here, if you analyse it, I will win the Nobel prize” How to start??
Let’s do some real data analysis Real data is messy: Missing values… Infer them as the mean of the corresponding feature (this is a basic technique for ‘imputation’) [MATLAB intermezzo]
Let’s do some real data analysis What now?? Let’s visualize the data! How?? 9-dimensional! Principal Component Analysis (PCA) [MATLAB intermezzo]
Mathematical intermezzo: PCA Two views: Variance maximization Error minimization Solved using eigenvalue problem Do not forget to centre the data (subtract from each feature its mean in the dataset)
Looks interesting… Could we perhaps predict the label from the data? I.e., find a rule that says when a cancer is benign and when it’s malignant (important for therapy and more!) Classification! [MATLAB intermezzo]
Mathematical intermezzo: LSR/FDA Least Squares Regression (LSR) Solved by means of a system of linear equations Xw=y (approx) Missfit: ||Xw-y||2 the mean squared error Fisher Discriminant Analysis: The same thing, if the labels y are -1/1
Could there be more? Perhaps there are more than 2 clusters? Cancers requiring different treatments? Let’s cluster the data! 2-clusters? (Benign vs malign?) More clusters? (Other cancer types?) [MATLAB intermezzo]