Download presentation
Presentation is loading. Please wait.
1
Population structure identification BNFO 602 Roshan
2
Population structure identification We are given millions of SNPs for a random sample of a population Structure identification is an unsupervised learning problem - no training examples are provided Popular methods for unsupervised learning (also known as clustering) –K-means –Spectral clustering
3
Data encoding Can we reduce the millions of SNPs into a compressed form of two or three dimensions? Data encoding: –Consider each individual as a vector of SNPs –Convert each SNP into a number. Assume the SNP is A/B where A and B are the nucleotides in alphabetical order –Set AA to 0, AB to 1, and BB to 2 –Now each individual is a vector of numbers
4
Principal component analysis We still have very high dimensional vectors to work with Can we reduce the dimensionality? We wish to find a vector w of length 1 such that the variance of the projected data onto w is maximized.
5
PCA Original dataProjected data
6
PCA optimization problem
7
PCA solution Using Lagrange multipliers we can show that w is given by the largest eigenvector of ∑. What is the largest eigenvector? Now we can compress all the vectors x i into w T x i Does this help? Before looking at examples, what if we want to compute a second projection u T x i such that w T u=0 and u T u=1? It turns out that u is given by the second largest eigenvector of ∑.
8
PCA - real examples 45 Japanese and 45 Han Chinese from the International HapMap Project PCA applied on 1.7 million SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007
9
PCA - real examples 255 subjects from four continents: Africa, Europe, Asia, and America Total of 9,419 SNPs PCA applied on 30 top PCA-correlated SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007
10
PCA - real examples 9 indigenous populations: Mbuti, Mende, Burunge, Spanish, Mala, East Asian, South Altanian, Nahua, and Quechua Total of 9,160 SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.