Presentation is loading. Please wait.

Presentation is loading. Please wait.

Population structure identification BNFO 602 Roshan.

Similar presentations


Presentation on theme: "Population structure identification BNFO 602 Roshan."— Presentation transcript:

1 Population structure identification BNFO 602 Roshan

2 Population structure identification We are given millions of SNPs for a random sample of a population Structure identification is an unsupervised learning problem - no training examples are provided Popular methods for unsupervised learning (also known as clustering) –K-means –Spectral clustering

3 Data encoding Can we reduce the millions of SNPs into a compressed form of two or three dimensions? Data encoding: –Consider each individual as a vector of SNPs –Convert each SNP into a number. Assume the SNP is A/B where A and B are the nucleotides in alphabetical order –Set AA to 0, AB to 1, and BB to 2 –Now each individual is a vector of numbers

4 Principal component analysis We still have very high dimensional vectors to work with Can we reduce the dimensionality? We wish to find a vector w of length 1 such that the variance of the projected data onto w is maximized.

5 PCA Original dataProjected data

6 PCA optimization problem

7 PCA solution Using Lagrange multipliers we can show that w is given by the largest eigenvector of ∑. What is the largest eigenvector? Now we can compress all the vectors x i into w T x i Does this help? Before looking at examples, what if we want to compute a second projection u T x i such that w T u=0 and u T u=1? It turns out that u is given by the second largest eigenvector of ∑.

8 PCA - real examples 45 Japanese and 45 Han Chinese from the International HapMap Project PCA applied on 1.7 million SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

9 PCA - real examples 255 subjects from four continents: Africa, Europe, Asia, and America Total of 9,419 SNPs PCA applied on 30 top PCA-correlated SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

10 PCA - real examples 9 indigenous populations: Mbuti, Mende, Burunge, Spanish, Mala, East Asian, South Altanian, Nahua, and Quechua Total of 9,160 SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

11


Download ppt "Population structure identification BNFO 602 Roshan."

Similar presentations


Ads by Google