Population structure identification BNFO 602 Roshan.

Slides:



Advertisements
Similar presentations
Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
EigenFaces.
Dimensionality Reduction PCA -- SVD
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
An introduction to Principal Component Analysis (PCA)
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Principal Component Analysis
PCA, population structure, and K-means clustering BNFO 601.
Principal Component Analysis IML Outline Max the variance of the output coordinates Optimal reconstruction Generating data Limitations of PCA.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
Principal Components Analysis (PCA) 273A Intro Machine Learning.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Principal Component Analysis. Consider a collection of points.
Techniques for studying correlation and covariance structure
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Hemispheres, Continents, and Oceans
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
CSE 185 Introduction to Computer Vision Face Recognition.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Buckymap ©. Shout out the numbers of the Continents.
How many continents are there, then? Europe Asia North America South America Africa Australia Antarctica It depends on the model...
Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete populations Katarzyna Bryc Postdoctoral Fellow, Reich Lab, Harvard.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Principal component analysis of the color spectra from natural scenes Long Nguyen ECE-499 Advisor: Prof. Shane Cotter.
Principal Components Analysis ( PCA)
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Dynamic graphics, Principal Component Analysis Ker-Chau Li UCLA department of Statistics.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis (PCA)
Principal Component Analysis
Dimensionality Reduction
Clustering Usman Roshan.
Dimensionality reduction
Principle Component Analysis (PCA) Networks (§ 5.8)
9.3 Filtered delay embeddings
Dimensionality reduction
Petros Drineas Rensselaer Polytechnic Institute
Dynamic graphics, Principal Component Analysis
Principal Component Analysis (PCA)
Techniques for studying correlation and covariance structure
Principal Component Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
X.1 Principal component analysis
Feature space tansformation methods
Principal Components What matters most?.
Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies 
Unsupervised Learning and Clustering
Principal Component Analysis
Clustering Usman Roshan CS 675.
Presentation transcript:

Population structure identification BNFO 602 Roshan

Population structure identification We are given millions of SNPs for a random sample of a population Structure identification is an unsupervised learning problem - no training examples are provided Popular methods for unsupervised learning (also known as clustering) –K-means –Spectral clustering

Data encoding Can we reduce the millions of SNPs into a compressed form of two or three dimensions? Data encoding: –Consider each individual as a vector of SNPs –Convert each SNP into a number. Assume the SNP is A/B where A and B are the nucleotides in alphabetical order –Set AA to 0, AB to 1, and BB to 2 –Now each individual is a vector of numbers

Principal component analysis We still have very high dimensional vectors to work with Can we reduce the dimensionality? We wish to find a vector w of length 1 such that the variance of the projected data onto w is maximized.

PCA Original dataProjected data

PCA optimization problem

PCA solution Using Lagrange multipliers we can show that w is given by the largest eigenvector of ∑. What is the largest eigenvector? Now we can compress all the vectors x i into w T x i Does this help? Before looking at examples, what if we want to compute a second projection u T x i such that w T u=0 and u T u=1? It turns out that u is given by the second largest eigenvector of ∑.

PCA - real examples 45 Japanese and 45 Han Chinese from the International HapMap Project PCA applied on 1.7 million SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

PCA - real examples 255 subjects from four continents: Africa, Europe, Asia, and America Total of 9,419 SNPs PCA applied on 30 top PCA-correlated SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

PCA - real examples 9 indigenous populations: Mbuti, Mende, Burunge, Spanish, Mala, East Asian, South Altanian, Nahua, and Quechua Total of 9,160 SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007