Population structure identification BNFO 602 Roshan.

Slides:

Advertisements

Similar presentations

Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Dimensionality Reduction PCA -- SVD

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.

Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Principal Component Analysis

An introduction to Principal Component Analysis (PCA)

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.

Principal Component Analysis

PCA, population structure, and K-means clustering BNFO 601.

Principal Component Analysis IML Outline Max the variance of the output coordinates Optimal reconstruction Generating data Limitations of PCA.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Unsupervised Learning

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Principal Components Analysis (PCA) 273A Intro Machine Learning.

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.

Principal Component Analysis. Consider a collection of points.

Techniques for studying correlation and covariance structure

Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.

Summarized by Soo-Jin Kim

Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.

Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)

Presented By Wanchen Lu 2/25/2013

Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.

Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.

Hemispheres, Continents, and Oceans

Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.

CSE 185 Introduction to Computer Vision Face Recognition.

Algorithms 2005 Ramesh Hariharan. Algebraic Methods.

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Buckymap ©. Shout out the numbers of the Continents.

How many continents are there, then? Europe Asia North America South America Africa Australia Antarctica It depends on the model...

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete populations Katarzyna Bryc Postdoctoral Fellow, Reich Lab, Harvard.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

Advanced Artificial Intelligence Lecture 8: Advance machine learning.

Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.

Principal component analysis of the color spectra from natural scenes Long Nguyen ECE-499 Advisor: Prof. Shane Cotter.

Principal Components Analysis ( PCA)

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Dynamic graphics, Principal Component Analysis Ker-Chau Li UCLA department of Statistics.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Principal Component Analysis (PCA)

Principal Component Analysis

Dimensionality Reduction

Clustering Usman Roshan.

Dimensionality reduction

Principle Component Analysis (PCA) Networks (§ 5.8)

9.3 Filtered delay embeddings

Dimensionality reduction

Petros Drineas Rensselaer Polytechnic Institute

Dynamic graphics, Principal Component Analysis

Principal Component Analysis (PCA)

Techniques for studying correlation and covariance structure

Principal Component Analysis

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

X.1 Principal component analysis

Feature space tansformation methods

Principal Components What matters most?.

Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies

Unsupervised Learning and Clustering

Principal Component Analysis

Clustering Usman Roshan CS 675.

Presentation transcript:

Population structure identification BNFO 602 Roshan

Population structure identification We are given millions of SNPs for a random sample of a population Structure identification is an unsupervised learning problem - no training examples are provided Popular methods for unsupervised learning (also known as clustering) –K-means –Spectral clustering

Data encoding Can we reduce the millions of SNPs into a compressed form of two or three dimensions? Data encoding: –Consider each individual as a vector of SNPs –Convert each SNP into a number. Assume the SNP is A/B where A and B are the nucleotides in alphabetical order –Set AA to 0, AB to 1, and BB to 2 –Now each individual is a vector of numbers

Principal component analysis We still have very high dimensional vectors to work with Can we reduce the dimensionality? We wish to find a vector w of length 1 such that the variance of the projected data onto w is maximized.

PCA Original dataProjected data

PCA optimization problem

PCA solution Using Lagrange multipliers we can show that w is given by the largest eigenvector of ∑. What is the largest eigenvector? Now we can compress all the vectors x i into w T x i Does this help? Before looking at examples, what if we want to compute a second projection u T x i such that w T u=0 and u T u=1? It turns out that u is given by the second largest eigenvector of ∑.

PCA - real examples 45 Japanese and 45 Han Chinese from the International HapMap Project PCA applied on 1.7 million SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

PCA - real examples 255 subjects from four continents: Africa, Europe, Asia, and America Total of 9,419 SNPs PCA applied on 30 top PCA-correlated SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007

PCA - real examples 9 indigenous populations: Mbuti, Mende, Burunge, Spanish, Mala, East Asian, South Altanian, Nahua, and Quechua Total of 9,160 SNPs Taken from “PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations” by Paschou et. al. in PLoS Genetics 2007