Presentation is loading. Please wait.

Presentation is loading. Please wait.

Petros Drineas Rensselaer Polytechnic Institute

Similar presentations


Presentation on theme: "Petros Drineas Rensselaer Polytechnic Institute"— Presentation transcript:

1 Identifying Ancestry Informative Markers via the Singular Value Decomposition
Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas

2 Population genetics can help translate that historical message.
Human genetic history Much of the biological and evolutionary history of our species is written in our DNA sequences. Population genetics can help translate that historical message.

3 Population genetics can help translate that historical message.
Human genetic history Much of the biological and evolutionary history of our species is written in our DNA sequences. Population genetics can help translate that historical message. The genetic variation among humans is a small portion of the human genome. All humans are almost than 99.9% identical.

4 Our objective Develop unsupervised, efficient algorithms for the selection of a small set of genetic markers that can be used to capture population structure, and predict individual ancestry.

5 Our objective Develop unsupervised, efficient algorithms for the selection of a small set of genetic markers that can be used to capture population structure, and predict individual ancestry. To this end, we employ matrix algorithms and matrix decompositions such as the Singular Value Decomposition (SVD), and the CX decomposition. We provide the first unsupervised algorithm for selecting of such markers.

6 Overview Basic genetics background
The Singular Value Decomposition (SVD) The CX decomposition Selecting Ancestry Informative Markers The HapMap data A worldwide set of populations An admixed population

7 Single Nucleotide Polymorphisms (SNPs)
Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals There are ¼ 10 million SNPs in the human genome, so this table could have ~10 million columns.

8 Two copies of a chromosome (father, mother)
Focus at a specific locus and assay the observed nucleotide bases (alleles). SNP: exactly two alternate alleles appear. T C

9 Two copies of a chromosome (father, mother)
Focus at a specific locus and assay the observed alleles. SNP: exactly two alternate alleles appear. C T An individual could be: Heterozygotic (in our study, CT = TC) Two copies of a chromosome (father, mother) SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals

10 Two copies of a chromosome (father, mother)
Focus at a specific locus and assay the observed alleles. SNP: exactly two alternate alleles appear. C C Two copies of a chromosome (father, mother) An individual could be: Heterozygotic (in our study, CT = TC) Homozygotic at the first allele, e.g., C SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals

11 Two copies of a chromosome (father, mother)
Focus at a specific locus and assay the observed alleles. SNP: exactly two alternate alleles appear. T T Two copies of a chromosome (father, mother) An individual could be: Heterozygotic (in our study, CT = TC) Homozygotic at the first allele, e.g., C Homozygotic at the second allele, e.g., T Encode as 0 Encode as +1 Encode as -1 SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals

12 Two copies of a chromosome (father, mother)
Focus at a specific locus and assay the observed alleles. SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: Heterozygotic (in our study, CT = TC) Homozygotic at the first allele, e.g., C Homozygotic at the second allele, e.g., T Rare (or Minor) Allele Frequency, RAF (or MAF): The frequency of the “less frequent” allele in a SNP e.g., freq(C) = 5/14. SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals

13 Population Genetics 101 Genetic diversity and population (sub)structure is caused by… Mutation Mutations are changes to the base pair sequence of the DNA. Natural selection Genotypes that correspond to favorable traits and are heritable become more common in successive generations of a population of reproducing organisms.

14 Population Genetics 101 Genetic diversity and population (sub)structure is caused by… Mutation Mutations are changes to the base pair sequence of the DNA. Natural selection Genotypes that correspond to favorable traits and are heritable become more common in successive generations of a population of reproducing organisms.  Mutations increase genetic diversity. Under natural selection, beneficial mutations increase in frequency, and vice versa.

15 Population Genetics 101 (cont’d)
Genetic drift Sampling effects on evolution! Example: say that the RAF of a SNP in a small population is p. The offspring generation would (in expectation) have a RAF of p as well for the same SNP. In reality, it will have a RAF of p’ (a drifted frequency) … Gene flow Transfer of alleles between populations.

16 Population Genetics 101 (cont’d)
Genetic drift Sampling effects on evolution! Example: say that the RAF of a SNP in a small population is p. The offspring generation would (in expectation) have a RAF of p as well for the same SNP. In reality, it will have a RAF of p’ (a drifted frequency) … Gene flow Transfer of alleles between populations. Genetic drift are a stronger force than natural selection in small populations.

17 Population Genetics 101 (cont’d)
Examples: Population bottlenecks Wars, epidemics, natural disasters wipe off a part of the population and lead to genetic drift in offspring population. Founders effect A new population is established by a small number of individuals, carrying only a fraction of the original population's genetic variation. Non-random mating Reduces interaction between (sub)populations. Other demographic events Immigration, etc.

18 Images courtesy of Kenneth Kidd,
Early Homo sapiens sapiens in Africa 150,000 to 100,000 years ago Images courtesy of Kenneth Kidd,

19 colonizing South West Asia,
Homo sapiens sapiens colonizing South West Asia, approx. 100,000 years ago.

20 Homo sapiens sapiens approx. 40,000 years ago.

21 Why study population structure?
History of human populations Genealogy Forensics Mapping causative genes for common complex disorders (e.g. diabetes, heart conditions, obesity, etc.)

22 Why are SNPs really important?
Genome Wide Association Studies (GWAS): Locating causative genes for common complex disorders is based on identifying association between affection status and known SNPs. No prior knowledge about the function of the gene(s) or the etiology of the disorder is necessary.

23 Why are SNPs really important?
Genome Wide Association Studies (GWAS): Locating causative genes for common complex disorders is based on identifying association between affection status and known SNPs. No prior knowledge about the function of the gene(s) or the etiology of the disorder is necessary. The subsequent investigation of candidate genes that are in physical proximity with the associated SNPs is the first step towards understanding the etiological pathway of a disorder and designing a drug. Numerous such studies revealed (and will continue to reveal) correlations between genes and diseases.

24 Recall our objective Develop unsupervised, efficient algorithms for the selection of a small set of SNPs that can be used to capture population structure, and predict individual ancestry. Why? cost efficiency, identification of regions of natural selection. Let’s discuss (briefly) prior work …

25 Inferring population structure
Oceania Africa Europe Middle East Central Asia East Asia America 377 STRPs, Rosenberg et al. Science ’02 Developed a software package called STRUCTURE. Works well, however: It is based on explicit assumptions that may not always hold. It cannot handle large genome-wide datasets (computationally expensive).

26 Selecting ancestry informative markers
Existing methods (Fst, Informativeness, δ) Rosenberg et al. Am J Hum Genet ’03 Allele frequency based. Require prior knowledge of individual ancestry (supervised). Such knowledge may not be available. (E.g., populations of complex ancestry, large multi-centered studies of anonymous samples, etc.)

27 Overview Basic genetics background
The Singular Value Decomposition (SVD) The CX decomposition Selecting Ancestry Informative Markers The HapMap data A worldwide set of populations An admixed population

28 The Singular Value Decomposition (SVD)
Let A be a matrix with m rows (one for each subject) and n columns (one for each SNP). Matrix rows: points (vectors) in a Euclidean space, e.g., given 2 objects (x & d), each described with respect to two features, we get a 2-by-2 matrix. Two objects are “close” if the angle between their corresponding vectors is small.

29 SVD, intuition Let the blue circles represent m data points in a 2-D Euclidean space. Then, the SVD of the m-by-2 matrix of the data will return … 2nd (right) singular vector 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector. 1st (right) singular vector 1st (right) singular vector: direction of maximal variance,

30 Singular values 2nd (right) singular vector 2 1st (right) singular vector 1: measures how much of the data variance is explained by the first singular vector. 2: measures how much of the data variance is explained by the second singular vector. 1

31 SVD: formal definition
Let 1 ¸ 2 ¸ … ¸  be the entries of . Exact computation of the SVD takes O(min{mn2 , m2n}) time. The top k left/right singular vectors/values can be computed faster using Lanczos/Arnoldi methods. : rank of A U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A.

32 Rank-k approximations via the SVD
= U VT features sig. significant noise noise = significant noise objects

33 Rank-k approximations (Ak)
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. k: diagonal matrix containing the top k singular values of A. Also,

34 PCA and SVD Principal Components Analysis (PCA) essentially amounts to the computation of the Singular Value Decomposition (SVD) of a covariance matrix. SVD is the algorithmic tool behind MultiDimensional Scaling (MDS) and Factor Analysis.

35 Back to data: the HapMap project
(~$130,000,000 funding from NIH) Map approx. 3.1 million SNPs for 270 individuals from 4 populations (YRI, CEU, CHB, JPT), to create a “genetic map” for researchers. Let A be the 90£2.7 million matrix of the CHB and JPT population in HapMap. Run SVD on A, keep the two (left) singular vectors (eigenSNPs). Run a (naïve, e.g., k-means) clustering algorithm to split the data in two clusters.

36 PCA for analyzing population structure was introduced by L
PCA for analyzing population structure was introduced by L. Cavalli-Sforza in the ’70s.

37 (Mathematically: spanning the same subspace.)
PCA for analyzing population structure was introduced by L. Cavalli-Sforza in the ’70s. Not altogether satisfactory: the (top two left) singular vectors (eigenSNPs) are linear combinations of all SNPs, and can not be genotyped! Can we find actual SNPs that capture the “information” in the eigenSNPs? (Mathematically: spanning the same subspace.)

38 Overview Basic genetics background
The Singular Value Decomposition (SVD) The CX decomposition Selecting Ancestry Informative Markers The HapMap data A worldwide set of populations An admixed population

39 CX decomposition Carefully chosen X Goal: make (some norm) of A-CX small. c columns of A Why? If A is an subject-SNP matrix, then selecting representative columns is equivalent to selecting representative SNPs to capture the same structure as the top eigenSNPs. We want c as small as possible!

40 CX decomposition Goal: make (some norm) of A-CX small.
Carefully chosen X Goal: make (some norm) of A-CX small. c columns of A Theory: for any matrix A, we can find C such that is almost equal to the norm of A-Ak with c ¼ k.

41 CX decomposition c columns of A Easy to prove that optimal X = C+A. (C+ is the Moore-Penrose pseudoinverse of C.) Thus, the challenging part is to find good columns (SNPs) of A to include in C.

42 CX decomposition c columns of A Easy to prove that optimal X = C+A. (C+ is the Moore-Penrose pseudoinverse of C.) Thus, the challenging part is to find good columns (SNPs) of A to include in C. From a mathematical perspective, this is a hard combinatorial problem, closely related to the so-called Column Subset Selection Problem (CSSP). The CSSP has been heavily studied in Numerical Linear Algebra.

43 A theorem (Drineas, Mahoney, and Muthukrishnan ’06, ’07)
Given an m-by-n matrix A, there exists an algorithm that picks, in expectation, at most O( k log k / 2 ) columns of A runs in O(mn2) time (m · n), and with probability at least

44 The CX algorithm Input: m-by-n matrix A, integer k
Output: C, the matrix consisting of the selected columns CX algorithm Compute probabilities pj summing to 1 Let c = O(k log k / 2) For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj} Let C be the matrix consisting of the chosen columns (C has – in expectation – at most c columns)

45 Subspace sampling (Frobenius norm)
Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not.

46 Subspace sampling (Frobenius norm)
Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not. Subspace sampling in O(mn2) time Normalization s.t. the pj sum up to 1

47 Deterministic variant of CX
Input: m-by-n matrix A, integer k, and c (number of SNPs to pick) Output: the selected SNPs CX algorithm Compute the scores pj Pick the columns (SNPs) corresponding to the top c scores.

48 Deterministic variant of CX
Input: m-by-n matrix A, integer k, and c (number of SNPs to pick) Output: the selected SNPs: we will call them PCA Informative Markers or PCAIMs CX algorithm Compute the scores pj Pick the columns (SNPs) corresponding to the top c scores. In order to estimate k for SNP data, we developed a permutation-based test to determine whether a certain principal component is significant or not. (A similar test was presented in Patterson et al. PLoS Genet ’06.)

49 Number of SNPs Misclassifications
50 6 100+ 1 As good as the best existing metric (informativeness). However, our metric is unsupervised!

50 Overview Basic genetics background
The Singular Value Decomposition (SVD) The CX decomposition Selecting Ancestry Informative Markers The HapMap data A worldwide set of populations An admixed population

51 Worldwide data European Americans South Altaians - - Spanish African Americans Japanese Chinese Puerto Rico Mende Nahua Mbuti Mala Burunge Quechua Africa Europe E Asia America 274 individuals, 12 populations, ~10,000 SNPs using the Affymetrix array (Shriver et al. Hum Genom ‘05)

52 (Africa, Europe, Asia, America)
Selecting PCA-correlated SNPs for individual assignment to four continents (Africa, Europe, Asia, America) Afr Eur Asi Ame Africa Europe Asia America * top 30 PCA-correlated SNPs PCA-scores SNPs by chromosomal order

53 (Use a subset of SNPs, cluster the individuals using k-means.)
Correlation coefficient between true and predicted membership of an individual to a particular geographic continent. (Use a subset of SNPs, cluster the individuals using k-means.)

54 Cross-validation on HapMap data

55 All SNPs and 10 principal components: perfect clustering!
B 9 indigenous populations from four different continents (Africa, Europe, Asia, Americas) All SNPs and 10 principal components: perfect clustering! 50 PCAIMs SNPs, almost perfect clustering; 200 PCAIMs SNPs, perfect clustering. Informativeness performed poorly (maybe an artifact of k-means clustering…)

56 Admixed populations Two (independent) Puerto Rican datasets, two major axes of variation: Dataset A: 192 individuals, ~7,000 SNPs Dataset B: 30 individuals, same ~7,000 SNP European - West African axis of variation Native American axis of variation

57 West Africa America Europe

58 Europe-Africa axis of variation
Ancestry coefficient: location of the individual in the Europe - Africa axis. Africa

59 Predicting ancestry using PCAIMs

60 Predicting ancestry using PCAIMs

61 Cross-validation on Puerto-Rican dataset B

62 Conclusions Using linear algebraic techniques (e.g., matrix decompositions) we selected markers that capture population structure. Our technique requires no prior assumptions and builds upon the power of SVD to identify population structure in various settings, including admixed populations. Prior theoretical work and mathematical understanding of the underlying problem was fundamental in designing our algorithm!

63 Acknowledgements Collaborators Students
P. Paschou, Democritus University, Greece E. Ziv, UCSF E. Burchard, UCSF M.W. Mahoney, Yahoo! Research K. K. Kidd, Yale University M. Shriver, Penn State R. Krauss, Oakland Research Institute Students Asif Javed, RPI Jamey Lewis, RPI Funding NSF CAREER award to Petros Drineas NIH U19 AG23122 and K22CA Breast Cancer Research Program grant to E. Ziv Hellenic Endocrine Society Research grant award to P. Paschou Ref: Paschou, Elad, Burchard,…, Mahoney, and Drineas (2007) PLoS Genetics, 3: e160


Download ppt "Petros Drineas Rensselaer Polytechnic Institute"

Similar presentations


Ads by Google