Single nucleotide polymorphisms and applications Usman Roshan BNFO 601
SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least 1% of the population to be a SNP. Occur every 100 to 300 bases along the 3 billion-base human genome. Many have no effect on cell function but some could affect disease risk and drug response.
Toy example
SNPs on the chromosome
Bi-allelic SNPs Most SNPs have one of two nucleotides at a given position For example: –A/G denotes the varying nucleotide as either A or G. We call each of these an allele –Most SNPs have two alleles (bi-allelic)
SNP genotype We inherit two copies of each chromosome (one from each parent) For a given SNP the genotype defines the type of alleles we carry Example: for the SNP A/G one’s genotype may be –AA if both copies of the chromosome have A –GG if both copies of the chromosome have G –AG or GA if one copy has A and the other has G –The first two cases are called homozygous and latter two are heterozygous
SNP genotyping
Real SNPs SNP consortium: snp.cshl.org SNPedia:
Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick random humans with and without cancer (say breast cancer) –Perform SNP genotyping –Look for associated SNPs –Also called genome-wide association study
Case-control example Study of 100 people: –Case: 50 subjects with cancer –Control: 50 subjects without cancer Count number of alleles and form a contingency table #Allele1#Allele2 Case1090 Control298
Effect of population structure on genome-wide association studies Suppose our sample is drawn from a population of two groups, I and II Assume that group I has a majority of allele type I and group II has mostly the second allele. Further assume that most case subjects belong to group I and most control to group II This leads to the false association that the major allele is associated with the disease
Effect of population structure on genome-wide association studies We can correct this effect if case and control are equally sampled from all sub-populations To do this we need to know the population structure
Population structure prediction Treated as an unsupervised learning problem (i.e. clustering)
Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective is to find C 1 and C 2 that minimize where m i is the mean of class C i
K-means algorithm for two clusters Input: Algorithm: 1.Initialize: assign x i to C 1 or C 2 with equal probability and compute means: 2.Recompute clusters: assign x i to C 1 if ||x i -m 1 ||<||x i -m 2 ||, otherwise assign to C 2 3.Recompute means m 1 and m 2 4.Compute objective 5.Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2.
K-means Is it guaranteed to find the clustering which optimizes the objective? It is guaranteed to find a local optimal We can prove that the objective decreases with subsequence iterations
Proof sketch of convergence of k-means Justification of first inequality: by assigning x j to the closest mean the objective decreases or stays the same Justification of second inequality: for a given cluster its mean minimizes squared error loss