CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.

Slides:



Advertisements
Similar presentations
The Evolution Of Populations
Advertisements

Genetic Terms Gene - a unit of inheritance that usually is directly responsible for one trait or character. Allele - an alternate form of a gene. Usually.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
A method of quantifying stability and change in a population.
Sampling distributions of alleles under models of neutral evolution.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Population Genetics (Ch. 16)
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Population Genetics: Populations change in genetic characteristics over time Ways to measure change: Allele frequency change (B and b) Genotype frequency.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Constant Allele Frequencies Hardy-Weinberg Equilibrium.
Evolutionary Change in Populations: Population Genetics, Selection & Drift.
Population Genetics.
Genetic variation, detection, concepts, sources, and forces
Chapter 23 Population Genetics © John Wiley & Sons, Inc.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Population Genetics Learning Objectives
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Genetic Variations Lakshmi K Matukumalli. Human – Mouse Comparison.
Broad-Sense Heritability Index
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
14 Population Genetics and Evolution. Population Genetics Population genetics involves the application of genetic principles to entire populations of.
Genes Within Populations
Evolution Chapters Evolution is both Factual and the basis of broader theory What does this mean? What are some factual examples of evolution?
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Copyright © 2008 Pearson Education Inc., publishing as Pearson Benjamin Cummings Chapter 23 The Evolution of Populations.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
INTRODUCTION TO ASSOCIATION MAPPING
CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.
Population and Evolutionary Genetics
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Vineet Bafna CSE280A CSE280Vineet Bafna. We will cover topics from Population Genetics. The focus will be on the use of algorithms for analyzing genetic.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
1.Stream A and Stream B are located on two isolated islands with similar characteristics. How do these two stream beds differ? 2.Suppose a fish that varies.
Objective: Chapter 23. Population geneticists measure polymorphisms in a population by determining the amount of heterozygosity at the gene and molecular.
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Evolution of Populations. Individual organisms do not evolve. This is a misconception. While natural selection acts on individuals, evolution is only.
Evolution of Populations
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
1,3, ,
LECTURE 9. Genetic drift In population genetics, genetic drift (or more precisely allelic drift) is the evolutionary process of change in the allele frequencies.
Equilibria in populations
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
The ‘V’ in the Tajima D equation is:
Basic concepts on population genetics
Population Genetics & Hardy - Weinberg
The Evolution of Populations
Vineet Bafna/Pavel Pevzner
The coalescent with recombination (Chapter 5, Part 1)
Outline Cancer Progression Models
Population Genetics: The Hardy-Weinberg Law
Presentation transcript:

CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280Vineet Bafna The scope/syllabus We will cover topics from the following areas: – Population genetics – Computational Mass Spectrometry – Biological networks (emphasis on comparative analysis) – ncRNA The focus will be on the use of algorithms for analyzing data in these areas – Some background in algorithms (mathematical maturity) is helpful. – Relevant biology will be discussed on a ‘need to know’ basis

CSE280Vineet Bafna Logisitics This class has little required homework. All reading is optional, but recommended. 1 Final Exam (20%), and 1 research project (70%) Students help edit class notes in latex (10%) – At most one topic per student – Groups of <= 2 per topic Project Goal: – Address research problems with minimum preparation Lectures will be given by instructors and by students. Most communication is electronic. – Check

CSE280Vineet Bafna From an individual to a population Individual genomes vary by about 1 in 1000bp. These small variations account for significant phenotype differences. – Disease susceptibility. – Response to drugs How can we understand genetic variation in a population, and its consequences? It took a long time (10-15 yrs) to produce the draft sequence of the human genome. Soon (within years), entire populations can have their DNA sequenced. Why do we care?

CSE280Vineet Bafna Population Genetics Individuals in a species (population) are phenotypically different. Often these differences are inherited (genetic). Understanding the genetic basis of these differences is a key challenge of biology! The analysis of these differences involves many interesting algorithmic questions. We will use these questions to illustrate algorithmic principles, and use algorithms to interpret genetic data.

CSE280Vineet Bafna Population genetics We are all similar, yet we are different. How substantial are the differences? – What are the sources of variation? – As mutations arise, they are either neutral and subject to evolutionary drift, or they are (dis-)advantageous and under selective pressure. Can we tell? – If you had DNA from many sub-populations, Asian, European, African, can you separate them? – How can we detect recombination? – Why are some people more likely to get a disease then others? How is disease gene mapping done? – Phasing of chromosomes

CSE280Vineet Bafna Computational mass spectrometry Mass Spectrometry is a key technology for measuring active proteins, their interactions, and their post-translational modifications Computation plays a key role in interpreting mass spectrometry data.

CSE280Vineet Bafna Back to the genome Recall that the genome contains protein-coding genes (1% of the genome) Another 5% might encode RNA, and regulatory sites How do we find regions of functional interest?

CSE280Vineet Bafna Population genetics basics

CSE280Vineet Bafna Scope of genetics lectures Basic terminology Key principles – Sources of variation – HW equilibrium – Linkage – Coalescent theory – Recombination/Ancestral Recombination Graph – Haplotypes/Haplotype phasing – Population sub-structure – Structural polymorphisms – Medical genetics basis: Association mapping/pedigree analysis

CSE280Vineet Bafna Terminology: allele Allele: A specific variant at a location – The notion of alleles predates the concept of gene, and DNA. – Initially, alleles referred to variants that described a measurable trait (round/wrinkled seed) – Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype. – As we discuss source of variation, we will have different kinds of alleles.

CSE280Vineet Bafna Terminology Locus: The location of the allele – A nucleotide position. – A genetic marker – A gene – A chromosomal segment

CSE280Vineet Bafna Terminology Genotype: genetic makeup of (part of) an individual Phenotype: A measurable trait in an organism, often the consequence of a genetic variation Humans are diploid, they have 2 copies of each chromosome. – They may have heterozygosity/homozygosity at a location – Other organisms (plants) have higher forms of ploidy. – Additionally, some sites might have 2 allelic forms, or even many allelic forms. Haplotype: genetic makeup of (part of) a single chromosome

CSE280Vineet Bafna What causes variation in a population? Mutations (may lead to SNPs) Recombinations Other crossover events (gene conversion) Structural Polymorphisms

CSE280Vineet Bafna Single Nucleotide Polymorphisms Small mutations that are sustained in a population are called SNPs SNPs are the most common source of variation studied The data is a matrix (rows are individuals, columns are loci). Only the variant positions are kept. A->G

CSE280Vineet Bafna Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once

CSE280Vineet Bafna Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC

CSE280Vineet Bafna STR can be used as a DNA fingerprint Consider a collection of regions with variable length repeats. Variable length repeats will lead to variable length DNA Vector of lengths is a finger- print loci individuals

CSE280Vineet Bafna Structural polymorphisms Large scale structural changes (deletions/insertions/inversions) may occur in a population. Copy Number variation Certain diseases (cancers) are marked by an abundance of these events

CSE280Vineet Bafna Personalized genome sequencing These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. PLoS Biology, 2007

CSE280Vineet Bafna Recombination Not all DNA recombines!

CSE280Vineet Bafna Human DNA Not all DNA recombines. mtDNA is inherited from the mother, and y-chromosome from the father

CSE280Vineet Bafna Gene Conversion Gene Conversion versus single crossover – Hard to distinguish in a population

CSE280Vineet Bafna Topic 1: Basic Principles In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium – (due to mixing in a population) Linkage (dis)-equilibrium – Due to recombination

CSE280Vineet Bafna Hardy Weinberg equilibrium Consider a locus with 2 alleles, A, a p (respectively, q) is the frequency of A (resp. a) in the population 3 Genotypes: AA, Aa, aa Q: What is the frequency of each genotype If various assumptions are satisfied, (such as random mating, no natural selection), Then P AA =p 2 P Aa =2pq P aa =q 2

CSE280Vineet Bafna Hardy Weinberg: why? Assumptions: – Diploid – Sexual reproduction – Random mating – Bi-allelic sites – Large population size, … Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

CSE280Vineet Bafna Hardy Weinberg: Generalizations Multiple alleles with frequencies – By HW, Multiple loci?

CSE280Vineet Bafna Hardy Weinberg: Implications The allele frequency does not change from generation to generation. Why? It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? Males are 100 times more likely to have the “red’ type of color blindness than females. Why? Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

CSE280Vineet Bafna Recombination

CSE280Vineet Bafna What if there were no recombinations? Life would be simpler Each individual sequence would have a single parent (even for higher ploidy) The genealogical relationship is expressed as a tree. This principle is used to track ancestry of an individual

CSE280Vineet Bafna The Infinite Sites Assumption The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is “destroyed” by recombination

CSE280Vineet Bafna Infinite sites assumption and Perfect Phylogeny Each site is mutated at most once in the history. All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

CSE280Vineet Bafna Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

CSE280Vineet Bafna Handling recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns

CSE280Vineet Bafna Quiz 1 Allele, locus, genotype, haplotype Hardy Weinberg equilibrium? Today: Linkage (dis)-equilibrium

CSE280Vineet Bafna Quiz 2 Recall that a SNP data-set is a ‘binary’ matrix. – Rows are individual (chromosomes) – Columns are alleles at a specific locus Suppose you have 2 SNP datasets of a contiguous genomic region. – One from an African population, and one from a European Population. – Can you tell which is which? – How long does the genomic region have to be?

CSE280Vineet Bafna Recombination, and populations Think of a population of N individual chromosomes. The population remains stable from generation to generation. Without recombination, each individual has exactly one parent chromosome from the previous generation. With recombinations, each individual is derived from one or two parents. We will formalize this notion later in the context of coalescent theory.

CSE280Vineet Bafna Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination Each new individual chromosome chooses a parent from the existing ‘haplotype’ AB AB

CSE280Vineet Bafna Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 2: diploidy and recombination Each new individual chooses a parent from the existing alleles AB AB

CSE280Vineet Bafna Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination Each new individual chooses a parent from the existing ‘haplotype’ – Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2: Extensive recombination Each new individual simply chooses and allele from either site – Pr[A,B=(0,1)]=0.125 Linkage equilibrium AB AB

CSE280Vineet Bafna LD In the absence of recombination, – Correlation between columns – The joint probability Pr[A=a,B=b] is different from P(a)P(b) With extensive recombination – Pr(a,b)=P(a)P(b)

CSE280Vineet Bafna Measures of LD Consider two bi-allelic sites with alleles marked with 0 and 1 Define – P 00 = Pr[Allele 0 in locus 1, and 0 in locus 2] – P 0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P 00 = P 0* P *0 D = abs(P 00 - P 0* P *0 ) = abs(P 01 - P 0* P *1 ) = …

CSE280Vineet Bafna LD over time With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear – Let D (t) = LD at time t – P (t) 00 = (1-r) P (t-1) 00 + r P (t-1) 0* P (t-1) *0 – D (t) = P (t) 00 - P (t) 0* P (t) *0 = P (t) 00 - P (t-1) 0* P (t-1) *0 (Why?) – D (t) =(1-r) D (t-1) =(1-r) t D (0)

CSE280Vineet Bafna Other measures of LD D’ is obtained by dividing D by the largest possible value – Dmax = max {P 1* P *1, P 0* P *1, P 1* P *0, P 0* P *0 } – Ex: D’ = abs(P 11 - P 1* P *1 )/ D max  = D/(P 1* P 0* P *1 P *0 ) 1/2 Let N be the number of individuals Show that  2 N is the  2 statistic between the two sites Site 1 0 Site P 00 N P 0* N

CSE280Vineet Bafna LD over distance Assumption – Recombination rate increases linearly with distance – LD decays exponentially with distance. The assumption is reasonable, but recombination rates vary from region to region, adding to complexity This simple fact is the basis of disease association mapping.

CSE280Vineet Bafna LD and disease mapping Consider a mutation that is causal for a disease. The goal of disease gene mapping is to discover which gene (locus) carries the mutation. Consider every polymorphism, and check: – There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to the same disease Instead, consider a dense sample of polymorphisms that span the genome

CSE280Vineet Bafna LD can be used to map disease genes LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene DNNDDNDNNDDN LD

CSE280Vineet Bafna

CSE280Vineet Bafna 269 individuals – 90 Yorubans – 90 Europeans (CEPH) – 44 Japanese – 45 Chinese ~1M SNPs

CSE280Vineet Bafna LD and disease gene mapping problems Marker density? Complex diseases Population sub-structure

CSE280Vineet Bafna Topic 2: Simulating population data We described various population genetic concepts (HW, LD), and their applicability The values of these parameters depend critically upon the population assumptions. – What if we do not have infinite populations – No random mating (Ex: geographic isolation) – Sudden growth – Bottlenecks – Ad-mixture It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?