Vineet Bafna CSE280A CSE280Vineet Bafna. We will cover topics from Population Genetics. The focus will be on the use of algorithms for analyzing genetic.

Slides:



Advertisements
Similar presentations
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Advertisements

Heredity and Evolution
Sampling distributions of alleles under models of neutral evolution.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Lecture 28 Evolution. Variation Without variation (which arises from mutations of DNA molecules to produce new alleles) natural selection would have nothing.
Unit Animal Science. Problem Area Animal Genetics and Biotechnology.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
Evolutionary Concepts: Variation and Mutation 6 February 2003.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Biotechnology and Genomics Chapter 16. Biotechnology and Genomics 2Outline DNA Cloning  Recombinant DNA Technology ­Restriction Enzyme ­DNA Ligase 
Linkage and LOD score Egmond, 2006 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
explain how crime scene evidence is
1 Chapter 7 Chapter 7 DNA Fingerprinting Learning Goals: o Explain how crime scene evidence is collected and processed to obtain DNA o Describe how radioactive.
Unit 4 Vocabulary Review. Nucleic Acids Organic molecules that serve as the blueprint for proteins and, through the action of proteins, for all cellular.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Broad-Sense Heritability Index
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
14 Population Genetics and Evolution. Population Genetics Population genetics involves the application of genetic principles to entire populations of.
Genes Within Populations
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Chapter 5 The Content of the Genome 5.1 Introduction genome – The complete set of sequences in the genetic material of an organism. –It includes the.
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
10 Nature, structure and organisation of the genetic material.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Biotechnology and Genomics Chapter 16. Biotechnology and Genomics 2Outline DNA Cloning  Recombinant DNA Technology ­Restriction Enzyme ­DNA Ligase 
Selectionist view: allele substitution and polymorphism
DNA –Position in the cell Nucleus DNA double helix Chromosomes.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Objective: Chapter 23. Population geneticists measure polymorphisms in a population by determining the amount of heterozygosity at the gene and molecular.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Chapter 2 Genetic Variations. Introduction The human genome contains variations in base sequence from one individual to another. Some sequence variants.
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Bio II: Forensics.  DNA molecules are found in the nucleus of cells in the human body in chromosomes.  People have 23 pairs of chromosomes, with an.
History Evidence BIOLOGICAL EVIDENCE EXAMINED FOR INHERITED TRAITS TECHNIQUES EMERGED FROM HEALTHCARE DNA FINGERPRINTING DEVELOPED IN 1984.
Evolution of Populations. Individual organisms do not evolve. This is a misconception. While natural selection acts on individuals, evolution is only.
Evolution of Populations
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Biotechnology.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
explain how crime scene evidence is
PRINCIPLES OF CROP PRODUCTION (3 CREDIT HOURS)
Sexual reproduction creates unique combinations of genes.
Outline Cancer Progression Models
Sequential Steps in Genome Mapping
Chapter 6 Clusters and Repeats.
Biology, 9th ed,Sylvia Mader
explain how crime scene evidence is
Presentation transcript:

Vineet Bafna CSE280A CSE280Vineet Bafna

We will cover topics from Population Genetics. The focus will be on the use of algorithms for analyzing genetic data – Some background in algorithms (mathematical maturity) is helpful. – Relevant biology will be discussed on a ‘need to know’ basis (We’ll cover some basics today). CSE280Vineet Bafna

April’08Bafna

It took a long time (10-15 yrs) to produce the draft sequence of the human genome. Soon (within years), entire populations can have their DNA sequenced. Why do we care? Individual genomes vary by about 1 in 1000bp. These small variations account for significant phenotype differences. – Disease susceptibility. – Response to drugs How can we understand genetic variation in a population, and its consequences? CSE280Vineet Bafna

Individuals in a species (population) are phenotypically different. Often these differences are inherited (genetic). Understanding the genetic basis of these differences is a key challenge of biology! The analysis of these differences involves many interesting algorithmic questions. We will use these questions to illustrate algorithmic principles, and use algorithms to interpret genetic data. CSE280Vineet Bafna

All reading is available online. We may have class Martin Luther King day) and President’s day to make up for some lost classes. 4-5 assignments, (50%), 1 Final Exam (20%), and 1 research project (30%) Project Goal: – Address research problems with minimum preparation Lectures will be given by instructors and by students. Most communication is electronic. – Check the class website. CSE280Vineet Bafna

Basic terminology Key principles – Sources of variation – HW equilibrium – Linkage – Coalescent theory – Recombination/Ancestral Recombination Graph – Haplotypes/Haplotype phasing – Population sub-structure – Demographics – Structural polymorphisms – Medical genetics basis: Association mapping/pedigree analysis CSE280Vineet Bafna

CSE280Vineet Bafna

CSE280Vineet Bafna

Live S type bacteria could be isolated from the mix, and used to infect other cells. Until now, it was assumed that different types stayed fixed from generation to generation. Grifiths’ experiment suggested that bacteria could be transformed. transformation

CSE280Vineet Bafna Avery et al. did the following: Lysed dead S-cells to isolate components: sugar, proteins, DNA, and RNA. Used enzymes to remove sugars, proteins, RNA, and DNA. DNA was the only necessary ingredient for transformation.

Proteins are the machines in the molecular factory of the cell. – Enzymatic reactions – Cell growth, division – Switching the production of other proteins on or off – Carrying a signal from one location to another Ditto for RNA. RNA also helps in the production of protein DNA is the inherited material, and must contain instructions for manufacturing all the proteins, as well as the location of the regulatory switch. – Variation in DNA  possible change in instructions  possible change in molecular machinery CSE280Vineet Bafna

Levene found that all DNA had repeated units of Phosphate linked to deoxyribose sugar, linked to a nitrogenous base. He assumed that it contained repeating units of all bases, but that could not be logically correct. His hypothesis was refuted by Chargaff who found that the different bases appear in different proportions, but A matches T and C matches G CSE280Vineet Bafna "DNA-Nucleobases" by Sponk

Two antiparallel strands, each forming a helix Complementary nucleotides form hydrogen bonds satisfying Chargaff’s rules One strand contains all of the information. The other is complementary CSE280Vineet Bafna

We can think of the strand as a long string of nucleotide bases (A,C,G,T) Specific substrings encode information about how to make proteins The two complementary strands allow DNA to replicate and cells to make copies of each other CSE280Vineet Bafna

The DNA is inherited by the child from its parent. The copying is not identical, but might be mutated. If the mutation lies in a gene,…. – Different proteins are produced – Different proteins are switched on or off – Different phenotype. CSE280Vineet Bafna

Darwin suggested the following: – Organisms compete for finite resources. – Organisms with favorable mutations are more likely to reproduce, leading to fixation of favorable mutations (Natural selection). – Over time, the accumulation of many changes suitable to an environment leads to speciation. Kimura (1960s) observed – Most mutations are selectively neutral! – They drift in the population, eventually getting eliminated, or fixed by random chance. CSE280Vineet Bafna

We seek to separate the two types of mutations. Identifying mutations under selection is important for identifying the genetic basis of phenotypes Neutral mutations help in reconstructing evolutionary scenarios and providing a baseline for parameter calculations CSE280Vineet Bafna

Mutations accumulate over time In looking at the DNA of different species – DNA has had a lot of time to mutate, and is not expected to be identical. – If the DNA is highly similar, that region is functionally important. In comparing DNA from a population – Not enough time to mutate, so DNA is expected to be identical. – The few differences that exist mediate phenotypic differences. CSE280Vineet Bafna

By sampling DNA from a population, we can answer the following – What are the sources of variation? – As mutations arise, they are either neutral and subject to evolutionary drift, or they are (dis-)advantageous and under selective pressure. Can we tell? – If you had DNA from many sub-populations, Asian, European, African, can you separate them? – Why are some people more likely to get a disease then others? How is disease gene mapping done? – Phasing of chromosomes. How do we separate the maternal and paternal chromosomes CSE280Vineet Bafna

Allele: A specific variant at a location – The notion of alleles predates the concept of gene, and DNA. – Initially, alleles referred to variants that described a measurable trait (round/wrinkled seed) – Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype. – As we discuss source of variation, we will have different kinds of alleles. CSE280Vineet Bafna

Locus: The location of the allele – A nucleotide position. – A genetic marker – A gene – A chromosomal segment CSE280Vineet Bafna

Genotype: genetic makeup of (part of) an individual Phenotype: A measurable trait in an organism, often the consequence of a genetic variation Humans are diploid, they have 2 copies of each chromosome, and 2 alleles at each locus – They may have heterozygosity/homozygosity at a location – Other organisms (plants) have higher forms of ploidy. – Additionally, some sites might have 2 allelic forms, or even many allelic forms. Haplotype: genetic makeup of (part of) a single chromosome CSE280Vineet Bafna

Mutations (may lead to SNPs) Recombinations Other crossover events (gene conversion) Structural Polymorphisms CSE280Vineet Bafna

Small mutations that are sustained in a population are called SNPs SNPs are the most common source of variation studied The data is a matrix (rows are individuals, columns are loci). Only the variant positions are kept. CSE280Vineet Bafna A->G

CSE280Vineet Bafna Infinite Sites Assumption: Each site mutates at most once

CSE280Vineet Bafna GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC

Consider a collection of regions with variable length repeats. Variable length repeats will lead to variable length DNA The locations are far enough apart not to be linked Vector of lengths is a finger- print CSE280Vineet Bafna loci individuals

Large scale structural changes (deletions/insertions/inversions) may occur in a population. Copy Number variation Certain diseases (cancers) are marked by an abundance of these events CSE280Vineet Bafna

These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. CSE280Vineet Bafna PLoS Biology, 2007

Not all DNA recombines! CSE280Vineet Bafna

Not all DNA recombines. mtDNA is inherited from the mother, and y-chromosome from the father CSE280Vineet Bafna

Gene Conversion versus single crossover – Hard to distinguish in a population CSE280Vineet Bafna

Allele Locus Recombination Mutation/Single nucleotide polymorphism STR (short tandem repeat) – How is DNA fingerprinting done Infinite sites assumption CSE280Vineet Bafna

CSE280Vineet Bafna

A possible strategy is to collect cases (affected) and control individuals, and look for a mutation that consistently separates the two classes. Next, identify the gene.

Case Control Problem 1: many unrelated common mutations, around one every 1000bp

Case Control

Case Control Problem 2: We may not sample the causal mutation.

We are guided by two simple facts governing these mutations 1. Nearby mutations are correlated 2. Distal mutations are not Case Control

Sample a population of individuals at variant locations across the genome. Typically, these variants are single nucleotide polymorphisms (SNPs). Create a new bi-allelic variant corresponding to cases and controls, and test for correlations. By our assumptions, only the proximal variants will be correlated. Investigate genes near the correlated variants. Case Control

Consider a fixed population (of chromosomes) evolving in time. Each individual arises from a unique, randomly chosen parent from the previous generation Time

Current (extant) population Time

Infinite sites assumption: A mutation occurs at most once at a site.

The collection of acquired mutations in the extant population describe the SNPs

We drop the ancestral chromosomes, and place the mutations on the internal branches.

A causal mutation creates a clade of affected descendants.

Note that the tree (genealogy) is hidden. However, the underlying tree topology introduces a correlation between each pair of SNPs

a. c. b.

In our idealized model, we assume that each individual chromosome chooses two parental chromosomes from the previous generation

Proximal SNPs are correlated, distal SNPs are not. The correlation (Linkage disequilibirium) decays rapidly after 20-50kb

Test each polymorphic locus for correlation with case-control status. The correlation is measured using one of many statistical tests Gene near a correlated locus is a candidate for mediating the case phenotype. Many factors confound the analysis – Even the sources of variation are not well understood – Understanding the confounding factors requires a knowledge of population genetics – Getting around them requires a set of computational and statistical techniques. CSE280Vineet Bafna

Our focus will be on the computational algorithms used to probe genetic data Basic terminology Key principles – Sources of variation – HW equilibrium – Linkage (dis)-equilibrium – Coalescent theory – Recombination/Ancestral Recombination Graph – Haplotypes/Haplotype phasing – Population sub-structure – Structural polymorphisms – Medical genetics basis: Association mapping/pedigree analysis CSE280Vineet Bafna