1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among.

Slides:



Advertisements
Similar presentations
Evolution of genomes.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Coalescence DNA Replication DNA Coalescence
CHAPTER 13 MEIOSIS AND SEXUAL LIFE CYCLES
Chapter 10 Genetic Variability.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
MAT 4830 Mathematical Modeling 4.4 Matrix Models of Base Substitutions II
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Clock I. Evolutionary rate Xuhua Xia
Molecular Evolution Revised 29/12/06
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
From population genetics to variation among species: Computing the rate of fixations.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Lecture 28 Evolution. Variation Without variation (which arises from mutations of DNA molecules to produce new alleles) natural selection would have nothing.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Sources of Genetic Variation
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Sequencing a genome and Basic Sequence Alignment
Out-of-Africa Theory: The Origin Of Modern Humans
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Chapter 3 Substitution Patterns Presented by: Adrian Padilla.
Molecular phylogenetics
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
Substitution Numbers and Scoring Matrices
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequencing a genome and Basic Sequence Alignment
Introduction to Bioinformatics.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Meiosis & Sexual Life Cycle Chapter 13. Slide 2 of 20 AP Essential Knowledge Essential knowledge 3.A.2: In eukaryotes, heritable information is passed.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Class 22 DNA Polymorphisms Based on Chapter 10 Recombinant DNA Technology Copyright © 2010 Pearson Education Inc.
Phylogeny Ch. 7 & 8.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Modelling evolution Gil McVean Department of Statistics TC A G.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Evolutionary Change in Sequences
Bring your cheat sheet Exam #1 F 2/12 Q&A Th 2/11 from 5-7pm in PAI million-year- old human ancestor.
Lecture 6 Genetic drift & Mutation Sonja Kujala
Unit 4 Meiosis and Genetics
Models for DNA substitution
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Goals of Phylogenetic Analysis
Mitosis vs. Meiosis.
Why Models of Sequence Evolution Matter
Chromosomes and Meiosis
Presentation transcript:

1 Introduction to Bioinformatics

2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among us?

3 Neandertal, Germany, 1856 Initial interpretations: * bear skull * pathological idiot * Old Dutchman...

4 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION

5

6

7 5.1 Variation in DNA sequences * Even closely related individuals differ in genetic sequences * (point) mutations : copy error at certain location * Sexual reproduction – diploid genome

8 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES Diploid chromosomes

9 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES Mitosis: diploid reproduction

10 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES Meiosis: diploid (=double) → haploid (=single)

11 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES * typing error rate very good typist: 1 error / 1K typed letters * all our diploid cells constantly reproduce 7 billion letters * typical cell copying error rate is ~ 1 error /1 Gbp

12 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES GERM LINE Reverse time and follow your cells: Now you count ~ cells One generation ago you had 2 cells ‘somewhere’ in your parents body Small T generations ago you had (2 T – multiple ancestors) cells Large T generations ago you counted #(fertile ancestors) cells Congratulations: you are 3.4 billion years old !!! Fast-forward time and follow your cells: Only a few cells in your reproductive organs have a chance to live on in the next generations The rest (including you) will die …

13 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES GERM LINE MUTATIONS This potentially immortal lineage of (germ) cells is called the GERM LINE All mutations that we have accumulated are en route on the germ line

14 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES * Polymorphism : multiple possibilities for a nucleotide: allelle * Single Nucleotide Polymorphism – SNP (“snip”) point mutation example: AAATAAA vs AAACAAA * Humans: SNP = 1/1500 bases = 0.067% * STR = Short Tandem Repeats (microsatelites) example: CACACACACACACACACA … * Transition - transversion

15 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES Purines – Pyrimidines

16 Introduction to Bioinformatics 5.1 VARIATION IN DNA SEQUENCES Transitions – Transversions

17 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION 5.2 Mitochondrial DNA * mitochondriae are inherited only via the maternal line!!! * Very suitable for comparing evolution, not reshuffled

18 Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA H.sapiens mitochondrion

19 Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA EM photograph of H. Sapiens mtDNA

20 Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

21 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION 5.3 Variation between species * genetic variation accounts for morphological- physiological-behavioral variation * Genetic variation (c.q. distance) relates to phylogenetic relation (=relationship) * Necessity to measure distances between sequences: a metric

22 Introduction to Bioinformatics 5.3 VARIATION BETWEEN SPECIES Substitution rate * Mutations originate in single individuals * Mutations can become fixed in a population * Mutation rate: rate at which new mutations arise * Substitution rate: rate at which a species fixes new mutations * For neutral mutations

23 Introduction to Bioinformatics 5.3 VARIATION BETWEEN SPECIES Substitution rate and mutation rate * For neutral mutations * ρ = 2Nμ*1/(2N) = μ * ρ = K/(2T)

24 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION 5.4 Estimating genetic distance * Substitutions are independent (?) * Substitutions are random * Multiple substitutions may occur * Back-mutations mutate a nucleotide back to an earlier value

25 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Multiple substitutions and Back-mutations conceal the real genetic distance GACTGATCCACCTCTGATCCTTTGGAACTGATCGT TTCTGATCCACCTCTGATCCTTTGGAACTGATCGT TTCTGATCCACCTCTGATCCATCGGAACTGATCGT GTCTGATCCACCTCTGATCCATTGGAACTGATCGT observed : 2 (= d) actual : 4 (= K)

26 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE * Saturation: on average one substitution per site * Two random sequences of equal length will match for approximately ¼ of their sites * In saturation therefore the proportional genetic distance is ¼

27 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE * True genetic distance (proportion): K * Observed proportion of differences: d * Due to back-mutations K ≥ d

28 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE SEQUENCE EVOLUTION is a Markov process: a sequence at generation (= time) t depends only the sequence at generation t-1

29 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE The Jukes-Cantor model Correction for multiple substitutions Substitution probability per site per second is α Substitution means there are 3 possible replacements (e.g. C → {A,G,T}) Non-substitution means there is 1 possibility (e.g. C → C)

30 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Therefore, the one-step Markov process has the following transition matrix: M JC = ACGT A1-αα/3α/3α/3 Cα/31-αα/3α/3 Gα/3α/31-αα/3 Tα/3α/3α/31-α

31 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL After t generations the substitution probability is: M(t) = M JC t Eigen-values and eigen-vectors of M(t): λ 1 = 1, (multiplicity 1):v 1 = 1/4 ( ) T λ 2..4 = 1- 4 α/ 3, (multiplicity 3): v 2 = 1/4 ( ) T v 3 = 1/4 ( ) T v 4 = 1/4 ( ) T

32 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Spectral decomposition of M(t): M JC t = ∑ i λ i t v i v i T Define M(t) as: M JC t = Therefore, substitution probability s(t) per site after t generations is: s(t) = ¼ - ¼ (1 - 4 α/ 3 ) t r(t) s(t) s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t) r(t)

33 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL substitution probability s(t) per site after t generations: s(t) = ¼ - ¼ (1 - 4 α/ 3 ) t observed genetic distance d after t generations ≈ s(t) : d = ¼ - ¼ (1 - 4 α/ 3 ) t For small α :

34 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL For small α the observed genetic distance is: The actual genetic distance is (of course): K = αt So: This is the Jukes-Cantor formula : independent of α and t.

35 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL The Jukes-Cantor formula : For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance For saturation: d ↑ ¾ : K →∞ So: if observed distance corresponds to random sequence- distance then the actual distance becomes indeterminate

36 Jukes-Cantor

37 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Variance in K If: K = f(d) then: So: Generation of a sequence of length n with substitution rate d is a binomial process: and therefore with variance: Var(d) = d(1-d)/n Because of the Jukes-Cantor formula:

38 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Variance in K Variance: Var(d) = d(1-d)/n Jukes-Cantor: So:

39 Var(K)

40 Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL EXAMPLE 5.4 on page 90 * Create artificial data with n = 1000: generate K* mutations * Count d * With Jukes-Cantor relation reconstruct estimate K(d) * Plot K(d) – K*

41 Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

42 Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

43 Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

44 Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90 (= FIG 5.3)

45 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE The Kimura 2-parameter model Include substitution bias in correction factor Transition probability (G↔A and T↔C) per site per second is α Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second is β

46 Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL The one-step Markov process substitution matrix now becomes: M K2P = ACGT A1-α-β β αβ Cβ 1-α-β β α Gαβ 1-α-ββ Tβ αβ 1-α-β

47 Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL After t generations the substitution probability is: M(t) = M K2P t Determine of M(t): eigen-values {λ i } and eigen-vectors {v i }

48 Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL Spectral decomposition of M(t): M K2P t = ∑ i λ i t v i v i T Determine fraction of transitions per site after t generations : P(t) Determine fraction of transitions per site after t generations : Q(t) Genetic distance: K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q) Fraction of substitutions d = P + Q → Jukes-Cantor

49 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Other models for nucleotide evolution * Different types of transitions/transversions * Pairwise substitutions GTR (= General Time Reversible) model * Amino-acid substitutions matrices * …

50 Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Other models for nucleotide evolution DEFICIT: all above models assume symmetric substitution probs; prob(A→T) = prob(T→A) Now strong evidence that this assumption is not true Challenge: incorporate this in a self-consistent model

51 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION 5.5 CASE STUDY: Neanderthals * mtDNA of 206 H. sapiens from different regions * Fragments of mtDNA of 2 H. neanderthaliensis, including the original 1856 specimen. * all 208 samples from GenBank * A homologous sequence of 800 bp of the HVR could be found in all 208 specimen.

52 Introduction to Bioinformatics 5.5 CASE STUDY: Neanderthals * Pairwise genetic difference – corrected with Jukes-Cantor formula * d(i,j) is JC-corrected genetic difference between pair (i,j); * d T = d * MDS (Multi Dimensional Scaling): translate distance table d to a nD-map X, here 2D-map

53 Introduction to Bioinformatics 5.5 CASE STUDY: Neanderthals distance map d(i,j)

54 Introduction to Bioinformatics 5.5 CASE STUDY: Neanderthals MDS H. sapiens H. neanderthaliensis well-separated

55 Introduction to Bioinformatics 5.5 CASE STUDY: Neanderthals phylogentic tree

56 END of LECTURE 5

57 Introduction to Bioinformatics LECTURE 5: INTER- AND INTRASPECIES VARIATION

58