* *Less than 10% Dinosaur content Jeffrey Boucher.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Introduction to bioinformatics
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
An Introduction to Bioinformatics
Terminology of phylogenetic trees
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Molecular phylogenetics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Phylogeny Ch. 7 & 8.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Using BLAST to Identify Species from Proteins
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Multiple Alignment and Phylogenetic Trees
Using BLAST to Identify Species from Proteins
Molecular Evolution.
Chapter 19 Molecular Phylogenetics
Molecular data assisted morphological analyses
Basic Local Alignment Search Tool
Using BLAST to Identify Species from Proteins
Presentation transcript:

* *Less than 10% Dinosaur content Jeffrey Boucher

Talk Outline Talk 1: – “How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction” Talk 2: – Ancestral Sequence Reconstruction Lab Talk 3: – “Ancestral Sequence Reconstruction: What is it Good for?”

How to Raise the Dead: The Nuts and Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory

Orientation for the Talk The Central Dogma: DNA RNA Protein

Orientation for the Talk (cont.) Chemistry of side chains govern structure/function Mutations to sequences occur over time

We Live in The Sequencing Era Since inception, database size has doubled every 18 months. Number of Entries Year

What Can We Learn From This Data? Individually…not much Too many sequences to characterize individually – Today: 1.5 Ε 8 sequences ÷ 7 E 9 people = 1 sequence/50 people – By Ε 9 sequences ÷ 7.5 E 9 people = 1 sequence/6 people >gi| |gb|ABF | pancreatic ribonuclease precursor subtype Na [Nasalis larvatus] MALDKSVILLPLLVVVLLVLGWAQPSLGRESRAEKFQRQHMDSGSSPSSSSTYCNQMMK RRNMTQGRCKPVNTFVHEPLVDVQNVCFQEKVTCKNGQTNCFKSNSRMHITDCRLTNG SKYPNCAYRTTPKERHIIVACEGSPYVPVHFDASVEDST

Bioinformatics! Bioinformatic methods developed to deal with this backlog Methods covered: – Sequence Alignment (& BLAST) – Phylogenetics – Sequence Reconstruction

Sequence Alignment How can we compare sequences? Simple scoring function – 1 for match – 0 for mismatch Orangutan Chimpanzee =

Not All Mismatches Are Created Equal How can scoring function account for this? Aspartate Glutamate Orangutan Chimpanzee Glutamate Leucine Vs. * *

Substitution Matrix Glutamate Aspartate Glutamate Leucine

Calculating A Substitution Matrix How are the rewards/penalties determined? Determined by log-odds scores: S i,j = log p i,j q i * q j p i,j is probability amino acid i transforms to amino acid j q i & q j represent the frequencies of those amino acids Why not just p i,j ?

Neither Are All Matches Cysteine Leucine Cysteine

BLOSUM62 (BLOcks of Amino Acid SUbstitution Matrix) ≥62% Identity <62% Identity STOP How did you get an alignment? You’re talking about ‘How to Make an Alignment’! Blocks used align well with 1/0 scoring function

BLOSUM62 Matrix Calculation S i,j = log p i,j q i * q j p G,A q G q A = 14/900 = = = 21/225 = = = 16/225 = ≥62% Identity <62% Identity G-G G-A A-A = 36

Pairwise Alignment Examples No Gaps allowed: = 14 Gap Penalty of -8: -Penalty heuristically determined = 40 Orangutan Chimpanzee Orangutan Chimpanzee

Pairwise Alignment Examples (cont.) If gap penalty is too low… Alignment of multiple sequences similar method Orangutan Chimpanzee

(& BLAST) Alignment can identify similar sequences BLAST (Basic Local Alignment Search Tool) How does alignment compare to alignment of random sequences? – E-value of 1 E -3 is a 1:1000 chance of alignment of random sequences

Homology vs. Identity Significant BLAST hits inform us about evolutionary relationships Homologous - share a common ancestor – This is binary, not a percentile – Identity is calculated, homology is a hypothesis – Homology does not ensure common function

Visual Depiction of Alignment Scores Suppose alignment of 3 sequences… Orangutan Chimpanzee Mouse O C M MCO M O C

Phylogenetics Relationships between organisms/sequences On the Origin of Species (1859) had 1 figure:

Phylogenetics Prior to 1950s phylogenies based on morphology Sequence data/Analytical methods – Qualitative  Quantitative

Phylogeny TIME A B G F E D C Internal Branch Peripheral Branch Taxa (observed data) Branch lengths represent time/change Node

A Tale of Two Proteins Significant sequence similarity & the same structure Protein X -Binds Single Stranded RNA Protein Y -Binds Double Stranded RNA

TIME A B G F E D C “Gene”alogy Single-Stranded Double-Stranded Last Common Ancestor of All Double-Stranded Last Common Ancestor of All Single-Stranded Last Common Ancestor of All

Back to the Future Resurrecting extinct proteins 1 st proposed Pauling & Zuckerkandl in 1963 In 1990, 1 st Ancestral protein reconstructed, expressed & assayed by S.A. Benner Group – RNaseA from ~5Myr old extinct ruminant

What Took So Long ?

How to Resurrect a Protein 1) Acquire/Align Sequences 2) Construct Phylogeny (from Chang et al. 2002) 3) Infer Ancestral Nodes 4) Synthesize Inferred Sequence

So Really…What Took So Long? Advances in 3 areas were required: – Sequence availability – Phylogenetic reconstruction methods – Improvements in DNA synthesis

Sequence Availability Number of Sequences Year

Advances in 3 areas were required: ✓ Sequence availability – Phylogenetic reconstruction methods – Improvements in DNA synthesis

Advances in Reconstruction Methods Consensus Parsimony Maximum Likelihood

Consensus Advantage: Easy & fast Disadvantages: Ignores phylogenetic relationships X X

Parsimony Parsimony Principle – Best-supported evolutionary inference requires fewest changes – Assumes conservation as model Advantage: – Takes phylogenetic relationships into account Disadvantage: – Ignores evolutionary process & branch lengths

Parsimony A B C D E F G H ABCDEFGHABCDEFGH

Parsimony V V V I L L Example adapted from David Hillis I L {V} {L} {V, I} {V, I, L} Changes = 4 V L I I I V L

Parsimony - Alternate Reconstructions Is conservation the best model? Resolve ambiguous reconstructions

Maximum Likelihood Likelihood: – How surprised we should be by the data – Maximizing the likelihood, minimize your surprise Example: – Roll 20-sided die 9 times: Likelihood = Probability(Data|Model)

Maximum Likelihood Fair Die Model: – 5% chance of rolling a 20 Trick Die Model: – 100% chance of rolling a 20 Likelihood = Probablity(Data|Model) Likelihood = (0.05) 9 = 2 E -11 Likelihood = (1) 9 = 1 Assuming trick model maximizes the likelihood

From Dice to Trees Likelihood= – Data - Sequences/Alignment – Model - Tree topology, Branch lengths & Model of evolution or Choose model that maximizes the likelihood

Improvements Over Parsimony Includes of evolutionary process & branch lengths – Reduction in ambiguous sites Fit of model included in calculation – Removes a priori choices – Use more complex models (when applicable) Confidence in reconstruction – Posterior probabilities

Advances in 3 areas were required: ✓ Sequence availability ✓ Phylogenetic reconstruction methods – Improvements in DNA synthesis

Advances in DNA Synthesis DNA synthesis work starts 1950s late 1970s Automated 1983 PCR nts Fragments 2002 ~200 nts Fragments Advances in Molecular Biology increased speed & fidelity PAST PRESENT

How to Synthesize a Gene ’- -3’ DNA Ligase 600 nts 5’- -3’ FW Primer 5’- 3’- -5’ RV Primer -5’ 5’- -3’ 3’- -5’ Schematic adapted from Fuhrmann et al ’ 3’- DNA Polymerase

On to the Easy Part…