Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Bioinformatics and Phylogenetic Analysis
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
The Tree of Life From Ernst Haeckel, 1891.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Phylogenetic trees Sushmita Roy BMI/CS 576
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.
PHYLOGENETIC TREES Dwyane George February 24,
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Tutorial 5 Phylogenetic Trees.
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.
Phylogeny - based on whole genome data
Distance based phylogenetics
Clustering methods Tree building methods for distance-based trees
The Tree of Life From Ernst Haeckel, 1891.
Phylogenetic Trees.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart

Introduction Goal is to produce an exercise that will engage allied health students and –Strengthen math skills and decrease math phobia –Decrease molecular data phobia –Increase bioinformatics literacy

Prerequisites The following will be presented to students prior to this project –Basic evolutionary concepts and use of 16S rRNA in determining relationships between prokaryotes –Introduction to Biology Workbench, BLAST and tree construction

Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Approach Students will pick a week in which food poisoning is likely; Christmas, 4 th of July, Thanksgiving, etc. Students will – identify a source of food poisoning (ex. Salmonella), and check the Morbidity and Mortality Weekly Report tables for the number of cases in a specific state or region – calculate proportion of cases represented by that region –Answer “Is this number of cases unusual based on the data presented for this time period? How can you tell?”

Approach Students will then address the questions –“Without culturing the organism, how might you track it in humans or in a food supply?” –“What relationships (if any) exists between various strains of this organism”? –“Can this type of data be used to find the original strain?

Approach Students will –obtain sequence data from NCBI’s GenBank for the organism (or virus) of interest –BLAST the sequence to find organisms with related sequences –Collect 8-13 of the closest BLAST results to perform a global alignment, and construct a tree

Questions Students choose a time period (week), search MMWR (Morbidity and Mortality Weekly Report) for the number of cases of a particular disease for a given week. 1.Given the chosen disease, how many cases of the disease occurred in a particular state (or other locale) during the week?

More Questions about the Scene 2a. How many persons are involved? Is there an index case? 2b. What percent of the population has the disease? 3. What other question might you ask from these data? 4. What microbe causes the disease? What strain, if appropriate?

Now What? (Questions about the microbe) 5. If you want to determine the specific strain of the microbe, can you find the genetic sequence? 6.How has the strain evolved? 7.What is its phylogeny, and what are the closest neighbors?

And Then... (Questions to Investigate) 8a. Why is the answer to the previous question of interest to you if you are a nurse, a dietician, a parent, the mayor, the hospital director, the first responder, a restaurant owner, a cruise ship director, a public health inspector, or other interested person (you choose)? 8b. What other questions are of interest to you in this role?

Finding the Microbe Search MMWR Morbidity Tables

Choose a Week

Choose a Disease &mmwr_week=07&mmwr_table=2F

What Percent of the Residents are Sick? &mmwr_week=01&mmwr_table=2F

Find a Microbe Use your text, class notes, or other resources to determine the causative agent of the disease you have chosen. Choose a microbe, then find its family tree. For the Salmonellosis example, we have chosen Salmonella enterica, a microbe with many variants, called serovars.

Basics of Tree Construction Preliminary Exercises Goal –Students will practice with small examples before trying to construct a tree

From Sequences to Pairwise Alignment The Needleman-Wunsch Method

We make a table of residue scores, S(i,j). The number S(i,j) is computed by comparing residue i in sequence (1) with residue j in sequence (2), using previously chosen values for matches and mismatches. Each alignment matrix entry, H(i,j), gives the score of the best alignment of the first i residues in sequence (1) with the first j residues of sequence (2) We have one row for each residue in sequence (2) and one column for each residue in sequence (1). To get started, we add a 0th row and a 0 th column. The upper left corner is position (0,0). We set H(0,0) = 0. The rest of the values in the top row are (reading across) -g, -2g, -3g, etc., where g is the gap penalty. Similarly, the rest of the values in the leftmost column are (reading down) –g, -2g, - 3g, etc. To compute the value of H(i+1,j+1) we first consider the values north, west and northwest. We then find S(i+1,j+1) + the value immediately northwest (The value just north) – g (The value just west) – g

Distance Matrix Then we choose the largest of these three numbers to be H(i+1,j+1) and draw an arrow from position (i+1,j+1) to the position that gave us the value of H(i+1,j+1). Example: Let match = 1, mismatch = -1 and g = 2. Consider the sequences (1) G A A T T C (2) G G A T GAATTC G-21 G-4 A-6 T-8

Try This Exercise (at home ok) a.Complete the table and then follow the arrows to determine the alignment : –A diagonal arrow corresponds to aligning the two letters. –A horizontal arrow corresponds to aligning a letter from (2) with a gap. –A vertical arrow corresponds to aligning a letter from (1) with a gap. –(Note that if you have ties, you may have more than one arrow, and so more than one “best” alignment.) b.Redo this exercise with your own choice of match, mismatch and gap values. Experiment with these values to obtain alignments different from the ones you got in part (a).

From Pairwise Alignment to Multiple Alignment Idea of global progressive alignment: Most alike sequences are aligned together in order of their similarity. A consensus is determined and then aligned to the next most similar sequence. The determination of “next most similar” is made using phylogenetic information (a guide tree).

From Alignment to Distance Matrix There are many different ways of computing the distance between pairs of sequences in multiple alignment. Each uses different assumptions, which may or may not be reasonable for a given situation. For example, the simplest model, Jukes-Cantor, assumes that mutation occurs at a constant rate, and that each nucleotide is equally likely to mutate into any other nucleotide (at that rate). For protein sequences, the calculation is (even) more complicated. From distance matrix to tree: Again, there are many different methods available. Biology Workbench uses ClustalW to construct multiple alignments. Clustal uses the neighbor joining methods to find the guide tree. The final tree produced by Workbench is a compilation of these guide trees.

Clustering Methods The UPGMA (Unweighted Pair-Group Methods with Arithmetic means) method + easy to describe; produces an ultrametric (and hence additive) tree - assumptions (molecular clock; all species evolve at the same rate) Idea: Step 1. Find the two closest taxa. Step 2. Treat the two closest as a new combined taxon, and make a new matrix, calculating distances from the combined taxon to the others using the average of all the pairwise distances involved. Iterate these two steps until the tree is completed.

ABCD A0975 B90810 C7808 D5 80 Construct the UPGMA tree for the following distance matrix: A/DBC 019/215/2 B 08 C 0 Observe: A and D are closest Now the A/D cluster and C are closest. Next, update the matrix

Exercise 1.Finish this tree. 2.The tree is ultrametric, but the data are not. (Why not?) How would the data have to be changed in order that they be ultrametric? 3.The tree is additive. Are the data? Redo questions 1 – 3 in case the BD distance is 12 instead of 10. A/DBC 019/215/2 B 08 C 0

Neighbor Joining (NJ) + additive (but not ultrametric); computationally efficient - unrooted. Prior knowledge is needed to decide how to root the tree. Note: the species which are closest according to the distance matrix need NOT be neighbors. That’s why we need a modified distance formula Exercise: Draw a picture of a tree on four taxa that illustrates the problem described in the note above.

Neighbor Joining Steps Step 1: Find the two species which are closest using the modified distance formula below. Join them. Modified distance assumptions: Let R_i = sum of all the distances from node i to all others, divided by N – 2 Let R_j = sum of all the distances to node j from all others, divided by N – 2 Let D(i,j) = matrix distance. Calculate modified distance from i to j as D(i,j) – R_i – R_j We now have two fewer taxa and one more internal node, for a net of one less node than we started with. Steps 2 and following. Repeat step 1 until all are nodes are joined. Problem: the new internal node n is not in the original matrix. This problem can be solved.

Final Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Find the Microbial Gene NCBI Search

Choose a Strain d=search&term=Salmonella+enterica+16s+ribosomal+RNA+gene

BLAST Basic Local Alignment Search Tool

Paste Sequence, BLAST off! =50&ALIGNMENT_VIEW=Pairwise&CLIENT=web&DATABASE=nr&DESCRIPTIONS=100&ENTREZ_QUERY=%28none%29& EXPECT=10&FILTER=L&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&NCBI_GI=on&PAGE=Nucleotides&PROGR AM=blastn&SERVICE=plain&SET_DEFAULTS.x=34&SET_DEFAULTS.y=8&SHOW_OVERVIEW=on&END_OF_HTTPGET=Ye s&SHOW_LINKOUT=yes&GET_SEQUENCE=yes

BLAST Results

BLAST Sequences

GenBank nlm.nih.gov/entre z/viewr.fcgi?db= nucleotide&val= nlm.nih.gov/entre z/viewr.fcgi?db= nucleotide&val=

FASTA fcgi?db=nucleotide&qty=1&c_start=1&list _uids= &dopt=fasta&dispmax=5& sendto=&from=begin&to=end&extrafeatpr esent=1&ef_CDD=8&ef_MGC=16&ef_HP RD=32&ef_STS=64&ef_tRNA=128 fcgi?db=nucleotide&qty=1&c_start=1&list _uids= &dopt=fasta&dispmax=5& sendto=&from=begin&to=end&extrafeatpr esent=1&ef_CDD=8&ef_MGC=16&ef_HP RD=32&ef_STS=64&ef_tRNA=128

Constructing a Tree Add sequences #! #

Clustal W Choose the Multiple Sequence Alignment ol.sdsc.edu/ CGI/BW.cg i# ol.sdsc.edu/ CGI/BW.cg i#!

Choose a Tree Type Choose Rooted and/or Unrooted Submit ol.sdsc.edu/ CGI/BW.cg i# ol.sdsc.edu/ CGI/BW.cg i#!

Voila! Unrooted Tree u/CGI/BW.cgi# u/CGI/BW.cgi#!

Rooted Tree Which species are the most closely related?

Final Questions How are the data helpful if you are a –Parent? –Restaurant owner? –Hospital director? –Public health inspector?

Assessment Student Learning Outcomes –More comfortable with computation –Using the tools to answer questions –Empowerment (we hope!)