Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart

Introduction Goal is to produce an exercise that will engage allied health students and –Strengthen math skills and decrease math phobia –Decrease molecular data phobia –Increase bioinformatics literacy

Prerequisites The following will be presented to students prior to this project –Basic evolutionary concepts and use of 16S rRNA in determining relationships between prokaryotes –Introduction to Biology Workbench, BLAST and tree construction

Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Approach Students will pick a week in which food poisoning is likely; Christmas, 4 th of July, Thanksgiving, etc. Students will – identify a source of food poisoning (ex. Salmonella), and check the Morbidity and Mortality Weekly Report tables for the number of cases in a specific state or region – calculate proportion of cases represented by that region –Answer “Is this number of cases unusual based on the data presented for this time period? How can you tell?”

Approach Students will then address the questions –“Without culturing the organism, how might you track it in humans or in a food supply?” –“What relationships (if any) exists between various strains of this organism”? –“Can this type of data be used to find the original strain?

Approach Students will –obtain sequence data from NCBI’s GenBank for the organism (or virus) of interest –BLAST the sequence to find organisms with related sequences –Collect 8-13 of the closest BLAST results to perform a global alignment, and construct a tree

Questions Students choose a time period (week), search MMWR (Morbidity and Mortality Weekly Report) for the number of cases of a particular disease for a given week. 1.Given the chosen disease, how many cases of the disease occurred in a particular state (or other locale) during the week?

More Questions about the Scene 2a. How many persons are involved? Is there an index case? 2b. What percent of the population has the disease? 3. What other question might you ask from these data? 4. What microbe causes the disease? What strain, if appropriate?

Now What? (Questions about the microbe) 5. If you want to determine the specific strain of the microbe, can you find the genetic sequence? 6.How has the strain evolved? 7.What is its phylogeny, and what are the closest neighbors?

And Then... (Questions to Investigate) 8a. Why is the answer to the previous question of interest to you if you are a nurse, a dietician, a parent, the mayor, the hospital director, the first responder, a restaurant owner, a cruise ship director, a public health inspector, or other interested person (you choose)? 8b. What other questions are of interest to you in this role?

Finding the Microbe Search MMWR Morbidity Tables

Choose a Week

Choose a Disease

What Percent of the Residents are Sick?

Find a Microbe Use your text, class notes, or other resources to determine the causative agent of the disease you have chosen. Choose a microbe, then find its family tree. For the Salmonellosis example, we have chosen Salmonella enterica, a microbe with many variants, called serovars.

Basics of Tree Construction Preliminary Exercises Goal –Students will practice with small examples before trying to construct a tree

From Sequences to Pairwise Alignment The Needleman-Wunsch Method

We make a table of residue scores, S(i,j). The number S(i,j) is computed by comparing residue i in sequence (1) with residue j in sequence (2), using previously chosen values for matches and mismatches. Each alignment matrix entry, H(i,j), gives the score of the best alignment of the first i residues in sequence (1) with the first j residues of sequence (2) We have one row for each residue in sequence (2) and one column for each residue in sequence (1). To get started, we add a 0th row and a 0 th column. The upper left corner is position (0,0). We set H(0,0) = 0. The rest of the values in the top row are (reading across) -g, -2g, -3g, etc., where g is the gap penalty. Similarly, the rest of the values in the leftmost column are (reading down) –g, -2g, - 3g, etc. To compute the value of H(i+1,j+1) we first consider the values north, west and northwest. We then find S(i+1,j+1) + the value immediately northwest (The value just north) – g (The value just west) – g

Distance Matrix Then we choose the largest of these three numbers to be H(i+1,j+1) and draw an arrow from position (i+1,j+1) to the position that gave us the value of H(i+1,j+1). Example: Let match = 1, mismatch = -1 and g = 2. Consider the sequences (1) G A A T T C (2) G G A T GAATTC 0-2-4-6-8-10-12 G-21 G-4 A-6 T-8

Try This Exercise (at home ok) a.Complete the table and then follow the arrows to determine the alignment : –A diagonal arrow corresponds to aligning the two letters. –A horizontal arrow corresponds to aligning a letter from (2) with a gap. –A vertical arrow corresponds to aligning a letter from (1) with a gap. –(Note that if you have ties, you may have more than one arrow, and so more than one “best” alignment.) b.Redo this exercise with your own choice of match, mismatch and gap values. Experiment with these values to obtain alignments different from the ones you got in part (a).

From Pairwise Alignment to Multiple Alignment Idea of global progressive alignment: Most alike sequences are aligned together in order of their similarity. A consensus is determined and then aligned to the next most similar sequence. The determination of “next most similar” is made using phylogenetic information (a guide tree).

From Alignment to Distance Matrix There are many different ways of computing the distance between pairs of sequences in multiple alignment. Each uses different assumptions, which may or may not be reasonable for a given situation. For example, the simplest model, Jukes-Cantor, assumes that mutation occurs at a constant rate, and that each nucleotide is equally likely to mutate into any other nucleotide (at that rate). For protein sequences, the calculation is (even) more complicated. From distance matrix to tree: Again, there are many different methods available. Biology Workbench uses ClustalW to construct multiple alignments. Clustal uses the neighbor joining methods to find the guide tree. The final tree produced by Workbench is a compilation of these guide trees.

Clustering Methods The UPGMA (Unweighted Pair-Group Methods with Arithmetic means) method + easy to describe; produces an ultrametric (and hence additive) tree - assumptions (molecular clock; all species evolve at the same rate) Idea: Step 1. Find the two closest taxa. Step 2. Treat the two closest as a new combined taxon, and make a new matrix, calculating distances from the combined taxon to the others using the average of all the pairwise distances involved. Iterate these two steps until the tree is completed.

ABCD A0975 B90810 C7808 D5 80 Construct the UPGMA tree for the following distance matrix: A/DBC 019/215/2 B 08 C 0 Observe: A and D are closest Now the A/D cluster and C are closest. Next, update the matrix

Exercise 1.Finish this tree. 2.The tree is ultrametric, but the data are not. (Why not?) How would the data have to be changed in order that they be ultrametric? 3.The tree is additive. Are the data? Redo questions 1 – 3 in case the BD distance is 12 instead of 10. A/DBC 019/215/2 B 08 C 0

Neighbor Joining (NJ) + additive (but not ultrametric); computationally efficient - unrooted. Prior knowledge is needed to decide how to root the tree. Note: the species which are closest according to the distance matrix need NOT be neighbors. That’s why we need a modified distance formula Exercise: Draw a picture of a tree on four taxa that illustrates the problem described in the note above.

Neighbor Joining Steps Step 1: Find the two species which are closest using the modified distance formula below. Join them. Modified distance assumptions: Let R_i = sum of all the distances from node i to all others, divided by N – 2 Let R_j = sum of all the distances to node j from all others, divided by N – 2 Let D(i,j) = matrix distance. Calculate modified distance from i to j as D(i,j) – R_i – R_j We now have two fewer taxa and one more internal node, for a net of one less node than we started with. Steps 2 and following. Repeat step 1 until all are nodes are joined. Problem: the new internal node n is not in the original matrix. This problem can be solved.

Final Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Find the Microbial Gene NCBI Search

Choose a Strain

BLAST Basic Local Alignment Search Tool

Paste Sequence, BLAST off!

BLAST Results

BLAST Sequences

GenBank Bases

Constructing a Tree Add sequences

Clustal W Choose the Multiple Sequence Alignment

Choose a Tree Type Choose Rooted and/or Unrooted Submit

Voila! Unrooted Tree

Rooted Tree Which species are the most closely related?

Final Questions How are the data helpful if you are a –Parent? –Restaurant owner? –Hospital director? –Public health inspector?

Assessment Student Learning Outcomes –More comfortable with computation –Using the tools to answer questions –Empowerment (we hope!)

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Similar presentations

Presentation on theme: "Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Similar presentations

Presentation on theme: "Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart."— Presentation transcript:

Similar presentations

About project

Feedback