November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI
November 18, 2000ICTCM 2000 Teaching Mathematics to Students of Biology Need to make the math in the courses correlate with math that needed in that discipline The most important “math” needed is statistics The molecular biology revolution in biology presents data in a form in which calculus has little impact (sequences of letters)
November 18, 2000ICTCM 2000 The Nature of Biological Sequence Data Primary structure of DNA, RNA, and proteins are sequences of letters -- 4 letters in the case of DNA (ATGC) and RNA (AUGC) and 20 letters representing the sequence of amino acids which makes up a protein Secondary and Tertiary structures (bending, folding and twisting) of structures determines function -- hints seen through primary structure
November 18, 2000ICTCM 2000 Use of Spreadsheets in this setting Commonly found and used in biological labs for data acquisition, storage and organization, and data analysis Commonly present on student computers and computer labs Unlike calculators -- able to handle data sets typical of “real world” applications R.F. Murphy at CMU has developed a set of worksheets for sequence analysis
November 18, 2000ICTCM 2000 Meaningful Questions & Problems 1. Measuring the similarity between two strings -- “alignment” or “homology” 2. Finding instances of a pattern in a string 3. Describing the composition and properties of a string 4. Graphing the evolutionary process and construction of phylogenetic trees
November 18, 2000ICTCM 2000 Measuring the Similarity between Strings Given a gene -- suggest the function of the protein coded for by finding a similar sequence (possibly in another species) Simple homology involves assigning a “1” for agreement and “0” for nonagreement at each site. Then sum over all sites Homology is the fraction of the highest possible score, in %
November 18, 2000ICTCM 2000 Spreadsheet #1 Simple Homology
November 18, 2000ICTCM 2000 Spreadsheet #1 (cont.) comparing random sequences
November 18, 2000ICTCM 2000 Finding Instances of a Particular Pattern in a String The process of locating genes involves locating regions of the DNA sequences that contain patterns which resemble those of known genes Identifying sites on DNA where one of the restriction enzymes can cleave DNA -- Also of interest is size of the fragments that result Identify regions of RNA which correspond to particular features (e.g. loops) which may be splice sites
November 18, 2000ICTCM 2000 Describing the Composition and Properties of a String Counts of frequencies of particular letters due to their properties (e.g. regions rich in G&C or A&T in DNA) Properties of proteins (e.g. charge or hydrophobicity) which depend on the nature and frequencies of the particular amino acids
November 18, 2000ICTCM 2000 Spreadsheet #2 Hydropathy Plot
November 18, 2000ICTCM 2000 Spreadsheet #2 (Cont.)
November 18, 2000ICTCM 2000 Graphing Evolution and Phylogenetic Trees Evolutionary distance between two DNA sequences used to determine the process of the changes in the sequences over time (e.g. the evolution of HIV or the flu viruses) Trees constructed to express the relationship between related sequences -- distance in the tree a monotone function of homology
November 18, 2000ICTCM 2000 Spreadsheet #3 Mutation & Evolution
November 18, 2000ICTCM 2000 Spreadsheet #3 (cont.) To study the evolution of a sequence, we randomly pick a site for mutation, then change its letter
November 18, 2000ICTCM 2000 Conclusion Use of a spreadsheet makes possible an experimental approach to introducing the mathematics of sequence analysis The use of spreadsheets makes possible the use of real-world data and presents the computational tool in a meaningful context The importance of the topics to all educated individuals suggests that the topics be included in many liberal arts math courses