Download presentation
Presentation is loading. Please wait.
Published byMarlene Lane Modified over 9 years ago
1
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying algorithms to analyze genomics data
2
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Contents What are computational biology and bioinformatics? Understanding computers and algorithms Sequence alignment Gene prediction Algorithms for analysis of phylogeny Analysis of microarray data Computer simulation
3
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Computational Biology and Bioinformatics Computational biology Development of computational methods to solve problems in biology Bioinformatics Application of computational biology to analysis and management of real molecular biology data Why do molecular biologists need computer science? Discrete nature of sequence data is ideal for analysis using digital computers Size and complexity of genomics data make the data impossible to analyze without computers
4
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithm an algorithm is a procedure (a finite set of well-defined instructions) for accomplishing some taskset A recipe to perform a task Algorithms often have steps that repeat (iterate) or require decisions (such as logic or comparison). Algorithms can be composed to create more complex algorithms.iteratelogic comparison Concept originated in 1936
5
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A historical perspective The 1960s: the birth of bioinformatics High-level computer languages Protein sequence data Academic access to computers Margaret Oakley Dayhoff First protein database First program for sequence assembly IBM 7090 computer
6
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Solving problems in computer science Necessary parameters for assessing the difficulty of a computer science problem Algorithmic complexity Is the problem theoretically solvable? If so, what is the most efficient solution? Current state of computer technology Memory CPU speed Cost
7
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithmic problems Example: searching for a number in an unordered list If the list has N numbers, the average amount of time the search will take will be proportional to N A more clever approach Place the numbers in order Do a binary search Step 1: Pick a number in the middle of the list Step 2: Restrict the search to the half that contains your number Return to Step 1 until you find your number Time for this approach is proportional to log 2 N
8
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The digital computer Represents everything in a code of zeros and ones Computer architecture CPU Memory Input / Output Advantages of digital computer Deterministic Minimization of noise Output CPUMemory Input
9
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The limitations of digital computers The limitations of digital computers are conceptual, not just technological Digital computers are deterministic Incapable of truly random behavior Digital computers deal with strictly discrete values Can only approximate continuous behavior Many interesting biological phenomena occur in the continuous realm of space and time
10
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence databases What is a database? An indexed set of records Records retrieved using a query language Database technology is well established Examples of sequence databases GenBank Encompasses all publicly available protein and nucleotide sequences Protein Data Bank Contains 3-D structures of proteins
11
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The client-server model The clients and servers are software processes Clients request data from servers Servers and clients can reside on the same or different machines Clients can act as servers to other processes and vice versa Web Browser BLAST Search Engine Database Web Server
12
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence alignment Sequence alignments search for matches between sequences Two broad classes of sequence alignments Global (wide) Local (narrow) Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment Local alignment ESG
13
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The biological importance of sequence alignment Sequence alignments assess the degree of similarity between sequences Similar sequences suggest similar function Proteins with similar sequences are likely to play similar biochemical roles Regulatory DNA sequences that are similar will likely have similar roles in gene regulation Sequence similarity suggests evolutionary history Fewer differences mean more recent divergence
14
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The algorithmic problem of aligning sequences Comparison of similar sequences of similar length is straightforward How does one deal with insertions and gaps that may hide true similarity? How does one interpret minimal similarity? Are sequences actually related? Is alignment by chance? QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKESGPSRSYC
15
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Methods of sequence alignment Graphical methods: visual Dynamic-programming methods: mathematically correct but needs time Heuristic methods: approximate but close to real answer
16
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis A graphical method Shows all possible alignments Caveats Some guesswork in picking parameters Window size Stringency Not as rigorous or quantitative as other methods RQQEPVRSTC Q Q E S G P V R S T C QQESGPVRSTC RQQEPVRSTC
17
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1 Noise to signal ratio
18
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Devising a scoring system Scoring matrices allow biologists to quantify the quality of sequence alignments Use different scoring matrices for different purposes Score for similar structural domains in proteins Score for evolutionary relationship Some popular scoring matrices PAM for evolutionary studies (Percent Accepted Mutation) BLOSUM for finding common motifs (BLOcks amino acid SUbstitution Matrix)
19
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 An example of scoring ARNDCQE A4-2 0 R 50-2-310 N-2061-300 D-2 16-302 C0 9 -4 Q100-352 E002-425 BLOSUM62 A sequence comparison Total score: 18 AA4AA4 DQ0DQ0 DE2DE2 RR5RR5 QQ5QQ5 C E -4 E C -4 RQ1RQ1 AA4AA4 DQ0DQ0
20
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dynamic programming (DP) Possibility of gaps (or insertions) makes number of possible sequence alignments astronomical Dynamic programming makes sequence alignment possible by abandoning low scoring alignments among subsequences as the algorithm progresses Mathematically proven to provide optimal alignments DP algorithms for sequence alignment Needleman-Wunsch-Gotoh algorithm for global alignments Smith-Waterman algorithm for local alignments DP alignment algorithms still too slow for searching an entire sequence database
21
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Heuristic methods with k-tuples Example: BLAST Using query sequence, derive a list of words (tupules) of length w (e.g., 3) Keep high-scoring matching words High-scoring words are compared with database sequences Sequences with many matches to high- scoring words are used for final alignments
22
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases with longer sequences The extreme-value distribution (E value) Used to calculate the probability of chance alignment Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared
23
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A practical example of sequence alignment MASH-1, a transcription factor
24
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 BLAST results
25
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Detailed BLAST results
26
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A pairwise alignment with MASH-1 HASH-2, a human homolog of MASH-1 “+” indicates conservative amino acid substitution “–” indicates gap/insertion XXXX… shows areas of low complexity (common occurrence)
27
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Multiple-sequence alignments Uses of multiple-sequence alignments Automated reconstruction of sequence fragments Phylogenetic analysis Identification of sequence families The problem of multiple-sequence alignment O(N M ) where N is the average sequence length and M is the number of sequences being aligned (optimal methods) Dynamic programming will work only for small M Heuristic methods are required
28
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Some methods for global multiple-sequence alignment Progressive methods Align most closely related sequences, and then less related ones Use phylogenetic trees to quantify similarities Downside: poor results with distantly related sequences Iterative methods Start with progressive alignment Realign sequences after leaving one sequence out Add left-out sequence Repeat until acceptable alignment is achieved Probabilistic methods Hidden Markov models ( we will talk later)
29
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Phylogenetic analysis Phylogenetic trees Describe evolutionary relationships between sequences Three common methods Maximum parsimony Distance Maximum likelihood human immunodeficiency viruses from around the world
30
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Comparison of methods for phylogenetic analysis Maximum parsimony (machine input)(close seqs) Finds optimal tree (or trees) requiring minimum number of substitutions to explain sequence variation Maximum likelihood (user input) (distantly related) Finds most probable tree Similar to maximum parsimony Distance (mix of close and distantly related) Compare pairs of sequences for number of differences between them Use many methods to get consensus tree
31
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithmic complexity and phylogenetic analysis Four steps Sequence alignment Substitution model (scoring matrix) Tree building Tree evaluation Tree building and evaluation are computationally expensive Heuristic methods required in most cases
32
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Gene prediction A problem of pattern recognition Algorithms look for features of genes: E.g., Splice sites, ORFs, starting methionine Identification of regulatory regions is difficult Statistical understanding of genes is ongoing Problems of this type require machine learning algorithms: learn what is the pattern based on small dataset
33
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Central Dogma in Molecular Biology
34
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Artificial neural networks Machine learning algorithms that mimic the brain Connections between “neurons” vary in strength Connection weights (w ij ) (strength) change while network is exposed to training set Fully trained network recognizes pattern in novel input GRAIL A feed forward neural network
35
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Hidden Markov models Can be used for machine learning Units constitute transition states Transitions not dependent on history Many uses in genomics Gene prediction Multiple sequence alignment Finding periodic patterns
36
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 HMMs The example of a dishonest gambler is often used to illustrate this point. The gambler may carry a loaded die that he or she occasionally substitutes for a fair die, but not so often that the other players would notice. The fair die has a one-in-six chance of showing any particular number. When using the loaded die, a player will have a 50% chance of rolling a one and a 10% chance of rolling any other number. It is in these types of situations that stochastic models called hidden Markov models (HMMs) are useful, because they take into account unknown (or hidden) states. For example, exactly when the cheating gambler is using a fair or loaded die is hidden from the other players, but insight may still be gained by looking at the outcome of the cheater’s rolls. If he or she rolls three ones in a row, it is more likely (a 12.5% chance) that the loaded die is being used than the fair one, which would have only a 0.5% chance of generating three ones in a row. Hidden Markov models describe the probability of transitions between hidden states, as well as the probabilities associated with each state. In the example of the cheating gambler, an HMM would describe the probabilities of rolling particular numbers given the loaded or fair die and the probability that the dishonest gambler would switch from one die to another.
37
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 HMMs continued Hidden Markov models can be used to answer three types of questions. The first type is the likelihood question: Given a particular HMM, what is the probability of obtaining a particular outcome (e.g., rolling three ones)? The second type is the decoding question: Given a particular HMM, what is the most likely sequence of transitions between states for a particular outcome? In the case of the cheating gambler, this sequence would be the order in which he or she transitioned from one die to another. The third type is the learning question: Given a particular outcome and set of assumptions about possible transition states, what are the best model parameters (e.g., probabilities between transition states)? This third question allows HMMs to be used for machine learning. The figure in the slide shows a simple example of a hidden Markov model being used to account for the DNA sequence at the bottom. Every HMM has a start and end state, denoted by the S and E, respectively, in the slide. Hidden states lie between the start and end states. In the figure, the squares are states, and the lines between them indicate the probability of one state transitioning to another. The loops on the upper and lower states show the probabilities associated with the state remaining the same. States transition back and forth until the HMM reaches the end state. In this HMM, the top square represents a state that has equal probabilities of generating A, G, C, or T. The bottom state has probabilities of 0.1, 0.1, 0.1, and 0.7 of generating A, G, C, and T, respectively.
38
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Hidden Markov models Can be used for machine learning Units constitute transition states Transitions not dependent on history Many uses in genomics Gene prediction Multiple sequence alignment Finding periodic patterns
39
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 HMMs for gene prediction HMMs are trained on sequences that are members of known gene class HMM gives probability that a particular sequence belongs to the gene class Length of the bar indicates probability Bigger the bar higher probability Genscan 2000 human introns
40
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithms for secondary-structure determination Chou-Fasman / GOR method Based on experimentally determined frequency of amino acids in secondary structures Machine learning algorithms Neural networks Nearest-neighbor methods Trained on previously deduced structures to detect amino acid patterns in secondary structures
41
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Analysis of microarray data Microarrays can measure the expression of thousands of genes simultaneously Vast amounts of data require computers Types of analysis Gene-by-gene Method: Statistical techniques Categorizing groups of genes Method: Clustering algorithms Deducing patterns of gene regulation Method: Under development
42
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Unsupervised techniques Make no assumptions about how the data should behave Cluster genes based on similar patterns of gene expression Examples Hierarchical clustering Principal components analysis (PCA) Hierarchical clustering PCA
43
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for gene expression Need a method to measure how similar genes are based on expression Examples Euclidean distance Pearson correlation coefficient Euclidean distance Pearson correlation coefficient
44
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Supervised techniques Divide groups of genes based on sample properties Can predict sample condition based on gene expression pattern Examples Support vector machine Nearest neighbor Nearest neighbor Support vector machine
45
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The usefulness of simulation Why simulate when you can experiment? Models involving many parameters may be difficult to conceptualize without simulations A simulation may suggest ways of testing a hypothesis Some experiments cannot be done in vivo, or in vitro, and must therefore be done in silico If a simulation is good, it can be used in place of more expensive or time-consuming experiments. Nuclear experiments by the US.
46
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Numerical methods Numerical methods are needed because of the discrete nature of computers Differential equations are turned into difference equations that deal with discrete rather than continuous quantities Smaller steps lead to greater simulation accuracy
47
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Examples of computer simulations in biology Gene regulatory networks Simulations of cells Networks of neurons Population genetics A model of gene regulation
48
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Prospects for a fully simulated cell
49
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Limitations of computer simulation Algorithmic Computers only can process discrete values Simulating continuous behavior accurately often requires an unfeasible number of calculations Experimental Simulation only as good as data it is based on Critical data often missing from simulation Conceptual Overly complex simulations do not contribute to understanding of a biological system
50
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Summary Vast amounts of data require bioinformatics These are limited by the following: Algorithmic complexity of bioinformatics problems Computer hardware performance Heuristic methods used to get around these limitations Bioinformatics methods used in the following areas: Sequence alignment Phylogenetic-tree construction Gene prediction Secondary-structure determination Analysis of microarray data Simulation of biological systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.