Download presentation
Presentation is loading. Please wait.
1
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying algorithms to analyze genomics data
2
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Contents Sequence alignment Gene prediction Algorithms for analysis of phylogeny Analysis of microarray data
3
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Computational Biology and Bioinformatics Computational biology Development of computational methods to solve problems in biology Bioinformatics Application of computational biology to analysis and management of real data Why do biologists need computer science? Discrete nature of sequence data is ideal for analysis using digital computers Size and complexity of genomics data make the data impossible to analyze without computers
4
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithmic problems Example: searching for a number in an unordered list If the list has N numbers, the average amount of time the search will take will be proportional to N A more clever approach Place the numbers in order Do a binary search Step 1: Pick a number in the middle of the list Step 2: Restrict the search to the half that contains your number Return to Step 1 until you find your number Time for this approach is proportional to log 2 N
5
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The digital computer Represents everything in a code of zeros and ones Computer architecture CPU Memory Input / Output Advantages of digital computer Deterministic Minimization of noise Output CPUMemory Input
6
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence databases What is a database? An indexed set of records Records retrieved using a query language Database technology is well established Examples of sequence databases GenBank Encompasses all publicly available protein and nucleotide sequences Protein Data Bank Contains 3-D structures of proteins
7
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The client-server model The clients and servers are software processes Clients request data from servers Servers and clients can reside on the same or different machines Clients can act as servers to other processes and vice versa Web Browser BLAST Search Engine Database Web Server
8
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence alignment Sequence alignments search for matches between sequences Two broad classes of sequence alignments Global Local Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment Local alignment ESG
9
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The biological importance of sequence alignment Sequence alignments assess the degree of similarity between sequences Similar sequences suggest similar function Proteins with similar sequences are likely to play similar biochemical roles Regulatory DNA sequences that are similar will likely have similar roles in gene regulation Sequence similarity suggests evolutionary history Fewer differences mean more recent divergence
10
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The algorithmic problem of aligning sequences Comparison of similar sequences of similar length is straightforward How does one deal with insertions and gaps that may hide true similarity? How does one interpret minimal similarity? Are sequences actually related? Is alignment by chance? QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKESGPSRSYC
11
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Methods of sequence alignment Graphical methods Dynamic-programming methods Heuristic methods
12
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis A graphical method Shows all possible alignments Caveats Some guesswork in picking parameters Window size Stringency Not as rigorous or quantitative as other methods RQQEPVRSTC Q Q E S G P V R S T C QQESGPVRSTC RQQEPVRSTC
13
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1
14
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Devising a scoring system Scoring matrices allow biologists to quantify the quality of sequence alignments Use different scoring matrices for different purposes Score for similar structural domains in proteins Score for evolutionary relationship Some popular scoring matrices PAM for evolutionary studies BLOSUM for finding common motifs
15
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 An example of scoring ARNDCQE A4-2 0 R 50-2-310 N-2061-300 D-2 16-302 C0 9 -4 Q100-352 E002-425 BLOSUM62 A sequence comparison Total score: 18 AA4AA4 DQ0DQ0 DE2DE2 RR5RR5 QQ5QQ5 C E -4 E C -4 RQ1RQ1 AA4AA4 DQ0DQ0
16
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Heuristic methods with k-tuples Example: BLAST Using query sequence, derive a list of words of length w (e.g., 3) Keep high-scoring words High-scoring words are compared with database sequences Sequences with many matches to high- scoring words are used for final alignments
17
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases with longer sequences The extreme-value distribution Used to calculate the probability of chance alignment Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared
18
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A practical example of sequence alignment
19
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 BLAST results
20
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Detailed BLAST results
21
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A pairwise alignment with MASH-1 HASH-2, a human homolog of MASH-1 “+” indicates conservative amino acid substitution “–” indicates gap/insertion XXXX… shows areas of low complexity
22
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Phylogenetic analysis Phylogenetic trees Describe evolutionary relationships between sequences Three common methods Maximum parsimony Distance Maximum likelihood
23
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Gene prediction A problem of pattern recognition Algorithms look for features of genes: E.g., Splice sites, ORFs, starting methionine Identification of regulatory regions is difficult Statistical understanding of genes is ongoing Problems of this type require machine learning algorithms
24
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Analysis of microarray data Microarrays can measure the expression of thousands of genes simultaneously Vast amounts of data require computers Types of analysis Gene-by-gene Method: Statistical techniques Categorizing groups of genes Method: Clustering algorithms Deducing patterns of gene regulation Method: Under development
25
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Normalization of Microarray Data To make arrays comparable: Assume total intensity from an RNA pool is the same from another (cells growth arrested vs. cells dividing). Take the median value of all the spot intensities and subtract it from each spot’s own intensity. THIS IS KNOWN AS GLOBAL NORMALIZATION
26
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Example Data
27
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Log Normalized Data to total median intensity (Log2Ratio normalized) 10.128-7.7=2.428 6.5961-7.71=-1.1039
28
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Differentially Expressed Genes (DEGs) The difference between two groups of samples (arrays that belong to tumor vs. those to health; or arrays from growth arrested cell and those from asynchronously dividing cells) can be estimated and those genes whose mRNA expression significantly differ can be determined statistically.
29
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Log2 Ratio ½=0.5 2/1=2 Log(1/2)=-1 Log(1)-Log(2)=-1 Log(2/1)=1 Log(2)-Log(1)=1
30
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Average(Arrest)-Average(Control) Which genes upregulated with respect to control in arrest phenotype? Which genes downregulated with respect to control in arrest phenotype?
31
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Are these FoldChanges Significant? Very basic statistics: t-test between two groups
32
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate a log2Ratio in excel? Type in =AVERAGE(I2:K2)-AVERAGE(L2:N2) for FSTL1 Drag the cell from the bottom right corner down to fill in for the other rows.
33
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate FoldChange in excel? Raise the Log2Ratio Column to the power of 2 (2^O2 for FSTL1 gene) Drag the cell from the bottom right corner down to fill in for the other rows.
34
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to do a ttest in excel? Use function t-test from statistical function library: Type in =TTEST(I2:K2,L2:N2,2,2) for the following data:
35
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for Gene Expression Euclidian Distance
36
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Calculation of Euclidian Calculate the Euclidian distance between FSTL1 and AACS
37
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Calculation of Euclidian Larger the Euclidian Distance between two expression profiles more different they are from each other
38
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Correlation Coefficient
39
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Genes Across Conditions
40
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Highly Significant Genes Across Conditions
41
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Highly Significant Gene Clusters Across Conditions
42
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for gene expression Need a method to measure how similar genes are based on expression Examples Euclidean distance Pearson correlation coefficient Euclidean distance Pearson correlation coefficient
43
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Unsupervised techniques Make no assumptions about how the data should behave Cluster genes based on similar patterns of gene expression Examples Hierarchical clustering Principal components analysis (PCA) Hierarchical clustering PCA
44
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Supervised techniques Divide groups of genes based on sample properties Can predict sample condition based on gene expression pattern Examples Support vector machine Nearest neighbor Nearest neighbor Support vector machine
45
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Summary Vast amounts of data require bioinformatics These are limited by the following: Algorithmic complexity of bioinformatics problems Computer hardware performance Heuristic methods used to get around these limitations Bioinformatics methods used in the following areas: Sequence alignment Phylogenetic-tree construction Gene prediction Secondary-structure determination Analysis of microarray data Simulation of biological systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.