© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Contents  Sequence alignment  Gene prediction  Algorithms for analysis of phylogeny  Analysis of microarray data

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Computational Biology and Bioinformatics  Computational biology  Development of computational methods to solve problems in biology  Bioinformatics  Application of computational biology to analysis and management of real data  Why do biologists need computer science?  Discrete nature of sequence data is ideal for analysis using digital computers  Size and complexity of genomics data make the data impossible to analyze without computers

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithmic problems  Example: searching for a number in an unordered list  If the list has N numbers, the average amount of time the search will take will be proportional to N  A more clever approach  Place the numbers in order  Do a binary search  Step 1: Pick a number in the middle of the list  Step 2: Restrict the search to the half that contains your number  Return to Step 1 until you find your number  Time for this approach is proportional to log 2 N

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The digital computer  Represents everything in a code of zeros and ones  Computer architecture  CPU  Memory  Input / Output  Advantages of digital computer  Deterministic  Minimization of noise Output CPUMemory Input

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence databases  What is a database?  An indexed set of records  Records retrieved using a query language  Database technology is well established  Examples of sequence databases  GenBank  Encompasses all publicly available protein and nucleotide sequences  Protein Data Bank  Contains 3-D structures of proteins

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The client-server model  The clients and servers are software processes  Clients request data from servers  Servers and clients can reside on the same or different machines  Clients can act as servers to other processes and vice versa Web Browser BLAST Search Engine Database Web Server

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence alignment  Sequence alignments search for matches between sequences  Two broad classes of sequence alignments  Global  Local  Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment Local alignment ESG

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The biological importance of sequence alignment  Sequence alignments assess the degree of similarity between sequences  Similar sequences suggest similar function  Proteins with similar sequences are likely to play similar biochemical roles  Regulatory DNA sequences that are similar will likely have similar roles in gene regulation  Sequence similarity suggests evolutionary history  Fewer differences mean more recent divergence

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The algorithmic problem of aligning sequences  Comparison of similar sequences of similar length is straightforward  How does one deal with insertions and gaps that may hide true similarity?  How does one interpret minimal similarity?  Are sequences actually related?  Is alignment by chance? QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKESGPSRSYC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis  A graphical method  Shows all possible alignments  Caveats  Some guesswork in picking parameters  Window size  Stringency  Not as rigorous or quantitative as other methods RQQEPVRSTC Q Q E S G P V R S T C QQESGPVRSTC RQQEPVRSTC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Devising a scoring system  Scoring matrices allow biologists to quantify the quality of sequence alignments  Use different scoring matrices for different purposes  Score for similar structural domains in proteins  Score for evolutionary relationship  Some popular scoring matrices  PAM for evolutionary studies  BLOSUM for finding common motifs

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 An example of scoring ARNDCQE A4-2 0 R 50-2-310 N-2061-300 D-2 16-302 C0 9 -4 Q100-352 E002-425 BLOSUM62 A sequence comparison Total score: 18 AA4AA4 DQ0DQ0 DE2DE2 RR5RR5 QQ5QQ5 C E -4 E C -4 RQ1RQ1 AA4AA4 DQ0DQ0

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Heuristic methods with k-tuples  Example: BLAST  Using query sequence, derive a list of words of length w (e.g., 3)  Keep high-scoring words  High-scoring words are compared with database sequences  Sequences with many matches to high- scoring words are used for final alignments

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Statistical significance  Chance alignments have no biological significance  Statistical significance implies low probability of generating a chance alignment  Probability of long alignments increases with longer sequences  The extreme-value distribution  Used to calculate the probability of chance alignment  Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A pairwise alignment with MASH-1  HASH-2, a human homolog of MASH-1  “+” indicates conservative amino acid substitution  “–” indicates gap/insertion  XXXX… shows areas of low complexity

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Phylogenetic analysis  Phylogenetic trees  Describe evolutionary relationships between sequences  Three common methods  Maximum parsimony  Distance  Maximum likelihood

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Gene prediction  A problem of pattern recognition  Algorithms look for features of genes:  E.g., Splice sites, ORFs, starting methionine  Identification of regulatory regions is difficult  Statistical understanding of genes is ongoing  Problems of this type require machine learning algorithms

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Analysis of microarray data  Microarrays can measure the expression of thousands of genes simultaneously  Vast amounts of data require computers  Types of analysis  Gene-by-gene  Method: Statistical techniques  Categorizing groups of genes  Method: Clustering algorithms  Deducing patterns of gene regulation  Method: Under development

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Normalization of Microarray Data  To make arrays comparable:  Assume total intensity from an RNA pool is the same from another (cells growth arrested vs. cells dividing).  Take the median value of all the spot intensities and subtract it from each spot’s own intensity.  THIS IS KNOWN AS GLOBAL NORMALIZATION

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Differentially Expressed Genes (DEGs)  The difference between two groups of samples (arrays that belong to tumor vs. those to health; or arrays from growth arrested cell and those from asynchronously dividing cells) can be estimated and those genes whose mRNA expression significantly differ can be determined statistically.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Average(Arrest)-Average(Control)  Which genes upregulated with respect to control in arrest phenotype?  Which genes downregulated with respect to control in arrest phenotype?

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate a log2Ratio in excel?  Type in =AVERAGE(I2:K2)-AVERAGE(L2:N2) for FSTL1  Drag the cell from the bottom right corner down to fill in for the other rows.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate FoldChange in excel?  Raise the Log2Ratio Column to the power of 2 (2^O2 for FSTL1 gene)  Drag the cell from the bottom right corner down to fill in for the other rows.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to do a ttest in excel?  Use function t-test from statistical function library:  Type in =TTEST(I2:K2,L2:N2,2,2) for the following data:

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Calculation of Euclidian  Larger the Euclidian Distance between two expression profiles more different they are from each other

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for gene expression  Need a method to measure how similar genes are based on expression  Examples  Euclidean distance  Pearson correlation coefficient Euclidean distance Pearson correlation coefficient

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Unsupervised techniques  Make no assumptions about how the data should behave  Cluster genes based on similar patterns of gene expression  Examples  Hierarchical clustering  Principal components analysis (PCA) Hierarchical clustering PCA

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Supervised techniques  Divide groups of genes based on sample properties  Can predict sample condition based on gene expression pattern  Examples  Support vector machine  Nearest neighbor Nearest neighbor Support vector machine

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Summary  Vast amounts of data require bioinformatics  These are limited by the following:  Algorithmic complexity of bioinformatics problems  Computer hardware performance  Heuristic methods used to get around these limitations  Bioinformatics methods used in the following areas:  Sequence alignment  Phylogenetic-tree construction  Gene prediction  Secondary-structure determination  Analysis of microarray data  Simulation of biological systems

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying.

Similar presentations

Presentation on theme: "© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying.

Similar presentations

Presentation on theme: "© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying."— Presentation transcript:

Similar presentations

About project

Feedback