Evolving Models of Biological Sequence Similarity Daniel P. Miranker The University of Texas at Austin [Chenetal98]

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
The Four Major Groups of Organic Compounds: Carbohydrates, Lipids, Proteins, and Nucleic Acids; and Their Functions in Living Systems.
The Chemistry of Life Macromolecules
CARBON AND MOLECULAR DIVERSITY
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Structural bioinformatics
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Protein Structure, Databases and Structural Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
The Protein Data Bank (PDB)
Sequence similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Protein Structures.
Sequence comparison: Local alignment
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Protein Tertiary Structure Prediction
Proteins (aka polypeptides)
RNA World Hypothesis The RNA world hypothesis proposes that self-replicating ribonucleic acid (RNA) molecules were precursors to current life, which is.
Honors Biology The molecules of Cells
Review of Biological Chemistry. Biologically Important Elements.
Protein Sequence Alignment and Database Searching.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Carbon Compounds Essential Questions What makes food “nutritious”?
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Cell Chemistry.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Polymer Molecule made of many monomers bonded together
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biological Molecules. Life is carbon-based chemistry Hydrolysis and Synthesis of Biological Molecules Carbohydrates Lipids Proteins Nucleic Acids.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Investigating the three of the four major biological molecules, including structure and function within biological systems.
Review Question and Answer.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Protein Structure  The structure of proteins can be described at 4 levels – primary, secondary, tertiary and quaternary.  Primary structure  The sequence.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Protein Tertiary Structure Prediction Structural Bioinformatics.
What is a macromolecule? There are four main types of biological molecules called macromolecules. The four types of macromolecules are carbohydrates, lipids,
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Amino Acids and Proteins
Sequence comparison: Local alignment
Macromolecules.
Sequence Based Analysis Tutorial
Protein Structures.
FUNDAMENTALS OF CHEMISTRY
Protein structure prediction.
Four Levels of Protein Structure
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Evolving Models of Biological Sequence Similarity Daniel P. Miranker The University of Texas at Austin [Chenetal98]

Polymers Polymer: a molecule composed of a linear sequence of smaller molecules (monomers).

Biopolymers Start with monomers Nucleic acids DNA RNA Amino acids Proteins Peptides Sugars Carbohydrates

Monomers/Polymers Nucleic acids DNAs RNAs Amino acids Proteins Peptides Sugars Carbohydrates

Describing Polymers Primary, Secondary and Tertiary Structure

Polymer: Primary Structure Description Most pictures borrowed from: Jiunn-Liang Chen, James M.Nolan, Michael E.Harris and Norman R.Pace, Comparative photocross-linking analysis of the tertiary structures of Escherichia coli and Bacillus subtilis RNase P RNAs, The EMBO Journal Vol.17 No.5 pp.1515–1525, 1998

Polymer Secondary Structure RNA’s fold up on themselves –Loops –Helices Proteins –Alpha - helix –Beta - sheet –… 7 structures and beyond [Chenetal98]

Polymer Tertiary Structure

How to model similarity? Which features do we pick? What are the metrics?

First, determine the goal Given a molecule, a biologist will ask: 1.What is it? 2.What does it do? 3.How does it do it?

What about homology? Definition: Homology A component of two organisms, (e.g a molecule), are homologous if they evolved from a common ancestor.

Homology and the Three Questions Homology is a property on its own. 1.Homology is a way of defining equivalence classes. –Classifying a molecule in group gives it identity. Homologous molecules, 2.usually, perform the same function. and 3.largely, function in the same way. –The small differences are an opportunity understand the system as a whole

Primary Structure Similarity: Has answered “What is this?”, based on homology Important: –Large-scale production of primary structure definitions. –$1, human genome Can use string algorithms.

Primary Structure Matching MethodNovelty Needleman-Wunch[70]Global Alignment Sellers [74][Metric] Weighting Waterman, Smith and Beyer [76] Gaps Smith-Waterman[81]Local-alignment BLAST, [Altshul etal90]Hot-spot matching

Global-alignment Needleman-Wunch Alignment new base-case, 0’s for all “$” cells $PIPER $ P0 E0 P0 P0 E0 R0 scores the common sequence no penalty for different length sequences parts of sequences that don’t align aka: Longest common subsequence problem (LCS)

Recurrence for Global Alignment S ij = 0 if i = 0 or j = 0 S i-1,j-1 + c(v i,w j ) S i,j = min S i,j-1 + c(_,w j ) S i-1,j + c(v i, _)

Local alignment Smith Waterman alignment s i-1,j-1 + c(v i,w j ) s i,j = max s i,j-1 + c(_,w j ) s i-1,j + c(v i, _) 0 No longer a metric max, not min cost matrix, penalizes edits with negative scores

Replacing Edits with “Words” Local areas of high conservation: such retained features form a larger vocabulary of building blocks

Phylogenetic Footprint [Mondal etal 2007] “Key word”

Keywords, a basis of critical function e.g. active site for docking [Biespiel]

Small Differences are Revealing The basis for stabilizing a fold in a RNA [Chenetal98]

Nature Retains and Rediscovers Useful Structures Biological goal: –Determine a larger vocabulary of building blocks. Molecular data management systems play a key an important role –Catalog identified building blocks. (e.g. Pfam, SCOP) –Organize around functional and homologous groups. Increasingly, identity is being resolved by word- level matches.

NCBI Protein BLAST Result Pfam domain matches If you insist, a second query for sequence matches will be executed.

Sequence-based homology Is no less important, (biological criteria) More sequence data --> –Identification is easier –For an unknown, all definitions of identity

Where does that leave us? Models must begin to reflect chemical function. Bad news: leave a comfort zone.

A common current approach: Polymers have first, second and tertiary structure Create a triple (Primary structure descriptor, Secondary structure descriptor, Tertiary structure descriptor) Good news: lots of degrees of freedom, lots of room for different ideas.

Protein Example (W, alpha, (3.32, 1.027, )) Primary Structure: amino acid alphabet –No change Secondary Structure: alpha-helix or beta sheet, –Symbolic vocabulary of structure –Open opportunity, SCOP catalog Tertiary Structure: location, x, y, z, of a particular carbon atom in the amino acid. - Known for some proteins, PDB is the repository

If you have two PDB files: Generally, –3-d data is unavailable. –PDB is the basis for gold standards [wikipedia]

An Observation Even a little secondary structure information helps a lot. Despite adding new explicit dimensions, Implicit dimensionality goes down. [Bhattahcarya et. al.]

Open Problems: DBMS: If data is organized by homology group, what are the [query] services? Database retrieval in biology is almost always a two step, two criteria process. 1.Retrieve a solution set based on similarity. 2.Assign a statistical significance to each result in the solution set. (e.g. BLAST e-scores) Is there a one step process (index), that embodies both? Other data types in biology, not just individual molecules –Pathways, sets of proteins may be homologous. –Mass-spectra