Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.

Slides:

Advertisements

Similar presentations

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.

Advertisements

1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,

Gibbs sampling Morten Nielsen, CBS, BioSys, DTU. Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Optimization methods Morten Nielsen Department of Systems Biology, DTU.

Algorithms in Bioinformatics Morten Nielsen BioSys, DTU.

Structural bioinformatics

Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology.

Strict Regularities in Structure-Sequence Relationship

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Protein Fold recognition

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Protein structure Classification Ole Lund, Associate professor, CBS, DTU.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Structures.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...

Algorithms in Bioinformatics Morten Nielsen Department of Systems Biology, DTU.

Discussion on Metagenomic Data for ANGUS Course Adina Howe.

Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.

Discover the UniProt Blast tool. Murcia, February, 2011Protein Sequence Databases Customize the BLAST results.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Lars Arge1 Permuting Lower Bound Permuting N elements according to a given permutation takes I/Os in “indivisibility” model Indivisibility model: Move.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.

Copyright OpenHelix. No use or reproduction without express written consent1.

What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.

The Blosum scoring matrices Morten Nielsen BioSys, DTU.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Protein Homologue Clustering and Molecular Modeling L. Wang.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Psi-Blast Morten Nielsen, Department of systems biology, DTU.

Optimization methods Morten Nielsen Department of Systems biology, DTU IIB-INTECH, UNSAM, Argentina.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

Stabilization matrix method (Ridge regression) Morten Nielsen Department of Systems Biology, DTU.

HANDS-ON ConSurf! Web-Server: The ConSurf webserver.

3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.

Protein Modularity and Evolution : An examination of organism complexity via protein domain structure Presented by Jennelle Heyer and Jonathan Ebbers December.

Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Outline Basic Local Alignment Search Tool

Research Paper on BioInformatics

Sequence motifs, information content, logos, and HMM’s

Learning Sequence Motif Models Using Expectation Maximization (EM)

Protein Structures.

Morten Nielsen, CBS, BioSys, DTU

Outline Basic Local Alignment Search Tool

Protein structure prediction.

Sequence motifs, information content, and sequence logos

Basic Local Alignment Search Tool

Presentation transcript:

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU

Outline What is data redundancy? Why is it a problem? How can we deal with it?

Databases are redundant Biological reasons –Some protein functions, or sequence motifs are more common than others Laboratory artifacts –Some protein families have been heavily investigated, others not –Mutagenesis studies makes large and almost identical replica of data –This bias is non-biological

Date redundancy What can we learn? 1.A at P1 favors binding? 2.I is not allowed at P9? 3.K at P4 favors binding? 4.Which positions are important for binding? ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV 10 MHC restricted peptides

Redundant data ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

PDB. Example 1055 protein sequence Len Function annotations –ACTIN-BINDING –ANTIGEN –COAGULATION –HYDROLASE/DNA –LYASE/OXIDOREDUCTASE –ENDOCYTOSIS/EXOCYTOSIS –…

PDB. Example

What is similarity? Sequence identity? Blast e-values –Often too conservative Other DFLKKVPDDHLEFIPYLILGEVFPEWDERELGVGEKLLIKAVA MATGIDAKEIEESVKDTGDL-GE DVLLGADDGSLAFVP SEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGE ACDFG ACEFG 80% ID versus 24% ID

Ole Lund et al. (Protein engineering 1997)

Ole’s formula

How to deal with redundancy Hobohm 1 –Fast –Requires a prior sorting of data Hobohm 2 –Slow –Gives unique answer always –No prior sorting

Hobohm 1 Input data - sorted list A B C D E F G H I Unique Add next data point to list of unique if it is NOT similar to any of the elements already on the unique list

Hobohm 1 Input data A B C D E F G H I Unique Add next data point to list of unique if it is NOT similar to any of the elements already on the unique list

Hobohm 1 Input data A C D E F G H I Unique Add next data point to list of unique if it is NOT similar to any of the elements already on the unique list B

Hobohm 1 Input data A C F I Unique Add next data point to list of unique if it is NOT similar to any of the elements already on the unique list B D E G H Need only to align sequences against the Unique list!

Hobohm-2 Align all against all Make similarity matrix D (N*N) with value 1 if is similar to j, otherwise 0 While data points have more than one neighbor –Remove data point S with most nearest neighbors

Hobohm-2 A B C D E F G H I A B C D E F G H I D: Make similarity matrix N*N

Hobohm-2 A B C D E F G H I A B C D E F G H I N N D: S Find point S with the largest number of similarities

Hobohm-2 A B C D E F G H I A B C D E F G H I N N D: A B C D E F G H A B C D E F G H N N D: Remove point S with the largest number of similarities, and update N counts

Hobohm-2 (repeat this) A B C D E F G H A B C D E F G H N N D: Remove point S with the largest number of similarities N N A B C E F G H A B C E F G H D:

Hobohm-2 (until N=1 for all) A B C D E F G H I A B C D E F G H I N N D: C F H C F H => D’:D’: Unique list is C, F, H N111N111

Hobohm

Hobohm-1

Hobohm-2

Why two algorithms? Hobohm-2 –Unbiased –Slow (O2) –Focuses on lonely sequences –Example from exercise 1000 Sequences alignment 2 hours Hobohm-2: 22 seconds Hobohm-1 –Biased. Prioritized list –Fast (0) –Focuses on populated sequence areas –Example from exercise 1000 Sequences Hobohm-1: 12 seconds Hobohm2 in general gives more sequences than Hobohm1

Hobohm-1 versus Hobohm-2 Prioritized lists –PDB structures. Not all structures are equally good Low resolution, NMR, old? –Peptide binding data Strong binding more important than weak binding Quantitative data (yes no data) –All data are equally important