SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,

Slides:



Advertisements
Similar presentations
Measuring the degree of similarity: PAM and blosum Matrix
Advertisements

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Strict Regularities in Structure-Sequence Relationship
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
The Protein Data Bank (PDB)
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Protein Modules An Introduction to Bioinformatics.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Lesson 3 Aligning sequences and searching databases.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Automatic methods for functional annotation of sequences Petri Törönen.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel:
Construction of Substitution matrices
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Demo: Protein Information Resource
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
PIR: Protein Information Resource
Predict Protein Sequence by Fuzzy-Association Rules
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Protein Structure Prediction by A Data-level Parallel Proceedings of the 1989 ACM/IEEE conference on Supercomputing Speaker : Chuan-Cheng Lin Advisor.
Alignment IV BLOSUM Matrices
Protein Structural Classification
Presentation transcript:

SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,

Introduction Protein sequences are composed of 20 amino acids The twenty amino acid letters are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Proteins are product of genes which have many functions in our body: antibodies, enzymes, structural (hairs, tendons etc) etc.

Introduction A sequence motif is a short amino pattern sequence in a protein sequence that has biological significance. For example: AATCKLMMVTVVWTTAGA Underlined are motifs important for the function of this protein Proteins in the same functional domain will share a common motif

Introduction A protein superfamily comprises set of protein sequences that are evolutionary and therefore functionally and structurally related Protein sequences in a family share some common motifs Two protein sequences are assigned to the same class if they have high homology in the sequence level (e.g., common motif).

First fact of biology “if two peptides stretches exhibit sufficient similarity at the sequence level, then they are likely to be biologically related”

Sequence alignment The similarity between two protein sequences are commonly established through multiple alignment algorithm (e.g., BLAST) Example of multiple sequence alignment to identify common motifs in protein sequences

Protein families Rapid grow in the number of protein sequences Searching one query sequence against all in all the databases is computationally expensive, need super-computer or weeks of computational time

growth of sequence data in GenBank

Solution Grouping protein into families based on the sequence level similarity can aid in: ◦ Group analysis of sequences to identify common motifs; ◦ Support rapid search of a protein sequence

The idea Protein database Fam 1 Fam 2 Fam 3 Fam m Group into families using clustering algorithms Feature extraction Features that represent each protein family Query sequence Feature extraction matching

Neural network for protein family classification Protein database Fam 1 Fam 2 Fam 3 Fam m Group into families using clustering algorithms … … ….. Extract features for all protein sequences Query sequence Input patterns from different classes Neural network training Input vector Fam ? predicti on

Issues on application on NN for biological sequences analysis

Input encoding How to convert the amino acids into numerical values? Direct encoding: ◦ Each amino acid letter is represented by a binary coding ◦ Various binary representation can be obtained based on the properties of amino acids

Input encoding Each amino acid is converted into numerical value based on its properties

Input encoding Amino acids are converted to binary vectors or feature values (Wu & McLarty, 2000)

Direct encoding 1-of-20 (Bin20) Each amino acid is represented by 20 binary digits, with only one of them is 1, others zero. E.g., A = [1, 0, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0] D = [0, 1, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0] … K = [0, 0, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,1]. A protein sequence becomes the concatenation of these binary vectors.

Exchange groups The twenty amino acids can be grouped according to various chemical/structural properties.

Indirect encoding n-gram features ◦ n-gram feature is the number of occurrences of a short protein sequence of length n in a protein sequence. ◦ Definition: The n-gram features is a pair of values (v i, c i ) where v i   depicts the i-th gram feature and c i is the counts of this feature in a protein sequence for i=i…. |  | n ◦ For example: the 2-gram are: ◦ (AA, AB, …, AY, BA, BB, …, BY, YA, …, YY).

N-gram feature example AGCCDDAGAGKDDV AG – 3 GC- 1 CC -1 CD-1 DD-2 DA-1 GA-1 GK -1 KD -1 DV - 1

Indirect encoding Problems with n-gram features: ◦ Large input dimension for n > 2 ◦ Solution: feature selection N-gram# of inputs

N-gram feature N-gram feature can also be reduced by using the amino acid exchange groups in Table 6.2. E.g.

N-gram with exchange groups Using hydrophobicity ◦ A={DENQRK}, B={CSTPGHY}, C={AMILVFW} AGCCDDAGAGKDDV  CBBBAACBCBAAAC 2-grams are: CB-2, BB-2, BA-2, AA-2, AC-1, BC-1, AA-2 # of features just 3 2

Conclusion Neural network is an effective and efficient option for protein family classification with proper feature representation and input encoding.