Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

Slides:



Advertisements
Similar presentations
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Advertisements

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Tutorial 5 Motif discovery.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein and Function Databases
Single Motif Charles Yan Spring Single Motif.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Biological Databases By : Lim Yun Ping E mail :
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein families, domains and motifs in functional prediction May 31, 2016.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
A brief on: Domain Families & Classification
MULTIPLE SEQUENCE ALIGNMENT
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
PROTEIN PATTERN DATABASES
A brief on: Domain Families & Classification
Presentation transcript:

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek

Pattern databases - topics Definition Applications Classifications Common Databases Conclusions

Pattern databases Definition Applications Classifications Common Databases Conclusions

Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc Pattern databases – definition

Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern databases Pattern Extraction - Multiple sequence alignment Thousands of patterns

Pattern databases Definition Applications Classifications Common Databases Conclusions

Pattern Databases - Applications Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). Useful for classification of protein sequences into families. It takes less time to search the pattern than the primary database. –Since “patterns” is the compact representation of features of many sequences.

Pattern databases Definition Applications Classifications Common Databases Conclusions

Multiple Sequence Alignment (MSA) Family based databases – considers full MSA Motif -3 Motif -1 Motif based databases – considers local regions in MSA

Pattern Databases – Protein Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS

DNA pattern database REBASE Transfac

InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom InterPro

Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS, BLOCKS & SMART (motif based) –MetaFam, InterPro (Integrated databases) Conclusions

Databases – General Tips 1. Source 2. Input formats & parameters 3. Output formats 4. Quality of the data 5. Other details – updates, coverage, speed, download, reference, methods etc.

Focus To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. To analyze the quality of results from each of these database –Sensitivity, specificity. Sequence & Pattern searches - In the afternoon’s practical.

PROSITE consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Based on SWISSPROT/TrEMBL

Text Search Sequence Scanner ID and text Search

Details about the pattern/profile Details about the pattern/profile PROSITE ID PROSITE Pattern Result: PROSITE Documentaion page [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]

Numerical Results PROSITE Pattern Detailed View - page 1

Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)

Raw Text Format – PROSITE Format

ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line

PROSITE Profiles

Highly degenerate protein structural and functional domains –immunoglobulin domains, SH2 and SH3 domains. Consensus sequences of repetitive DNA elements –SINEs, LINEs Basic gene expression signals –promoter elements, RNA processing signals, translational initiation sites. DNA-binding protein motifs. Protein and nucleic acid compositional domains –glutamine-rich activation domains, CpG islands.

PROSITE - features Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS- PROT(primary database)

Multiple Sequence Alignment Find 4-5 functionally conserved residues cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST] CORE PATTERN SWISS-PROT More FALSE POSITIVES ? Increase the sequence length of the pattern PROSITE DB YES NO motif

Protein fingerprint database Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. Improved diagnostic reliability than single motif methods Source – SWISSPROT/TrEMBL

Multiple Sequence Alignment Identification of ALL the conserved regions cydeggis cyedggis cyeeggit cyhgdggs Creation of frequency matrices SWISS-PROT / Tr-EMBL PRINTS DB xxxxxxx Frequency matrices motif fingerprint Iterative database scanning of the frequency matrices with protein databases till convergence

Database ID, no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database)

Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation, Links & references

Page 2 Fingerprint details Sequence Summary

Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint

BLOCKS Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.

Blocks Making Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.

Sequence, no. of blocks and text Searches Blocks Maker

Page 1 Summary Search methods using blocks

Page 2 BLOCK - 1 Represent start position of the block SWISSPROT ID Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100

Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.

ID and text Search ID & sequence Search Domain & GO search Alkaline Phosphatase

Results – Alkaline phosphatase “Signatures” PROSITE –Represented as a single motif. PRINTS –Represented as 5 motif regions. BLOCKS –Represented as 6 block regions SMART –Represented as a single profile

Composite Pattern Databases MetaFam InterPro CDD (conserved Domain Database) IProClass

Metafam & PANAL Metafam - PANAL – Protein ANALysis tool page of Metafam Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.

PANAL

Interpro Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS- PROT and TrEMBL Text- and sequence-based searches.

PRINTS PROSITE Pfam PRODOM SMART Detailed View - page 1

Detailed View - page 2 BLOCKS database link

PR – PRINTS PS – PROSITE PF – Pfam PD – ProDom SM – SMART

Detailed View - page 2

T – True Positive F – False Positive Range of the motif

Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS & BLOCKS (motif based) –MetaFam, InterPro (Integrated databases) Conclusions

CONCLUSION Diverse pattern databases from small patterns to profiles to complex HMM models Different strength and weakness Different database formats Best to combine and analyze results from different pattern databases.