Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek
Pattern databases - topics Definition Applications Classifications Common Databases Conclusions
Pattern databases Definition Applications Classifications Common Databases Conclusions
Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc Pattern databases – definition
Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern databases Pattern Extraction - Multiple sequence alignment Thousands of patterns
Pattern databases Definition Applications Classifications Common Databases Conclusions
Pattern Databases - Applications Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). Useful for classification of protein sequences into families. It takes less time to search the pattern than the primary database. –Since “patterns” is the compact representation of features of many sequences.
Pattern databases Definition Applications Classifications Common Databases Conclusions
Multiple Sequence Alignment (MSA) Family based databases – considers full MSA Motif -3 Motif -1 Motif based databases – considers local regions in MSA
Pattern Databases – Protein Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS
DNA pattern database REBASE Transfac
InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom InterPro
Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS, BLOCKS & SMART (motif based) –MetaFam, InterPro (Integrated databases) Conclusions
Databases – General Tips 1. Source 2. Input formats & parameters 3. Output formats 4. Quality of the data 5. Other details – updates, coverage, speed, download, reference, methods etc.
Focus To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. To analyze the quality of results from each of these database –Sensitivity, specificity. Sequence & Pattern searches - In the afternoon’s practical.
PROSITE consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Based on SWISSPROT/TrEMBL
Text Search Sequence Scanner ID and text Search
Details about the pattern/profile Details about the pattern/profile PROSITE ID PROSITE Pattern Result: PROSITE Documentaion page [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]
Numerical Results PROSITE Pattern Detailed View - page 1
Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)
Raw Text Format – PROSITE Format
ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line
PROSITE Profiles
Highly degenerate protein structural and functional domains –immunoglobulin domains, SH2 and SH3 domains. Consensus sequences of repetitive DNA elements –SINEs, LINEs Basic gene expression signals –promoter elements, RNA processing signals, translational initiation sites. DNA-binding protein motifs. Protein and nucleic acid compositional domains –glutamine-rich activation domains, CpG islands.
PROSITE - features Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS- PROT(primary database)
Multiple Sequence Alignment Find 4-5 functionally conserved residues cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST] CORE PATTERN SWISS-PROT More FALSE POSITIVES ? Increase the sequence length of the pattern PROSITE DB YES NO motif
Protein fingerprint database Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. Improved diagnostic reliability than single motif methods Source – SWISSPROT/TrEMBL
Multiple Sequence Alignment Identification of ALL the conserved regions cydeggis cyedggis cyeeggit cyhgdggs Creation of frequency matrices SWISS-PROT / Tr-EMBL PRINTS DB xxxxxxx Frequency matrices motif fingerprint Iterative database scanning of the frequency matrices with protein databases till convergence
Database ID, no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database)
Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation, Links & references
Page 2 Fingerprint details Sequence Summary
Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint
BLOCKS Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.
Blocks Making Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.
Sequence, no. of blocks and text Searches Blocks Maker
Page 1 Summary Search methods using blocks
Page 2 BLOCK - 1 Represent start position of the block SWISSPROT ID Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100
Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.
ID and text Search ID & sequence Search Domain & GO search Alkaline Phosphatase
Results – Alkaline phosphatase “Signatures” PROSITE –Represented as a single motif. PRINTS –Represented as 5 motif regions. BLOCKS –Represented as 6 block regions SMART –Represented as a single profile
Composite Pattern Databases MetaFam InterPro CDD (conserved Domain Database) IProClass
Metafam & PANAL Metafam - PANAL – Protein ANALysis tool page of Metafam Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.
PANAL
Interpro Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS- PROT and TrEMBL Text- and sequence-based searches.
PRINTS PROSITE Pfam PRODOM SMART Detailed View - page 1
Detailed View - page 2 BLOCKS database link
PR – PRINTS PS – PROSITE PF – Pfam PD – ProDom SM – SMART
Detailed View - page 2
T – True Positive F – False Positive Range of the motif
Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS & BLOCKS (motif based) –MetaFam, InterPro (Integrated databases) Conclusions
CONCLUSION Diverse pattern databases from small patterns to profiles to complex HMM models Different strength and weakness Different database formats Best to combine and analyze results from different pattern databases.