Download presentation
Presentation is loading. Please wait.
Published byImogene Cobb Modified over 9 years ago
1
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek
2
Pattern databases - topics Definition Applications Classifications Common Databases Conclusions
3
Pattern databases Definition Applications Classifications Common Databases Conclusions
4
Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc Pattern databases – definition
5
Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern databases Pattern Extraction - Multiple sequence alignment Thousands of patterns
6
Pattern databases Definition Applications Classifications Common Databases Conclusions
7
Pattern Databases - Applications Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). Useful for classification of protein sequences into families. It takes less time to search the pattern than the primary database. –Since “patterns” is the compact representation of features of many sequences.
8
Pattern databases Definition Applications Classifications Common Databases Conclusions
9
Multiple Sequence Alignment (MSA) Family based databases – considers full MSA Motif -3 Motif -1 Motif based databases – considers local regions in MSA
10
Pattern Databases – Protein Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS
11
DNA pattern database REBASE Transfac
12
InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom InterPro
13
Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS, BLOCKS & SMART (motif based) –MetaFam, InterPro (Integrated databases) Conclusions
14
Databases – General Tips 1. Source 2. Input formats & parameters 3. Output formats 4. Quality of the data 5. Other details – updates, coverage, speed, download, reference, methods etc.
15
Focus To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. To analyze the quality of results from each of these database –Sensitivity, specificity. Sequence & Pattern searches - In the afternoon’s practical.
16
PROSITE http://www.expasy.org/prosite/ consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Based on SWISSPROT/TrEMBL
17
Text Search Sequence Scanner ID and text Search http://www.expasy.org/prosite/
19
Details about the pattern/profile Details about the pattern/profile PROSITE ID PROSITE Pattern Result: PROSITE Documentaion page [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]
20
Numerical Results PROSITE Pattern Detailed View - page 1
21
Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)
22
Raw Text Format – PROSITE Format
23
ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line
24
PROSITE Profiles
25
Highly degenerate protein structural and functional domains –immunoglobulin domains, SH2 and SH3 domains. Consensus sequences of repetitive DNA elements –SINEs, LINEs Basic gene expression signals –promoter elements, RNA processing signals, translational initiation sites. DNA-binding protein motifs. Protein and nucleic acid compositional domains –glutamine-rich activation domains, CpG islands.
26
PROSITE - features Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS- PROT(primary database)
27
Multiple Sequence Alignment Find 4-5 functionally conserved residues cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST] CORE PATTERN SWISS-PROT More FALSE POSITIVES ? Increase the sequence length of the pattern PROSITE DB YES NO motif
28
http://bioinf.man.ac.uk/dbbrowser/PRINTS/ Protein fingerprint database Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. Improved diagnostic reliability than single motif methods Source – SWISSPROT/TrEMBL
29
Multiple Sequence Alignment Identification of ALL the conserved regions cydeggis cyedggis cyeeggit cyhgdggs Creation of frequency matrices SWISS-PROT / Tr-EMBL PRINTS DB xxxxxxx Frequency matrices motif fingerprint Iterative database scanning of the frequency matrices with protein databases till convergence
30
http://bioinf.man.ac.uk/dbbrowser/PRINTS/ Database ID, no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database)
32
Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation, Links & references
33
Page 2 Fingerprint details Sequence Summary
34
Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint
35
BLOCKS http://blocks.fhcrc.org/blocks/ Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.
36
Blocks Making Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.
37
http://blocks.fhcrc.org/blocks/blocksdiag.jpg
38
http://blocks.fhcrc.org/blocks/ Sequence, no. of blocks and text Searches Blocks Maker
39
Page 1 Summary Search methods using blocks
40
Page 2 BLOCK - 1 Represent start position of the block SWISSPROT ID Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100
41
Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues. http://smart.embl-heidelberg.de/
42
ID and text Search ID & sequence Search Domain & GO search Alkaline Phosphatase
45
Results – Alkaline phosphatase “Signatures” PROSITE –Represented as a single motif. PRINTS –Represented as 5 motif regions. BLOCKS –Represented as 6 block regions SMART –Represented as a single profile
46
Composite Pattern Databases MetaFam InterPro CDD (conserved Domain Database) IProClass
47
Metafam & PANAL Metafam - http://metafam.ahc.umn.edu/http://metafam.ahc.umn.edu/ PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/http://mgd.ahc.umn.edu/panal/ Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.
48
PANAL
49
Interpro http://www.ebi.ac.uk/interpro Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS- PROT and TrEMBL Text- and sequence-based searches.
51
http://www.ebi.ac.uk/interpro/
52
PRINTS PROSITE Pfam PRODOM SMART Detailed View - page 1
53
Detailed View - page 2 BLOCKS database link
54
PR – PRINTS PS – PROSITE PF – Pfam PD – ProDom SM – SMART
55
Detailed View - page 2
56
T – True Positive F – False Positive Range of the motif
57
Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS & BLOCKS (motif based) –MetaFam, InterPro (Integrated databases) Conclusions
58
CONCLUSION Diverse pattern databases from small patterns to profiles to complex HMM models Different strength and weakness Different database formats Best to combine and analyze results from different pattern databases.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.