Download presentation
Presentation is loading. Please wait.
Published byEmily Stevenson Modified over 9 years ago
1
Secondary Databases Ansuman sahoo Roll: Y1011009 Bioinformatics Class Presentation 30 Jan 2013
2
Whydatabases ? biology has turned into data-rich science High-throughput genomics, proteomics, metabolomics,... Vast amount of data generated in experiments (like MS peptide fragments) need for storing and communicating large datasets has grown tremendously archiving, curation, analysis and interpretation of all of these datasets are a challenge convenient methods for proper storing, searching & retrieving necessary Databases are the means to handle this data overload Why need Database?
3
What candatabases do? Make biological data available... 1.… to scientists. 2.… in computer-readable form. Analysis (computer based) Handle and share large volumes of data Interface for computer based systems (Algorithms, Web interfaces) Store data Defined formats Automated storage and retrieval of experimental data Link knowledge with external resources What Databases can do?
4
Database classification I Type of data Nucleotide or protein sequences Protein sequence patterns and motifs Macromolecular 3D structures Gene expression data Metabolic pathways... Data entry and quality control Scientists deposit data directly Appointed curators add and update Type and degree of error checking Consistency, redundancy, conflicts, updates
5
Database classification II Primary or derived data Primary: experimental results directly into database Secondary: results of analysis of primary databases Technical design Flat-files Relational database (SQL) Object-oriented database Exchange/publication technologies (FTP, HTML, COBRA, XML, SOAP) Maintainer status Large, public institution funded by government (EMBL, NCBI) Academic group or scientist Commercial company
6
EMBL DDBJDDBJ GenBank sequences submitted directly by scientists and genome sequencing group, and sequences taken from literature and patents. entries in the EMBL, GenBank and DDBJ databases are synchronized on a daily basis accession numbers are managed in a consistent manner comparatively little error checking and fair amount of redundancy. Nucleotide sequence databases
7
Protein sequence databases UniProt KB mission to provide a comprehensive, high-quality and freely accessible resource of protein sequence and functional information SWISS-PROT is a protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. PIR SWISS-PROT and PIR are different from the nucleotide databases in that they are both curated
8
Examples of Secondary Databases
9
Motif: Super secondary structure level Simple combination of a few secondary structure elements with specific geometric arrangements Helix-turn-helix is a motif Helix-loop-helix is a motif Several DNA major groove binding proteins (eg. transcription factors) have helix-turn-helix (NOT helix- loop-helix) motif Source: Tirumala kumar choudhry
10
Motif: http://www.ebi.ac.uk/thornton-srv/databases/cgi- bin/pdbsum/GetPage.pl?pdbcode=n/a&template=doc_promotif.html Motif descriptions: 1. Beta barrels 2. Beta sheets 3. Beta-alpha-beta units 4.Beta hairpins 5. Psi loops 6.Beta bulges 7.Beta strands 8. Helices 9. Helix-helix interactions 10. Beta turns 11. Gamma turns 12.Disulphides PROMOTIF: SCANPROSITE: If you are interested in finding a motif in novel protein go to: or http://prosite.expasy.org/scanprosite/
11
Domain: More of a tertiary structure level Several motifs can arrange three-dimensionally into a domain In simple terms, a domain is a fundamental unit of tertiary structure A polypeptide chain that is folded independently into stable structure is a domain Domains conveniently divide protein structures into discrete subunits, which are frequently classified separately Knowing the Domain, protein function prediction is possible. Four major classes: all alpha, beta, alpha/beta, alpha+beta, However, new classes are being added: double barrels, beta rolls are two classes added in 2010 Structural Classification of Proteins (SCOP) is where you have to look for definitions and examples http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/ CATH is another database http://www.cathdb.info/http://www.cathdb.info/
12
Source: Tirumala kumar choudhry Four helix bundle is commonly seen in alpha- proteins Several alpha helical proteins show this domain. Ferritin, cytochrome C’ are some examples
13
pfam A collection of protein Domain families. Each entry is a multiple sequence alignment of a protein domain or a conserved region of interest. Based on Hidden Markov Model Pfam A: initial alignment of protein alignment is carried out by hand. Pfam B: Generated from automatic clustering of ProDom database. > 80% entries are associated with swiss-prot & TrEMBL entries.
14
Strategies for Secondary structure Prediction if an experimentally determined three- dimensional structure of a closely related protein is known, copy the secondary structure assignment from the known structure rather than attempt to predict it denovo If no related structures are known, use multiple sequence information
15
If the particular algorithm does not accept MSA as an input, try to predict the secondary structure for the target and a few of its distant homologues and use the consensus pattern of secondary structures as an additional indicator of reliability of the prediction. Run as many good methods as possible and use the agreement between their results to infer a consensus prediction.
16
Protein fold recognition the representation of the template structures (usually corresponding to proteins from the Protein Data Bank database) the evaluation of the compatibility between the target sequence and a template fold the algorithm to compute the optimal alignment between the target sequence and the template structure the way the ranking is computed and the statistical significance is estimated
17
Epitope Prediction Epitope: defined as “the chemical structure recognized by specific receptors of the immune system (antibodies, MHC molecules, and/or T cell receptors) Database: Immune Epitope Database andAnalysis Resource (IEDB)
18
Document Management Overview
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.