Domain databases and prediction. A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

C A T H C A T H lass rchitecture opology or Fold Group
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Pfam(Protein families )
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 1: Protein Structure Basics (1) Centre for Integrative Bioinformatics.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.
Protein structure (Part 2 of 2).
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Domain databases and prediction. A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein structures in the PDB
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Master Course Sequence Alignment Lecture 9b Pattern matching part II.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein function Lecture 17: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Domains and domain databases Lecture 17: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Manually Adjusting Multiple Alignments Chris Wilton.
Protein Domain Database
Comparing and Classifying Domain Structures
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Motif searches.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Domains and domain databases
Chapter 14 Protein Structure Classification
Protein families, domains and motifs in functional prediction
Demo: Protein Information Resource
Genome Annotation Continued
There are four levels of structure in proteins
A brief on: Domain Families & Classification
A brief on: Domain Families & Classification
Presentation transcript:

Domain databases and prediction

A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a tinkerer and not an inventor” (Jacob, 1977).

Identification of domains is essential for: High resolution structures (e.g. Pfuhl & Pastore, 1995). Sequence analysis (Russell & Ponting, 1998) Multiple alignment methods Sequence database searches Prediction algorithms Fold recognition Structural/functional genomics

Domain connectivity

Domain size The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998), the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

Analysis of chain hydrophobicity in multidomain proteins

Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

Domain fusion For example, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the polypeptide appears as GARs- (AIRs)2-GARt. However, GARs-AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997). 1GAR: glycinamide ribonucleotide synthetase AIR: aminoimidazole ribonucleotide synthetase

Domain fusion Genetic mechanisms influencing the layout of multidomain proteins include gross rearrangements such as inversions, translocations, deletions and duplications, homologous recombination, and slippage of DNA polymerase during replication (Bork et al., 1992). Although genetically conceivable, the transition from two single domain proteins to a multidomain protein requires that both domains fold correctly and that they accomplish to bury a fraction of the previously solvent-exposed surface area in a newly generated inter-domain surface.

Domain swapping Domain swapping is such a structurally viable mechanism for forming oligomeric assemblies (Bennett et al., 1995). In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein. Domain swapping can range from secondary structure elements to whole structural domains. It also represents a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997).

Domain databases

BLOCKS Domain database Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concentrated form, reducing noise and increasing sensitivity to distant relationships. The BLOCKS Database (Henikoff et al., 2000) is automatically generated by looking for the most highly conserved regions in groups of proteins documented in various domain databases. Version 13.0 of the BLOCKS database consists of 8656 sequence blocks generated specifically from proteins in the PROSITE database.

COGS Domain database The COGs (Clusters of Orthologous Groups) database is a phylogenetic classification of the proteins encoded within complete genomes (Tatusov et al., 2001). It primarily consists of bacterial and archaeal genomes. Incorporation of the larger genomes of multicellular eukaryotes into the COG system is achieved by identifying eukaryotic proteins that fit into already existing COGs. Eukaryotic proteins that have orthologs within diffferent COGs are split into their individual domains. The COGs database currently consists of 3166 COGs including 75,725 proteins from 44 genomes.

PRINTS database PRINTS is a database of protein fingerprints (Attwood et al., 1999). A fingerprint is a group of conserved motifs used to characterise a protein family. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than a single motif. Release 31.0 of PRINTS contains 1,550 entries, encoding 9,531 individual motifs.

Regular expressions Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Regular expression [AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q {PG} = not (P or G)

Regular expressions Regular expressionNo. of exact matches in DB D-A-V-I-D71 D-A-V-I-[DENQ]252 [DENQ]-A-V-I-[DENQ]925 [DENQ]-A-[VLI]-I-[DENQ]2739 [DENQ]-[AG]-[VLI]2-[DENQ]51506 D-A-V-E1088

PRINTS database PRINTS contains the most discriminating groups of regular expressions for each protein sequence

The PRODOM Database ProDom is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases

The PRODOM Database ProDom ProDom (Corpet et al., 2000) is a database of protein domain families automatically generated from SWISSPROT and TrEMBL sequence databases (Bairoch and Apweiler, 2000) using a novel procedure based on recursive PSI-BLAST searches (Altschul et al., 1997). Release of ProDom contains 283,772 domain families, 101,957 having at least 2 sequence members. ProDom-CG (Complete Genome) is a version of the ProDom database which holds genome-specific domain data.

The PROSITE Database PROSITE (Hofmann et al., 1999) is a good source of high quality annotation for protein domain families. A PROSITE sequence family is represented as a pattern or profile. The profiles provide a means of sensitive detection of common protein domains in new protein sequences. PROSITE release contains signatures specific for 1,098 protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

The PFAM Database Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures Search with Hidden Markov Model (HMM) for each alignment

The PFAM Database Pfam is a database of two parts, the first is the curated part of Pfam containing over 5193 protein families (Pfam-A). Pfam-A comprises manually crafted multiple alignments and profile-HMMs. To give Pfam a more comprehensive coverage of known proteins we automatically generate a supplement called Pfam-B. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found.

The PFAM Database Sequence coverage Pfam-A : 73% (Gr+Bl) Sequence coverage Pfam-B : 20% (Bl) Other (Grey)

CYB_TRYBB/1-197CYB_TRYBB/1-197 M...LYKSG..EKRKG..LLMSGC.....LYR.....IYGVGFSLGFFIALQIIC..GVCLAWLFFSCFICSNWYFVLFL CYB_MARPO/1-208 M.ARRLSILKQPIFSTFNNHLIDY.....PTPSNISYWWGFGSLAGLCLVIQILTGVFLAMHYTPHVDLAFLSVEHIMR. CYB_HETFR/1-205 MATNIRKTH..PLLKIINHALVDL.....PAPSNISAWWNFGSLLVLCLAVQILTGLFLAMHYTADISLAFSSVIHICR. CYB_STELO/1-204 M.TNIRKTH..PLMKILNDAFIDL.....PTPSNISSWWNFGSLLGLCLIMQILTGLFLAMHYTPDTTTAFSSVAHICR. CYB_ASCSU/ MKLDFVNSMVVSL.....PSSKVLTYGWNFGSMLGMVLGFQILTGTFLAFYYSNDGALAFLSVQYIMY. CYB6_SPIOL/1-210 M.SKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTDAFASVQYIMT. CYB6_MARPO/1-210 M.GKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTEAFSSVQYIMT. CYB6_EUGGR/1-210 M.SRVYDWF..EERLEIQAIADDVSSKYVPPHVNIFYCLGGITFT..CFIIQVATGFAMTFYYRPTVTEAFLSVKYIMN. CYB_TRYBB/1-197 WDFDLGFVIRSVHICFTSLLYLLLYIHIFKSITLIILFDTH..IL....VWFIGFILFVFIIIIAFIGYVLPCTMMSYWG CYB_MARPO/1-208.DVKGGWLLRYMHANGASMFFIVVYLHFFRGLY....YGSY..ASPRELVWCLGVVILLLMIVTAFIGYVLPWGQMSFWG CYB_HETFR/1-205.DVNYGWLIRNIHANGASLFFICIYLHIARGLY....YGSY..LLKE..TWNIGVILLFLLMATAFVGYVLPWGQMSFWG CYB_STELO/1-204.DVNYGWFIRYLHANGASMFFICLYAHMGRGLY....YGSY..MFQE..TWNIGVLLLLTVMATAFVGYVLPWGQMSFWG CYB_ASCSU/1-196.EVNFGWIFRVLHFNGASLFFIFLYLHLFKGLF....FMSY..RLKK..VWVSGIVILLLVMMEAFMGYVLVWAQMSFWA CYB6_SPIOL/1-210.EVNFGWLIRSVHRWSASMMVLMMILHVFRVYL....TGGFKKPREL..TWVTGVVLGVLTASFGVTGYSLPWDQIGYWA CYB6_MARPO/1-210.EVNFGWLIRSVHRWSASMMVLMMILHIFRVYL....TGGFKKPREL..TWVTGVILAVLTVSFGVTGYSLPWDQIGYWA CYB6_EUGGR/1-210.EVNFGWLIRSIHRWSASMMVLMMILHVCRVYL....TGGFKKPREL..TWVTGIILAILTVSFGVTGYSLPWDQVGYWA CYB_TRYBB/1-197 LTVFSNIIATVPILGIWLCYWIWGSEFINDFTLLKLHVLHV.LLPFILLIILILHLFCLHYFM CYB_MARPO/1-208 ATVITSLASAIPVVGDTIVTWLWGGFSVDNATLNRFFSLHY.LLPFIIAGASILHLAALHQYG CYB_HETFR/1-205 ATVITNLLSAFPYIGDTLVQWIWGGFSIDNATLTRFFAFHF.LLPFLIIALTMLHFLFLHETG CYB_STELO/1-204 ATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHF.ILPFIITALAAVHLLFLHETG CYB_ASCSU/1-196 SVVITSLLSVIPVWGFAIVTWIWSGFTVSSATLKFFFVLHF.LVPWGLLLLVLLHLVFLHETG CYB6_SPIOL/1-210 VKIVTGVPDAIPVIGSPLVELLRGSASVGQSTLTRFYSLHTFVLPLLTAVFMLMHFLMIRKQG CYB6_MARPO/1-210 VKIVTGVPEAIPIIGSPLVELLRGSVSVGQSTLTRFYSLHTFVLPLLTAIFMLMHFLMIRKQG CYB6_EUGGR/1-210 VKIVTGVPEAIPLIGNFIVELLRGSVSVGQSTLTRFYSLHTFVLPLLTATFMLGHFLMIRKQG CYB_MARPO/1-208 CYB_HETFR/1-205 CYB_STELO/1-204 CYB_ASCSU/1-196 CYB6_SPIOL/1-210 CYB6_MARPO/1-210 CYB6_EUGGR/1-210 CYB_TRYBB/1-197 CYB_MARPO/1-208 CYB_HETFR/1-205 CYB_STELO/1-204 CYB_ASCSU/1-196 CYB6_SPIOL/1-210 CYB6_MARPO/1-210 CYB6_EUGGR/1-210 CYB_TRYBB/1-197 CYB_MARPO/1-208 CYB_HETFR/1-205 CYB_STELO/1-204 CYB_ASCSU/1-196 CYB6_SPIOL/1-210 CYB6_MARPO/1-210 CYB6_EUGGR/1-210 The coloured markup was created by Jalview (Michele Clamp) Alignments are colored using the ClustalX scheme in Jalview (orange:glycine (G); yellow: Proline (P); blue: small and hydrophobic amino-acids (A, V, L, I, M, F, W); green: hydroxyl and amine amino-acids (S, T, N, Q); red: charged amino-acids (D, E, R, K); cyan: histidine (H) and tyrosine(Y)). A PFAM alignment

A hidden Markov model accompanying a PFAM alignment HMMER2.0 [2.2g] NAME cytochrome_b_N ACC PF00033 DESC Cytochrome b(N-terminal)/b6/petB LENG 222 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 8 DATE Thu Dec 12 02:48: CKSUM 8731 GA TC NC XT NULT NULE EVD HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -300 * – – Good for profile searches…

INTERPRO combined database InterPro Because the underlying construction and analysis methods of the above domain family databases are different, the databases inevitably have different diagnostic strengths and weaknesses. The InterPro database (Apweiler et al., 2000) is a collaboration between many of the domain database curators. It aims to be a central resource reducing the amount of duplication between the databases. Release 3.2 of InterPro contains 3,939 entries, representing 1,009 domains, 2,850 families, 65 repeats and 15 posttranslational modification sites. Entries are accompanied by regular expressions, profiles, fingerprints and Hidden Markov Models which facilitate sequence database searches.

Domain structure databases Several methods of structural classification have been developed to classify the large number of protein folds present in the PDB. The most widely used and comprehensive databases are CATH, 3Dee, FSSP and SCOP, which use four unique methods to classify protein structures at the domain level.

Detecting Structural Domains A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).

Detecting Structural Domains However, this approach encounters problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation. Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).

CATH The CATH domain database assigns domains based on a consensus approach using the three algorithms PUU (Holm and Sander, 1994), DETECTIVE (Swindells, 1995) and DOMAK (Siddiqui and Barton, 1995) as well as visual inspection (Jones et al., 1998). The CATH database release 2.3 contains approximately 30,000 domains ordered into five major levels: Class; Architecture; Topology/fold; Homologous superfamily; and Sequence family.

CATH Class covers , , and  /  proteins Architecture is the overall shape of a domain as defined by the packing of secondary structural elements, but ignoring their connectivity. The topology-level consists of structures with the same number, arrangement and connectivity of secondary structure based on structural superposition using SSAP structure comparison algorithm (Taylor and Orengo, 1989). A homologous superfamily contains proteins having high structural similarity and similar functions, which suggests that they have evolved from a common ancestor. Finally, the sequence family level consists of proteins with sequence identities greater than 35%, again suggesting a common ancestor.

CATH CATH classifies domains into approximately 700 fold families; ten of these folds are highly populated and are referred to as ‘super-folds’. Super-folds are defined as folds for which there are at least three structures without significant sequence similarity (Orengo et al., 1994). The most populated is the  /  -barrel super-fold.

3Dee 3Dee structural domain repository (Siddiqui et al., 2001) stores alternative domain definitions for the same protein and organises the domains into sequence and structural hierarchies. Most of the database creation and update processes are performed automatically using the DOMAK (Siddiqui and Barton, 1995) algorithm. However, some domains are manually assigned. It contains non- redundant sets of sequences and structures, multiple structure alignments for all domain families, secondary structure and fold name definitions. The current 3Dee release is now two years old and contains 18,896 structural domains.

FSSP FSSP (Holm and Sander, 1997) is a complete comparison of all pairs of protein structures in the PDB. It is the basis for the Dali Domain Dictionary (Dietmann et al., 2001), a numerical taxonomy of all known structures in the PDB. The taxonomy is derived automatically from measurements of structural, functional and sequence similarities. The database is split into four hierarchical levels corresponding to super-secondary structural motifs, the topology of globular domains, remote homologues (functional families) and sequence families.

FSSP The top level of the fold classification corresponds to secondary structure composition and super- secondary structural motifs. Domains are assigned by the PUU algorithm (Holm and Sander, 1994) and classified into one of five ‘attractors’, which can be characterised as all- , all- ,  / ,  -  meander, and antiparallel  -barrels. Domains which are not clearly defined to a single attractor are assigned to a mixed class. In September 2000, the Dali classification contained 17,101 chains, 1,375 fold types and 3,724 domain sequence families. The database contains definitions of structurally conserved cores and a library of multiple alignments of distantly related protein families.

SCOP The SCOP database (Structural Classification of Proteins) is a manual classification of protein structure (Murzin et al., 1995). The classification is at the domain level for many proteins, but in general, a protein is only split into domains when there is a clear indication that the individual domains may have existed as independent proteins. Therefore, many of the domain definitions in SCOP will be different to those in the other structural domain databases. The principal levels of hierarchy are family, superfamily and fold, split into the traditional four domain classes, all- , all- ,  +  and  / . Release1.55 of the SCOP database contains 13,220 PDB entries, 605 fold types and 31,474 domains.

False sequence relationship by local search