Presentation is loading. Please wait.

Presentation is loading. Please wait.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools.

Similar presentations

Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools."— Presentation transcript:

1 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools Lecture 12

2 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] 21 May 2007 Content: Background Linker prediction (DomCut, Elsik) Protein domain delineation based on consistency of multiple ab initio model tertiary structures (SnapDRAGON) (Rosetta) Protein domain delineation based on combining homology searching with domain prediction (Domaination) Domain delineation based on sequence hydrophobicity patterns (SCOOBY-DOmain) Protein Domain delineation

3 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] 21 May 2007 A domain is a: Compact, semi-independent unit (Richardson, 1981) Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973) Fundamental unit of protein function Recurring functional and evolutionary module (Bork, 1992) “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

4 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] 21 May 2007 Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

5 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] 21 May 2007 The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers.

6 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] 21 May 2007 Delineating domains is essential for: Obtaining high resolution structures by NMR (due to size limitations of proteins) Sequence analysis  Multiple sequence alignment methods Prediction algorithms (secondary/tertiary structure, solvent accessibility,..) Fold recognition and threading Structural/functional genomics Cross genome comparative analysis Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method)

7 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] 21 May 2007 Prediction of protein-protein interactions Rosetta stone Gene fusion is the an effective method for prediction of protein-protein interactions If proteins A and B are homologous to two domains of a protein C, A and B are predicted to have interaction Though gene-fusion has low prediction coverage, it false-positive rate is low A B C

8 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] 21 May 2007 Domain fusion example Vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt). In insects, the polypeptide appears as GARs-(AIRs) 2 -GARt. In yeast, GARs-AIRs is encoded separately from GARt In bacteria each domain is encoded separately (Henikoff et al., 1997). GAR: glycinamide ribonucleotide AIR: aminoimidazole ribonucleotide

9 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] 21 May 2007 Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty

10 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] 21 May 2007 Domain connectivity linker A continuous domain is often an evolutionary module

11 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] 21 May 2007 Domain size The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998) the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

12 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] 21 May 2007 Detecting Structural Domains A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).

13 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] 21 May 2007 Detecting Structural Domains Protein core is densely packed Contact plot

14 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] 21 May 2007 Detecting Structural Domains Approaches encounter problems when faced with highly associated domains (and sometimes also with discontinuous ones) and many definitions will require manual interpretation. Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).

15 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] 21 May 2007 Detecting Structural Domains Early on: Interaction of secondary structure: region with weak boundaries are supposed to coincide with domain boundaries (Busetta and Barans, 1984) -- not very successful Contact plots: domains are regions with high contact density (Vonderviszt & Simon, 1986) – not very successful

16 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] 21 May 2007 Detecting Structural Domains More recent methods are better: Taylor (1999): will come later during this lecture

17 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] 21 May 2007 Detecting Domains using Sequence only Even more difficult than prediction from structure!

18 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] 21 May 2007 Predicting domain boundaries from linker regions Needed: discernible signal that sets linker regions apart from other sequence regions Problems: Linker regions are short, difficult to get statistical signal Linker regions versus intra-domain loops No distinction continuous/discontinuous domain possible

19 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] 21 May 2007 Predicting domain boundaries from linker regions – approaches: Building linker index (using amino-acid propensities for being within linker or non-linker): LinkerDB (George & Heringa, 2002) Domcut (Suyama & Ohara, 2003) – Sens./Spec. ~= 50% where i denotes the amino acid type and f the frequencies in either linker or domain

20 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] 21 May 2007 Predicting domain boundaries from linker regions – approaches: Bae, Mallick, and Elsik (2005): developed a hidden Markov model (HMM) of linker/non-linker sequence regions using a linker index derived from amino acid propensity. employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo (MCMC), particularly Gibbs sampling, to simulate parameters from the posteriors. The model generates a probabilistic output. The method was applied to a dataset of protein sequences in which domains and inter-domain linkers had been delineated using the Pfam-A database. Prediction results are superior to a simpler method that also uses linker index (DomCut) L-L, L-D, D-D, D-L transitions..?

21 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] 21 May 2007 SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851. Integrating protein multiple alignment, secondary and tertiary structure prediction to predict domain boundaries in sequence data

22 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] 21 May 2007 SnapDRAGON Scientific Name Antirrhinum majus Common Name Snapdragon





27 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] 21 May 2007 SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) and predicted secondary structure 2.Generate 100 DRAGON 3D models for the protein structure associated with the MSA 3.Assign domain boundaries to each of the 3D models (Taylor, 1999) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.

28 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] 21 May 2007 SnapDragon Folds generated by Dragon Boundary recognition (Taylor, 1999) Summed and Smoothed Boundaries CCHHHCCEEE Multiple alignment Predicted secondary structure

29 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] 21 May 2007 SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) 1.Sequence searches using PSI-BLAST (Altschul et al., 1997) 2.followed by sequence redundancy filtering using OBSTRUCT (Heringa et al.,1992) 3.and alignment by PRALINE (Heringa, 1999) and predicted secondary structure 4.PREDATOR secondary structure prediction program George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.

30 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] 21 May 2007 Information content of a multiple alignment   Align homologous sequences (ideally orthologues)

31 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] 21 May 2007 SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 2.Generate 100 DRAGON (Aszodi & Taylor, 1994) models for the protein structure associated with the MSA DRAGON folds proteins based on the requirement that (conserved) hydrophobic residues cluster together (Predicted) secondary structures are used to further estimate distances between residues (e.g. between the first and last residue in a  -strand). Based on these constraints, it compiles a target matrix with ‘desired’ distances It then constructs 100 random high dimensional C  (and pseudo C  ) distance matrices For each distance matrix, distance geometry is used to find the 3D conformation corresponding to the prescribed target matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure) DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN

32 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] 21 May 2007 The C  distance matrix is divided into smaller clusters. Separately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data




36 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] 21 May 2007 Taylor method (1999) DOMAIN-3D 3. Assign domain boundaries to each of the 3D models (Taylor, 1999) Easy and clever method Uses a notion of spin glass theory (disordered magnetic systems) to delineate domains in a protein 3D structure Steps: 1.Take sequence with residue numbers (1..N) 2.Look at neighbourhood of each residue (first shell) 3.If (“average nghhood residue number” > res no) resno = resno+1 else resno = resno-1 4.If (convergence) then take regions with identical “residue number” as domains and terminate Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :203-216

37 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] 21 May 2007 Taylor method (1999) 41 5 6 89 56 78 repeat until convergence if 41 < (5+6+56+78+89)/5 then Res 41 42 (up 1) else Res 41 40 (down 1)

38 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] 21 May 2007 Taylor method (1999) 41 5 6 89 56 78 1, 2, 3, …, 198, 199, 200 49, 49, 49, …, 151, 151, 151 ‘Res number’ Sequence location ‘Res number’ continuous discontinuous

39 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [39] 21 May 2007 SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window (assign central position) Window score =  1 ≤ i ≤ l S i × W i Where W i = (p - |p-i|)/p 2 and p = ½ (n+1). It follows that  l W i = 1 George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851. i WiWi

40 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [40] 21 May 2007 SNAPDRAGON Statistical significance: Convert peak scores to Z-scores using z = (x-mean)/stdev If z > 2 then assign domain boundary Can further test statistical significance using random models: Test hydrophibic collapse given distribution of hydrophobicity over sequence Make 5 scrambled multiple alignments (MSAs) and predict their secondary structure Make 100 models for each MSA Compile mean and stdev from the boundary distribution over the 500 random models If observed peak z > 2.0 stdev (from random models) then assign domain boundary

41 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [41] 21 May 2007 SnapDRAGON prediction assessment Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. Boundary predictions are compared to the region of the protein connecting two domains (maximally  10 residues from true boundary)

42 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [42] 21 May 2007 SnapDRAGON prediction assessment Baseline method I: Divide sequence in equal parts based on number of domains predicted by SnapDRAGON Baseline method II: Similar to Wheelan et al., based on domain length partition density function (PDF) PDF derived from 2750 non-redundant structures (deposited at NCBI) Given sequence, calculate probability of one- domain, two-domain,.., protein Highest probability taken and sequence split equally as in baseline method I

43 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [43] 21 May 2007 Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success (PPV) is the % of correct predictions made (TP/TP+FP)

44 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [44] 21 May 2007 Average prediction results per protein

45 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [45] 21 May 2007 SnapDRAGON Uses consistency in the absence of standard of truth Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences Is very slow (can be hours for proteins>400 aa) – need cluster or GRID implementation SnapDRAGON webserver is underway Strategy is now used by the Baker group (UW, Seattle): RosettaDOM

46 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [46] 21 May 2007 RosettaDOM: Domain boundary distribution and models that were made by the Rosetta de novo structure prediction method for T0248. The first plot displays the domain boundaries assigned to models produced by Rosetta and the corresponding models for three examples are shown on the right. The Z-scores for each position are shown in the second plot. The CASP domain assignments in the context of the native structure is displayed in the bottom left corner. Interestingly, models with roughly the correct domain boundaries are being produced by Rosetta PROTEINS: Structure, Function, and Bioinformatics Suppl 7:193–200 (2005)

47 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [47] 21 May 2007 Automated Prediction of Domain Boundaries in CASP6 Targets Using Ginzu and RosettaDOM David E. Kim,† Dylan Chivian,† Lars Malmstrom, and David Baker* University of Washington, Seattle, Washington “We developed a de novo domain prediction method that is similar in concept to SnapDRAGON but uses the Rosetta de novo structure prediction method to produce models. RosettaDOM generates 400 three-dimensional models using Rosetta, and then selects the top 200 scoring models that pass filters that eliminate structures with too many local contacts or unlikely strand topologies. Domain boundaries are then assigned for each of the 200 models using Taylor’s structure-based domain identification algorithm described above. Final domain boundary predictions are made based on consistencies found in the domain assignments of these models by taking the sum of boundary assignments at each position along the protein chain, smoothing the values using a center weighted sliding window, and then converting the smoothed boundary distributions to Z-scores as described by George et al.15 Positions with Z- scores of 2.5 or greater are treated as potential domain boundaries. Because logic is not applied to assign discontinuous domains and continuous domains are unlikely to be less than 50 residues in length, final domain boundaries are assigned for positions with the highest Z-scores that are at least 50 residues apart and are not within 50 residues of the N and C terminus.” RosettaDOM PROTEINS: Structure, Function, and Bioinformatics Suppl 7:193–200 (2005)

48 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [48] 21 May 2007 CASP CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment (though it is commonly referred to as a competition) for protein structure prediction taking place every two years since 1994.protein structure prediction CASP provides research groups with an opportunity to assess the quality of their methods for protein structure prediction from the primary structure of the protein. As a consequence, CASP provides the research community with an assessment of the state of the art in this field. It is not uncommon for entire research groups to shut down for months while they focus on getting their results ready for CASP.protein structure predictionprimary structure Protein structures that are either expected to be solved shortly or that have been recently solved, but not yet discussed in public, are used as targets for the prediction. If the given sequence is found (for example, using sequence alignment methods such as BLAST or FASTA) to be similar to a protein sequence of known structure, comparative protein modeling may be used to predict the tertiary structure. Otherwise, other methods such as protein threading or de novo protein structure prediction must be applied.sequence alignmentBLASTFASTAcomparative protein modelingtertiary structureprotein threadingde novo protein structure prediction

49 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [49] 21 May 2007 CASP Evaluation of the results is carried out in the following prediction categories: tertiary structure prediction (all CASPs)tertiary structure secondary structure prediction (dropped after CASP5)secondary structure prediction prediction of structure complexes (CASP2 only; a separate experiment - CAPRI - carries on this subject) residue-residue contact prediction (starting CASP4) disordered regions prediction (starting CASP5) domain boundary prediction (starting CASP6) function prediction (starting CASP6) model quality assessment (starting CASP7) model refinement (starting CASP7) Tertiary structure prediction category was further subdivided into homology modeling fold recognition (also called protein threading; Note, this is incorrect as threading is a method)protein threading de novo structure prediction Now referred to as 'New Fold' as many methods apply evaluation, or scoring, functions that are biased by knowledge of native protein structures, such an example would be an artificial neural novo structure prediction

50 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [50] 21 May 2007 CAFASP CAFASP, or the Critical Assessment of Fully Automated Structure Prediction, is a large-scale blind experiment in protein structure prediction that studies the performance of automated structure prediction webservers in homology modeling, fold recognition, and ab initio prediction of protein tertiary structures based only on amino acid sequence. The experiment runs once every two years in parallel with CASP, which focuses on predictions that incorporate human intervention and expertise. Compared to related benchmarking techniques LiveBench and EVA, which run weekly against newly solved protein structures deposited in the Protein Data Bank, CAFASP generates much less data, but has the advantage of producing predictions that are directly comparable to those produced by human prediction experts. Recently CAFASP has been run essentially integrated into the CASP results rather than as a separate experiment.protein structure predictionhomology modelingfold recognitiontertiary structuresamino acid sequence CASPbenchmarkingLiveBenchEVAProtein Data Bank

51 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [51] 21 May 2007 DOMAINATION Richard A. George Protein domain identification and improved sequence searching using PSI-BLAST Integrating protein sequence database searching and on-the-fly domain recognition George R.A. and Heringa J. (2002) Protein domain identification and improved sequence similarity searching using PSI-BLAST, Proteins: Struct. Func. Gen. 48, 672-681.

52 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [52] 21 May 2007 Domaination Current iterative homology search methods (e.g. PSI-BLAST) do not take into account (that): –Domains may have different ‘rates of evolution’. –Common conserved domains, such as the tyrosine kinase domain, can obscure weak but relevant matches to other domain types –Premature convergence (false negatives) –Matrix migration / Profile wander (false positives).

53 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [53] 21 May 2007 PSI (Position Specific Iterated) BLAST basic idea use results from BLAST query to construct a profile matrix search database with profile instead of query sequence iterate

54 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [54] 21 May 2007 A Profile Matrix (Position Specific Scoring Matrix – PSSM) This is the same as a profile without position-specific gap penalties

55 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [55] 21 May 2007 PSI BLAST: Constructing the Profile Matrix Figure from: Altschul et al. Nucleic Acids Research 25, 1997

56 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [56] 21 May 2007 PSI-BLAST iteration Q ACD..YACD..Y Pi Px Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y Pi Px PSSM Database hits xxxxxxxxxxxxxxxxx iterate

57 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [57] 21 May 2007 PSI-BLAST steps in words Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996 – next slide) which are masked The program then initially operates on a single query sequence by performing a gapped BLAST search Then, the program takes significant local alignments (hits) found, constructs a ‘multiple alignment’ (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment. The database is rescanned in a subsequent round, using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged PSI-BLAST steps in words

58 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [58] 21 May 2007 Low-complexity sequences For example: AAAAA… or AYLAYLAYL… or AYLLYAALY… Low-complexity (sub)sequences have a biased composition and contain less information than high- complexity sequences Because of the low information content, they often lead to spurious hits without a biological basis (for example, you can’t tell whether a poly-A sequence is more similar to a globin, an immunoglobulin or a kinase sequence). Query sequence xxxxxxxxxxxxxxxxx

59 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [59] 21 May 2007 PSI-BLAST entry page Paste your query sequence Switch this off for default run


61 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [61] 21 May 2007 1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject or target sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e -117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

62 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [62] 21 May 2007 ‘ X’ residues denote low-complexity sequence fragments that are ignored

63 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [63] 21 May 2007 Sequence searching QUERY DATABASE True Positive True Negative True Positive False Positive True Negative False Negative T POSITIVES NEGATIVES

64 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [64] 21 May 2007 PSI-BLAST query Strategy: Combine C- and N-termini of local alignments to delineate domain boundaries


66 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [66] 21 May 2007 Post-processing low complexity Remove local fragments with > 15% LC

67 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [67] 21 May 2007 Identifying domain boundaries Sum N- and C-termini of gapped local alignments True N- and C- termini are counted twice (within 10 residues) Boundaries are smoothed using two windows (15 residues long) Combine scores using biased protocol: if Ni x Ci = 0 then Si = Ni + Ci else Si = Ni + Ci +(Ni x Ci)/(Ni + Ci)

68 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [68] 21 May 2007 Identifying domain deletions Deletions in the query (or insertion in the DB sequences) are identified by –two adjacent segments in the query align to the same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini) DB Query

69 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [69] 21 May 2007 Identifying domain permutations A domain shuffling event is declared –when two local alignments (>35 residues) within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order. DB Query b a a b

70 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [70] 21 May 2007 Identifying continuous and discontinuous domains Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain. An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered as discontinuous domains and joined.

71 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [71] 21 May 2007 Creating domain profiles A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq). A multiple sequence alignment is generated using PRALINE (Heringa 1999, 2002; Simossis et al., 2005). Each domain multiple alignment is used as a profile in further database searches using PSI- BLAST (Altschul et al 1997). The whole process is iterated until no new domains are identified.

72 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [72] 21 May 2007 Domain boundary prediction accuracy Set of 452 multidomain proteins 56% of proteins were correctly predicted to have more than one domain 42% of predictions are within  20 residues of a true boundary 49.9% (  44.6%) correct boundary predictions per protein

73 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [73] 21 May 2007 23.3% of all linkers found in 452 multidomain proteins. Not a surprise since: –Structural domain boundaries will not always coincide with sequence (motif) domain boundaries –Proteins must have some domain shuffling For discontinuous proteins 34.2% of linkers were identified 30% of discontinuous domains were successfully joined (good for sequence only method) Domain boundary prediction accuracy

74 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [74] 21 May 2007 Benchmarking sequence searching improvement versus PSI-BLAST A set 452 non-homologous multidomain protein structures Delineated each sequence using true structural domains Do PSI-BLAST database searches using individual domain sequences Tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

75 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [75] 21 May 2007 Two reference sets based on individual domain searches (using known domains) Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query. Reference set 2: consists of database sequences found by searching with one or more of the domain sequences Therefore set 2 contains many more sequences than set 1 Ref set 1 Ref set 2 Query DB seqs Seq 1 Seq 2 Seq 3 Seq 4 Seq 5 Seq 6 Seq 7 Seq 1 Seq 2 Seq 3

76 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [76] 21 May 2007 Sequences found over Reference sets 1 and 2 Note that PSI-BLAST and DOMAINATION were run over full sequences in Ref sets 1 and 2

77 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [77] 21 May 2007 Reference 1 PSI-BLAST finds 97.9% of sequences Domaination finds 99.1% of sequences Reference 2 PSI-BLAST finds 83.2% of sequences Domaination finds 90.6% of sequences

78 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [78] 21 May 2007 SSEARCH significance test Verify the statistical significance of database sequences found by relating them to the original query sequence (instead of to the PSSM created by PSI-BLAST at each iteration). SSEARCH (Pearson & Lipman 1988) was used. It calculates an E-value for each generated local alignment. This filter will lose distant homologies (bad E- values). Use the 452 proteins with known structure.

79 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [79] 21 May 2007 Significant sequences found in database searches At an E-value cut-off of 0.1 the performance of DOMAINATION searches with the full-length proteins is 15% better than PSI-BLAST

80 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [80] 21 May 2007 Scooby-domain: prediction of globular domains in protein sequence Richard A. George 1,2, Kuang Lin 3 and *Jaap Heringa 4 1 Inpharmatica Ltd, 60 Charlotte Street, London W1T 2NU UK 2 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK 3 Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill NW7 1AA, UK 4 Centre for Integrative Bioinformatics (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, Vrije Universiteit, De Boelelaan 1081a, 1081HV Amsterdam, The Netherlands * Corresponding author George, R.A., Lin, K., and Heringa J. (2005) Scooby-Domain: prediction of globular domains in protein sequence, Nucleic Acids Res., 33 (Web Server issue), W160-W163.

81 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [81] 21 May 2007 Generating a domain probability matrix for a query sequence Scooby-domain uses a multilevel smoothing window to predict the location of domains in a query sequence. Based on the window length and its average hydrophobicity, the probability that it can fold into a domain is found directly from the distribution of domain size and hydrophobicity, calculated using sequence-level domain representatives from the CATH domain database (S-level). Visualisation of the Scooby-domain probability matrix for a sequence can be used to effectively identify regions that are likely to fold into domains or are likely to be unstructured.

82 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [82] 21 May 2007 First plot: the number of CATH domains as a function of their hydrophobicity and domain length. Second plot: the average CATH domain hydrophobicity minus the average hydrophobicity for randomised sequences (generated from a random selection of residues from sequences in the CATH database). Information is used to create partition density function for domain likelihood

83 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [83] 21 May 2007 CATH domains Randomized domain sequences CATH domains minus Randomized domain sequences

84 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [84] 21 May 2007 (b) Multilevel smoothing window horizontal axis corresponds to the sequence position vertical axis represents the window length used in the smoothing of sequence hydrophobicity. Each position in the matrix corresponds to the average hydrophobicity assigned to the centre of a window during smoothing. ( 11 amino acid types are considered as hydrophobic: Ala, Cys, Phe, Gly, Ile, Leu, Met, Pro, Val, Trp and Tyr) (c) Each position in the matrix is then converted to a probability that it will fold into a domain, based on the lengths and hydrophobicities observed in the distribution of CATH domains.

85 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [85] 21 May 2007 (d) i. The highest scoring window (first predicted domain) is identified in the probability matrix and the sequence region it encapsulates (blue triangle) is removed from the sequence. ii. The resulting sequence fragments are rejoined and the probability matrix recalculated. iii. The smoothing windows that encapsulate the last 15 residues of the N-terminal fragment and the first 15 residues of the C-terminal fragment have their probabilities set to zero (white bands). If the next highest scoring region is found in the red region then the excised domain will be discontinuous, otherwise it will be continuous. discontinuous domains continuous domains

86 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [86] 21 May 2007 Automatic domain boundary assignment The Scooby-domain web server ( performs fast, automatic, domain annotation by identifying the most domain-like regions in the query sequence: The highest probability in the domain probability matrix represents the first predicted domain. The corresponding stretch of sequence for this domain is removed from the sequence -- the first predicted domain will always have a continuous sequence and further domain predictions can encompass discontinuous domains. If the excised domain is at a central position in the sequence, the resulting N- and C-termini fragments are rejoined and the probability matrix recalculated as before. The second highest probability is then found and the corresponding sub-sequence removed.

87 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [87] 21 May 2007 Table 1. No weightingN- and C-termini weighting MethodSensitivityAccuracy (PPV) SensitivityAccuracy (PPV) FirstScoobyDo50.523.251.830.8 Domainati on 59.627.659.829.5 Linker42.714.842.714.8 Class41.622.940.125.1 BestScoobyDo75.144.4 76.750.1 Domainati on 88.844.487.447.4 Linker79.434.179.434.1 Class71.046.670.948.0 Two measures are used to score predictions: percentage of real boundaries predicted (sensitivity) and percentage of correct predictions made (accuracy). ‘N- and C-termini weighting’ are predictions made with increased probability of domain boundaries at the ends of the protein sequences. ‘Domaination’ are results for ScoobyDo predictions made with added information from Domaination. ‘Linker’ are results for ScoobyDo predictions made with added information from the interdomain linker propensities from the Linker database. ‘Class’ are ScoobyDo predictions made using three smoothing windows to separately predict all-α, all-β and α-β domains. ‘First’ is the highest probability prediction made. ‘Best’ is the best of ten predictions made.


89 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [89] 21 May 2007 Improvements: Use Multiple Sequence Alignments and average prediction results Use A* combining domain delineation protocol for 10 top-predictions Under review..

90 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [90] 21 May 2007 SCOOBYDOmain The Scooby+MSA) prediction for the hyperthermostable D-ribose-5- phosphate isomerase from Pyrococcus horikoshii (PDB 1LK5, chain A). a) The structure of 1LK5, coloured according to the linker prediction by Scooby- Domain. The corresponding predictions are 136 and 207. The CATH domain annotation shows that it consists of two domains, a discontinuous domain made of two segments 1-128 (green) and 208-229 (blue); and the continuous domain 129-207 (red). b) The Scooby-Domain plot for 1LK5.

91 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [91] 21 May 2007 Scooby-domain prediction

92 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [92] 21 May 2007 Wrapping up Different approaches to the domain-delineation problem It is a hard problem when having a protein structure at hand It is mind boggling doing it from sequence information alone Approaches range from simple window approaches to linker prediction (DomCut) to elaborate consistency-based and 3-D model-reliant prediction (SnapDRAGON) Performance still low but results can be very helpful Domaination: combined iterative methods can improve each of the single methods

Download ppt "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools."

Similar presentations

Ads by Google