Protein structure – introduction “Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton (2003).
Secondary structure elements -helix -strand -sheet
Tertiary structure = protein fold complete 3-dimensional structure why is it interesting? isn’t the sequence enough? a key to understand protein function Structure-based drug design detection of distant evolutionary relationships the structure is more conserved!
Fold classification classification: clustering proteins into structural families motivation? profound analysis of evolutionary mechanisms constraints on secondary structure packing? classification at domain level
hierarchical classification of protein domain structures in the Brookhaven Protein Databank (PDB). domains are clustered at four major levels: Class Architecture Topology Homologous superfamily Sequence family CATH – Protein Structure Classification
Class secondary structure content: mainly , mainly , – , low 2 nd structure content. Architecture gross orientation of secondary structures, independent of connectivity. Topology ( = fold) clusters structures according to their topological connections. CATH – hierarchical classification
CATH – architectures
CATH – architectures (cont.)
Homologous superfamily homologous domains identified by sequence similarity, and structure similarity Sequence family domains clustered in the same sequence families, with sequence identity>35% CATH – hierarchical classification other classification schemes: SCOP, FSSP partial disagreement between them.
Growing demand for protein structures! PDB contains 20,868 structures X-Ray and NMR have limitations. WE NEED FASTER METHODS! GenBank contains 24,027,936 sequences!
Protein Structure Prediction I) Ab initio = ‘from the beginning’ - Simulation (physics) - search for conformation with lowest energy - Knowledge-based (i.e. statistics) protein sequence: RGYSLGNWVC KVFGRCELAA AMKRHGLDNY AAKFESNFNT QATNRNTDGS TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS DGNGMNAWVA WRNRCKGTDV Limited to very short peptides!
Can known structures assist prediction? the number of possible folds seems to be limited! CATH inspection: more then 36,000 domains, but... only ~800 topology groups Total of "new folds" (light blue) and "old folds" (orange) for a given year PDB inspection: a ‘new’ protein has a good chance to be of a known structure!
Template-based prediction (fold recognition) II) Comparative modeling (homology modeling) - alignment with homologous sequence of known structure. - high sequence identity areas: similar structure - variable areas: must be built can’t be used if no sequence similarity found! III) Threading - alignment with structure sequences in fold library - sophisticated scoring function finds most similar fold - ‘Threading’ aligns target sequence onto template structure
“What are the baselines for protein fold recognition?” McGuffin, Bryson and Jones (2001) Goals: 1.what constitutes a baseline level of success for protein fold recognition methods, above random guesswork? 2. can simple methods that make use of 2 nd structure information assign folds more reliably? 3.how valuable might these methods be in the rapid construction of a useful hierarchical classification?
1. Absolute difference in length 2. Absolute difference in number of secondary structure elements 3. Simple alignment of secondary structure elements 4. Alignment of secondary structure elements (Przytycka et al., 1999) 5. Alignment of secondary structure elements without additional scoring 6. Alignment of secondary structure elements using DSSP as secondary structure assignment 7. Alignment of secondary structure elements with gap penalty 8. Alignment of secondary structure elements with gap penalty for long elements 9. Alignment of secondary structure elements with absolute difference in length as scoring scheme 10. Alignment of full length secondary structure strings 11. Alignment of primary sequence shorten 2 nd structure strings: CCCHHHHCCCEEECCHHHCCC HCECH. pairwise alignment scoring function also considers length of elements The methods evaluated (ordered by complexity and runtime)
A representative set of protein domains a set of 1087 domains representing different “Sequence Families” was selected from CATH. 1. >1atx00 2. GAAaLbKSDGPNTRGNSMSGTIWVFGcPSGWNNbEGRAIIGYacKQ 3. EEE TTS S TTSSEEEEEESS TT EEE SSSSSEEEE 4. CEEEEEHHECEEEECCCECEEEECCCEECCEECEEECCEECEEEEC generate an informative file for each domain:
First evaluation: true positive percentage compare true positive percentage, at a fixed 3% false positive. run each method on all possible pairs from the 1087 set (a,b) (a,c) (a,d)... (g,d) (g,e)... (k,f)... (r,s).... ~590,000 pairs CATH (g) != CATH (e)CATH (r) = CATH (s)CATH (a) != CATH (b)CATH (a) = CATH (d) for each list: go top downward, and compare assignment to CATH true counter = false counter = CATH (k) != CATH (f) 2 3 STOP! 3% false positives reached. true positive for this method = 2% Sort each score list by descending similarity score. (a,d) 0.99 (g,e) 0.98 (r,s) 0.87 | (a,b) 0.63 (k,f) 0.45 (g,d) 0.37 lets assume there are 100 structurly similar pairs And 100 dissimilar pairs
We need lower,upper controls to compare with lower control: intelligent guesswork 1. randomly assign CATH topology codes according to frequency 2. calculate true positive, false positive percentage upper control: automated recognition (given the 3D structure) 1.FSSP, SCOP and CATH databases were screened for all dissimilar domains that exist in the three of them. 2.FSSP gave similarity scores to all possible pairs. 3.FSSP assignments compared against CATH, and against SCOP.
Optimisation of similarity scoring methods: “Class pre-filter” each domain was assigned a class according to 2 nd structure: percentage of residues constituting -helices / -strands domain “1cgt03” 80% of AA in -strand 10% of AA in -helix
most accurate is method number 5: “Alignment of secondary structure elements without additional scoring”, with: 27.18% true positive. partial agreement between classification schemes: FSSP compared with SCOP: 61.1%, FSSP compared with CATH: 46.7% methods that use 2 nd structure alignments are in better agreement with CATH accuracy ordering of methods doesn’t correspond to their relative complexity methods that use 2 nd structure usually don’t benefit notably from class pre-filter.
Second evaluation: CASP-like sensitivity similarly to CASP – we measure the sensitivity of each method: what is the probability of a method correctly assigning a fold? lower control: a random proportional fold assignment upper control: FSSP was used as a scoring method
Sensitivity results: method 5 wins again: 31.8% sensitivity. other 2 nd structure based methods with small gap. sensitivity order of the methods ~ true positive percentage order.
Similarity trees - can we construct classification? Best method’s similarity scores for all pairs were clustered into a tree. a. globin-like <> casein kinase b. immunoglobulin-like <> thrombin subunit H whole tree: generally disordered 1ckjA2 1irk02 1phk02 1ampE2 1hcl02 1ckjA2 1gdj00 1kobA2 1hbg00 1babA0 1lhs00 1mba00 1eca00 1ithA0 1ash00 1flp00 1sctA0 1cpcA0 1ddt02 1colA0 (a)(b) 1bec01 1tcrA2 1edhA2 1nfkA1 1itbB1 1cgt03 1svpA2 1jxpA2 1try02 1sgt02 1sgpE1 1sgpE2 1dar02
Conclusions 1.Baseline level to be exceeded by fold recognitionmethods: 27% true positive assignments allowing 3% false positive; sensitivity level of 32%. 2.methods which make use of 2 nd structure information seem more accurate and sensitive than those who don’t. 3.simple 2 nd structure alignments alone can not construct reliable classification hierarchy. 4.the agreement between FSSP, SCOP and CATH classification schemes is surprisingly low.