Bioinformātika Proteīnu un RNS struktūras LU, 2008, Juris Vīksna
Proteīni: ko mēs ar to saprotam ar proteīnu struktūru, struktūru reprezentācija Ar proteīnu struktūrām saistītās problēmas RNS: ko mēs ar to saprotam ar RNS struktūru Ar RNS struktūrām saistītās problēmas Metodes proteīnu struktūru salīdzināšanai Proteīnu struktūru datubāzes Rīki proteīnu struktūru salīdzināšanai un vizualizācijai Proteīnu struktūru klasifikācijas RNS struktūru prognozēšana Šodien:
Proteīni [Adapted from R.Shamir]...VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANK... Protein sequence:
Proteīnu struktūra [Adapted from R.Shamir]
Proteīnu struktūra [Adapted from R.Shamir] We will be interested mostly in secondary and tertiary structure
Proteīnu struktūras noteikšana - kristalogrāfija [Adapted from G.Lee] The basics: Purify an protein crystal Shoot an X-ray through the rotating crystal Collect Data in one of many ways Interpret data
Proteīnu struktūras noteikšana - kristalogrāfija [Adapted from G.Lee] Problems: Crystal setup takes….forever (almost) Interpreting the data is no easy task But all methods create this mass of data Expensive($$$)
Proteīnu struktūras noteikšana - kristalogrāfija [Adapted from G.Lee] In the end, biologists want the best results possible and X-ray Crystallography provides this right now It gets the job done No other method does the job better
Proteīnu struktūras noteikšana -kristalogrāfija
Magnet Radio frequency amplifiers Samples Proteīnu struktūras noteikšana - NMR [Adapted from V.Arcus] NMR - Nuclear magnetic resonance
Proteīnu struktūras noteikšana - NMR [Adapted from V.Arcus]
Proteīnu struktūras noteikšana - NMR Protein NMR requires large amounts of very pure protein.. [Adapted from V.Arcus] Extraction from the natural source a major disadvantage here is the very low levels of protein in tissues for example, one might start with 10 l of blood and get 1 mg of protein! this also requires a large number of purification steps the main advantage is the maintenance of post-translational modifications
NMR vai kristalogrāfija? [Adapted from V.Arcus] Both techniques to determine protein structures NMR uses protein in solution X-ray crystallography uses protein crystals Both techniques require large amounts of pure protein Both techniques require expensive equipment!
NMR priekšrocības [Adapted from V.Arcus] Protein in solution! Can look at the dynamic properties of the protein structure Can look at the interactions between the protein and ligands, substrates or other proteins Can look at protein folding Sample is not damaged in any way No “phase problem” Can “characterise” your protein using NMR
NMR trūkumi [Adapted from V.Arcus] Size limit! The maximum size of a protein for NMR structure determination is ~30 kDa. This eliminates ~50% of all proteins High solubility is a requirement Comparatively low resolution
Kristalogrāfijas priekšrocības [Adapted from V.Arcus] No size limit As long as you can crystallise it Solubility requirement is less stringent Simple definition of resolution Direct calculation from data to electron density and back again
Kristalogrāfijas trūkumi [Adapted from V.Arcus] Crystallisation! This is a process bottleneck Binary (all or nothing) Phase problem If the cell contains two electrons (each with the same scattering power) and their positional relationship is such that the distance between them is exactly one-half the distance between reflecting planes, then they will cancel out each others contribution to diffraction.
Proteīnu struktūras fails HEADER HYDROLASE 03-NOV-00 1G65 TITLE CRYSTAL STRUCTURE OF EPOXOMICIN:20S PROTEASOME REVEALS A TITLE 2 MOLECULAR BASIS FOR SELECTIVITY OF ALPHA,BETA-EPOXYKETONE TITLE 3 PROTEASOME INHIBITORS COMPND MOL_ID: 1; ATOM 115 CD PRO A C ATOM 116 N SER A N ATOM 117 CA SER A C ATOM 118 C SER A C ATOM 119 O SER A O ATOM 120 CB SER A C ATOM 121 OG SER A O ATOM 122 N GLY A N ATOM 123 CA GLY A C ATOM 124 C GLY A C ATOM 125 O GLY A O ATOM 126 N LYS A N PDB file format
Proteīnu struktūra - atomu koordinātas [Adapted from M.Gerstein and I.Eidhammer, I.Jonassen] Structure is described by 3D coordinates (X,Y,Z) of all C atoms
Proteīnu struktūra - foldi "Fold" representation of 7timA0
Hydrogen bonding patterns for four helices; Structures are represented in a diagrammatic way to simplify counting the atoms in each H-bonded loop. 2 7 ribbon 3 10 helix helix helix Proteīnu struktūra - spirāles [Adapted from S.Rafferty]
Proteīnu struktūra - sloksnes Composed of strands Adjacent Strands may be parallel or antiparallel Strands are flat: think of a beta sheet as a helix with two residues per turn Parallel AntiParallel [Adapted from S.Rafferty]
Proteīnu foldi - sandwhich ( )
Proteīnu foldi - barrels ( )
Proteīnu foldi - horseshoe ( - )
Proteīnu foldi - helix “bundles” ( )
Proteīnu foldi - mijiedarbības Transcription factors - homeodomain proteins
Proteīnu foldi - daži skaitļi
Proteīnu struktūras - citas reprezentācijas Different representations of myoglobin molecule Contact map (graph-based) representation of protein structure
Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (protein folding problēma; viens no bioinformātikas Holy Grail...) Salīdzināšana (nav gluži triviāli, bet ir metodes, kas praksē darbojas pietiekami labi) Reprezentācijas Virsmas modelēšana Proteīnu mijiedarbību modelēšana/prognozēšana Vizualizācija Ar proteīnu struktūrām saistītās problēmas
The folded state is a low energy state under physiological conditions: H 2 O, pH ~ 7.0, NaCl Protein folding G Gibbs Free Energy U I F G U–F
Kas ietekmē protein folding: Hidrofobiskie spēki (ūdens "izspiešana") Ūdeņraža saites Elektrostatiskie spēki Disulfīdu saites Chaperones Protein folding
Chaperones Chaperone proteins were first identified as "heat-shock proteins" (hsp60 and hsp70) Hsp70 recognizes exposed, unfolded regions of new protein chains - especially hydrophobic regions It binds to these regions, apparently protecting them until productive folding reactions can occur Occurs while the chain is still being translated
CASP
CAFASP
Prioni Prion - proteinaceous infectious particle PrP c -the normal versionHypothetical structure of PrP sc
Prioni Spontaneously (rare): the normal fold is overwhelmingly the favored conformation Inherited: a mutation in the PRNP gene destabilizes the normal conformation Transmitted: ingestion of PrPsc from diet, surgical instruments, blood, or blood-derived products
Molekulārās virsmas Key-and-lock princips:
RNS struktūra RNA sequence:...AGGCUAUGGCCA... Single-stranded, but A tends to pair with U G tends to pair with C
RNS sekundārā struktūra 5’ 3’ G--C C--G A | U--A G--C A A A [Adapted from C.Staben]
RNS sekundārā struktūra [Adapted from K.Selesniemi] Pseudo-knot
RNS terciālā struktūra [Adapted from K.Selesniemi]
RNS terciālā struktūra [Adapted from K.Selesniemi]
RNS struktūras noteikšana - fizikālās metodes [Adapted from P. De Rijk] The experimental method giving the highest resolution is single crystal X-ray diffraction. X-ray diffraction reveals secondary, tertiary and three dimensional structures. Unfortunately, it is very difficult to obtain crystals of RNA molecules suitable for X-ray diffraction. The structure of tRNA's have been solved using this technique.
RNS struktūras noteikšana - fizikālās metodes [Adapted from P. De Rijk] NMR can provide details about local conformation, and can be used to determine secondary, tertiary and, in theory, three-dimensional structures. The size of RNA molecules that can studied using NMR is currently rather limited. Oligonucleotides used in NMR studies are designed to adopt structures found in larger RNA molecules.
RNS struktūras noteikšana - fizikālās metodes [Adapted from P. De Rijk] Direct observation of partially denatured RNA molecules is possible using electron microscopy. However, the choice of denaturing conditions is crucial, and the resolution of electron microscopy is usually too limited to see fine details.
RNS struktūras noteikšana - ķīmiskās metodes [Adapted from P. De Rijk] RNA structure has been probed by testing the accessibility of nucleotides to chemical and enzymatic modification. The RNA molecules are exposed to chemical reagents or enzymes with a specific affinity for either single-stranded or double stranded RNA. This method is only applicable for short RNAs because of the limited resolution of gel electrophoresis. For larger RNAs reverse transcriptase is used to synthesize DNA complementary to the RNA starting from a radioactively labeled primer. Modified residues cause the reverse transcriptase to stop, and separation of the synthesised DNAs by gel electrophoresis can then be used to determine the positions of modification.
RNS struktūras noteikšana - mutāciju analīze [Adapted from P. De Rijk] RNA structure or protein-RNA interactions can also be studied by the introduction of specific mutations into the RNA sequence. The effect of the mutations can be assayed by measuring the ability of the mutated sequence to bind a protein which specifically recognizes the normal RNA or by testing the change in some function. Caution is required with the interpretation of mutation analysis results. Loss of protein binding or other functions is not always necessarily caused by a change in RNA secondary structure.
Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (atšķirībā no proteīniem salīdzinoši viegla, bet sekundārajai, nevis terciālajai struktūrai) Salīdzināšana (mērķi mazliet citi, nekā proteīniem) Mijiedarbība (tik tālu, iespējams, mēs vēl neesam tikuši) Ar RNS struktūrām saistītās problēmas
Struktūru salīdzināšana Translation Rotation Translation and rotation x 1, y 1, z 1 x 2, y 2, z 2 x 3, y 3, z 3 x 1 + d, y 1, z 1 x 2 + d, y 2, z 2 x 3 + d, y 3, z 3 [Adapted from T.Hanekamp]
How to estimate comparison "quality"? Root Mean Square Deviation (RMSD) n = number of atoms d i = distance between the corresponding atoms in structures [Adapted from T.Hanekamp] Struktūru salīdzināšana - RMSD
RMSD units => e.g. Ångstroms - identical structures => RMSD = “0” - similar structures => RMSD is small (1 – 3 Å) - distant structures => RMSD > 3 Å [Adapted from T.Hanekamp]
Koordinātu RMSD [Adapted from I.Eidhammer, I.Jonassen]
Attālumu RMSD [Adapted from I.Eidhammer, I.Jonassen] Experimentally it has been shown that these two measures are linearly related: RMSD D 0.75 RMSD C + 0.2
RMSD metodes [Adapted from I.Eidhammer, I.Jonassen]
RMSD - optimālās transformācijas atrašana Given two 3D sets of points: P={p i }, Q={q i }, i=1,…,n; Find a 3-D rotation R 0 and translation T 0, such that min R,T i |Rp i + T - q i | 2 = i |R 0 p i + T 0 - q i | 2. It can be done in time O(n).
RMSD - struktūru līdzības atrašana Tātad: Dotiem k atomu pāriem nav grūti atrast transformāciju, kas minimizē RMSD Bet: Iespējamo atomu pāru kopu skaits ir eksponenciāls (no proteīnu "garuma" n un/vai pāru skaita k) Optimālās pāru kopas atrašana tiek uzskatīta (?) par NP-pilnu problēmu... Praksē mēdz lietot t.s. double dynamic programming heiristiku.
RMSD - vēl daži aspekti Sequence order dependent alignment RMSD iekļauto atomu pāru secība abās struktūrās atbilst to secībai aminoskābju virknēs Sequence order independent alignment RMSD iekļauto atomu pāru secība nav saistīta ar atomu secību aminoskābju virknēs Nav viennozīmīgi skaidrs, kura no pieejām ir "labāka" Populārākās struktūru salīdzināšanas programmas laikam ņem vērā atomu secību aminoskābju virknēs
RMSD - vēl daži aspekti Līdz šim mēs pieņēmām, ka proteīnu struktūras ir nemainīgas. Principā struktūras mēdz būt arī elastīgas - var nedaudz mainīties, atkarībā no "ārējiem apstākļiem". Ir algoritmu modifikācijas, kas ņem vērā struktūru elastību - piem., mēs varam vispirms meklēt nelielus ne-ealstīgus līdzīgus struktūru fragmentus, un tad paskatīties, vei mēs varam tos iekļaut abās struktūrās tādā pašā secībā.
RMSD - vēl daži aspekti Virknēm mēs sākām ar pāru salīdzināšanu, un tad apgalvojām, ka bieži vien interesantāk ir vienlaicīgi salīdzināt vairāk kā divas virknes. Kā ir ar struktūrām? Principā ir programmas, kas salīdzina vienlaicīgi vairāk kā divas struktūras (lietojot, piem., kaut ko līdzīgu pakāpeniskajai heiristikai), taču multiple alignment problēma struktūrām ir mazāk aktuāla: struktūru līdzība homologiem saglabājās daudz labāk nekā virkņu līdzība ir cits "evolūcijas modelis" un attālām struktūrām multiple alignment parasti neuzrādīs labi saglabātus struktūru fragmentus
RMSD - DDP pamatprocedūra [Adapted from I.Eidhammer, I.Jonassen]
RMSD - sākam ar līdzības matricu [Adapted from M.Gerstein] Sakotnēju martricu var konstruēt balstoties uz aminoskābju līdzību, lai gan bieži izmanto arī vēl citus kritērijus
RMSD - līdzības matricas Structural Alignment Similarity S(i,J) is dependent from the 3D coordinates of residues i and j Distance between i and j M(i,j) = 100 / (5 + d 2 ) [Adapted from M.Gerstein] Pēc tam līdzību katram atomu pārim pārrēķina - jo mazāks attālums pēc RMSD minimizējošās transformācijas, jo "līdzīgāki"
RMSD - līdzības matricas [Adapted from R.B.Altman]
RMSD - līdzības matricas [Adapted from I.Eidhammer, I.Jonassen]
RMSD trūkumi all atoms are being treated as equal (but residues on the surface usually have a greater freedom of movement than residues inside the structure) the best alignment not necessarily means the best RMSD RMSD performance depends form the size of molecules [Adapted from T.Hanekamp]
RSMD alternatīvas aRMSD = best root-mean-square deviation calculated over all aligned alpha-carbon atoms bRMSD = the RMSD over the highest scoring residue pairs wRMSD = weighted RMSD [Adapted from T.Hanekamp]
Piemērs - 3znf un 4znf salīdzinājums Lys30 30 CA atoms RMS = 0.70Å 248 atoms RMS = 1.42Å [Adapted from T.Hanekamp]
Cik viegli pamanīt struktūru līdzību? Easy: Globins 125 res., ~1.5 Å Tricky: Ig C & V 85 res., ~3 Å Very Subtle: G3P-dehydro- genase, C-term. Domain >5 Å [Adapted from M.Gerstein]
Struktūru līdzība un Computer Vision [Adapted from M.Shatsky]
Vienkāršs heiristisks algoritms For each pair of point triples (one from each molecule), which form “almost equal” triangle find an affine transformation that transfers one of them to the another. Find number of pairs which is “almost superimposed” by this transformation and give the results in this order For the best hypotheses improve the transformation by using RMSD Complexity (assuming there are n points in each molecule) - O(n 7 ). [Adapted from M.Shatsky] Ja n=100, tad n 7 =10 14 :(
References punktu trijnieki p1p1 p2p2 p3p3 [Adapted from M.Shatsky] Refernece frame - ortogonālu vienības vektoru, kuri iziet no viena punkta, trijnieks Katram (nedeģenerētam) 3D punktu trijniekam var viennozīmīgi piekārtot šādu reference frame
Geometric hashing - ideja Chose a reference frame Find the point coordinates in this reference frame Use these coordinates as “hash” adresses and place these points in hash table Repeat this step for each reference frame. [Adapted from M.Shatsky]
Geometric hashing - ideja [Adapted from M.Shatsky] Izvēlamies universālo reference frame, un katram trijniekam no-hašojam transformāciju uz lokālo reference frame (laiks O(n 4 ))
Geometric hashing - atpazīšana For the target protein : Chose a reference frame Find the coordinates of other points in this reference frame Use coordinates to select the points from hash table Find RMSD transformations for best hypotheses Repeat for each reference frame Select the best alignments O(n 4 + n 4 * BinSize) ~ O(n 5 ) Ja n=100 tad n 5 =10 10 [Adapted from M.Shatsky]
Geometric hashing - 2D piemērs [Adapted from I.Eidhammer, I.Jonassen]
Geometric hashing - 2D piemērs [Adapted from I.Eidhammer, I.Jonassen] (a) (0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6) (b) (1,8)(2,2)(0,0)(4,-2)(10,0)(8,3)(8,7) (c) (0,0)(3,-2)(8,0)(6,2)(10,4)(3,8)(0,6)
Geometric hashing - 2D piemērs [Adapted from I.Eidhammer, I.Jonassen]
midpoint distance line distance References sekundārās struktūras elementi A base fingerprint is a 5D vector composed of: SSE types: helix, strand Line distance Midpoint distance Angle
Geometric hashing - priekšrocības Independence from sequences Can be used for partially disconnected structures Allows to find interesting “patterns” Comparatively fast Can be applied also for the docking problem Can be easily parallelized [Adapted from M.Shatsky]
Proteīnu struktūru datubāzes - PDB
PDB faila fragments ATOM 1575 C ASP E ENT1729 ATOM 1576 O ASP E ENT1730 ATOM 1577 CB ASP E ENT1731 ATOM 1578 CG ASP E ENT1732 ATOM 1579 OD1 ASP E ENT1733 ATOM 1580 OD2 ASP E ENT1734 ATOM 1581 N GLY E ENT1735 ATOM 1582 CA GLY E ENT1736 ATOM 1583 C GLY E ENT1737 ATOM 1584 O GLY E ENT1738 ATOM 1585 N ILE E ENT1739 ATOM 1586 CA ILE E ENT1740 ATOM 1587 C ILE E ENT1741 Formāts: 80 simboli katrā rindā, katram atribūtam fiksētas pozīcijas Atoma Nr Atoms AS Chain X,Y,ZAS Nr Temp. factor Occupancy Only 5 digits are available for the atom serial number, but some structures have already been received with more that 99,999 atoms...
Proteīnu struktūru datubāzes - MMDB
Struktūru vizualizācija 1) Rasmol un Protein Explorer 2) Cn3D 3) DeepView Swiss-PDB Viewer (quite powerful modeling program) Also calculates various RSMDs
Struktūru salīdzināšana - DaliLite
Struktūru salīdzināšana - SSAP
Struktūru salīdzināšana - VAST
Struktūru salīdzināšana - CE un CL
Proteīnu struktūru klasifikācijas - SCOP
Proteīnu struktūru klasifikācijas - CATH
Proteīnu struktūru klasifikācijas - CATH CATH - hierarchical classification of protein domain structures [C.Orengo, J.Thornton et al; UCL] CATH number Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Proteīnu struktūru klasifikācijas - CATH CATH number Class (C) Topology (T) Architecture (A) Homologous superfamily (H) Class 1 - mainly alpha 2 - mainly beta 3 - alpha-beta 4 - low secondary structure content Assigned automatically
Proteīnu struktūru klasifikācijas - CATH CATH number Class (C) Topology (T) Architecture (A) Homologous superfamily (H) Architecture overall shape of the domain structure according to orientations of secondary structures Assigned manually
Proteīnu struktūru klasifikācijas - CATH CATH number Class (C) Topology (T) Architecture (A) Homologous superfamily (H) Topology shape and connectivity of secondary structures Assigned automatically by SSAP algorithm
Proteīnu struktūru klasifikācijas - CATH CATH number Class (C) Topology (T) Architecture (A) Homologous superfamily (H) Homologous superfamily proteins that share a common ancestor Assigned automatically by sequence comparisons and SSAP
Proteīnu struktūru klasifikācijas - DALI
Proteīnu struktūru klasifikācijas - DALI
RNS struktūru prognozēšana? RNA sequence:...AGGCUAUGGCCA... Fortunately here we can do better...
RNS struktūra [Adapted from R.B.Altman]
RNS struktūra - pseidomezgli [Adapted from R.B.Altman]
Enerģijas minimizācija [Adapted from R.B.Altman]
RNS struktūru prognozēšana - DP algoritms [Adapted from R.B.Altman]
RNS struktūru prognozēšana - DP algoritms [Adapted from R.B.Altman]