1/62 An Iterative Relaxation Technique for the NMR Backbone Assignment Problem Wen-Lian Hsu Institute of Information Science Academia Sinica
2/62 Characteristics of Our Method Model this as a constraint satisfaction problem Solve it using natural language parsing techniques Both top-down and bottom-up Both top-down and bottom-up An iterative approach Create spin systems based on noisy data. Create spin systems based on noisy data. Link spin systems by using maximum independent set finding techniques. Link spin systems by using maximum independent set finding techniques.
3/62 Outline Introduction Method Experiment Results Conclusion
4/62 Blind Man’s Elephant We cannot directly “see” the positions of these atoms (the structure) But we can measure a set of parameters (with constraints) on these atoms Which can help us infer their coordinates Which can help us infer their coordinates Each experiment can only determine a subset of parameters (with noises) To combine the parameters of different experiments we need to stitch them together
5/62 The Flow of NMR Experiments Structure ConstraintsResonance assignment Get protein Samples Calculation and simulation - Energy minimization - Fitness of structure constraints Collect NMR spectra
6/62 Find out Chemical Shift for Each Atom Backbone atoms: Ca, Cb, C’, N, NH Various experiments: HSQC, CBCANH, CBCACONH, HN(CA)CO, HNCO, HN(CO)CA, HNCA Side chain: all others (especially CHs) TOCSY-HSQC, HCCCONH, CCCONH, HCCH-TOCSY CC CC N H H CC CC CC H2H2 H2H2 H3H3 Chemical Shift Assignment One amino acid
7/62 H-C-H C H-C-HH -N-C-C-N-C-C-N-C-C-N-C-C- O O O O H H H H HO H H-C-H CH3 Backbone Some Relevant Parameters ppm CH
8/62 Backbone: Ca, Cb, C’, N, NH HSQC, CBCANH, CBCA(CO)NH, HN(CA)CO, HNCO, HN(CO)CA, HNCA sequential assignment chemical shifts of Ca, Cb, NH HSQC Three important experiments
Our NMR spectra CBCANH CBCA(CO)NH HSQC CBCA(CO)NH (2 peaks) HNCACB (4 peaks)
10/62 HSQC Spectra HSQC peaks (1 chemical shifts for an amino acid) HNIntensity HSQC
11/62 CBCA(CO)NH Spectra CBCA(CO)NH peaks (2 chemical shifts for one amino acid) HNCIntensity
12/62 CBCANH Spectra CBCANH peaks (4 chemical shifts for one amino acid) Ca (+), Cb (-) Ca (+), Cb (-) HNCIntensity ─ ─
13/62 A Dataset Example HSQC HNCACB 4 CBCA(CO)NH 2 N H
14/62 Backbone Assignment Goal Assign chemical shifts to N, NH, Ca (and Cb) along the protein backbone. Assign chemical shifts to N, NH, Ca (and Cb) along the protein backbone. General approaches Generate spin systems Generate spin systems A spin system: an amino acid with known chemical shifts on its N, NH, Ca (and Cb).A spin system: an amino acid with known chemical shifts on its N, NH, Ca (and Cb). Link spin systems Link spin systems
15/62 Ambiguities All 4 point experiments are mixed together All 2 point experiments are mixed together Each spin system can be mapped to several amino acids in the protein sequence False positives, false negatives
16/62 Previous Approaches Constrained bipartite matching problem The spin system might be ambiguous The spin system might be ambiguous Can’t deal with ambiguous link Can’t deal with ambiguous link Legal matching Illegal matching under constraints
17/62 Natural Language Processing ─ Signal or Noise? Speech recognition : Homophone selection 台 北 市 一 位 小 孩 走 失 了 台 北 市 小 孩 台 北 適 宜 走 失 事 宜 一 位 一 味 移 位
18/62 An Error-Tolerant Algorithm
19/62 Phrase, Sentence Combination
20/62 句意模版 句型模版 片語模版 字詞模版 Hierarchical Analysis
Perfect Group Each spin group contains 6 points, in which 4 points are from the first experiments 2 points are from the second experiment H O H N H C C C C C H O H N H C C C C C
Perfect Group Each spin group contains 6 points, in which 4 points are from the first experiments 2 points are from the second experiment H O H N H C C C C C H O H N H C C C C C
23/62 NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi NHCIntensity e e e e+007 CBCA(CO)NH CBCANH i -1 Ca Cb A Perfect Spin System Group
24/62 False Positives and False Negatives False positives Noise with high intensity Noise with high intensity Produce fake spin systems Produce fake spin systems False negatives Peaks with low intensity Peaks with low intensity Missing peaks Missing peaks In real wet-lab data, nearly 50% are noises (false positive).
25/62 Spin System Group Perfect False Negative False Positive N H
26/62 Outline Introduction Method Experiment Results Conclusion
27/62 Main Idea Deal with false negative in spin system generation procedures. Eliminate false positive in spin system linking procedures. Perform spin system generation and linking procedures in an iterative fashion.
28/62 Spin System Group Generation Three types of spin system group are generated based on the quality of CBCANH data: Perfect Perfect Weak false negative Weak false negative Severe false negative Severe false negative
29/62 Perfect Spin Systems A spin system is determined without any added pseudo peak. NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi NHCIntensity e e e e+007 CBCA(CO)NH CBCANH i -1 Ca Cb
30/62 Weak False Negative Spin System Group NHCIntensity e e+007 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi A spin system is determined with an added pseudo peak. NHCIntensity e e e+008 CBCA(CO)NH CBCANH i -1 Ca Cb Ca e+008
31/62 Severe false Negative Spin System Group NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi A spin system is determined with two added pseudo peaks. NHCIntensity e e+008 CBCA(CO)NH CBCANH e e+008 i -1 Ca Cb Ca Note: it is also possible that C a i-1 = and C b i-1 =
32/62 A note on spin system generation To generate *ALL* possible spin systems, a peak can be included in more than one spin system. False positives are eliminated in spin system linking procedure. False positives are eliminated in spin system linking procedure. False negative are treated by adding pseudo peaks. False negative are treated by adding pseudo peaks. A rule-based mechanism is used to filter out incompatible spin systems (false positives). Adopt maximum weight independent set algorithm Adopt maximum weight independent set algorithm
33/62 Spin System Linking Goal Link spin system as long as possible. Link spin system as long as possible. Constraints Each spin system is uniquely assigned to a position of the target protein sequence. Each spin system is uniquely assigned to a position of the target protein sequence. Two spin systems are linked only if the chemical shift differences of their intra- and inter- residues are less than the predefined thresholds. Two spin systems are linked only if the chemical shift differences of their intra- and inter- residues are less than the predefined thresholds.
A Peculiar Parking Lot (valet parking) Information you have: The make of your car, the car parked in front of you (approximately). Together with others, try to identify as many cars in the right order as possible (maximizing the overall satisfaction).
Backbone Assignment DGRIGEIKGRKTLATPAVRRLAMENNIKLS
36/62 Spin System Positioning D 50G 10R 40I 50| => => => => We assign spin system groups to a protein sequence according to their codes. Spin System
37/62 Segment 3 Segment 2 Segment 1 Link Spin System groups DGRI
38/62 Iterative Concatenation DGRI….FKJJREKL …. Step n Segment …. 56 Spin Systems Step1 56 … Step2 Segment 1 Segment 2 Segment 31 … Step n-1 Segment 78Segment 79 …
39/62 Conflict Segments DGRIGEIKGRKTLATPAVRRLAMENNIKLS Segment 78 Segment 71 Segment 79 Segment 99Segment 98 Segment 97 Two kinds of conflict segments Overlap (e.g. segment 71, segment 99) Use the same spin system (e.g. both segment 78 and segment 79 contain spin system 1 )
40/62 A Graph Model for Spin System Linking G(V,E) V: a set of nodes (segments). V: a set of nodes (segments). E: (u, v), u, v V, u and v are conflict. E: (u, v), u, v V, u and v are conflict. Goal Assign as many non-conflict segments as possible => find the maximum independent set of G. Assign as many non-conflict segments as possible => find the maximum independent set of G.
41/62 An Example of G Seq. : GEIKGRKTLATPAVRRLAMENNIKLSE Segment1: SP12->SP13->SP14 Segment2: SP9->SP13->SP20->SP4 Segment3: SP8->SP15->SP21 Segment4: SP7->SP1->SP15->SP3 Seg1Seg3Seg4Seg2 Seg1 Seg3 Seg2 Seg4 SP13 SP15 Overlap
42/62 Segment weight The larger length of segment is, the higher weight of segment is. The less frequency of segment is, the higher of segment is.
43/62 Find Maximum Weight Independent Set of G Boppana, R. and M.M. Halld ό rsson, Approximatin Maximum Independent Sets bt Excluding Subgraphs. BIR, (2).
44/62 An Iterative Approach We perform spin system generation and linking iteratively. Three stages.
45/62 First Stage Generate perfect spin systems; Perform spin system concatenation on spin systems (newly generated perfect) to generate segments; Perform spin system concatenation on spin systems (newly generated perfect) to generate segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments; Perform MaxIndSet on the segments; Drop spin systems (and related peaks) that are used in the resulting segments. Drop spin systems (and related peaks) that are used in the resulting segments.
46/62 Second Stage Generate weak false negative spin systems. Perform segment extension on the resulting segments of the first iteration (using unused perfect and newly generated weak false negative); Perform segment extension on the resulting segments of the first iteration (using unused perfect and newly generated weak false negative); Perform spin system concatenation on the unused spin systems (perfect + weak false negative) to generate longer segments; Perform spin system concatenation on the unused spin systems (perfect + weak false negative) to generate longer segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments; Perform MaxIndSet on the segments; Drop spin systems (and related peaks) that are used in the resulting segments. Drop spin systems (and related peaks) that are used in the resulting segments.
47/62 Third Stage Generate severe false negative spin systems. Perform segment extension on the resulting segments of the second iteration (using unused perfect and weak false negative, as well as newly generated severe false negative); Perform segment extension on the resulting segments of the second iteration (using unused perfect and weak false negative, as well as newly generated severe false negative); Perform spin system concatenation on the unused spin systems (perfect + weak false negative + severe false negative) to generate longer segments; Perform spin system concatenation on the unused spin systems (perfect + weak false negative + severe false negative) to generate longer segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments. Perform MaxIndSet on the segments.
48/62 ….FKJJREKL…. Segment Extension … New 109 New spin systems
49/62 Segment Extension DGRGEKGRKTLATPAVRRLAMENNIKLS DGRGEKGRKTLATPAVRRLAMENNIKLS MaxIndSet 77 99‘ 97‘ ‘ 97‘ 99 97
50/62 Outline Introduction Method Experimental Results Conclusion
51/62 Experimental Results Two datasets obtained from our collaborator Dr. Tai-Huang, Huang in IBMS, Academia Sinica: Average precision: 87.5% Average precision: 87.5% Average recall: 73.1% Average recall: 73.1% Perfect data from BMRB: 99.1%
52/62 Real Wet-Lab Datasets The two datasets are obtained from our collaborator Dr. Tai- Huang, Huang in IBMS at Academia Sinica, Taiwan. Datasetssbdlbd # of amino acids5385 # of amino acids that are assigned manually by biologists4280 # of HSQC peaks5878 # of CBCA(CO)NH peaks # of HNCACB peaks # of expected CBCA(CO)NH84160 # of expected HNCACB false positive of CBCA(CO)NH67.4% 41.0 % false positive of HNCACB25.0% 48.4 %
53/62 Experimental Results on Real Data datasetssbdlbd # of amino acid 5385 # of assigned amino acid 4281 # of HSQC 5878 # of CBCANH peaks # of CBCA(CO)NH peaks # of correctly assigned# of assignedaccuracyrecall Method on sbd %76.2% Method on lbd %70.0%
54/62 Outline Introduction Method Experiment Results Conclusion
55/62 Conclusion We model the backbone assignment problem as a constraint satisfaction problem This problem is solved using a natural language parsing technique (both bottom- up and top-down approach) The same approach seem to work for a large class of noise reduction problems that are discrete in nature
56/62 A genetic algorithm for NMR backbone resonance assignment (I) Randomly generate a population of chromosomes Each chromosome represents a possible backbone resonance assignment Each chromosome represents a possible backbone resonance assignment Fitness function Evaluate the fitness of each chromosome according to the connectivity between adjacent amino acids Evaluate the fitness of each chromosome according to the connectivity between adjacent amino acids
57/62 A genetic algorithm for NMR backbone resonance assignment (II) Crossover operation An offspring inherits different connected blocks from parents An offspring inherits different connected blocks from parents Mutation operation Make a new connected block from any position to increase the popular diversity Make a new connected block from any position to increase the popular diversity
58/62 Generation of a random chromosome Step1. Randomly select a position x Step2. Randomly select a SSGroup i from CL( x ) Step3. Extend connected fragments from i to both sides by using adjacency lists until no more extension can be found. Step4. Repeat Step1~Step3 until all positions are assigned
59/62 Fitness Evaluation Fitness(ch) = The number of connected pairs associate with their chemical shift differences. Two principles: 1. The more connected pairs it has, the higher score it gets. 2. The less chemical shift differences it has, the higher score it gets Building Blocks: connected fragments
60/62 Crossover Operation parents offspring cutting site
61/62 Mutation operation Once a position is going to mutate, the following positions will also mutate to produce a connected fragments. Mutation point
62/62 Experiment Results The accuracy on two real dataset SBD:95.1% (FP: 67%) SBD:95.1% (FP: 67%) LBD:100% (FP: 48%) LBD:100% (FP: 48%) The average accuracy on perfect BMRB datasets (902 proteins)