1/62 An Iterative Relaxation Technique for the NMR Backbone Assignment Problem Wen-Lian Hsu Institute of Information Science Academia Sinica.

Slides:



Advertisements
Similar presentations
Areas of Spectrum.
Advertisements

Protein NMR.
1 Introduction to Data Flow Analysis. 2 Data Flow Analysis Construct representations for the structure of flow-of-data of programs based on the structure.
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
Human-Computer Interaction Human-Computer Interaction Segmentation Hanyang University Jong-Il Park.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Computing Protein Structures from Electron Density Maps: The Missing Loop Problem I. Lotan, H. van den Bedem, A. Beacon and J.C. Latombe.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Non-Linear Problems General approach. Non-linear Optimization Many objective functions, tend to be non-linear. Design problems for which the objective.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Reconstructing Circular Order from Inaccurate Adjacency Information Applications in NMR Data Interpretation Ming-Yang Kao.
Chapter 10: Algorithm Design Techniques
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
What is an assignment? Associate a given signal back to the originating spin.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Sequencing a genome and Basic Sequence Alignment
Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Genetic Programming.
Genetic Algorithm.
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
1 Refined Solution Structure of HIV-1 Nef Stephen Grzesiek, Ad Bax, Jin-Shan Hu, Joshua Kaufman, Ira Palmer, Stephen J Stahl, Nico Tjandra and Paul T.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Optimization in Engineering Design Georgia Institute of Technology Systems Realization Laboratory Mixed Integer Problems Most optimization algorithms deal.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Common parameters At the beginning one need to set up the parameters.
Biomolecular Nuclear Magnetic Resonance Spectroscopy BASIC CONCEPTS OF NMR How does NMR work? Resonance assignment Structure determination 01/24/05 NMR.
1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉 中央研究院資訊所.
Genetic Algorithms Siddhartha K. Shakya School of Computing. The Robert Gordon University Aberdeen, UK
Automating Steps in Protein Structure Determination by NMR CS April 13, 2009.
1/67 Institute of Information Science, Academia Sinica Research Assistant: Lin, Hsin-Nan 林信男.
Biomolecular Nuclear Magnetic Resonance Spectroscopy FROM ASSIGNMENT TO STRUCTURE Sequential resonance assignment strategies NMR data for structure determination.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
The number of protons yielding correlations in a 2D NOESY spectrum quickly overwhelms the space available on A 2D map. 15N labeling can help simplify the.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
DYNAMIC FACILITY LAYOUT : GENETIC ALGORITHM BASED MODEL
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Jia-Ming Chang 0508 Graph Algorithms and Their Applications to Bioinformatics 1/38.
Introduction to Genetic Algorithms. Genetic Algorithms We’ve covered enough material that we can write programs that use genetic algorithms! –More advanced.
Genetic Algorithms. 2 Overview Introduction To Genetic Algorithms (GAs) GA Operators and Parameters Genetic Algorithms To Solve The Traveling Salesman.
Graph-based Deformable Matching of 3D Line Segments with Application in Protein Fitting 12 1 HANG DOU 1, MATTHEW L BAKER 2, TAO JU Washington University.
1/60 An Iterative Relaxation Technique for the NMR Backbone Assignment Problem Wen-Lian Hsu Institute of Information Science Academia Sinica.
Presented by: Idan Aharoni
Assignment Strategies -N-C  -CO-N-C  -CO- H H-C-H H O-C-O HH H-C-H O-H -N-C  -CO-N-C  -CO- H H-C-H H O-C-O HH H-C-H O-H Homonuclear  Two steps needed.
GENETIC ALGORITHM Basic Algorithm begin set time t = 0;
Innovative and Unconventional Approach Toward Analytical Cadastre – based on Genetic Algorithms Anna Shnaidman Mapping and Geo-Information Engineering.
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Genetic algorithms: A Stochastic Approach for Improving the Current Cadastre Accuracies Anna Shnaidman Uri Shoshani Yerach Doytsher Mapping and Geo-Information.
A Two-Phase Linear programming Approach for Redundancy Problems by Yi-Chih HSIEH Department of Industrial Management National Huwei Institute of Technology.
Genetic Algorithms. Underlying Concept  Charles Darwin outlined the principle of natural selection.  Natural Selection is the process by which evolution.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
Areas of Spectrum. Remember - we are thinking of each amino acid as a spin system - isolated (in terms of 1 H- 1 H J-coupling) from the adjacent amino.
Genetic Algorithm(GA)
Hirophysics.com The Genetic Algorithm vs. Simulated Annealing Charles Barnes PHY 327.
Calcium-Induced Conformational Switching of Paramecium Calmodulin Provides Evidence for Domain Coupling Jaren et al. Biochemistry 2002, 41,
Constrained Hidden Markov Models for Population-based Haplotyping
NMR Spectroscopy – Part 2
NMR Spectroscopy Question and Answer Session
The coalescent with recombination (Chapter 5, Part 1)
Proteins Have Too Many Signals!
Presentation transcript:

1/62 An Iterative Relaxation Technique for the NMR Backbone Assignment Problem Wen-Lian Hsu Institute of Information Science Academia Sinica

2/62 Characteristics of Our Method  Model this as a constraint satisfaction problem  Solve it using natural language parsing techniques Both top-down and bottom-up Both top-down and bottom-up  An iterative approach Create spin systems based on noisy data. Create spin systems based on noisy data. Link spin systems by using maximum independent set finding techniques. Link spin systems by using maximum independent set finding techniques.

3/62 Outline  Introduction  Method  Experiment Results  Conclusion

4/62 Blind Man’s Elephant  We cannot directly “see” the positions of these atoms (the structure)  But we can measure a set of parameters (with constraints) on these atoms Which can help us infer their coordinates Which can help us infer their coordinates Each experiment can only determine a subset of parameters (with noises) To combine the parameters of different experiments we need to stitch them together

5/62 The Flow of NMR Experiments Structure ConstraintsResonance assignment Get protein Samples Calculation and simulation - Energy minimization - Fitness of structure constraints Collect NMR spectra

6/62 Find out Chemical Shift for Each Atom Backbone atoms: Ca, Cb, C’, N, NH Various experiments: HSQC, CBCANH, CBCACONH, HN(CA)CO, HNCO, HN(CO)CA, HNCA Side chain: all others (especially CHs) TOCSY-HSQC, HCCCONH, CCCONH, HCCH-TOCSY CC CC N H H CC CC CC H2H2 H2H2 H3H3 Chemical Shift Assignment One amino acid

7/62 H-C-H C H-C-HH -N-C-C-N-C-C-N-C-C-N-C-C- O O O O H H H H HO H H-C-H CH3 Backbone Some Relevant Parameters ppm CH

8/62 Backbone: Ca, Cb, C’, N, NH HSQC, CBCANH, CBCA(CO)NH, HN(CA)CO, HNCO, HN(CO)CA, HNCA  sequential assignment  chemical shifts of Ca, Cb, NH HSQC Three important experiments

Our NMR spectra CBCANH CBCA(CO)NH  HSQC  CBCA(CO)NH (2 peaks)  HNCACB (4 peaks)

10/62 HSQC Spectra  HSQC peaks (1 chemical shifts for an amino acid) HNIntensity HSQC

11/62 CBCA(CO)NH Spectra  CBCA(CO)NH peaks (2 chemical shifts for one amino acid) HNCIntensity

12/62 CBCANH Spectra  CBCANH peaks (4 chemical shifts for one amino acid) Ca (+), Cb (-) Ca (+), Cb (-) HNCIntensity ─ ─

13/62 A Dataset Example  HSQC  HNCACB 4  CBCA(CO)NH 2 N H

14/62 Backbone Assignment  Goal Assign chemical shifts to N, NH, Ca (and Cb) along the protein backbone. Assign chemical shifts to N, NH, Ca (and Cb) along the protein backbone.  General approaches Generate spin systems Generate spin systems A spin system: an amino acid with known chemical shifts on its N, NH, Ca (and Cb).A spin system: an amino acid with known chemical shifts on its N, NH, Ca (and Cb). Link spin systems Link spin systems

15/62 Ambiguities  All 4 point experiments are mixed together  All 2 point experiments are mixed together  Each spin system can be mapped to several amino acids in the protein sequence  False positives, false negatives

16/62 Previous Approaches  Constrained bipartite matching problem The spin system might be ambiguous The spin system might be ambiguous Can’t deal with ambiguous link Can’t deal with ambiguous link Legal matching Illegal matching under constraints

17/62 Natural Language Processing ─ Signal or Noise?  Speech recognition : Homophone selection 台 北 市 一 位 小 孩 走 失 了 台 北 市 小 孩 台 北 適 宜 走 失 事 宜 一 位 一 味 移 位

18/62 An Error-Tolerant Algorithm

19/62 Phrase, Sentence Combination

20/62 句意模版 句型模版 片語模版 字詞模版 Hierarchical Analysis

Perfect Group   Each spin group contains 6 points, in which 4 points are from the first experiments 2 points are from the second experiment H O H  N H C C C C C    H O H  N H C C C C C   

Perfect Group   Each spin group contains 6 points, in which 4 points are from the first experiments 2 points are from the second experiment H O H  N H C C C C C    H O H  N H C C C C C   

23/62 NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi NHCIntensity e e e e+007 CBCA(CO)NH CBCANH i -1 Ca Cb A Perfect Spin System Group

24/62 False Positives and False Negatives  False positives Noise with high intensity Noise with high intensity Produce fake spin systems Produce fake spin systems  False negatives Peaks with low intensity Peaks with low intensity Missing peaks Missing peaks  In real wet-lab data, nearly 50% are noises (false positive).

25/62 Spin System Group Perfect False Negative False Positive N H

26/62 Outline  Introduction  Method  Experiment Results  Conclusion

27/62 Main Idea  Deal with false negative in spin system generation procedures.  Eliminate false positive in spin system linking procedures.  Perform spin system generation and linking procedures in an iterative fashion.

28/62 Spin System Group Generation  Three types of spin system group are generated based on the quality of CBCANH data: Perfect Perfect Weak false negative Weak false negative Severe false negative Severe false negative

29/62 Perfect Spin Systems  A spin system is determined without any added pseudo peak. NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi NHCIntensity e e e e+007 CBCA(CO)NH CBCANH i -1 Ca Cb

30/62 Weak False Negative Spin System Group NHCIntensity e e+007 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi  A spin system is determined with an added pseudo peak. NHCIntensity e e e+008 CBCA(CO)NH CBCANH i -1 Ca Cb Ca e+008

31/62 Severe false Negative Spin System Group NHCIntensity e e+008 C a i-1 C b i-1 CaiCaiCaiCai CbiCbiCbiCbi  A spin system is determined with two added pseudo peaks. NHCIntensity e e+008 CBCA(CO)NH CBCANH e e+008 i -1 Ca Cb Ca Note: it is also possible that C a i-1 = and C b i-1 =

32/62 A note on spin system generation  To generate *ALL* possible spin systems, a peak can be included in more than one spin system. False positives are eliminated in spin system linking procedure. False positives are eliminated in spin system linking procedure. False negative are treated by adding pseudo peaks. False negative are treated by adding pseudo peaks.  A rule-based mechanism is used to filter out incompatible spin systems (false positives). Adopt maximum weight independent set algorithm Adopt maximum weight independent set algorithm

33/62 Spin System Linking  Goal Link spin system as long as possible. Link spin system as long as possible.  Constraints Each spin system is uniquely assigned to a position of the target protein sequence. Each spin system is uniquely assigned to a position of the target protein sequence. Two spin systems are linked only if the chemical shift differences of their intra- and inter- residues are less than the predefined thresholds. Two spin systems are linked only if the chemical shift differences of their intra- and inter- residues are less than the predefined thresholds.

A Peculiar Parking Lot (valet parking) Information you have: The make of your car, the car parked in front of you (approximately). Together with others, try to identify as many cars in the right order as possible (maximizing the overall satisfaction).

Backbone Assignment DGRIGEIKGRKTLATPAVRRLAMENNIKLS

36/62 Spin System Positioning D 50G 10R 40I 50| => => => =>  We assign spin system groups to a protein sequence according to their codes. Spin System

37/62 Segment 3 Segment 2 Segment 1 Link Spin System groups DGRI

38/62 Iterative Concatenation DGRI….FKJJREKL …. Step n Segment …. 56 Spin Systems Step1 56 … Step2 Segment 1 Segment 2 Segment 31 … Step n-1 Segment 78Segment 79 …

39/62 Conflict Segments DGRIGEIKGRKTLATPAVRRLAMENNIKLS Segment 78 Segment 71 Segment 79 Segment 99Segment 98 Segment 97  Two kinds of conflict segments Overlap (e.g. segment 71, segment 99) Use the same spin system (e.g. both segment 78 and segment 79 contain spin system 1 )

40/62 A Graph Model for Spin System Linking  G(V,E) V: a set of nodes (segments). V: a set of nodes (segments). E: (u, v), u, v  V, u and v are conflict. E: (u, v), u, v  V, u and v are conflict.  Goal Assign as many non-conflict segments as possible => find the maximum independent set of G. Assign as many non-conflict segments as possible => find the maximum independent set of G.

41/62 An Example of G  Seq. : GEIKGRKTLATPAVRRLAMENNIKLSE Segment1: SP12->SP13->SP14 Segment2: SP9->SP13->SP20->SP4 Segment3: SP8->SP15->SP21 Segment4: SP7->SP1->SP15->SP3 Seg1Seg3Seg4Seg2 Seg1 Seg3 Seg2 Seg4 SP13 SP15 Overlap

42/62 Segment weight  The larger length of segment is, the higher weight of segment is.  The less frequency of segment is, the higher of segment is.

43/62 Find Maximum Weight Independent Set of G  Boppana, R. and M.M. Halld ό rsson, Approximatin Maximum Independent Sets bt Excluding Subgraphs. BIR, (2).

44/62 An Iterative Approach  We perform spin system generation and linking iteratively.  Three stages.

45/62 First Stage  Generate perfect spin systems; Perform spin system concatenation on spin systems (newly generated perfect) to generate segments; Perform spin system concatenation on spin systems (newly generated perfect) to generate segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments; Perform MaxIndSet on the segments; Drop spin systems (and related peaks) that are used in the resulting segments. Drop spin systems (and related peaks) that are used in the resulting segments.

46/62 Second Stage  Generate weak false negative spin systems. Perform segment extension on the resulting segments of the first iteration (using unused perfect and newly generated weak false negative); Perform segment extension on the resulting segments of the first iteration (using unused perfect and newly generated weak false negative); Perform spin system concatenation on the unused spin systems (perfect + weak false negative) to generate longer segments; Perform spin system concatenation on the unused spin systems (perfect + weak false negative) to generate longer segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments; Perform MaxIndSet on the segments; Drop spin systems (and related peaks) that are used in the resulting segments. Drop spin systems (and related peaks) that are used in the resulting segments.

47/62 Third Stage  Generate severe false negative spin systems. Perform segment extension on the resulting segments of the second iteration (using unused perfect and weak false negative, as well as newly generated severe false negative); Perform segment extension on the resulting segments of the second iteration (using unused perfect and weak false negative, as well as newly generated severe false negative); Perform spin system concatenation on the unused spin systems (perfect + weak false negative + severe false negative) to generate longer segments; Perform spin system concatenation on the unused spin systems (perfect + weak false negative + severe false negative) to generate longer segments; Retain segments that contain at least 3 spin systems; Retain segments that contain at least 3 spin systems; Perform MaxIndSet on the segments. Perform MaxIndSet on the segments.

48/62 ….FKJJREKL…. Segment Extension … New 109 New spin systems

49/62 Segment Extension DGRGEKGRKTLATPAVRRLAMENNIKLS DGRGEKGRKTLATPAVRRLAMENNIKLS MaxIndSet 77 99‘ 97‘ ‘ 97‘ 99 97

50/62 Outline  Introduction  Method  Experimental Results  Conclusion

51/62 Experimental Results  Two datasets obtained from our collaborator Dr. Tai-Huang, Huang in IBMS, Academia Sinica: Average precision: 87.5% Average precision: 87.5% Average recall: 73.1% Average recall: 73.1%  Perfect data from BMRB: 99.1%

52/62 Real Wet-Lab Datasets  The two datasets are obtained from our collaborator Dr. Tai- Huang, Huang in IBMS at Academia Sinica, Taiwan. Datasetssbdlbd # of amino acids5385 # of amino acids that are assigned manually by biologists4280 # of HSQC peaks5878 # of CBCA(CO)NH peaks # of HNCACB peaks # of expected CBCA(CO)NH84160 # of expected HNCACB false positive of CBCA(CO)NH67.4% 41.0 % false positive of HNCACB25.0% 48.4 %

53/62 Experimental Results on Real Data datasetssbdlbd # of amino acid 5385 # of assigned amino acid 4281 # of HSQC 5878 # of CBCANH peaks # of CBCA(CO)NH peaks # of correctly assigned# of assignedaccuracyrecall Method on sbd %76.2% Method on lbd %70.0%

54/62 Outline  Introduction  Method  Experiment Results  Conclusion

55/62 Conclusion  We model the backbone assignment problem as a constraint satisfaction problem  This problem is solved using a natural language parsing technique (both bottom- up and top-down approach)  The same approach seem to work for a large class of noise reduction problems that are discrete in nature

56/62 A genetic algorithm for NMR backbone resonance assignment (I)  Randomly generate a population of chromosomes Each chromosome represents a possible backbone resonance assignment Each chromosome represents a possible backbone resonance assignment  Fitness function Evaluate the fitness of each chromosome according to the connectivity between adjacent amino acids Evaluate the fitness of each chromosome according to the connectivity between adjacent amino acids

57/62 A genetic algorithm for NMR backbone resonance assignment (II)  Crossover operation An offspring inherits different connected blocks from parents An offspring inherits different connected blocks from parents  Mutation operation Make a new connected block from any position to increase the popular diversity Make a new connected block from any position to increase the popular diversity

58/62 Generation of a random chromosome  Step1. Randomly select a position x  Step2. Randomly select a SSGroup i from CL( x )  Step3. Extend connected fragments from i to both sides by using adjacency lists until no more extension can be found.  Step4. Repeat Step1~Step3 until all positions are assigned

59/62 Fitness Evaluation Fitness(ch) = The number of connected pairs associate with their chemical shift differences. Two principles: 1. The more connected pairs it has, the higher score it gets. 2. The less chemical shift differences it has, the higher score it gets Building Blocks: connected fragments

60/62 Crossover Operation parents offspring cutting site

61/62 Mutation operation  Once a position is going to mutate, the following positions will also mutate to produce a connected fragments. Mutation point

62/62 Experiment Results  The accuracy on two real dataset SBD:95.1% (FP: 67%) SBD:95.1% (FP: 67%) LBD:100% (FP: 48%) LBD:100% (FP: 48%)  The average accuracy on perfect BMRB datasets (902 proteins)