Prediction of NMR Chemical Shifts. A Chemometrical Approach К.А. Blinov, Y.D. Smurnyy, Т.S. Churanova, М.Е. Elyashberg Advanced Chemistry Development (ACD)

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

By Luigi Cardamone, Daniele Loiacono and Pier Luca Lanzi.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Simple Interval Calculation bi-linear modelling method. SIC-method Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Evaluation.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Chapter 10. VSEPR - Lewis structures do not help us predict the shape or geometry of molecules; only what atoms and bonds are involved. To predict shape.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Elaine Martin Centre for Process Analytics and Control Technology University of Newcastle, England The Conjunction of Process and.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
New Approach to Quantum Calculation of Spectral Coefficients Marek Perkowski Department of Electrical Engineering, 2005.
Chemometrics Method comparison
Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Machine Learning CS 165B Spring 2012
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Geometry Optimisation Modelling OH + C 2 H 4 *CH 2 -CH 2 -OH CH 3 -CH 2 -O* 3D PES.
Efficient Model Selection for Support Vector Machines
To determine the rate constants for the second order consecutive reactions, a number of chemometrics and hard kinetic based methods are described. The.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Nuclear Magnetic Resonance Spectroscopy Dr. Sheppard Chemistry 2412L.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Common parameters At the beginning one need to set up the parameters.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
It follows that and Step 1: Assume system to be described as, where y is the output, u is the input and is the vector of all unknown parameters. Step 2:
„The perfect is not good enough!” (Carl Benz) V ISUALIZATION OF HIGH DIMENSIONAL DATA BY USE OF GENETIC PROGRAMMING – APPLICATION TO ON - LINE INFRARED.
Modeling Remote Interactions Docking,  -Stacking, Stereorecognition, and NMR Chemical Shift Calculations.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Institute for Advanced Studies in Basic Sciences – Zanjan Kohonen Artificial Neural Networks in Analytical Chemistry Mahdi Vasighi.
Today Ensemble Methods. Recap of the course. Classifier Fusion
STAR Sti, main features V. Perevoztchikov Brookhaven National Laboratory,USA.
Regression Lines. Today’s Aim: To learn the method for calculating the most accurate Line of Best Fit for a set of data.
3.4 The Components of the OLS Variances: Multicollinearity We see in (3.51) that the variance of B j hat depends on three factors: σ 2, SST j and R j 2.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Fuzzy Systems Michael J. Watts
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
Spectral Interpretation General Process for Structure Elucidation of an Unknown Nat. Prod. Rep., 1999, 16,
Physical Science and You Chapter One: Studying Physics and Chemistry Chapter Two: Experiments and Variables Chapter Three: Key Concepts in Physical Science.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Simulation and Experimental Verification of Model Based Opto-Electronic Automation Drexel University Department of Electrical and Computer Engineering.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Modeling Remote Interactions Docking,  -Stacking, Stereorecognition, and NMR Chemical Shift Calculations.
Rotation about Aromatic amide Bonds A Computational Project Evan Grassi and Donald D. Clarke Department of Chemistry, Fordham University James B Foresman,
( a and K are unknown ) Real System Model Sensitivity Equations + - Required for Feed Forward Model. Numerical Power models typically are non-invertible.
APPLICATION OF CLUSTER ANALYSIS AND AUTOREGRESSIVE NEURAL NETWORKS FOR THE NOISE DIAGNOSTICS OF THE IBR-2M REACTOR Yu. N. Pepelyshev, Ts. Tsogtsaikhan,
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Data Transformation: Normalization
Deep Feedforward Networks
▪ Thus, the geometry of a molecule is determined
Object Detection with Bootstrapping Carlos Rubiano Mentor: Oliver Nina
Generalization ..
Collaborative Filtering Nearest Neighbor Approach
Delocalized Pi Bonding
Boltzmann Machine (BM) (§6.4)
Topological Signatures For Fast Mobility Analysis
10.5 (exceptions) Tro's Introductory Chemistry, Chapter 10.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Prediction of NMR Chemical Shifts. A Chemometrical Approach К.А. Blinov, Y.D. Smurnyy, Т.S. Churanova, М.Е. Elyashberg Advanced Chemistry Development (ACD)

Structure and its spectral data StructureSpectra

Sometimes solution is not obvious In many cases we obtain several structures corresponding to spectral data. In this case we need a method to rank the structures. Most powerful method - compare experimental and predicted 13 C NMR spectra

13 C NMR spectral data 2, Experimental Predicted

How to find the best structure? In most cases predicted spectrum of “correct structure” has best fit to experimental spectrum In practice “correct structure” has average deviation between predicted and experimental spectra 2-3 ppm

The role of the spectra prediction Real-world task. Unknown structure with MF C 29 H 32 N 2 O 5 and spectral data (1D and 2D NMR). 20 min to generate all structures (> ) 24 hours to predict the NMR 13 С spectra of all the obtained structures Speed of spectra prediction should be increased

Methods of the prediction of NMR spectra Quantum Mechanics Database approach –HOSE Codes –Maximum Common Substructure Rule-based –Additive scheme –Neural Networks – extremely slow – accurate but slow – fast but inaccurate Our choice – improve accuracy of fast method

Additive scheme  a i x i  = = Main problem – find correct values of atom increments

Available data We have database of 1.5 millions of chemical shifts for 13 С. We can try to obtain correct values!

How to encode atom environment CH 2 Atom’s type Number of atoms … 1 1 CH Input variables … C 1 1 st sphere CH 2 CH 3 O nd sphere

Data for PLS regression Atom environment encoding Samples Chemical shifts XY

Find best structure encoding Initially best scheme of structure representation does not evident We should find scheme which has best accuracy We should optimize –substitutents coding scheme –number of used “spheres”

Used data 210 K of chemical shifts used as a training set. 170 K of chemical shifts from recent literature used as external validation set.

How to describe atom type Atom type (C, O, etc.). Hybridization (sp 3, sp 2, etc). Valence Number of neighbor H. Charge Distance to “central” atom (bonds) “Central” atom “Substitutent” 7 (N) 1 (sp 3 )

Result for different atom encoding

Result for number of spheres

Is it the best possible accuracy? Best possible average deviation is 3.5 ppm. We need less than 3 ppm (2 is preferable). Should we use additional variables? We should be very careful adding variables.

141,48 125,90 138,30 125,38 Substitutents interference (cross effect) +2,48 122,90 134,16    ,26

Enhanced structure encoding CH 2 and CH Atom pair type Number of pairs … 1 Input variables … 1 AtomsPairs of atoms (Crosses) C and O

Result for atom pairs (crosses) Distance between atoms within a cross Number of spheres Mean error, ppm

More enhancements? Now accuracy is good enough (2.3 ppm) But it is still bad in some cases Unfortunately these cases are very important This “special” cases should be taken into account

Stereo effects: double bonds ,9 A 2,9 A We use “topological” distance Sometimes equal topological distance correspond to different “real” distances

Modified structure encoding AtomsPairs of atoms (Crosses) “Stereo” effects Variables

Prediction of spectra by different methods (mean error, ppm) Taken into the account All types of atoms CH 3 =C Atoms only3,521,558,03 + pairs of atoms (crosses) 2,321,503,22 + “stereo” effects2,271,243,22 + solvent2,251,243,20 + to be continued?

Size of training set We have 1.5 millions of chemical shifts We should try to use all available data Only one problem – matrix size In many cases matrix size becomes more than 2 GB

Bigger dataset – smaller mean error!

The final results Method Average deviation The rate of calculation shifts/sec. Old Method - HOSE Codes New Additive scheme Faster by 3 order!

Prediction time: the past and present MethodAverage deviationTime HOSE Codes1.72> 24 hours Additive scheme1.632 min. C 29 H 32 N 2 O 5

Conclusions Combination of “new” method with old well-known algorithm can produce very good (and unexpected) result