Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe

Terminology SAR (Structure-Activity Relationships) –Circa 19 th century? QSPR (Quantitative Structure Property Relationships) –Relate structure to any physical-chemical property of molecule QSAR (Quantitative Structure Activity Relationships) –Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) –Brown and Frazer (1868-9) ‘constitution’ related to biological response –LogP

Statistical Models Simple –Mean, median and variation –Regression Advanced –Validation methods –Principal components, co-variance –Multiple Regression QSAR,QSPR

Modern QSAR –Hansch et. Al. (1963) Activity  ‘travel through body’  partitioning between varied solvents –C (minimum dosage required) –  (hydrophobicity) –  (electronic) –E s (steric)

Choosing Descriptors Buffon’s Problem –Needle Length? –Needle Color? –Needle Compostion? –Needle Sheen? –Needle Orientation?

Choosing Descriptors Constitutional –MW, N atoms Topological –Connectivity,Weiner index Electrostatic –Polarity, polarizability, partial charges Geometrical Descriptors –Length, width, Molecuar volume Quantum Chemical –HOMO and LUMO energies –Vibrational frequencies –Bond orders –Energy total

Choosing Descriptors Constitutional –MW, N atoms of element Topological –Connectivity,Weiner index (sums of bond distances) –2D Fingerprints (bit-strings) –3D topographical indices, pharmacophore keys Electrostatic –Polarity, polarizability, partial charges Geometrical Descriptors –Length, width, Molecular volume

Choosing Descriptors Chemical –Hydrophobicity (LogP) –HOMO and LUMO energies –Vibrational frequencies –Bond orders –Energy total –  G  S  H

Statistical Methods 1-D analysis Large dimension sets require decomposition techniques –Multiple Regression –PCA –PLS Connecting a descriptor with a structural element so as to interpolate and extrapolate data

Simple Error Analysis(1-D) Given N data points –Mean –Variance –Regression

Simple Error Analysis(1-D) Given N data points –Regression

Simple Error Analysis(1-D) Given N data points –(Poor 0<R 2 <1(Good)

Correlation vs. Dependence? Correlation –Two or more variables/descriptors may correlate to the same property of a system Dependence –When the correlation can be shown due to one changing due to the change in another Ex. Elephants head and legs –Correlation exists between size of head and legs –The size of one does not depend on the size of the other

Quantitative Structure Activity/Property Relationships (QSAR,QSPR) Discern relationships between multiple variables (descriptors) Identify connections between structural traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)

Pre-Qualifications Size –Minimum of FIVE samples per descriptor Verification –Variance –Scaling –Correlations

QSAR/QSPR Pre-Qualifications Variance –Coefficient of Variation

QSAR/QSPR Pre-Qualifications Scaling –Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis

QSAR/QSPR Pre-Qualifications Scaling –Unit Variance (Auto Scaling) –Ensures equal statistical weights (initially) –Mean Centering

QSAR/QSPR Pre-Qualifications Correlations –Remove correlated descriptors –Keep correlated descriptors so as to reduce data set size –Apply math operation to remove correlation (PCR)

QSAR/QSPR Pre-Qualifications Correlations

QSAR/QSPR Scheme Goal –Predict what happens next (extrapolate)! –Predict what happens between data points (interpolate)!

QSAR/QSPR Scheme Types of Variable –Continuous Concentration, occupied volume, partition coefficient, hydrophobicity –Discrete Structural (1-meta substituted, 0-no meta substitution)

QSAR/QSPR-Principal Components Analysis Reduces dimensionality of descriptors Principle components are a set of vectors representing the variance in the original data

QSAR/QSPR-Principal Components Analysis Geometric Analogy (3-D to 2-D PCA) y z x

QSAR/QSPR-Principal Components Analysis Formulate matrix Diagonalize matrix Eigenvectors are the principal components –These principal components (new descriptors) are a linear combination of the original descriptors Eigenvalues represent variance –Largest accounts for greatest % of data variance –Next corresponds to second greatest and so on

QSAR/QSPR-Principal Components Analysis Formulate matrix (Several types) –Correlation or covariance (N x P) N is number of molecules P is number of descriptors –Variance-Covariance matrix (N x N) Diagonalize (Rotate) matrix

QSAR/QSPR-Principal Components Analysis Eigenvectors (Loadings) –Represents contribution from each original descriptor to PC (new descriptor) # columns = # of descriptors # rows = # of descriptors OR # of molecules Eigenvalues –Indicate which PC most important (representative of original descriptors) Benzene has 2 non-zero and 1 zero eigenvalue (planar)

QSAR/QSPR-Principal Components Analysis Scores –Graphing each object/molecule in space of 2 or more PCs # rows = # of objects/molecules # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system

QSAR-PCA SYBYL (Tripos Inc.)

10D  3D

SYBYL (Tripos Inc.) Eigenvalues  Explanation of variance in data

SYBYL (Tripos Inc.) Each point corresponds to column (# points = # descriptors) in original data Proximity  correlation

SYBYL (Tripos Inc.) Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space He Napthalene H2OH2O Molecular Size Small acting Big Proximity  similarity

SYBYL (Tripos Inc.) Outlier

SYBYL (Tripos Inc.)

QSAR/QSPR-Regression Types Principal Component Analysis

Non-Linear Mappings Calculate “distance” between points in N-d descriptor/parameter space –Euclidean –City-block distances Randomly assign compounds in set to points on a 2- D or 3-D space Minimize Difference (Optimal N-d  2D plot)

Non-Linear Mappings Advantages –Non-linear –No assumptions! –Chance groupings unlikely (2D group likely an N-D group) Disadvantages –Dependence on initial guess (Use PCA scores to improve)

QSAR/QSPR-Regression Types Multiple Regression PCR PLS

QSAR/QSPR-Regression Types Linear Regression –Minimize difference between calculated and observed values (residuals) Multiple Regression

QSAR/QSPR-Regression Types Principal Component Regression –Regression but with Principal Components substituted for original descriptors/variables

QSAR/QSPR-Regression Types Partial Least Squares –Cross-validation determines number of descriptors/components to use –Derive equation –Use bootstrapping and t-test to test coefficients in QSAR regression

QSAR/QSPR-Regression Types Partial Least Squares (a.k.a. Projection to Latent Structures) –Regression of a Regression Provides insight into variation in x’s(b i,j ’s as in PCA) AND y’s (a i ’s) –The t i ’s are orthogonal –M= (# of variables/descriptors OR #observations/molecules whichever smaller)

QSAR/QSPR-Regression Types PLS is NOT MR or PCR in practice –PLS is MR w/cross-validation –PLS Faster couples the target representation (QSAR generation) and component generation while PCA and PCR are separate PLS well applied to multi-variate problems

QSAR/QSPR Post-Qualifications Confidence in Regression –TSS-Total Sum of Squares –ESS-Explained Sum of Squares –RSS-Residual Sum of Squares

QSAR/QSPR Post-Qualifications Confidence in Prediction (Predictive Error Sum of Squares)

QSAR/QSPR Post-Qualification Bias? –Bootstrapping Choosing best model? –Cross Validation

QSAR/QSPR Post-Qualification Bootstrapping –ASSUME calculated data is experimental/observed data –Randomly choose N data (allowing for a multiple picks of same data) –Regenerate parameters/regression –Repeat M times –Average over M bootstraps –Compare (calculate residual) If close to zero then no bias If large then bias exists M is typically 50-100

QSAR/QSPR Post-Qualification Cross-Validation (used in PLS) –Remove one or more pieces of input data –Rederive QSAR equation –Calculate omitted data –Compute root-mean-square error to evaluate efficacy of model Typically 20% of data is removed for each iteration The model with the lowest RMS error has the optimal number of components/descriptors

QSPR Example Relation between musk odourant properties and benzenoid structure –Training set of 148 compounds (81 non-musk and 67 musk) –47 chemical descriptors initially –Pre-qualifications Correlations (47-12=35) –Post-qualifications Bootstrapping Test-set –6/6 musks, 8/9 non-musks Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156 (1986)

Practical Issues 10 times as many compounds as parameters fit 3-5 compounds per descriptor Traditional QSAR –Good for activity prediction –Not good for whether activity is due to binding or transport

Advanced Methods Neural Networks Genetic/Evolutionary Algorithms Monte Carlo Alternate descriptors –Reduced graphs –Molecular connectivity indices –Indicator variables (0 or 1) Combinatorics (e.g. multiple substituent sites)

Tools Available Sybyl (Tripos Inc.) Insight II (Accelrys Inc.) Pole Bio-Informatique Lyonnais –http://pbil.univ-lyon1.fr/http://pbil.univ-lyon1.fr/ Molecular Biology –http://www.infobiogen.fr/services/deambulum/ english/logiciels.htmlhttp://www.infobiogen.fr/services/deambulum/ english/logiciels.html

Summary QSAR/QSPR –Statistics connect structure/behavior w/ observables –Interpolate/Extrapolate Multi-Variate Analysis –Pre-Qualification –Regression PCA PLS MLS –Post-Qualification

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

Similar presentations

Presentation on theme: "Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

Similar presentations

Presentation on theme: "Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe."— Presentation transcript:

Similar presentations

About project

Feedback