Statistics evaluation and graphics

Slides:



Advertisements
Similar presentations
Virtual Synthesis - Reactor
Advertisements

Scientific & technical presentation Calculator Plugins January 2011.
1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.
Machine learning methods for the analysis of heterogeneous, multi- source data Ilkka Huopaniemi Statistical machine learning and.
Chapter 2.  A compound is a substance that is made of two or more joined elements  Organic compounds contain carbon atoms(along with other elements)
Pacific Europe NETWORK for Science and Technology Seventh framework programme CAPACITIES specific programme Activities of international cooperation Grant.
Analyzing Small Molecules by EI and GC-MS
Experiment 2 DISTILLATION AND GAS CHROMATOGRAPHY OF ALKANES.
1 Welcome! Mass spectrometry meets cheminformatics Tobias Kind and Julie Leary UC Davis Course 2: Mass spectral and molecular data handling Class website:
UAB Metabolomics Symposium December 12, 2012 Christopher B. Newgard, Ph.D. Sarah W. Stedman Nutrition and Metabolism Center Department of Pharmacology.
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.
Welcome! Mass Spectrometry meets Cheminformatics Tobias Kind and Julie Leary UC Davis Course 7: Concepts for LC-MS Class website: CHE Spring 2008.
Organic Macromolecules
Lecture 8. GC/MS.
1 Review of Organic Molecules Lipids. 2 Fatty acids & triglycerides.
1 Organic Molecules Carbohydrates These pictures will give you background help with *Objectives
Metabolomics DNA RNA Protein Biochemicals (Metabolites) Genomics – 25,000 Genes Transcriptomics – 100,000 Transcripts Metabolomics – 2,800 Compounds Proteomics.
Mass Spectrometry 12-1 to 12-4
Organic Compounds and their… Monomers & Polymers.
PHARMACEUTICAL CHEMISTRY RESEARCH PROJECTS 2013 ;.
By, Blessy Babu. What is Gas Chromatography?  Gas spectroscopy is a technique used to separate volatile components in a mixture.  It is particularly.
Metabolomics 5/2/2014. ‘Omics Family Tree W. M. Claudino, et al., Journal of Clinical Oncology, 2007, 25(19), pp /2/2014.
Gas chromatography coupled to mass spectrometry Liquid chromatography coupled to mass spectrometry MetMAX – Data alignment COVAIN – Data integration, Statistics.
Monomer A small repeating unit that can make larger more complex molecules. What toy can you think of that is like a monomer?
LipidBlast - In silico created MS/MS libraries for lipid profiling
Organic compounds Carbon compounds
Carbohydrates Contains Carbon, Hydrogen, Oxygen
Warm up 10/2:Warm up 10/2: 1.Pass labs toward the center. 2.When is the next Exam? a.When is the study guide due? b.When is intervention? 3.When is your.
Concept 5.3: Lipids include fats and steroids.. Lipids Group of organic compounds that include fats, oils, and waxes. Composed of carbon, hydrogen, and.
Metabolomics and analytical chemistry: GC & GCMS Simone Bossi Analytical Chemistry Lab – Plant Physiology – Plant Biology.
Rational Drug Design Dr Robert Sbaglia. Curriculum Vitae Bachelor of Science (Honours), University of Melbourne Bachelor of Science.
Carbs ENERGY Structural support of plants. Proteins.
Phospholipid A phospholipid is a type of lipid used in the cells of living things.
LIPIDS.
The Necessities of Life. WATER Cells = 70% water Chemical reactions in metabolism require water Humans can only survive about 3 days without Water comes.
Chemistry 2412 L Dr. Sheppard
What are some other organic molecules? Lipids Fats.
1. Can you name this structure? Monosaccharide Organic Compound: Carbohydrate Monomer: Monosaccharide/Glucose.
Applying MetaboAnalyst
Chemicals in Organisms Organisms living things made up of cells.
Lipids. What are lipids? Large non polar organic molecules Elements that make up most lipids: – Carbon – Hydrogen – Oxygen Examples: – Phospholipids,
Fatty Acid Recovery and Identification in Mars Analogue Soil Samples Kimberly Lykens Mentor: Michael Tuite Jet Propulsion Laboratory Planetary Chemistry.
Biomolecules Macromolecules. Organic Compounds An organic compound is any compound that contains atoms of the element carbon. Carbon has 2 electrons in.
SCIX 2015 Sept.27-Oct.02 Rhode island convention center providence, RI Wenqian Hou 10/10/2015 Research Center of Analytical Instrumentation.
Topic: Nutrition Aim: Describe nutrition & the 2 types of digestion. Do Now: HW:
Big data toolbox.
Warm Up List the characteristics of organic compounds.
International Neurourology Journal 2014;18:
Metabolomics Research Core
5.3 Lipids are a diverse group of hydrophobic molecules
Chapter 31 & 32: Separation Science & Chromatography
Biological Molecules.
The four primary organic macromolecules
Untargeted metabolomics profiling by GC-TOF-MS reveals a human PCa-associated metabolic phenotype in Zn-deficient middle-aged Wistar-Unilever rat prostates.
Triglycerides Energy storage (fat)
Microbiome: Metabolomics
Organic Compounds.
Molecular basis of life
BIOMOLECULES Overview.
Carbs ENERGY Structural support of plants.
Moisture content evaluation for highly hydrophobic natural compound mixtures / cyclodextrin complexes by thermal methods Daniel I. Hădărugă, Nicoleta G.
Name ___________________________________________ Date _________________ Period __________ THE MOLECULES-O-LIFE Directions: For each molecule below,
Paper title-Analytical techniques in chemistry
Macromolecules aka Giant Molecules.
Important Organic Molecules in Cells
Microbiome: Metabolomics
Aim: Organic Compounds # 2 - Proteins
Scilligence ELN & ChemAxon Registration Integration
Biomolecules.
Presentation transcript:

Statistics evaluation and graphics with ChemAxon tools and Statistica and WEKA towards QSPR and QSAR development Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Free Academic Licenses for JChem and Instant JChem provided by ChemAxon Academic License for Statistica Dataminer provided by Statsoft Technical presentation  See notes and comments for deeper discussion ChemAxon Fiehnlab (fiehnlab.ucdavis.edu) Statistics - QSPR/QSAR - with JChem and Statistica and WEKA and Yale GNU general public license for WEKA provided by WEKA Machine Learning Project

Metabolomics - The science of the small molecules Compound Classes: sugars amino acids steroids fatty acids lipids phospholipids organic acids ... Molecules under investigation Visit us! www.fiehnlab.ucdavis.edu 3D model of a molecule with surface plot

Techniques and tools Analytical techniques (LC-MS, GC-MS, FT-MS, NMR, IR) BioInformatics, ChomoInformatics Liquid Chromatography LC-MS Gas Chromatography GC-MS BioInformatics and Cheminformatics Statistics (Statistica Dataminer) Open Source Tools

ChemAxon JChem has now PCA and PLS Create new library with JCHEM Manager GUI (testcase here: fingerprints) Exctract fingerprints and do dimension reduction with principal component analysis (PCA) with command line tool PCA.bat or pca.sh PCA – principal component analysis PLS – partial least squares

ChemAxon JChem Principal Component Analysis (PCA) Start PCA by getting information from DB (here Access, but can be Oracle, Derby, MySQL) Test case 250.000 chemicals from NCI DB PCA can be done from any descriptor, chemical fingerprints, BCUT etc. This is just a simple example made from the 16 standard fingerprints. Be sure only to select descriptors you want (and not the molecule ID) PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= 250000" -o PCA-scores.txt -t PCA-Eigenvalues.txt TimeThis : Command Line : run-pca.bat TimeThis : Start Time : Mon Nov 27 17:02:02 2006 TimeThis : End Time : Mon Nov 27 17:19:52 2006 TimeThis : Elapsed Time : 00:17:49.812 Testsystem AMD Dual Opteron 2,8 Ghz 2,8 GByte RAM; WINXP-32 bit --- TimeThis : Command Line : pca -i test-25kx16.txt -o PCA250k-scores-external.txt -t PCA250k-eigen-external.txt TimeThis : Start Time : Mon Nov 27 22:24:16 2006 activeColumns [I@b1b4c3 TimeThis : End Time : Mon Nov 27 22:40:01 2006 TimeThis : Elapsed Time : 00:15:45.375 ---- PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= 250000" -o PCA-scores.txt -t PCA-Eigenvalues.txt TimeThis : Command Line : run-pca.bat TimeThis : Start Time : Mon Nov 27 17:02:02 2006 TimeThis : End Time : Mon Nov 27 17:19:52 2006 TimeThis : Elapsed Time : 00:17:49.812 Z:\>pca -h PCA 3.2, (C) 2002-2006 ChemAxon Ltd. Principal Component Analisis. Usage: pca [options] General options: -h --help this help message -d --driver <JDBC driver> JDBC driver -u --dburl <url> URL of database -l --login <login> login name -p --password <password> password -s --saveconf save settings into "C:\Documents and Settings\Tobi\chemaxon\.jchem" -m --meancenter Don't autoscale just mean center data -s --noStandardize Don't mean center and autoscale the data -e --maxerr maximal error during the iteration Input options (default: standard input): -i --input <path> input file -q --query <sql> SQL query string for reading input (database input) Output options (default: standard output): -o --scoreOutput <filepath> output file path for principal components scores (text file output) -t --infoOutput <filepath> output file path for Eigenvalues, Cumulated variance ... (text file output) Problem here: A) JDBC extraction not tuned - DB extraction of values nearly 2 minutes. B) PCA calculation time too long - 15 minutes for a matrix 250.000 x 16 The current PCA algorithm needs to be changed, its very inefficent (faster matrix routines exist for JAVA) Database extraction time with Statistica: 8 seconds. The same PCA with Statistica is finished in: 1 second (no joke – thats a factor of 1:900).

JChem PCA output Eigenvalues, % and Cumulated variance (in rows)= 1.77 1.623 1.518 1.326 1.106 1.028 0.999 0.94 0.919 0.849 0.824 0.788 0.742 0.71 0.674 0.602 0.582 10.409 9.547 8.93 7.798 6.505 6.048 5.879 5.527 5.407 4.994 4.847 4.638 4.362 4.177 3.965 3.543 3.424 10.409 19.957 28.886 36.684 43.189 49.236 55.115 60.643 66.05 71.043 75.891 80.528 84.89 89.068 93.033 96.576 100 Loadings (in rows)= 0.191 0 -0.159 -0.17 0.306 0.617 -0.419 0.105 0.338 0.307 -0.304 -0.263 -0.324 0.348 -0.31 0.101 0.563 0.076 0.085 0.577 -0.117 0 0.128 0.084 -0.123 -0.255 -0.146 0.084 -0.682 0.335 -0.374 -0.63 0.11 0.063 0.182 -0.167 -0.049 0.181 -0.553 0.233 0.126 -0.016 -0.286 0.344 -0.535 -0.055 0.469 -0.035 0.235 -0.442 0.29 0.141 -0.572 0.077 -0.073 ------ PCA scores -0.873 0.597 1.843 -0.131 0.204 -1.141 1.016 0.806 -0.263 0.221 0.208 1.54 1.704 -1.382 0.705 1.397 0.622 0.668 0.233 -0.175 0.748 0.801 1.087 1.366 -0.91 -1.369 0.192 1.919 -2.231 -0.218 1.043 1.13 -0.672 0.723 -1.015 -0.089 0.477 1.877 0.381 0.766 -0.59 -0.082 0.877 0.466 0.2 -0.397 1.189 1.308 -0.102 0.304 0.81 0.896 -1.853 0.435 0.551 0.32 -1.083 -0.439 0.346 1.081 0.557 -0.624 -0.042 -2.87 -0.835 -1.519 -0.705 1.147 -0.62 0.198 0.492 -0.34 -0.526 0.484 0.011 -0.456 -0.299 0.509 -1.294 -0.801 -0.947 0.455 -0.595 -0.673 -2.836 0.796 -0.631 0.353 -1.157 -1.519 0.957 0.966 1.113 0.919 0.368 -1.399 -0.215 -1.106 -1.638 -0.673 -1.918 -0.477 1.168 1.835 -0.755 -2.252 -0.962 -0.515 -0.722 1.023 2.844 1.349 0.353 -2.345 -0.737 0.808 1.778 0.343 -0.197 -0.221 -0.529 3.189 -1.481 -1.754 0.152 -0.881 -2.449 -0.649 0.622 0.301 0.928 -1.174 0.526 -0.322 -0.17 -0.589 0.233 -0.149 -0.783 -0.704 1.524 -1.547 -1.642 -1.085 0.981 0 The PCA results matrix is inverted and values *(-1) from Statistica. Problem: Currently no graphics. But multivariate statistics lives from graphics. Follwing simple graphic examples are made with Statistica or WEKA via DB query.

Following slides  „What could be“ in the future. or  „What can be done“ right now. Check the pretty comprehensive statistics link http://www.statsoft.com/textbook/stathome.html

Machine Learning and statistic tools PLS Machine Learning (KNN) Feature selection Tree model Neural Network Cluster Analysis Response curves Try matlab if you want to die from command line sickness  BUT MatLab is very fast and compiled for each specifc CPU  YALE http://superb-east.dl.sourceforge.net/sourceforge/yale/yale-3.4-tutorial.pdf We use Statistica Dataminer as a comprehensive statistics work tool. WEKA or YALE are free but (not yet :-) as powerful as the Statistica Dataminer.

Connection of a JCHEM molecule DB via JDBC with Statistica For Oracle and Apache Derby and multi core CPU speed with JCHEM calculations check here: Check my other presentation for JCHEM (See www.chemaxon.com/forum/ftopic2218.html ) Or copy this http://www.chemaxon.com/forum/download2189.ppt Or this (not mine) http://www.oracle.com/technology/products/berkeley-db/pdf/je-derby-performance.pdf Time for query + copy of 4,000,000 values with 250k molecules 16 fingerprints = 8 seconds. Test system JChem 3.2 with MS Access with Statistica Dataminer 7.1 Dual Opteron 2.8 GHz

Statistica with JChem data Statistica has 11.000 inbuilt functions and most (if not all) statistical routines. Its way more comfortable than R or matlab – R has only a commandline Try also Yale or WEKA

PCA Scree plot – determine optimal factors to retain Visible Step Technical presentation  See notes and comments for deeper discussion ChemAxon Fiehnlab (fiehnlab.ucdavis.edu) Statistica Dataminer 7.1 Four factors can be retained. The 16 dimensional space can be compressed into a 4-dimensional space. (Scree plot is not optimal here)

PCA Loadings plot – which variables are influential? If you want to cluster loadings you have to put the loadings output into a cluster analysis. Statistica Dataminer 7.1 Which of the 16 fingerprints are similar? Those who “cluster” together are similar (fp_11 and fp_14). The variables fp_5 and fp_16 influence factor 1 in the same way. Variables inside or near the center (0,0) have no discrimination power. Remember PCA is no cluster analysis!

PCA Scores plot – picture of the reduced dimensionality. Technical presentation  See notes and comments for deeper discussion ChemAxon Fiehnlab (fiehnlab.ucdavis.edu) Statistica Dataminer 7.1 The 16 fingerprints are compressed into 2D. We can use other high dimensionality descriptors for enhanced examples. Cases (molecules) which „cluster“ together may have same properties or functional groups (depending on input). Here we see the KOW molecule set covers the whole NCI dataset based on 16 pfs.

PCA Scores 3D plot – KOWWIN versus silicon compound test set Statistica Dataminer 7.1 The 16 fingerprints are compressed into 3D. The KOWWIN test set does not cover the whole molecules space of important silicon containing molecules. You can also do an Overlap Analysis (compare two databases) within the all-new Instant-JChem.

Statistica – Random Forest Machine learning 1024-DIM FC descriptor space Statistica generates all graphical output + SQL code Z:\>timethis "generatemd c 10k-test.smi -T -2 -k CF >10k-fp.txt" TimeThis : Command Line : generatemd c 10k-test.smi -T -2 -k CF >10k-fp.txt TimeThis : Start Time : Wed Nov 29 20:35:27 2006 TimeThis : End Time : Wed Nov 29 20:35:33 2006 TimeThis : Elapsed Time : 00:00:05.421 On Dual Opteron 2,8 GHz (one core used only). ------ Miklos Chemical fingerprint generation: 500/s Pharmacophore fingerprint generation calculated: 80/s rule-based: 200/s Screening: 12000/s Optimization: 10s/metric Hardware/software environment: P4 3GHz, 1GB RAM Red Hat Linux 9 Java 1.4.2 Chemical fingerprint descriptors generated with JCHEM GenerateMD GenerateMD performance 1800 molecules/second for 1024 dimensional fp On Dual Opteron 2,8 GHz (one core used only).

CART tree method for QSPR and QSAR Thats no joke, check out scholar.google.com Classification trees, boosting trees, random forest, regression trees and honest trees and adaptive trees – lots of wood and forests - did you hear about them?

Other machine learning techniques from Statistica Dataminer we use Most of them work for classification and regression Model class specific model # Generalized Linear Models (GLM) General Discriminant Analysis 1 Binary logit (logistic) regression 2 Binary probit regression 3 Nonlinear model Multivariate adaptive regression splines (MARS) 4 Tree models Standard Classification Trees (CART) 5 Standard General Chi-square Automatic Interaction Detector (CHAID) 6 Exhaustive CHAID 7 Boosting classification trees 8 Neural Networks Multilayer Perceptron neural network (MLP) 9 Radial Basis Function neural network (RBF) 10 Machine Learning Support Vector Machines (SVM) 11 Naive Bayes classifier 12 k-Nearest Neighbors (KNN) 13 More than 11.000 functions available

Now with open source datamining tool WEKA URL SQL Data For MS Access create from ADMIN tools, JDBC driver, add DNS file, create DB; or use Orcacle settings File databaseutils.props in weka root DIR jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver jdbcURL=jdbc:odbc:jchem-z SQL: SELECT silicon.`cd_fp1`, silicon.`cd_fp2`, silicon.`cd_fp3`, silicon.`cd_fp4`, silicon.`cd_fp5`, silicon.`cd_fp6`, silicon.`cd_fp7`, silicon.`cd_fp8`, silicon.`cd_fp9`, silicon.`cd_fp10`, silicon.`cd_fp11`, silicon.`cd_fp12`, silicon.`cd_fp13`, silicon.`cd_fp14`, silicon.`cd_fp15`, silicon.`cd_fp16` FROM `Z:\access-DB\silicon`.`silicon` silicon Yellow = OK Easy: enter DB URL, enter SQL statement, import data. Try free AquaStudio for SQL!

WEKA - Machine learning algorithms in Java Technical presentation  See notes and comments for deeper discussion ChemAxon Fiehnlab (fiehnlab.ucdavis.edu)

WEKA – fingerprint visualization Data matrix 22,000x16

Conclusions regarding statistics: JChem PCA and PLS output (Eigenvalues, scores, loadings) are provided only as textfile. More univariate and multivariate tools needed. JChem PCA and PLS results must have graphical output. (They must) JChem PCA must be made faster (factor 600-1000) by using math routines. Integration into Instant-JChem would be good or ChemAxon provides enhanced bundled statistics tools. Currently JDBC query from JChem to other statistical packages like WEKA or Statistica or R or MATLAB or YALE is perfect. Each package works best in the field it was designed for. Matlab and R and YALE database connection JDBC or ODBC not shown here MATLAB http://www.mathworks.com/access/helpdesk/help/toolbox/database/ R http://cran.r-project.org/doc/manuals/R-data.pdf YALE http://sourceforge.net/project/showfiles.php?group_id=114160 Thats it. Thanks