1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

Slides:



Advertisements
Similar presentations
Virtual Synthesis - Reactor
Advertisements

Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Leveraging Commercial Graph DB Technologies in Open Source and Polyglot Application Environments Brian Clark, VP Product Management Objectivity, Inc.
1 Szabolcs Csepregi*, Szilárd Dóránt, Nóra Máté, Miklós Vargyas, Péter Kovács, György Pirok, Ferenc Csizmadia First presented at Applications of Cheminformatics.
Scientific & technical presentation JChem Cartridge for Oracle
Integrating ChemAxon technology into your End User Applications Java solutions for cheminformatics Ver. Mar., 2005.
Scientific & technical presentation Calculator Plugins January 2011.
Instant JChem INFORMATICS MATTERS
Java Solutions for Cheminformatics Feb 2008 Whats new for PP.
JChem Web Services Server Jonathan Lee Solutions for Cheminformatics Technical Product Presentation.
Chemical Naming Daniel Bonniot, PhD October 2008.
Nov 2008 Scientific & technical presentation JChem for Excel.
In Silico Synthesis György Pirok, Nóra Máté. Elements of the Virtual Synthesis Technology A language for describing chemical rules –Chemical Terms A library.
Solutions for Cheminformatics
Interfacing the JChem Suite outside of Java Jonathan Lee Solutions for Cheminformatics.
Java Solutions for Cheminformatics April 2006 Using and fine tuning JChem Cartridge (Workshop)
UGM, June, 2007 Presenting: Szabolcs Csepregi JChem Base and Cartridge latest.
Instant JChem - current status and what's coming soon. Tim Dudgeon Solutions for Cheminformatics.
1 György Pirok, Szilárd Dóránt May, 2005 What is Marvin and how to...
Statistics evaluation and graphics
1 Real World Chemistry Virtual discovery for the real world Joe Mernagh 19 May 2005.
2008 Accelrys EUGM Pipelining ChemAxon Szilard Dorant Solutions for Cheminformatics.
Instant JChem 2009 US + EU Seminars Confidential. Copyright© 2009 ChemAxon Kft, Informatics Matters Ltd Instant JChem Instant JChem Seminar series Q
Java Solutions for Cheminformatics March About Us Molecule Drawing and Visualization Structure Searching Cartridge Structure Standardization Molecular.
Setting up repositories: Technical Requirements, Repository Software, Metadata & Workflow. Repository services Iryna Kuchma, eIFL Open Access program manager,
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
1 G54PRG Programming Lecture 1 Amadeo Ascó Adam Moore G54PRG Programming Lecture 1 Amadeo Ascó 3 Java Programming Language.
Enterprise Java and Data Services Designing for Broadly Available Grid Data Access Services.
Eldas 1.0 Enterprise Level Data Access Services Design Issues, Implementation and Future Development Davy Virdee.
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
The Internet and the World Wide Web. Una DooneySlide 2Internet and WWW What is the Internet? This is the physical infrastructure or backbone of computers,
William Weadock, MD Frank Londy Sarah Abate James Ellis, MD
PeopleSoft Ping David Kurtz
What's new?. ETS4 for Experts - New ETS4 Functions - improved Workflows - improvements in relation to ETS3.
6/3/2014 BMC Remedy Software License Management Example Manuel Linares.
Change Management on the Cheap: Tortoise SVN and Ant Two Tools for your Applications Implementation Toolkit Joe Tseng North Slope Solutions
XIr2 Recommended Performance Tuning Andy Erthal BI Practice Manager.
Performance Tuning Compiled from: Oracle Database Administration, Session 13, Performance, Harvard U Oracle Server Tuning Accelerator, David Scott, Intec.
QA practitioners viewpoint
1 Migrating from Access to SQL Server Simon Kingston, CSU / NPS NRGIS.
The Use of Graph Matching Algorithms to Identify Biochemical Substructures in Synthetic Chemical Compounds Application to Metabolomics Mai Hamdalla, David.
Describing Complex Products as Configurations using APL Arrays.
© 2010 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary. TIBCO Spotfire Application Data Services TIBCO Spotfire European User Conference.
August 2012 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit
A Comparison of HTTP and HTTPS Performance Arthur Goldberg, Robert Buff, Andrew Schmitt [artg, buff, Computer Science Department Courant.
Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
PRIMERGY Installation
ArrayExpress Query Interface Gonzalo Garc í a Lara January, / 24.
Presented by Douglas Greer Creating and Maintaining Business Objects Universes.
Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.
Dan Bassett, Jonathan Canfield December 13, 2011.
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Page 1 GADD Software - An Introduction Public version, August 2014, gaddsoftware.com.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Bottoms Up Factoring. Start with the X-box 3-9 Product Sum
1 Implementing DDIEditor in the Danish Data Archive - Demonstration and gained experience Part of session: Recent Developments in the DDI Implementation.
OPENING NEW FRONTIERS FOR TEST SM Galaxy Examinator: GEX The ultra fast, easy to use solution for: Data analysis Characterization Yield optimization Test.
Welcome! Mass Spectrometry meets Cheminformatics Tobias Kind and Julie Leary UC Davis Course 7: Concepts for LC-MS Class website: CHE Spring 2008.
Copyright GeneGo CONFIDENTIAL »« MetaCore TM (System requirements and installation) Systems Biology for Drug Discovery.
Metabolomics DNA RNA Protein Biochemicals (Metabolites) Genomics – 25,000 Genes Transcriptomics – 100,000 Transcripts Metabolomics – 2,800 Compounds Proteomics.
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Sagent Design Studio 컴퓨터학과 데이터베이스연구실 석사 1 학기 홍 은 주 2000 년 3 월 27 일 월요일.
May 2009 ChemAxon - What’s New?. What’s new and hot? All products have seen enhancements in the past 12 months BUT WHAT’S REALLY HOT?
Installation of Storage Foundation for Windows High Availability 5.1 SP2 1 Daniel Schnack Principle Technical Support Engineer.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.
Enterprise PHP PHP applications in the big business.
ETL Validator Deployment Options
Windows Server 2008 and SQL 2008 Windows Server 2008.
Database Software.
Presentation transcript:

1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem and Instant JChem provided by

2 ChemAxon product suite We have free academic licenses for all products Source: Chemaxon.com

3 Fiehnlab- The science of the small molecules Compound Classes: sugars amino acids steroids fatty acids lipids phospholipids organic acids... Molecules under investigation (shown with ChemAxon Marvin) 3D model of a molecule with surface plot (shown with ChemAxon MarvinSpace) Visit fiehnlab.ucdavis.edufiehnlab.ucdavis.edu

4 Metabolomics is a truly emerging science...tries to identify all small molecules (< 2000 Da) in all life forms in a comprehensive manner Life Science Tree: Genomics (DNA) Transcriptomics (RNA) Proteomics (Proteins) Metabolomics (Small Molecules)

5 Techniques and tools Analytical techniques (LC-MS, GC-MS, NMR, IR) BioInformatics, Cheminformatics Liquid Chromatography LC-MS Gas Chromatography GC-TOF-MS BioInformatics and Cheminformatics Statistics (Statistica Dataminer) Open Source + commercial software LTQ-FT-MS

6 We use cheminformatics tools for mass spectrometry based structure elucidation See our BMC Bioinformatics paper: Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm ;

7 What are JChem and Instant-JChem? JChem and Instant JChem are cheminformatics tools for handling small molecule structures together with substance data (logP, fingerprint, pKa, toxicity, meta-information) + searches + filter + web connections and more Difference: JChem = complex package and Instant-JChem = one single tool Instant-JChemJChem Picture ChemAxon

8 Benchmarking Instant-JChem and JChem Oracle (and more) Myth 1: JChem+Oracle is faster than Instant-JChem+Apache Derby – Reality: lets see... Myth 2: JAVA is slow – Reality: Its fast (70% of C++).70% of C++ Myth 3: Old Intel Netbust Xeons (Netburst) are slow – Reality: Yes. Myth 4: Oracle is a hazzelfree and handsome DB for beginners – Reality: Myth 6: 2 CPUs are better than one – Reality: Yes. Myth 7: Comparing apples with oranges (in germany pears) is unfair - c'mon... Only first myth left.

9 A bit of Oracle Reality Oracle works, lots of people invested lots of mony (ORCL market cap = 92 billion dollars) Its good for large data (TByte) - Its overkill for a small DB. If you plan to install it on your production workstation (a big No No) It will eat MB of your valuable RAM (for nothing, on WINXP 32 bit) It will create 15,049 files in 2,029 folders (for what?) It will create a lot of hassle with certain network setups (DHCP) RTFM (read the … manual) is no joke and you need to learn SQL (try the free Aqua Data Studio)Aqua Data Studio Complete learning will take you 1..2 years, but gives you extreme flexibility If you plan to install JCHEM + Oracle you need JChem (includes cartride for Oracle) Oracle Apache Tomcat 1-2 days time (ChemAxon documentation is good, but too many things can go wrong with Oracle) Happy Oracle Ace paid 10K for certificate 1st time Oracle user

10 A bit of Instant JChem Reality v1.0 A) Download B) Install C) It Runs instantly inbuilt Apache Derby DB JAVA engine included complete JChem included out-of-the-box tool can connect to other DBs

11 During import in Instant JChem only one CPU works. The fingerprint calculation is probably not multi-threaded. (Solution: work pool = make pool for n CPUs) Short import time is critical for user convinience, but not for long term database projects. Importing Structures into Instant JChem

12 Importing Structures into Instant-JChem influence of JAVA hotspot compiler JAVA VM runs in to modes: with client compiler and server compiler (directories under JRE) If you run any calculation intensive programs alwyas use server mode, in a batch file call java –server XYZ Good and fast Bad and slow

13 Import of 250k structures (NCI99.smi) into Instant- JChem: Server JVM is 20% faster! Influence of JAVA hotspot compiler Importing Structures into Instant-JChem Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate); ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1 GByte/s transfer) lower is better

14 SMILES: NC1=CC=NC2=C1C=CC(Cl)=C2 (mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptor Count() <= 10) (acceptor count for C and H) Influence of JAVA hotspot compiler with Instant-JChem JAVA server mode: 15 seconds (30% faster) JAVA client mode:21 seconds If you want to speed-up this query you need to pre-calculate and include all descriptors already in the database Task: Search for substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results.

15 Influence of number of CPUs with Instant-JChem Doing the Lipinski utilizes both CPU cores! Try Intel Quad! Try Opteron 8x! 2 CPUs1 CPU JAVA server mode:15 seconds33 seconds JAVA client mode:21 seconds44 seconds Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate); ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer) Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results

16 Influence of number of CPUs with Instant-JChem Doing the Lipinski utilizes multiple CPU cores! However a single logP calculation is dependent on CPU speed, not CPU cores. Use AMD Opteron 8xCPU systems (or better). For cheaper setups use Intel Core 2 Quad (QX6700). 1 CPU (1x2.8 GHz)*2 CPUs (1x2.8 GHz)*8 CPUs** (2 GHz) 33 seconds15 seconds4 seconds Testsystem*: Dual Opteron 254 (2.8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate); Testsystem** : 4 x Dual-Core Opteron GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results (on the fly)

17 Influence of number of CPUs on complex calculations with Instant-JChem Hits1 CPU (1x2.8 GHz)*2 CPUs (1x2.8 GHz)*8 CPUs** (2 GHz) Bioavailability s17 s7.5 s Ghose filter s8 s4.4 s Lead likeness s25 s9.8 s Lipinski rule of s7.5 s4.7 s Muegge filter s s Veber filter s s Testsystem*: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate); Testsystem** : 4 x Dual-Core Opteron GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space Task: Search in 1000 compounds from PubChem-1000-demo and calculate on-the-fly: Take home message: The more complex the request – the more CPUs you need. The lead likeness has 7 filters and reaches a 5-8 times speed-up with more CPUs.

18 Scaling complex calculations to larger DBs with Instant-JChem Hits Direct Query Calculation 8 CPUs** (2 GHz) extrapolated time from 1000er DB Obtained speed-up Bioavailability 227,997<1s380 s2055 s5 Ghose filter 160,047<1s230 s2762 s12 Lead likeness 159,656<1s1255 s2947 s2 Lipinski rule of 5 199,821<1s176 s1210 s7 Muegge filter 145,234<1s299 s1783 s6 Veber filter 215,377<1s20 s696 s35 Testsystem** : 4 x Dual-Core Opteron GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB max set for JAVA heap space 1.5 GByte JAVA heap space used. Task: Now search in 250,000 compounds from NCI2000 and calculate on the fly: Take home message: Do not extrapolate calculational times from different or smaller DBs. The speedups here are 2-35 larger than expected. Pre-calculate values once and store them in the DB and query values later.

19 Derby database file sizes for Instant- JChem+Apache Derby Compounds only 100k structures ~30 MByte 1 Mio structures~300 MByte 10 Mio structures~3 GByte 20 Mio structures~6 Gbyte If you have dual or quad cores turn drive compression on. You can save almost 50% space, speed overhead is low.

20 Instant-JChem on disk based and RAMDisk based systems People who said the OS has efficient disk caching lied. A large RAMDISK can speed up your system extremely. A) If you have money – buy a Solid State Disk RAMSAN-400; 128 GByte; Price $252,720 3,000 MB/s random sustained external throughput. B) If you have some money – buy a RAID5 card. ARECA ARC-1120 for 8 HDs, Price $ MB/s read and write access C) If you have litte money – buy a RAMDISK and stuff as much RAM in as possible (take a 64-bit OS) MB/s read and write access...a normal hard drive has ~30-50 MB/s transfer rate

21 Instant-JChem on disk based and RAMdisk based system Load 3 Mio compound DB from Ramdisk:2 seconds Load 3 Mio compound DB from RAID5 disk:11 seconds (factor 5) Search Substructure from RAMDISK DB:instant (imemory buffered) Search Substructure from RAID5 DB:instant (memory buffered) A) Heap Memory max 800 MByte (OK) B) Heap Memory max 200 MByte (too low) Load 3 Mio compound DB from Ramdisk:19 seconds Load 3 Mio compound DB from RAID5 disk:25 seconds (factor 1.3) Search Substructure from RAMDISK DB:22 seconds Search Substructure from RAID5 DB:38 seconds (factor 1.7) Take home message: give JAVA (JChem) as much heap memory as you can. For 3 Million structures you need minimum 300 MByte heap space. No Heap memory: Performance degradation: Everything must be read from disk; My RAID5 is already extremely fast, still the RAMDISK is even faster

22 JChem+Oracle DB on Xeon vs. Instant-JChem+Apache Derby DB on Opteron (apples vs. oranges) 3GHz Dual Xeon with 2GB system memory - JChem+Oracle DB = 5801 seconds (96 minutes) 2.8 GHz Dual Opteron with 2,88 GB memory - Instant-JChem+Apache Derby = 5333 seconds (88 minutes) Task: Import and indexing 3 million compounds (NCI2000 duplicated to 3 Mio) Source Xeon data: Oracle Cartridge Benchmark Take home message: If you have a (modest) modern computer it can handle JChem and Instant-JChem and a local database can be faster than a remote database

23 Instant-JChem+Apache Derby DB on Socrates* vs. Instant-JChem+Apache Derby DB on Dual Opteron 2.8 GHz (WIN-XP)** vs. JChem+Oracle DB on Dual Xeon 3 GHz (W2003 Server)*** (more apples vs. oranges) Task: Search for a substructures in a 3 million compound database (NCI2000x12) # Hits Instant- JChem+Derby* Instant- JChem+Derby**JChem+Oracle*** C1CN1c2cnnc3c(cncc23)C4=CSC=C O=C1ONC(N1c2ccccc2)c3ccccc [#6]-c1cc(-[#6])nc(NS(=O)(=O)c2ccccc2)n c1ncc2ncnc2n1 65,2082 s7 s14 s Clc1ccccc1 274,6085 s15 s43 s O=Cc1ccccc1 443,580 9 s28 s85 s Take home message: Instant-JChem is fast (nothing more). Source: Instant-JChem (own system), JChem (ChemAxon website) Socrates*: 4x Dual Opteron 870 2GHz; CentOS 64-bit, 32 GByte RAM, 4 GB set for JAVA Opteron**: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer ); ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer) Xeon: Dual Intel Xeon 3GHz, 2GB memory, 160GB IDE hard drive; Windows ; Oracle DB buffer 1 GB; 1.5.0_06-b05 Apache Tomcat/5.5.12

24 A 20 million compound DB with Instant-JChem in a local Derby DB (WinXP-32bit) Import is heavily disk dependent several hundred million read/write operations to disk (JAVA writes in 4 KB chunks) JAVA heap space used during import is around 600 MByte import time is not linear anymore WIN XP 32-bit + NTFS desperatly try to cache the 6 GByte database file, even if there is only 3 GByte memory maximum available (1 GByte max for cache). index creation (import smiles): 20h (too long) open index for search: 1 min substructure search: > 1min (to long) 20 Mio currently to large for Instant-JChem v1.0 use JChem+Oracle (or MySQL, MS SQL) Aim: Full PubChem data (15-20 Mio) locally

25 Some general JAVA + JChem speed advices 1.Always use server JVM (check directory bin\client and bin\server) check batch or sh file options for JAVA –server xyz xyz.jar 2.Use 64-bit systems; the JAVA maximum heap space for LINUX or WIN as 32-bit system is only 1.6 GByte -Xms=1600m 3.Use only multicore machines (AMD Opterons, Intel Quad) 4.Use the fastest disks you can buy (WD Raptor) or use RAID5 or RAID6 for large files (PubChem SDF data for 5 Mio compounds = 30 GByte) 5.Give Instant-JChem as much memory as you have - minimum 500 MByte for extreme speed (no wait time for searches)

26 Lets not forget competitors Two reasons: The programs work under WINDOWS and LINUX ChemAxon has the best and most responsive public forum: Critics is taken seriously, requested features are implemented ASAP, and a public response within 1-3 days. WHY? Many commercial licencees. Remember, for academics all free. Many good systems exist: MDL (ISIS Base), ACDLabs (ACD/ChemFolder Enterprise), Tripos (Sybyl+Auspyx), Molecular Networks (Carol), CDK and Taverna, Accelrys (Accord), Daylight (Thor and Merlin), CambridgeSoft (ChemOffice Enterprise), Molsoft (ICM+MolCart) Why is ChemAxon better?

27 Results and conclusion JChem Oracle vs. Instant-JChem 1.Instant-JChem+Derby is as fast or faster than JChem+Oracle for DBs < 3 Mio 2.If you want to have fun and results at your fingertip: Instant-JChem 3.If you want extreme flexibility and you know JAVA+SQL: JChem-Oracle 4.We are far away from handling billions of structures in a DB (with modest efforts) We will handle such large number of structures file stream based with cluster support. 5.Software producers (in general) need to put more efforts into software development for multi-core CPUs + clusters under Windows and LINUX.