C. Furlanello – June 22th, 2006 Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Yue Han and Lei Yu Binghamton University.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Diagnosis of Ovarian Cancer Based on Mass Spectrum of Blood Samples Committee: Eugene Fink Lihua Li Dmitry B. Goldgof Hong Tang.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Previous Lecture: Regression and Correlation
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Ensemble Learning (2), Tree and Forest
Proteomics Informatics Workshop Part III: Protein Quantitation
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
Proteomics Informatics – Molecular signatures (Week 11)
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
A Significance Test-Based Feature Selection Method for the Detection of Prostate Cancer from Proteomic Patterns M.A.Sc. Candidate: Qianren (Tim) Xu The.
Whole Genome Expression Analysis
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Metrological Experiments in Biomarker Development (Mass Spectrometry—Statistical Issues) Walter Liggett Statistical Engineering Division Peter Barker Biotechnology.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
INF380 - Proteomics-61 INF380 – Proteomics Chapter 6 – Mass Spectrometry – MALDI TOF The MALDI-TOF instruments are the simplest MS instruments suitable.
The Broad Institute of MIT and Harvard Classification / Prediction.
Reanalysis of Petricoin et al. Ovarian Cancer Data Set 3 Russ Wolfinger and Geoff Mann SAS Institute Inc. NISS Proteomics Workshop March 6, 2003.
INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.
Laxman Yetukuri T : Modeling of Proteomics Data
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Previous Lecture: Bioimage Informatics Agullo-Pascual E, Reid DA, Keegan S, Sidhu M, Fenyö D, Rothenberg E, Delmar M, "Super-resolution fluorescence microscopy.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
MS Calibration for Protein Profiles We need calibration for –Accurate mass value Mass error: (Measured Mass – Theoretical Mass) X 10 6 ppm Theoretical.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
High throughput Protein Measurement Techniques Harin Kanani.
Evaluating Results of Learning Blaž Zupan
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Consensus Group Stable Feature Selection
Introduction to Biostatistics and Bioinformatics Experimental Design.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Enabling Grids for E-sciencE ITC-irst for NA4 biomed meeting at EGEE conference: Ginevra 2006 BioDCV - Features 1.Application for analysis of microarray.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Tutorial on "GRID Computing“ EMBnet Conference 2008 CNR - ITB GRID distribution supporting chaotic map clustering on large mixed microarray.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.
Fig. 1. proFIA approach for peak detection and quantification
Proteomics Informatics David Fenyő
Softberry Mass Spectra (SMS) processing tools
Model generalization Brief summary of methods
Pierre P. Massion, MD, Richard M. Caprioli, PhD 
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Proteomics Informatics David Fenyő
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst, Trento, Italy Proteome profiling without selection bias

C. Furlanello – Trento – April 4, 2006 Topics Biomarkers and Predictive classification Biomarkers and Predictive classification Prediction and Bias Prediction and Bias Biomarker Ranking algorithms with Support Vector Machines (kernel methods) Biomarker Ranking algorithms with Support Vector Machines (kernel methods) The Complete Validation Platform (BioDCV) The Complete Validation Platform (BioDCV) Pipeline of Preprocessing Procedures Pipeline of Preprocessing Procedures Grid application (EGEE-Biomed VO) Grid application (EGEE-Biomed VO) Experiments Experiments Cromwell MALDI-TOF simulated data Cromwell MALDI-TOF simulated data SELDI-TOF Ovarian cancer (NCIFDAProteomics) SELDI-TOF Ovarian cancer (NCIFDAProteomics) MALDI-TOF Ovarian cancer (Keck Labs) MALDI-TOF Ovarian cancer (Keck Labs) Biomarkers and Predictive classification Biomarkers and Predictive classification Prediction and Bias Prediction and Bias Biomarker Ranking algorithms with Support Vector Machines (kernel methods) Biomarker Ranking algorithms with Support Vector Machines (kernel methods) The Complete Validation Platform (BioDCV) The Complete Validation Platform (BioDCV) Pipeline of Preprocessing Procedures Pipeline of Preprocessing Procedures Grid application (EGEE-Biomed VO) Grid application (EGEE-Biomed VO) Experiments Experiments Cromwell MALDI-TOF simulated data Cromwell MALDI-TOF simulated data SELDI-TOF Ovarian cancer (NCIFDAProteomics) SELDI-TOF Ovarian cancer (NCIFDAProteomics) MALDI-TOF Ovarian cancer (Keck Labs) MALDI-TOF Ovarian cancer (Keck Labs)

C. Furlanello – Trento – April 4, 2006 Prediction and Bias  Bias: on data preparation, preprocessing (complex!), classification  E Petricoin, A Ardekani, B Hitt, P Levine, B Fusaro, S Steinberg, G Mills, C Simone, D Fishman, E Kohn, and L Liotta. Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359: ,  K Baggerly, J Morris, and K Coombes. Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20(5): ,  Controversy: J Natl Cancer Inst 2005; 97  K Baggerly, JS Morris, SR Edmonson, KR Coombes. Signal in Noise: Evaluating Reported Reproducibility of Serum Proteomic Tests for Ovarian Cancer  LA Liotta, M Lowenthal, A Mehta, TP Conrads, TD Veenstra, DA Fishman, EFIII Petricoin. Importance of Communication Between Producers and Consumers of Publicly Available Experimental Data  DF Ransohoff. Lessons from Controversy: Ovarian Cancer Screening and Serum Proteomics  DF. Ransohoff. Bias as a threat to the validity of cancer molecular-marker research. Nature, 5: , 2005  DF. Ransohoff. Bias as a threat to the validity of cancer molecular-marker research. Nature, 5: , 2005.

The selection bias problem (*): similar results of near perfect classification with few genes published in PNAS, Machine Learning, Genome Research, BMC Bioinformatics, etc. (*) Colon cancer data: 2000 genes, 62 tissues 22 normal and 40 tumor cases (Alon et. al,1999) Pervasive in the first years of microarray classification studies: use CV to evaluate models, pick up best probes, compute again expected error with CV … A zero error (CV) may be obtained with only 8 genes (*). But when repeating the experiment after a label randomization, a very similar result is reached: 14 genes are sufficient to get a zero error estimate. The same effect can be reproduced with no-information datasets !! METHODOLOGY - Ambroise & McLachlan, Simon et. al Furlanello et. al 2003

COMPLETE VALIDATION To avoid selection bias (p>>n): * externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation high computational costs due to replicates ( models)** ** Binary classification, on a genes x 45 cDNA array, 400 loops * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error

C. Furlanello – Trento – April 4, 2006 E-RFE with SVM Classifiers ACCELERATIONS Parametrics Sqrt–RFE Decimation-RFE Non-Parametrics E–RFE: adapting to weight distribution by thresholding the SVM weights at w* features steps Chunk 1 … … 1-step RFE 20K MODEL: Support Vector Machines (SVM) RANKING  SELECTION Recursive Feature Elimination (RFE): a stepwise backward selection procedure. At each step, eliminate the “least interesting variable” and retrain

C. Furlanello – June 22th, 2006 High Throughput Computing Complete validation on cluster/ Grid BIODCV for National Institute of Cancer 1.COMPLETE VALIDATION CURES THE SELECTION BIAS 2.Computational solutions: Clusters and GRID 3.The by-products of complete validation Accuracy estimates Set of lists: aggregation stability k25 - kt50 mult G-221 G-2040 G-3800 G-3567 G-3275 G-795 G-238 G-867 G-832 G-2957 G-2290 G-3376 G-1408 G-3098 G-2599 G-1171 G-467 G-338 G-2267 G-676 G-1303 G-182 G-1532 G-164 G-317 G-934 G-1786 G-3902 G-183 G-534 G-451 G-6 G-2866 G-1913 G-1682 G-1346 G-1371 G-3280 G-239 G-933 G-1031 G-3140 G-1256 G-1722 G-80 G-3566 G-72 G-384 G-1059 G-1704 Mutual genelist distance Canberra distance Density SS1 SS2 PREVIOUS WORK ON MICROARRAY DATA Neural Networks BMC Bioinformatics IEEE Trans. Signal Processing IEEE Trans. Comp. Biology and Bioinformatic s Int. J. of Cancer

Harnessing Bias Data generation Preprocessing Complete Validation Biomarkers identification ▪integrate a pipeline for proteomic data preprocessing with the BioDCV complete validation process

Preprocessing 2 ppc: another R package – features defined by cluster centroids’ location R. Tibshirani et al. Sample classification from protein mass spectrometry, by “peak probability contrasts” Bioinformatics 20(17): , 2004 BatchSingle spectrum (ms Standardization) Peak Assignment 2 Centroid Identification 2 Peak Extraction 2 Normalization (A.U.C.) Baseline subtraction 1 1 PROcess: an R package -- lowess for baseline subtraction X. Li A package for processing protein mass spectrometry data. Not discussed here: calibration, filtering

Complete Validation for Proteomics GIVEN mz-ms data: spectra in a standardized mass spectrometry format; a binary label for each spectrum (e.g. +1/-1) FIND: Biomarkers valid on novel data & classification error estimates preprocessing preprocessing parameters Training: (development data) New data (test) Class. Model parameters Predictor (for new data) PREDICTED ACCURACY ESTIMATES Test Accuracy classification with complete validation Across sampleWithin sample (Standardization) Peak assignment Peak clustering Peak extraction AUC normalization Baseline Subtraction Within sample BIOMARKERS 1 ?

Methodology  H=5 splits (80%-20% or 50%-50%)  B=400 replicates  Preprocessing –AP0: non-batch –AP1: batch  Random labels exps. Goal: study biomarker selection by complete validation setup PROTEOMIC DATA TR 1 TS 1 TR 2 TS 2 TR 5 TS 5 Predicted Accuracy Estimates BIOMARKERS Test Accuracy … Predicted Accuracy Estimates BIOMARKERS Test Accuracy … classification with complete validation Within sample Across sample Within sample preprocessing Predicted Accuracy Estimates BIOMARKERS Test Accuracy

Datasets  Simulated MALDI-TOF data (Cromwell’s): 4 datasets at increasing levels of noise: ε=N(0, σ) σ=(0,10, 50, 300) tot #class 1 class -1 #m/z (100Da < m/z < 20000Da)  Ovarian 8/7/02 (SELDI-TOF)*  Ovarian ’05 (MALDI - TOF) tot #cancercontrol#m/z (0Da < m/z < 20000Da) tot #cancercontrol#m/z reflectron#m/z linear (3450Da < m/z < 28000Da) Nat. Ovarian Cancer Early Detection Program Northwestern Univ. Hospital Micromass Keck Lab Yale (Wu et al 2005) * Technical and experimental design of this dataset were questioned.

Simulated Data (I) Configuration A.Default parameters: n Voltage between plates (20 KV) n Length of drift tube (L=1 m) n Distance between charged grids (8 mm) n Standard deviation on initial particles’ velocity (50) B.Defined Parameters: n Peak sites (chosen from a real dataset) n Peak intensity (max no. of a set of particles) n Standard deviation on noise over intensity Software: v 2.0 in R, from S-Plus code K.R.Coombes et al. Understanding the characteristics of mass spectrometry data through the use of simulation Cancer Informatics 2005:1(1) Hypotheses Different peak intensity at a panel of m/z locations discriminates the two classes A “band” structure Cromwell: a proteomic MALDI-TOF simulation engine Sample plate Detector grids Drift region D1D1 D2D2 Flight path

Simulated Data (II) Design: the 2 classes are discriminated by peak intensities in bands B1 and B3, but no discriminations in B2 and B Intensity classPeak Intensity [Number of Peaks] B1B2B3B [7]7000 [7]5000 [7]1000 [60] 5000 [7] 7000 [7]10000 [7]1000 [60] σ=(0,10,50,300) M/Z Intensity M/Z Intensity

Results: Simulated Data σ=0 81 peaks detected σ=10 σ=50 Two extra non-valid peaks identified in the preprocessing phase (due to noise) After the BioDCV procedure one was rejected σ=300 The actual 13 discriminant peaks were found among the most significant features extracted A list stability indicator showed that the number of relevant variables over all run is exactly 13 Note: the 4 synthetic MALDI-TOF datasets were built each with a total of 14 discriminant peaks, but our preprocessing procedure detected only 13 of them since the first one is located too close to the inf of spectrum border. PREPROCESSING PIPELINE RESULTS COMPLETE VALIDATION RESULTS

Results: SELDI-TOF Ovarian 8/7/02 The first and the second most relevant peaks for BioDCV classification in all the sublists of the dataset confirm previous studies (and their concerns) K. Baggerly et al. Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20(5): ,2004. W. Zhu et al. Detection of cancer-specific markers amid massive mass spectral data. PNAS 100(25): , December AVG Error on blind test set (5 features): ~3% Random labels: ATE on blind test=41.1% (CI 34.6, 56.8) No Info = 36% Biomarker analysis AP0: BioDCV ATE with Indip. preproc AP1: Batch. preproc Blind test set error

Results: MALDI-TOF Keck Lab The first and the fifth most relevant peaks in the Keck Lab dataset AVG cancer AVG control 95% cancer 95% control Results compliant with: Baolin Wu et al. Ovarian cancer classification based on mass spectrometry analysis of sera. Cancer Informatics, AVG Error (AP0 mode) test set (14 features): 32.5% (CI 32.1,32.7) AVG Error (AP0 mode) test set (all features): 24.5% AVG Error (AP1 mode) test set (14 features): 25.7% Random labels: ATE on blind test=49.1% No Info=45.3% Biomarker analysis:

Enabling Grids for E-sciencE BioDCV SubVersion Repository C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2): , More on Windows native version available

Enabling Grids for E-sciencE Grid Computing for Proteomics 1.IEEE CBMS 2006: series of experiments on proteomics data standard complete validation analysis random labels analysis 2.A strict deadline for the final version Solution: We used the EGEE Biomed grid infrastructure 20 cpus/job, for a total of jobs BioDCV jobs were run on many Biomed Sites in all Europe

Enabling Grids for E-sciencE Grid implementation The BioDCV system 2-50 MB MB grid-ftp scp grid-ftp scp Commands: 1.grid-url-copy/lcg-cp db from local to SE 2.edg-job-submit BioDCV.jdl 3.grid-url-copy/lcg-cp db from SE to local

Enabling Grids for E-sciencE BioDCV jobs on CEs in Europe BioDCV jobs was run on these Biomed’s CEs in Europe: Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-3hr Destination: cluster.pnpi.nw.ru:2119/jobmanager-pbs-biomed Destination: helmsley.dur.scotgrid.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: ce101.grid.ucy.ac.cy:2119/jobmanager-lcgpbs-biomed Destination: grid-ce.ii.edu.mk:2119/jobmanager-lcgpbs-biomed Destination: ce01.kallisto.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-3hr Destination: lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-bioL Destination: ce01.ariagni.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: lcg-ce.its.uiowa.edu:2119/jobmanager-lcgpbs-biomed Destination: lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-bioL Destination: ce01.grid.acad.bg:2119/jobmanager-lcgpbs-biomed Destination: mu6.matrix.sara.nl:2119/jobmanager-pbs-short Destination: cluster.pnpi.nw.ru:2119/jobmanager-pbs-biomed Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-3hr Destination: gridba2.ba.infn.it:2119/jobmanager-lcgpbs-long Destination: ce1.pp.rhul.ac.uk:2119/jobmanager-pbs-biomedgrid Destination: cluster.pnpi.nw.ru:2119/jobmanager-pbs-biomed Destination: fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: grid012.ct.infn.it:2119/jobmanager-lcglsf-short Destination: ce1.pp.rhul.ac.uk:2119/jobmanager-pbs-biomedgrid Destination: ce01.grid.acad.bg:2119/jobmanager-lcgpbs-biomed Destination: grid-ce.ii.edu.mk:2119/jobmanager-lcgpbs-biomed Destination: grid0.fe.infn.it:2119/jobmanager-lcgpbs-grid Destination: grid012.ct.infn.it:2119/jobmanager-lcglsf-short Destination: ce01.ariagni.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: grid0.fe.infn.it:2119/jobmanager-lcgpbs-grid Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-12hr Destination: epgce1.ph.bham.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: ramses.dsic.upv.es:2119/jobmanager-pbs-biomedg Destination: t2ce02.physics.ox.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: ce1.pp.rhul.ac.uk:2119/jobmanager-pbs-biomedgrid Destination: ce01.kallisto.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: grid10.lal.in2p3.fr:2119/jobmanager-pbs-biomed Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-6hr Destination: ce01.grid.acad.bg:2119/jobmanager-lcgpbs-biomed Destination: scaicl0.scai.fraunhofer.de:2119/jobmanager-lcgpbs-biomed Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-12hr Destination: grid-ce.ii.edu.mk:2119/jobmanager-lcgpbs-biomed Destination: gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: grid0.fe.infn.it:2119/jobmanager-lcgpbs-grid Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-12hr Destination: grid10.lal.in2p3.fr:2119/jobmanager-pbs-biomed Destination: t2ce02.physics.ox.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-grid Destination: ce01.grid.acad.bg:2119/jobmanager-lcgpbs-biomed Destination: testbed001.grid.ici.ro:2119/jobmanager-lcgpbs-biomed Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-3hr Destination: lcg06.sinp.msu.ru:2119/jobmanager-lcgpbs-biomed Destination: ce01.isabella.grnet.gr:2119/jobmanager-pbs-biomed Destination: ce2.egee.cesga.es:2119/jobmanager-lcgpbs-biomed Destination: obsauvergridce01.univ-bpclermont.fr:2119/jobmanager-lcgpbs-biomed Destination: dgc-grid-40.brunel.ac.uk:2119/jobmanager-lcgpbs-short Destination: testbed001.grid.ici.ro:2119/jobmanager-lcgpbs-biomed Destination: ce01.isabella.grnet.gr:2119/jobmanager-pbs-biomed Destination: ce01.ariagni.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: obsauvergridce01.univ-bpclermont.fr:2119/jobmanager-lcgpbs-biomed Destination: ce01.marie.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: ce01.pic.es:2119/jobmanager-lcgpbs-biomed Destination: t2ce02.physics.ox.ac.uk:2119/jobmanager-lcgpbs-biomed Destination: ce01.kallisto.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: mu6.matrix.sara.nl:2119/jobmanager-pbs-short Destination: mars-ce.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-12hr Destination: ce01.marie.hellasgrid.gr:2119/jobmanager-pbs-biomed Destination: mallarme.cnb.uam.es:2119/jobmanager-pbs-biomed Destination: ce01.grid.acad.bg:2119/jobmanager-lcgpbs-biomed Destination: clrlcgce03.in2p3.fr:2119/jobmanager-lcgpbs-biomed Destination: grid001.ics.forth.gr:2119/jobmanager-lcgpbs-biomed And 50 more sites …. Production based on benchmarks

C. Furlanello – Trento – April 4, 2006 High-Throughput: Summary Predictive profiling for high-throughput proteomics Predictive profiling for high-throughput proteomics Selection Bias Selection Bias Computational procedures for complete validation (BioDCV) Computational procedures for complete validation (BioDCV) Biomarker Lists: reproducibility, stability, correlation Biomarker Lists: reproducibility, stability, correlation Modify machine learning algorithm to directly link selection to target functions (new kernel methods, or maybe simpler classifiers) Modify machine learning algorithm to directly link selection to target functions (new kernel methods, or maybe simpler classifiers) Consider the problem of batch preprocessing for true reproducibility Consider the problem of batch preprocessing for true reproducibility Use simulator to tune systems Use ensemble methods to achieve stability of selected lists Use simulator to tune systems Use ensemble methods to achieve stability of selected lists Applications Applications Grid Computing as a viable resource for prediction with Mass spectrometry (SELDI-TOF, MALDI-TOF) Grid Computing as a viable resource for prediction with Mass spectrometry (SELDI-TOF, MALDI-TOF) Predictive profiling for high-throughput proteomics Predictive profiling for high-throughput proteomics Selection Bias Selection Bias Computational procedures for complete validation (BioDCV) Computational procedures for complete validation (BioDCV) Biomarker Lists: reproducibility, stability, correlation Biomarker Lists: reproducibility, stability, correlation Modify machine learning algorithm to directly link selection to target functions (new kernel methods, or maybe simpler classifiers) Modify machine learning algorithm to directly link selection to target functions (new kernel methods, or maybe simpler classifiers) Consider the problem of batch preprocessing for true reproducibility Consider the problem of batch preprocessing for true reproducibility Use simulator to tune systems Use ensemble methods to achieve stability of selected lists Use simulator to tune systems Use ensemble methods to achieve stability of selected lists Applications Applications Grid Computing as a viable resource for prediction with Mass spectrometry (SELDI-TOF, MALDI-TOF) Grid Computing as a viable resource for prediction with Mass spectrometry (SELDI-TOF, MALDI-TOF)

Details

Preprocessing – peak identification Yasui 2003 Tibshirani 2004

Using peaks across multiple spectra can generate thousands of features. Using peaks across multiple spectra can generate thousands of features. The number of examples required to learn a “reasonable” hypothesis increases exponentially with the number of features. The number of examples required to learn a “reasonable” hypothesis increases exponentially with the number of features. Clustering reduces these features and has a rough correction for spectrometer resolution or drift of m/z between spectra. Clustering reduces these features and has a rough correction for spectrometer resolution or drift of m/z between spectra. Extracting Common Features

Peak Alignment - Clustering (Tibshirani 2004) Complete hierarchical clustering on log(m/z) axis over all spectra Complete hierarchical clustering on log(m/z) axis over all spectra Build a dendrogram Build a dendrogram Cut at treshold T  induces centroids position Cut at treshold T  induces centroids position

Spectra with extracted centroids

Error Curve vs. Features number Blind test set Sim. Preproc spectra Ind. Preproc spectra

Top discriminant peaks Control Cancer 95% confidence