9th Benelux Bioinformatics Conference, 09/12/2014.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Yue Han and Lei Yu Binghamton University.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
1 InCoB 2009, Singapore Ren é Hussong et al. Highly accelerated feature detection in mass spectrometry data using modern graphics processing units Bioinformatics.
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Principal Component Analysis
Proteomic Mass Spectrometry
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
FIGURE 5. Plot of peptide charge state ratios. Quality Control Concept Figure 6 shows a concept for the implementation of quality control as system suitability.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Basic Definitions Support: number of clusters that contain all the members of an analyte-set Confidence of Association rule X  Y: Support( X  Y ) / Support(
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Density-Based Clustering Algorithms
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
PerkinElmer Life Sciences Production Company Meeting - 1st February 2002 Progenesis John Hoyland Product Manager - Bioinformatics.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Temple University MASS SPECTROMETRY INTRODUCTION Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Data Mining and Decision Support
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Quality control of thousands of experiments with qcML pieter kelchtermans computational omics and systems biology group.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Collecting and mining mass spectrometry quality control data Wout Bittremieux, Pieter Kelchtermans, Dirk Valkenborg, Lennart Martens, Bart Goethals, Kris.
Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Data Mining Functionalities
Mining Utility Functions based on user ratings
Approaches for mass spectrometry quality control
Efficient Cluster Detection by Ordered Neighborhoods
Bottom-Up Proteomics Data collection
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Organic Chemistry Lesson 18 Mass spectrometry.
Shedding light on complex mass spectrometry proteomics processes through advanced data mining  Wout Bittremieux.
Research in Computational Molecular Biology , Vol (2008)
Quantitative Targeted Absolute Proteomics-Based Adme Research as A New Path to Drug Discovery and Development: Methodology, Advantages, Strategy, and.
William Norris Professor and Head, Department of Computer Science
Lecture 2 Techniques in proteomics By Ms. Shumaila Azam
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
Data Mining II: Association Rule mining & Classification
Sangeeta Devadiga CS 157B, Spring 2007
Thomas BOTZANOWSKI & Blandine CHAZARIN
OCR Level 3 Cambridge Technicals in IT
Instrumental Chemistry
William Norris Professor and Head, Department of Computer Science
Proteomics Informatics David Fenyő
NoDupe algorithm to detect and group similar mass spectra.
Top-down protein identification.
A, Base peak chromatogram of apomyoglobin digest generated by 0
Schematic summarizing the various functions and features of MASH Suite Pro. Schematic summarizing the various functions and features of MASH Suite Pro.
Pierre P. Massion, MD, Richard M. Caprioli, PhD 
Data Analysis – Part1: The Initial Questions of the AFCS
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

9th Benelux Bioinformatics Conference, 09/12/2014

Pattern mining of mass spectrometry quality control data Wout Bittremieux

Mass spectrometry 3 protein digestion peptide separation protein sample peptide sample ion sourcedetector generalized mass spectrometer ion selector fragmentation fragment mass analyzer output spectra

Quality control metrics Derived from experimental data Instrument settings 4 Walzer, M. et al. qcML: An exchange format for quality control metrics from mass spectrometry experiments. Molecular & Cellular Proteomics 13, 1905–1913 (2014). Bittremieux, W. et al. jqcML: An open-source Java API for mass spectrometry quality control data in the qcML format. Journal of Proteome Research 13, 3484–3487 (2014). Bittremieux, W. et al. Mass spectrometry quality control through instrument monitoring. In preparation.

Metrics derived from experimental data 5

6

7

8

9

10

Instrument settings 11

Instrument settings 12

Instrument settings 13

Instrument settings 14

Instrument settings 15

Instrument settings 16

High dimensionality 17

Previous approaches: Univariate 18

Previous approaches: Multivariate 19

Previous approaches: Multivariate 20

Our approach: Subspace clustering Try to find a suitable subset of the original feature space in which (dis)similar items can be found 21 ExperimentQC 1 QC 2 QC 3 QC 4 Exp Exp Exp Exp

Our approach: Subspace clustering Try to find a suitable subset of the original feature space in which (dis)similar items can be found 22 ExperimentQC 1 QC 2 QC 3 QC 4 Exp Exp Exp Exp

Our approach: Subspace clustering Try to find a suitable subset of the original feature space in which (dis)similar items can be found 23 ExperimentQC 1 QC 2 QC 3 QC 4 Exp Exp Exp Exp ✓✓

Our approach: Subspace clustering Try to find a suitable subset of the original feature space in which (dis)similar items can be found 24 ExperimentQC 1 QC 2 QC 3 QC 4 Exp Exp Exp Exp ✓✓ ✗✗

Our approach: Subspace clustering Try to find a suitable subset of the original feature space in which (dis)similar items can be found 25 ExperimentQC 1 QC 2 QC 3 QC 4 Exp Exp Exp Exp

Frequent itemset mining 26 Aksehirli, E. et al. Cartification: A neighborhood preserving transformation for mining high dimensional data. in 13 th IEEE International Conference on Data Mining 937–942 (2013). Naulaerts, S. et al. A primer to frequent itemset mining for bioinformatics. Briefings in Bioinformatics (2013).

Cartification Transactions consist of the k nearest neighbors on a single dimension for each item

Cartification Transactions consist of the k nearest neighbors on a single dimension for each item

Cartification Transactions consist of the k nearest neighbors on a single dimension for each item 1 2 3

Cartification Transactions consist of the k nearest neighbors on a single dimension for each item

Cartification k -nearest neighbors in the first dimension (X-axis) k -nearest neighbors in the second dimension (Y-axis)

Cartification Frequent itemset mining: 4 maximal frequent itemsets with support = 4

CartiClus 1.Convert the high-dimensional database to a transaction database 2.Mine (maximal) frequent itemsets 3.Convert the itemsets to subspace clusters 4.Redo clustering projected on the detected subspaces (optional) 33

CartiClus 34

Results Detected subspaces Various quartiles of the same metric Related metrics: significant overlap with previous manually defined groups of co-occurring metrics New relationships between metrics to be validated using expert knowledge Detected clusters Highly dependent on projected subspaces Able to capture valid relationships between experiments 35

Results 36

Results 37

Conclusion Different sources of qualitative data Metrics derived from experimental data Instrument settings Subspace clustering to detect patterns in high-dimensional data Univariate insufficient: metrics influence each other Multivariate insufficient: global transformation 38

Conclusion Cartification: Neighborhood-preserving transformation Finds relevant subspaces and discards noise Fast Resulting subspace clustering Able to identify relationships between various qualitative metrics Clusters experiments exhibiting similar behavior 39

Acknowledgments 40 ADReM / biomina Emin Aksehirli Bart Cuypers Aida Mrzic Stefan Naulaerts Pieter Meysman Bart Goethals Kris Laukens InSPECtor Hanny Willems Lennart Martens Dirk Valkenborg biomina