Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Chapter 5 Multiple Linear Regression
DECISION TREES. Decision trees  One possible representation for hypotheses.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Minimum Redundancy and Maximum Relevance Feature Selection
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Lecture 4: Embedded methods
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Mutual Information Mathematical Biology Seminar
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
1 Feature Selection: Algorithms and Challenges Joint Work with Yanglan Gang, Hao Wang & Xuegang Hu Xindong Wu University of Vermont, USA; Hefei University.
Gene expression profiling identifies molecular subtypes of gliomas
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
The Broad Institute of MIT and Harvard Classification / Prediction.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Data Mining and Decision Support
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Cluster Analysis II 10/03/2012.
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Gene expression.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Feature Selection for Pattern Recognition
COMP61011 Foundations of Machine Learning Feature Selection
Impact of Formal Methods in Biology and Medicine
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
K Nearest Neighbor Classification
Nearest Neighbors CSC 576: Data Mining.
Group 9 – Data Mining: Data
Deep Learning in Bioinformatics
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova + * University of Porto, PORTUGAL + Russian Academy of Sciences RUSSIA § University of Oulu FINLAND

Outline 1. Goals 2. Background 3. SAGE Data 4. Gene Expression Data 5. Feature Selection 6. GRAD 7. Experiments 8. Conclusions

Goal Predictive Analysis: Feature Selection Methods in Bioinformatics and Machine Learning Cancer Classification

Background Genes code proteins and other larger biomolecules Genes are expressed in a two steps process (Central Dogma of Biology) Several technologies measure transcription: SAGE, Micro array… Central Dogma of Biology Gene Expression Process 1- Transcribed into an RNA Sequence 2- Translated into a protein Molla et al, 2003

SAGE DATA Advantages: Compare samples between different organs and patients. (No normalisation required) Collects complete gene expression profile of a cell/tissue without prior knowledge of the mRNA to be profiled

SAGE DATA Drawbacks: Very Expensive to Collect Data using the SAGE method Very Few Examples (consequence)

GENE EXPRESSION DATA Challenges posed to Machine Learning Number of Genes Dramatically Exceeds Examples!!! Curse of Dimensionality (not enough density to estimate accuratelly the model) Over-fitting (higher probability of finding casual relationships among data attributes)

Remove Irrelevant and Redundant Genes Methods: Wrapper Fit classifier to a subset of data and use classification accuracy to drive the search for relevant genes (e.g. C4.5 accuracy ) Filtering Use a function to assess the goodness of a subset of genes (e.g. euclidean distance, entropy, correlation, etc...) Problem Complexity O(2 n )... n, number of genes Smaller dataset n=822. O(2 n )  2.8x  Intractable using a simple exaustive search Feature Selection

Gene Selection In Bioinformatics Filtering is usually prefered because is computationally less expensive Several works on classification select genes with: Wilcoxon test, t-test Additionally, also remove genes with low entropy, variability, or absolute expression level. Cons Redundancy Interdependency unaware

Our Proposals Study Bioinformatics Filtering Techniques Compare with Machine Learning Algorithms Avoid Redundancy Consider Interdependency and low expressed genes Introduce a new Filtering Algorithm GRAD

GRAD Search Strategy Search Strategy 1. Use Exaustive Search on the formation of informative groups of attributes (“granules”) 2. Use AdDel for choosing subsets of granules AdDel: A combination of forward sequential search (FSS) and backward sequential search (BSS) Number of attributes to include on a subset is estimated by algorithm

and are the distances to closest neighbors, one from each class GRAD Algorithm Algorithm P0: x1,x2,…,xn – initial set of features Formation of granules: Ordering by individual relevance G1: x7, x33, x12,…,xn All pairs by exhaustive search G2: x3x8, x15x88,…,xi xj All triplets by exhaustive search G3: x75x1x35, x11x49x55,…, xi xj xk Top level most relevant granules using AdDel G= … AdDel

Experiments Comparison 1. GRAD 2. Wrapper C Original Dataset 4. Filtering – Wilcoxon Test, low entropy, variability, and very low absolute expression level Classifiers 1. C SVM 3. RBF 4. NN-MLP Data Small Dataset: 74x822

Data Characterization Not all organs have samples of both classes Unbalanced number of cases: 50 Cancer Samples 24 Normal Samples Most data is relativelly low expressed Mean quite far from median: Potentially due to outliers

Data Characterization average vs standard deviation average vs range Both range and standard deviation have roughly linear relationship with gene expression level average

Experimental Results Predictive Accuracy GRAD WRAPPER Original Filtering 86% 82% 79% 78% GRAD is significantly better than using the original or the filtered dataset Wrapper approach is not

GRAD Results Importance of considering dependence Distance Function: 10best by GRAD P=100 % 10 most individually informative P=75,7 %

GRAD Results Scatter Plot of GRAD Attributes Interdependency relationship between two non differentially expressed genes selected with GRAD Two differentially expressed genes selected with GRAD.

GRAD Results Examples ordered by the value of the Distance Function In the future it can allow to estimate the degree of risk, to make early diagnostics and to supervise a course of treatment

Induced Classifiers C4.5 Induced on GRAD attributesC4.5 Induced using a Wrapper Approach

Conclusions 1. Coping with redundancy and dependency between attributes is very important. 2. Algorithm GRAD represents effective means to select a subset of attributes from very big initial set. 3. The submitted results have only illustrative character. 4. We are open for cooperation with those who have interest on the biological interpretation of results

Questions …

In increasing n the relevance grows, then growth stops and begins its decrease due to addition less informative, rustling attributes. The maximum of the curve of quality allows to specify optimum quantity of attributes. Only algorithms of AdDel family has such property. GRAD

Gene Selection In Bioinformatics Redundancy: some genes are highly correlated. (probably belonging to the same biological pathways) Curse of Dimensionality Interdependency: A few interdependent genes may carry together more significant information than a subset of independent genes. Loss of relevant information to discriminate among classes All this have a negative impact on predictive accuracy!!

Feature Selection Wrapper Considers the classifier while searching best subset Accuracy Improves May overfit due to small sample sizes and huge dimensionality Computationally more expensive Filtering: Potentially less accurate Faster: Does not requires the induction of a predictor Commonly prefered approach in bioinformatics