COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Slides:

Advertisements

Similar presentations

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.

Minimum Redundancy and Maximum Relevance Feature Selection

Feature Selection Presented by: Nafise Hatamikhah

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Exploratory Data Mining and Data Preparation

Sparse vs. Ensemble Approaches to Supervised Learning

A Study on Feature Selection for Toxicity Prediction*

Feature Selection for Regression Problems

Ensemble Learning: An Introduction

Three kinds of learning

Computational Genomics and Proteomics Lecture 9 CGP Part 2: Introduction Based on: Martin Bachler slides.

Feature Selection Lecture 5

Feature Selection Bioinformatics Data Analysis and Tools

Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Issues with Data Mining

Learning: Nearest Neighbor Artificial Intelligence CMSC January 31, 2002.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:

CLASSIFICATION: Ensemble Methods

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

Hybrid Ant Colony Optimization-Support Vector Machine using Weighted Ranking for Feature Selection and Classification.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Classification with Gene Expression Data

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

School of Computer Science & Engineering

Boosting and Additive Trees (2)

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

COMP61011 Foundations of Machine Learning Feature Selection

Data Mining (and machine learning)

Machine Learning Feature Creation and Selection

Data Mining Practical Machine Learning Tools and Techniques

A Unifying View on Instance Selection

Overfitting and Underfitting

Statistical Learning Dong Liu Dept. EEIS, USTC.

Machine Learning in Practice Lecture 22

Ensemble learning Reminder - Bagging of Trees Random Forest

Feature Selection Methods

FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR

Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

COT6930 Course Project

Outline Gene Selection Sequence Alignment

Why Gene Selection Identify marker genes that characterize different tumor status. Many genes are redundant and will introduce noise that lower performance. Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)

Why Gene Selection

Gene Selection Methods fall into three categories: –Filter methods –Wrapper methods –Embedded methods Filter methods are simplest and most frequently used in the literature Wrapper methods are likely the most accurate ones

Filter Method Features (genes) are scored according to the evidence of predictive power and then are ranked. Top s genes with high score are selected and used by the classifier. –Scores: t-statistics, F-statistics, signal-noise ratio, … –The # of features selected, s, is then determined by cross validation. Advantage: Fast and easy to interpret.

Good versus bad features

Filter Method: Problem Genes are considered independently. –Redundant genes may be included. –Some genes jointly with strong discriminant power but individually are weak will be ignored. Good single features do not necessarily form a good feature set The filtering procedure is independent to the classifying method –Features selected can be applied to all types of classifying methods

Wrapper Method Iterative search: many “feature subsets” are scored base on classification performance and the best is used. –Select a good subset of features Subset selection: Forward selection, backward selection, their combinations. –Exhaustive searching is impossible. –Greedy algorithm are used instead.

Wrapper Method: Problem Computationally expensive –For each feature subset considered, the classifier is built and evaluated. Exhaustive searching is impossible –Greedy search only. Easy to overfit.

Embedded Method Attempt to jointly or simultaneously train both a classifier and a feature subset. Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. Intuitively appealing

Relief-F Relief-F a filter approach for feature selection –Relief

Relief-F Original Relief is only able to handle binary classification problem. Extension was made to handle multiple-class problem

Relief-F Categorical attributes Numerical attributes

Relief-F Problem Time Complexity –m ×(m×a+c×m×a+a)=O(cm 2 a) – Assume m=100, c=3, a=10,000 – Time complexity 300×10 6 Only considers one single attribute, cannot select a subset of “good” genes

Solution: Parallel Relief-F Version 1: –Clusters runs ReliefF in parallel, and updated weighted weight values are collected at the master. –Theoretical time complexity O(cm 2 a/p) P is the # of clusters

Parallel Relief-F Version 2: –Clusters runs ReliefF in parallel, and each cluster directly update the global weight values. –Each cluster also considers the current weight values to select nearest neighbour instances –Theoretical time complexity O(cm 2 a/p) p is the # of clusters

Parallel Relief-F Version 3 –Consider selecting a subset of important features –Comparing the difference between including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features –Discussion in private!

Outline Gene Selection Sequence Alignment –Given a dataset D with N=1000 sequences (e.g., 1000 each) –Given an input x, –Do pair-wise global sequence alignment between x and all sequences D Dispatch jobs to clusters And aggregate the results