Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Random Forest Predrag Radenković 3237/10
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Curse of Dimensionality Prof. Navneet Goyal Dept. Of Computer Science & Information Systems BITS - Pilani.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
CMPUT 466/551 Principal Source: CMU
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Neuroinformatics 1: review of statistics Kenneth D. Harris UCL, 28/1/15.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Sparse vs. Ensemble Approaches to Supervised Learning
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
ICS 273A Intro Machine Learning
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Prediction with Regression Analysis (HK: Chapter 7.8) Qiang Yang HKUST.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
ANOVA 3/19/12 Mini Review of simulation versus formulas and theoretical distributions Analysis of Variance (ANOVA) to compare means: testing for a difference.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Machine Learning CS 165B Spring 2012
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
Learning the cis regulatory code by predictive modeling of gene regulation (MEDUSA) Christina Leslie Center for Computational Learning Systems Columbia.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
AP Statistics Section 11.1 B More on Significance Tests.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Using Classification Trees to Decide News Popularity
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 Estimation Chapter Introduction Statistical inference is the process by which we acquire information about populations from samples. There are.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
JMP Discovery Summit 2016 Janet Alvarado
Data Transformation: Normalization
Bagging and Random Forests
Introduction to Machine Learning and Tree Based Methods
Rule Induction for Classification Using
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Data Mining (and machine learning)
Exam #3 Review Zuyin (Alvin) Zheng.
Lecture Slides Elementary Statistics Thirteenth Edition
Figure S1: Gene importance plot derived from Variable/ Feature selection using machine learning on the training dataset. MeanDecreaseGini is the measure.
Lecture 18: Bagging and Boosting
Lecture 06: Bagging and Boosting
Ensemble learning Reminder - Bagging of Trees Random Forest
Instructor: Vincent Conitzer
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and Christina Leslie, in “Predicting Genetic Regulatory Response Using Classification” (2004)

The Problem Current studies of gene transcription tend to be descriptive Need for a predictive system – the ability to predict gene regulation for new experiments Rather than determining patterns in sets of genes and conditions, look at underyling causes of those patterns

The Important Parts of Genes & Experiments Regulation is determined by binding sites (motifs) and regulators (parents) The significance of experiments, then, is how they affect regulators The significance of genes is what motifs they contain

Binding Sites and Regulation Discretize gene response into only up-regulated (1), down-regulated(-1), or unchanged (0) A motif is either present (1) or absent (0) A parent is either up-regulated (1), down- regulated (-1), or unchanged (0) Assume (and we need to check with someone who actually knows something about biology on this) that things only happen if motif is present and parent is either up- or down-regulated

What our matrix really looks like g = # of genes e = # of experiments p = # of parents m = # of motifs Then we have g*e response values For each response, we have p*m parent/motif combinations For each parent/motif combination, there are three possibilities – present and up-regulated, present and down-regulated, or all those other possibilities where nothing happens Represent these possibilities as a pair of binary variables, one for up and one for down

M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, E1 M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, E2 M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - … M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, Ee M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - … M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - Gg, Ee

Some Numbers In the paper, the initial dataset had 6110 genes and 173 experiments 354 motifs are considered 475 regulators are considered Set of genes to consider is reduced to only 1411 genes of interest

Some More Numbers Only train on genes that are up- or down- regulated Approx. 8% of gene/experiment pairs from the overall sample appear to be, so, assuming this holds true in the reduced sample, we have 19,632 gene / experiment pairs to train on For each of these values we have 2*354*475 = 336,300 predictor variables

Some Problems 19,632 by 336,300 is an awfully large matrix to want to do any calculations with We have far more variables than observations

Possible Solutions Random Projection: –Pro: we can reduce our dimensionality –Con: it seems like a somewhat silly approach –Con: there’ll still be a lot of calculations just to make the projection

Possible Solutions Variable Selection or PCA –Pro: we could reduce our dimensionality in a more informed way –Con: computationally painful

Possible Solutions Random Forest From Breiman and Cutler: Grow a number of classification trees, and take the vote of the classifications of all trees For each tree, if we have n cases, sample, with replacement, n cases, using some number of randomly chosen variables much smaller than the full number of variables –Pro: allows for flexibility in reduction of dimensions being considered –Pro: dimension is reduced without computation problems –Con: ?

Their Solution Alternating decision tree: A tree with alternating levels of prediction nodes and splitter nodes Combines a set of weak prediction rules (the splitter nodes) to make one strong rule

Alternating Decision Trees v. Random Forests Both the same rough concept – take a vote from many weak rules to get a strong rule In ADTs, rules are based on single variables, and may be conditioned on values of other variables In random forests, rules can be based on multiple variables, and are only marginal over all values of other variables ADTs are fairly straightforward to read and interpret

Alternating Decision Trees Pro: computationally kind Pro: works better than Random Forests (Creamer & Freund, 2004) Con: we didn’t come up with it

Other Ideas?