A Bayesian Method for Rank Agreggation Xuxin Liu, Jiong Du, Ke Deng, and Jun S Liu Department of Statistics Harvard University.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Biologically Inspired Computing: Operators for Evolutionary Algorithms
Brief introduction on Logistic Regression
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Pair-copula constructions of multiple dependence Workshop on ''Copulae: Theory and Practice'' Weierstrass Institute for Applied Analysis and.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
BAYESIAN INFERENCE Sampling techniques
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Lecture 12 One-way Analysis of Variance (Chapter 15.2)
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Descriptive Statistics
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
Bayes Factor Based on Han and Carlin (2001, JASA).
Multiple testing correction
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Gene Set Enrichment Analysis (GSEA)
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Essential Statistics in Biology: Getting the Numbers Right
Analysis of Variance ( ANOVA )
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Reproducibility and Ranks of True Positives in Large Scale Genomics Experiments Russ Wolfinger 1, Dmitri Zaykin 2, Lev Zhivotovsky 3, Wendy Czika 1, Susan.
Lecture 15: Linkage Analysis VII
Digital Logic Design (CSNB163)
1 Value of information – SITEX Data analysis Shubha Kadambe (310) Information Sciences Laboratory HRL Labs 3011 Malibu Canyon.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Simple examples of the Bayesian approach For proportions and means.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Reducing MCMC Computational Cost With a Two Layered Bayesian Approach
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
From DeGroot & Schervish. Example Occupied Telephone Lines Suppose that a certain business office has five telephone lines and that any number of these.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
MCMC Output & Metropolis-Hastings Algorithm Part I
Statistical-Mechanical Approach to Probabilistic Image Processing -- Loopy Belief Propagation and Advanced Mean-Field Method -- Kazuyuki Tanaka and Noriko.
Advanced Statistical Computing Fall 2016
Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to find eQTLs and eGenes Dat Duong, Lisa Gai, Sagi Snir, Eun Yong Kang,
Bayesian inference Presented by Amir Hadadi
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Clustering Using Pairwise Comparisons
Kalman Filter: Bayes Interpretation
Matrices and Determinants
Presentation transcript:

A Bayesian Method for Rank Agreggation Xuxin Liu, Jiong Du, Ke Deng, and Jun S Liu Department of Statistics Harvard University

Outline of the talk Motivations Methods Review ◦ Classics: SumRank, Fisher, InvZ ◦ State transition method: MC4, MCT Bayesian model for the ranks ◦ power laws ◦ MCMC algorithm Simulation results

Motivations Goal : to combine rank lists from multiple experiments to obtain a “most reliable” list of candidates. Examples: ◦ Combine ranking results from different judges ◦ Combine biological evidences to rank the relevance of genes to a certain disease ◦ Combine different genomic experiment results for a similar biological setup

Data – the rank matrix The ranks of N “genes” in M experiments. Questions of interest: ◦ How many genes are “true” targets (e.g., truly differentially expressed, or truly involved in a certain biological function) ◦ Who are they?

Challenges Full rank list versus partial rank list ◦ Sometimes we can only “reliably” rank the top k candidates (genes) The quality of the ranking results can vary greatly from experiment (judge) to experiment (judge) There are also “spam” experiments that give high ranks to certain candidates because of some other reasons (bribes)

Some available methods Related to the methods for combining multiple p-values:

Corresponding methods for ranks Under H 0, each candidate’s rank is uniformly distributed in {1,…,N}. Hence, the p-value for a gene ranked at the k th place has a p-value k/N. Hence the previous 3 methods correspond to

Problems with these methods Experiments are treated equally, which is sometimes undesirable

Transition matrix method (google inspired?) Treat each gene as a node. P(i,j) is the transition probabilities from i to j. The stationary distribution is given by The importance of each candidate is ranked by 

MC4 algorithm The method is usually applied to rank the top K candidates, so P is K  K matrix ◦ Let U be the list of genes that hare ranked as top K at least once in some experiment ◦ For each pair of genes in U, let if for a majority of experiments i is ranked above j. ◦ Define ◦ Make P ergodic by mixing:

Comments The method can be viewed as a variation of the simple majority vote As long as spam experiments do not dominate the truth, MC4 can filter them out. Ad-hoc, no clear principles behind the method.

MCT algorithm Instead of using 0, or 1 for, it defines where is the number of times i ranked before j.

A Bayesian model We introduce D as an indicator vector indicating which of the candidates are among the true targets: ◦ if the ith gene is one of the targets, 0 otherwise. Prior ◦ The joint probability: where is the rank list from the jth experiment

Given D, we decompose R j into 3 parts: ◦, the relative ranks within the “null genes”, i.e., with ◦, the relative ranks within “targets”, i.e., with ◦, the relative ranks of each target among the null genes.

Example

Decompose the likelihood

Power law?

An MCMC algorithm

Simulation Study 1

True positives in top 20:

Inferred qualities of the experiments

Simulation 2: power of the spam filtering experiments

Summary The Bayesian method is robust, and performs especially well when the data is noisy and experiments have varying qualities The Fisher’s works quite well in most cases, seems rather robust to noisy experiments The MC-based methods worked surprisingly badly, with no exception