Some statistical musings Naomi Altman Penn State 2015 Dagstuhl Workshop.

Slides:



Advertisements
Similar presentations
RANDOM PROJECTIONS IN DIMENSIONALITY REDUCTION APPLICATIONS TO IMAGE AND TEXT DATA Ella Bingham and Heikki Mannila Ângelo Cardoso IST/UTL November 2009.
Advertisements

Design of Experiments Lecture I
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo.
Bayesian Factor Regression Models in the “Large p, Small n” Paradigm Mike West, Duke University Presented by: John Paisley Duke University.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Dimensionality Reduction PCA -- SVD
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Face Recognition Under Varying Illumination Erald VUÇINI Vienna University of Technology Muhittin GÖKMEN Istanbul Technical University Eduard GRÖLLER Vienna.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
New Methods in Ecology Complex statistical tests, and why we should be cautious!
Introduction to BioInformatics GCB/CIS535
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Chapter 2 Dimensionality Reduction. Linear Methods
Anomaly detection with Bayesian networks Website: John Sandiford.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Verna Vu & Timothy Abreo
Computer Science, Software Engineering & Robotics Workshop, FGCU, April 27-28, 2012 Fault Prediction with Particle Filters by David Hatfield mentors: Dr.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.
Distances Between Genes and Samples Naomi Altman Oct. 06.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Canadian Bioinformatics Workshops
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Gene Expression Profiling Brad Windle, Ph.D
Data Modeling Patrice Koehl Department of Biological Sciences
Unsupervised Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Chapter 8 Experiments.
Exploring Microarray data
Microarray - Leukemia vs. normal GeneChip System.
CH 5: Multivariate Methods
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Additional file 8: Estimation of biological variations
Machine Learning Basics
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Design and Analysis of Single-Cell Sequencing Experiments
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Probabilistic Models with Latent Variables
Introduction to Experimental and Observational Study Design
Dimension reduction : PCA and Clustering
X.1 Principal component analysis
Statistical Analysis and Design of Experiments for Large Data Sets
INTRODUCTION TO Machine Learning
Announcements Project 2 artifacts Project 3 due Thursday night
Interpretation of Similar Gene Expression Reordering
Midterm Exam Closed book, notes, computer Similar to test 1 in format:
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
The “Margaret Thatcher Illusion”, by Peter Thompson
Unsupervised Learning
Differential Expression of RNA-Seq Data
Design Issues Lecture Topic 6.
Presentation transcript:

Some statistical musings Naomi Altman Penn State 2015 Dagstuhl Workshop

Some topics that might be interesting Feature matching across samples and platforms Preprocessing number of features >> number of samples feature screening replication and possibly other design issues PCA and relatives mixture modeling

Feature Matching e.g. (simple) should we match RNA-seq with a gene expression microarray by “gene” or by “oligo”? protein MS with RNA-seq or ribo-Seq how should we match features such as methylation sites, protein binding regions, SNPs, transcripts and proteins?

Preprocessing These plots show the concordance of 3 normalizations of the same Affymetrix microarray. Dozens of methods are available for each platform. Matching features across platforms is going to be very dependent on which set of normalizations are selected.

p>>n When the number of features > number of samples:  correlations of magnitude very close to 1 are common  we can always obtain a multiple “perfect”predictors so selecting “interesting” features is difficult  “extreme” p-values, Bayes factors, etc become common  singular matrices occur in optimization algorithms

p>>n New statistical methods for feature selection such as “sparse” and “sure screening” selectors may be useful. The idea of “sure screening” selectors is that prescreening brings us to p<n-1. But … we have some high probability that all the “important” features are selected (along with others which we will screen out later).

Experimental Design Randomization, replication and matching enhance our ability to reproduce research In particular, replication ensures the results are not sample specific while blocking allows variability in the samples without swamping the effects Multi-omics is best done on single samples measured on multiple platforms Technical replication is seldom worth the cost compared to taking more biological replicates

Dimension Reduction PCA (or SVD) have many relatives that can be used to reduce the number of features using projections onto a lower dimensional space  The components are often not interpretable.  Many variations are available from both the machine learning and statistics communities.  Machine learning stresses fitting the data.  Statistics stresses fitting the data generating process.

Mixture Modeling In many cases we can think of a sample as a mixture of subpopulations We can use the EM algorithm or Bayesian methods to deconvolve into the components.

Some other statistical topics already mentioned missing features (present but not detected) which differ between samples mis-identified features do p-values (or FDR estimates) matter? multiple times; multiple cells; multiple individuals biological variation vs measurement noise & error propagation how can be enhance reproducibility (statistical issues) can we fit complex models? should we? the data are too big for most statistically trained folks how are we going to train the current and next generation?