University of California at San Diego

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
By: Katie Adolphsen, Robin Aldrich, Brandon Hu, Nate Havko.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Two Classes Meet the Bell Curve December 2004 MUPGRET Workshop.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Systems Biology Biological Sequence Analysis
1 Validation and Verification of Simulation Models.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Analysis of Drug-Gene Interaction Data Florian Ganglberger Sebastian Nijman Lab.
Introduction of Cancer Molecular Epidemiology Zuo-Feng Zhang, MD, PhD University of California Los Angeles.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Modeling Count Data over Time Using Dynamic Bayesian Networks Jonathan Hutchins Advisors: Professor Ihler and Professor Smyth.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Rare and common variants: twenty arguments G.Gibson Homework 3 Mylène Champs Marine Flechet Mathieu Stifkens 1 Bioinformatics - GBIO K.Van Steen.
Shankar Subramaniam University of California at San Diego Data to Biology.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Verna Vu & Timothy Abreo
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Higher Biology Chapter 16 Gene Mutations. This type of mutation involves a change in one or more of the nucleotides in a strand of DNA. There are four.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Decoding the Network Footprint of Diseases With increasing availability of data, there is significant activity directed towards correlating genomic, proteomic,
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Bioinformatics lectures at Rice University Li Zhang Lecture 11: Networks and integrative genomic analysis-3 Genomic data
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Shankar Subramaniam University of California at San Diego Data to Biology.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
The Future of Genetics Research Lesson 7. Human Genome Project 13 year project to sequence human genome and other species (fruit fly, mice yeast, nematodes,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Multi-scale network biology model & the model library 多尺度网络生物学模型 -- 兼论模型库的建立与应用 Jianghui Xiong 熊江辉
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
Seojin Bang. The goal of this review paper is.. To address problems and computational solutions that arise in analysis of omics data. To highlight fundamental.
Advanced Data Analytics
Single Nucleotide Polymorphisms (SNPs
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
MEASURES OF CENTRAL TENDENCY Central tendency means average performance, while dispersion of a data is how it spreads from a central tendency. He measures.
Gil McVean Department of Statistics
Variation among organisms
Of Sea Urchins, Birds and Men
University of California at San Diego
Statistical Applications in Biology and Genetics
Statistical Testing with Genes
Global Transcriptional Dysregulation in Breast Cancer
Gene-set analysis Danielle Posthuma & Christiaan de Leeuw
Dept of Biomedical Informatics University of Pittsburgh

Introductory Econometrics
University of California at San Diego
Mutual exclusivity analysis identifies oncogenic
Mixture of Mutually Exciting Processes for Viral Diffusion
Figure 1 Evolution of genetic concepts underlying risk of cardiovascular disease Figure 1 | Evolution of genetic concepts underlying risk of cardiovascular.
Quantitative Genetic Interactions Reveal Biological Modularity
Schedule for the Afternoon
genetic variation is meaningful only in the context of a population
The Study of Biological Information
Antonio Julià  Journal of Investigative Dermatology 
Volume 58, Issue 4, Pages (May 2015)
Analytics – Statistical Approaches
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Computational Biology
Statistical Testing with Genes
The power of metagenomic read recruitment
Presentation transcript:

University of California at San Diego Data to Biology Shankar Subramaniam University of California at San Diego 1

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS KNOWLEDGE EXTRACTION FROM DATA DEALING WITH THE COFFEE DRINKERS PROBLEM HOW CAN BIOLOGICAL DATA BE INTEGRATED? DEFINING THE GRANULARITY OF DATA UNBIASED STATISTICAL METHODS BIOLOGY-CONSTRAINED METHODS INFORMATION METRICS HOW DO WE DEAL WITH CONTEXT?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS NOISY DATA CAN WE DEFINE HOW MUCH NOISE AND WHAT TYPE OF NOISE CAN BE TOLERATED IN EXTRACTING KNOWLEDGE? IS MISSING DATA TANTAMOUNT TO NOISE? IF NOT HOW DO WE DEAL WITH IT?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS CLASSIFICATION OF MODULARITY FROM DATA HOW CAN WE DEFINE MODULES (FUNCTIONAL, SPATIAL, TEMPORAL, ETC.) FROM DATA? WHAT IS THE INFORMATION CONTENT IN THE MODULES? CAN WE COMPARE MODULES QUANTITATIVELY?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS DEALING WITH DYNAMICAL DATA HOW DO WE DEAL WITH TIME SERIES DATA? HOW IS INFORMATION PROCESSED IN TIME SERIES DATA? WHAT GRANULARITY AND CONTEXT IS NECESSARY TO ANALYZE THIS DATA?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem [highly skewed distributions]: 90% of people are coffee drinkers What does this say about making drink predictions that are 90% accurate? Biology is all about highly skewed distributions – posing significant challenges for methods, measures, and validation

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem – examples: 99% of us likely do not have the disease one might be looking for 99% of protein interactions are accounted for by 5% of the proteins 99% of the known disease-implicated mutations occur in less than 5% of the people (all estimates, but largely realistic)

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem: Most current techniques in data analysis are rendered useless because of this. Statistical significance with meaningful null hypotheses are critical (information content is one of the most commonly used measures even today) Simulation based methods often do not work – requiring analytics Methods must optimize for these analytical measures of quality Validation in the absence of complete data is hard

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem (real examples) When is a module in a network significant? When is an observed mutation in a sequenced phenotype implicated genome significant? When is an alignment of two networks significant? When is correlation in time-course microarray data significant? Conversely: How do we detect the most significant modules in a network? How do we identify all phenotype-implicated mutations from a large number of sequenced diseased and normal genomes? How do we align networks for most statistically significant alignments? How do we find most correlated signals and associated groups of genes?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The Hidden Terminal Problem: Consider a phenotype, reflected in its genetic variants (i.e., what are nucleotide-level variations associated with a disease, say). Often, these variations are not consistent (e.g., liver cancer manifests itself in gene mutations that are not all at the same place). However, these variations correspond to significantly aligned pathways in the underlying networks (i.e., they disrupt the same function, albeit by altering different genes). How do we go from an observable (phenotype/disease) to an abstraction (where the observable has little informative content) to other abstractions (where the observable might have significant information content). More importantly, how do we go backwards (predict observables)?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The Hidden Terminal Problem: Specific Instance Start from observed mutations in a specific disease (liver or breast cancer has significant genomic data available) The mutations result from both noise, other phenotypes, and the specific disease. A simple intersection yields no signal. Cross-reference against synthetic lethality data. Redefine intersection over pathways. Reassess mutations under this definition and quantify the significance of these mutations w.r.t. observed phenotype.