Download presentation
Presentation is loading. Please wait.
1
University of California at San Diego
Data to Biology Shankar Subramaniam University of California at San Diego 1
2
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
KNOWLEDGE EXTRACTION FROM DATA DEALING WITH THE COFFEE DRINKERS PROBLEM HOW CAN BIOLOGICAL DATA BE INTEGRATED? DEFINING THE GRANULARITY OF DATA UNBIASED STATISTICAL METHODS BIOLOGY-CONSTRAINED METHODS INFORMATION METRICS HOW DO WE DEAL WITH CONTEXT?
3
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
NOISY DATA CAN WE DEFINE HOW MUCH NOISE AND WHAT TYPE OF NOISE CAN BE TOLERATED IN EXTRACTING KNOWLEDGE? IS MISSING DATA TANTAMOUNT TO NOISE? IF NOT HOW DO WE DEAL WITH IT?
4
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
CLASSIFICATION OF MODULARITY FROM DATA HOW CAN WE DEFINE MODULES (FUNCTIONAL, SPATIAL, TEMPORAL, ETC.) FROM DATA? WHAT IS THE INFORMATION CONTENT IN THE MODULES? CAN WE COMPARE MODULES QUANTITATIVELY?
5
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
DEALING WITH DYNAMICAL DATA HOW DO WE DEAL WITH TIME SERIES DATA? HOW IS INFORMATION PROCESSED IN TIME SERIES DATA? WHAT GRANULARITY AND CONTEXT IS NECESSARY TO ANALYZE THIS DATA?
6
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The coffee drinkers problem [highly skewed distributions]: 90% of people are coffee drinkers What does this say about making drink predictions that are 90% accurate? Biology is all about highly skewed distributions – posing significant challenges for methods, measures, and validation
7
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The coffee drinkers problem – examples: 99% of us likely do not have the disease one might be looking for 99% of protein interactions are accounted for by 5% of the proteins 99% of the known disease-implicated mutations occur in less than 5% of the people (all estimates, but largely realistic)
8
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The coffee drinkers problem: Most current techniques in data analysis are rendered useless because of this. Statistical significance with meaningful null hypotheses are critical (information content is one of the most commonly used measures even today) Simulation based methods often do not work – requiring analytics Methods must optimize for these analytical measures of quality Validation in the absence of complete data is hard
9
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The coffee drinkers problem (real examples) When is a module in a network significant? When is an observed mutation in a sequenced phenotype implicated genome significant? When is an alignment of two networks significant? When is correlation in time-course microarray data significant? Conversely: How do we detect the most significant modules in a network? How do we identify all phenotype-implicated mutations from a large number of sequenced diseased and normal genomes? How do we align networks for most statistically significant alignments? How do we find most correlated signals and associated groups of genes?
10
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The Hidden Terminal Problem: Consider a phenotype, reflected in its genetic variants (i.e., what are nucleotide-level variations associated with a disease, say). Often, these variations are not consistent (e.g., liver cancer manifests itself in gene mutations that are not all at the same place). However, these variations correspond to significantly aligned pathways in the underlying networks (i.e., they disrupt the same function, albeit by altering different genes). How do we go from an observable (phenotype/disease) to an abstraction (where the observable has little informative content) to other abstractions (where the observable might have significant information content). More importantly, how do we go backwards (predict observables)?
11
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS
The Hidden Terminal Problem: Specific Instance Start from observed mutations in a specific disease (liver or breast cancer has significant genomic data available) The mutations result from both noise, other phenotypes, and the specific disease. A simple intersection yields no signal. Cross-reference against synthetic lethality data. Redefine intersection over pathways. Reassess mutations under this definition and quantify the significance of these mutations w.r.t. observed phenotype.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.