Biological data representation and data mining Xin Chen
Biology is never again the same Accumulation of data High-throughput experiment Scattered and layered knowledge Challenge in representing and integrating the data and knowledge – Fidel and full representation of the observation, not only conclusion – Connecting heterogeneous types of data – Building a computational framework You never can understand what is an elephant by looking at its hairs
The crown jewel of biology My personal opinion: Data analysis in general – BLAST – homology analysis – HMM – concept of “families” – Structure analysis, clinical trials, orthogonal experimental design, etc. – Statistics adapted to biology Data mining in specific – Analysis of relationship between entities
Data mining flavors Representation of data: – Sample or Tuple represented by fixed or variable number of elements (features, which are categorical or continuous numbers) Binary relationships – Unsupervised learning Looking for structures in the samples, assuming a “similarity” with biological sense Example: K-means, hierarchical – Supervised learning Looking for a function that describes the relationship between features of samples Example: Support vector machine, neural network, Bayesian network, regression Network relationships: – Which assumed network structure/parameter best describes the observation – Confidence over the network and confidence over the network elements – Example: probabilistic network (Bayesian network), neural network
When your hands on … Pre-processing – Clean the data (outlier, missing value, dependency…) – Feel the data (structure, relevance … most important, most difficult, and most underestimated) Data mining – Choose an algorithm (adapt it if necessary) – Run the analysis Post-processing (interpretation) – What is expected and what is unexpected – Connecting results with knowledge and discoveries Biology is the key – Where to look is always more important than how to look
In a nutshell Results obtained with unbiased (knowledge-independent) approaches, If correspond to existing knowledge, are proof of your analysis approach and the validity of your discoveries.
The course format Each of you will given at least one paper presentation at class and finish a toy data mining project (with paper report) using datasets in the UCL Machine Learning Repository. You will be evaluated with 70% on your paper presentation and class activity, and 30% on your course project report.