Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological data representation and data mining Xin Chen

Similar presentations


Presentation on theme: "Biological data representation and data mining Xin Chen"— Presentation transcript:

1 Biological data representation and data mining Xin Chen xinchen@zju.edu.cn

2 Biology is never again the same Accumulation of data High-throughput experiment Scattered and layered knowledge Challenge in representing and integrating the data and knowledge – Fidel and full representation of the observation, not only conclusion – Connecting heterogeneous types of data – Building a computational framework You never can understand what is an elephant by looking at its hairs

3 The crown jewel of biology My personal opinion: Data analysis in general – BLAST – homology analysis – HMM – concept of “families” – Structure analysis, clinical trials, orthogonal experimental design, etc. – Statistics adapted to biology Data mining in specific – Analysis of relationship between entities

4 Data mining flavors Representation of data: – Sample or Tuple represented by fixed or variable number of elements (features, which are categorical or continuous numbers) Binary relationships – Unsupervised learning Looking for structures in the samples, assuming a “similarity” with biological sense Example: K-means, hierarchical – Supervised learning Looking for a function that describes the relationship between features of samples Example: Support vector machine, neural network, Bayesian network, regression Network relationships: – Which assumed network structure/parameter best describes the observation – Confidence over the network and confidence over the network elements – Example: probabilistic network (Bayesian network), neural network

5 When your hands on … Pre-processing – Clean the data (outlier, missing value, dependency…) – Feel the data (structure, relevance … most important, most difficult, and most underestimated) Data mining – Choose an algorithm (adapt it if necessary) – Run the analysis Post-processing (interpretation) – What is expected and what is unexpected – Connecting results with knowledge and discoveries Biology is the key – Where to look is always more important than how to look

6 In a nutshell Results obtained with unbiased (knowledge-independent) approaches, If correspond to existing knowledge, are proof of your analysis approach and the validity of your discoveries.

7 The course format Each of you will given at least one paper presentation at class and finish a toy data mining project (with paper report) using datasets in the UCL Machine Learning Repository. You will be evaluated with 70% on your paper presentation and class activity, and 30% on your course project report.


Download ppt "Biological data representation and data mining Xin Chen"

Similar presentations


Ads by Google