JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

JM - http://folding.chmcc.org 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC

JM - http://folding.chmcc.org2 Outline of the lecture Motivating story: correlating inputs and outputs Learning with a teacher Regression and classification problems Model selection, feature selection and generalization k-nearest neighbors and some other classification algorithms Phenotype fingerprints and their applications in medicine

JM - http://folding.chmcc.org3 Web watch: an on-line biology textbook by JW Kimball Dr. J. W. Kimball's Biology Pages http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/ Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene Story #2: ApoB, cholesterol uptake, LDL and its endocytosis http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl Complex patterns of mutations in genes related to cholesterol transport and uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood.

JM - http://folding.chmcc.org4 Correlations and fingerprints Instead of often difficult to decipher underlying molecular model, one may simply try to find correlations between inputs and outputs. If measurements on certain attributes correlate with molecular processes, underlying genomic structures, phenotypes, disease states etc., one can use such attributes as indicators of such “hidden” states and to make predictions for new cases. Consider for example the elevated levels of the low density lipoprotein (LDL) particles in the blood, as an indicator (fingerprint) of the atherosclerosis.

JM - http://folding.chmcc.org5 Correlations and fingerprints: LDL example Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age ( see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549

JM - http://folding.chmcc.org6 LDL example: 2D projection

JM - http://folding.chmcc.org7 LDL example: regression with binary output and 1D projection for classification

JM - http://folding.chmcc.org8 Unsupervised vs. supervised learning In case of unsupervised learning the goal is to “discover” the structure in the data and group (cluster) similar objects, given a similarity measure. In case of supervised learning (or learning with a teacher) a set of examples with class assignments (e.g. healthy vs. diseased) is given and the goal is to find a representation of the problem in some feature (attribute) space that provides a proper separation of the imposed classes. Such representations With the resulting decision boundaries may be subsequently used to make prediction for new cases. Class 1 Class 2 Class 3

JM - http://folding.chmcc.org9 Choice of the model, problem representation and feature selection: another simple example heights estrogen F M adultschildren weight testosterone

10 Gene expression example again: JRA clinical classes Picture: courtesy of B. Aronow

JM - http://folding.chmcc.org11 Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand FixL PYP GLOBINS ?? No sequence similarity Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again).

JM - http://folding.chmcc.org12 Three phases in supervised learning protocols  Training data: examples with class assignment are given  Learning: i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)  Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)

JM - http://folding.chmcc.org13 Training set: LDL example again A set of objects (here patients) x i, i=1, …, N is given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class C k, k=1, …, K, he/she belongs to. Age LDL HDL SexClass 41 230 60Fhealthy (0) 32 120 50Mstroke within 5 years (1) 45 90 70Mheart attack within 5 years (1) { x i, C k } i=1, …, N

JM - http://folding.chmcc.org14 Optimizing adaptable parameters in the model Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w. Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years) y(x;w)y(x;w)

JM - http://folding.chmcc.org15 Examples of machine learning algorithms for classification and regression problems Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis) (simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models, adaptivity, “artificial brain”)

JM - http://folding.chmcc.org16 Training accuracy vs. generalization

JM - http://folding.chmcc.org17 Model complexity, training set size and generalization

JM - http://folding.chmcc.org18 Similarity measures

JM - http://folding.chmcc.org19 k-nearest neighbors as a simple algorithm for classification Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems Note however that the choice of the distance measure may again have a profound effect on the results The optimal k is found by trial and error

JM - http://folding.chmcc.org20 k-nearest neighbor algorithm Step 1: Compute pairwise distances and take k closest neighbors Step2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

Similar presentations

Presentation on theme: "JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

Similar presentations

Presentation on theme: "JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division."— Presentation transcript:

Similar presentations

About project

Feedback