Download presentation
Presentation is loading. Please wait.
1
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA
2
Introduction to Machine Learning
3
Biological Sequences Analysis, MTA 3 of 39 Some cool examples Introduction
4
Biological Sequences Analysis, MTA 4 of 39 Types of learnings Supervised learning - using "labeled" examples of input and desired output. Unsupervised learning - Models a set of inputs: labeled examples are not available. Reinforcement learning - Feedback on the actions from observing the environment (maximizing long term reward) Introduction
5
Clustering
6
Biological Sequences Analysis, MTA 6 of 39 Clustering definition Input: a set of instances Output: subsets (called clusters) so that observations in the same cluster are similar. Is it supervised or not?What does similar mean? Clustering
7
Biological Sequences Analysis, MTA 7 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center: Clustering
8
Biological Sequences Analysis, MTA 8 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center 3. Update location of centers: Clustering
9
Biological Sequences Analysis, MTA 9 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center 3. Update location of centers 4. Repeat 2-3 until no further change K-means - Interactive demo Clustering
10
Biological Sequences Analysis, MTA 10 of 39 Other clustering algorithms Take into account: homogeneity: similarity of instances inside a cluster. separation: dissimilarity of instances of different clusters. Allow "fuzzy clustering": instances bleongs to more than one cluster. Hierarchal clustering Clustering
11
Biological Sequences Analysis, MTA 11 of 39 Hierarchical clustering 1234512345 C1 C2 C3 C4 C5 C6.. Raw table Hierarchical clustering Cluster criterion Scores Similarity matrix Similarity criterion 1234512345 Clustering
12
Biological Sequences Analysis, MTA 12 of 39 UPGMA (you should already know it…) Neighbor-joining Hierarchical clustering 1234512345 C1 C2 C3 C4 C5 C6.. Cluster criterion Scores Similarity criterion 1234512345 A C B D E D A D (C,B) A E ((C,B),E) Clustering
13
Biological Sequences Analysis, MTA 13 of 39 Wait a minute… A tree is clustering?! Hierarchical clustering Clustering
14
Classifying
15
Biological Sequences Analysis, MTA 15 of 39 What is classification Input: labeled training set and unlabeled data set. Learn classifying (assigning labels), according to the features of the training set Output: labels on the data set. Example: qualified boy/girlfriend Classifying
16
Biological Sequences Analysis, MTA 16 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying
17
Biological Sequences Analysis, MTA 17 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying
18
Biological Sequences Analysis, MTA 18 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying
19
Biological Sequences Analysis, MTA 19 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying
20
Biological Sequences Analysis, MTA 20 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Now consider dozens of features… Classifying
21
Biological Sequences Analysis, MTA 21 of 39 How to classify KNN (K Nearest Neighbors) Decision trees SVM (Support Vector Machine) Naïve Bayes Baysian Networks NN (Neural Networks) Many many more… Classifying
22
Biological Sequences Analysis, MTA 22 of 39 KNN (K Nearest Neighbors) 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y Lazy (no pre-processing) Local Can deal with complex patterns Classifying
23
Biological Sequences Analysis, MTA 23 of 39 Decision trees 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y X ≥ 1.7 Y ≥ 36 X < 1.7 ? ? Y < 36 Tree actually means something! Can deal with complex patterns Classifying
24
Biological Sequences Analysis, MTA 24 of 39 SVM (Support Vector Machine) Classifying
25
Biological Sequences Analysis, MTA 25 of 39 SVM (Support Vector Machine) Finds optimal linear separation Maximizes the margin between the two data sets Can use transformation to higher dimension when not linear separable. Classifying
26
Biological Sequences Analysis, MTA 26 of 39 Naïve Bayse X P P( |X) and Can easily compute: P( |Y) and Can do the same for: Classifying Score( ) = P( |X,Y) Score( ) = P( |X) · P( |Y) Score( ) = P( |X,Y) Score( ) = P( |X) · P( |Y)
27
Biological Sequences Analysis, MTA 27 of 39 Naïve Bayse – graphical representation P( |X)P( |Y) XYZ P( |Z) Score( ) = P( |X,Y,Z) = P( |X)· P( |Y) · P( |Z) What if there are dependencies?? Classifying
28
Biological Sequences Analysis, MTA 28 of 39 Baysian Network P( |X,Z)P( |Y) XY Z P( X|Z) Score( ) = P( |X,Y,Z) = P( |X,Z) · P( |Y) Baysian Network takes dependencies into account Classifying
29
Biological Sequences Analysis, MTA 29 of 39 Use a labeled test set (in addition to the training set) Cross validation: 10-fold Leave-one-out How to choose a classifier (estimate performances)? Classifying
30
Legionalla pneumophila case-study
31
Biological Sequences Analysis, MTA 31 of 39 How did it all begin? Legionella pneumophila
32
Biological Sequences Analysis, MTA 32 of 39 Legionnaire disease nowadays Legionella pneumophila
33
Biological Sequences Analysis, MTA 33 of 39 Legionella pneumophila Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology
34
Biological Sequences Analysis, MTA 34 of 39 Identifying the effectors Legionella pneumophila
35
Biological Sequences Analysis, MTA 35 of 39 Homology to host proteins Regulatory elements Genome proximity to other effectors Secretion signal Abundance in Metazoa / Bacteria GC content Sequence homology The features Legionella pneumophila
36
Biological Sequences Analysis, MTA 36 of 39 The effectors machine 5 Legionella pneumophila
37
Biological Sequences Analysis, MTA 37 of 39 The big picture Similarity to known effectors Regulatory elements Features Similarity to host proteins G-C content Secretory signals Feature selection NN SVM Naïve Bayes Bayesian Net Voting Classification algorithms Experimental validation Predicted effectors Prior knowledge Trained model Unclassified genes Predicted non-effectors Newly validated effectors Non- effectors Validated effectors Abundance in Metazoa\Bacteria Genome arrangement Legionella pneumophila
38
Biological Sequences Analysis, MTA 38 of 39 Does it really work?? Machine learning
39
Biological Sequences Analysis, MTA 39 of 39
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.