Download presentation
Presentation is loading. Please wait.
Published byCamron Malone Modified over 9 years ago
1
1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand
2
2 Outline Introduction – A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion
3
The complete genetic material of an organism or species The Genome
4
Key genomic component: genes ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence
5
Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E... Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E...
6
6 413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes www.genomesonline.org Whole Genome Sequencing
7
atgcaccttg
8
8 Gene prediction and annotation International Human Genome Consortium, Nature 2001 Predicted genes16,896 Total31,778 Known genes14,882
9
Gene annotation We are given a new genome sequence with predicted genes. A few genes are well studied. Identify other genes in the same family to predict function. Verify predictions experimentally Two contexts: –Individual scientist –High throughput
10
10 Outline Introduction –Molecular biology –A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion
11
11 Evolutionarily related genes have related functions Ancestral gene atgccaggactcccagtga… atgcgccgtctggcatgt… β-globin atgcaaggagtcccagagc… γ-globin atgcgaggtctcccatgt… ε-globin Adult Fetal Embryonic Duplication
12
Evolutionarily related genes have related functions Gene family classification is a powerful source of information for inferring evolutionary, functional and structural properties of genes atgcgaggtctcccatgt… Ancestral gene atgccaggactcccagtga… Duplication atgcgccgtctggcatgt…atgcaaggagtcccagagc… β-globinγ-globinε-globin
13
13 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion
14
14 …atgcaaggagtcccagagcc… …atgcgaggtctcccagtgtc… xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences
15
15 xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences
16
16 xixi xjxj Gene family classification Goal: Given known genes, identify genes in the same family. Biological scenario: small number of known genes large number of unknown genes
17
17 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion
18
18 Framework: binary classification Determine which unlabeled genes belong to the family. Machine learning scenario: small number of labeled data genes known to be in family genes clearly not in family large number of unlabeled data
19
19 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta
20
20 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta
21
21 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family.
22
22 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning algorithm –Supervised learning algorithm Empirical evaluation Conclusion
23
23 Gene family classification Goal: Binary classification Machine learning scenario: large number of unlabeled data small number of labeled data Semi supervised learning: Exploit information from both labeled and unlabeled data Performed well in many applications
24
24 Graphical semi-supervised learning (Binary classification) Notation: V: The whole data set L: Labeled data set U: unlabeled data set Each vertex: (x i, y i ) or (x k, f(k)) Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) (x i,y i = 1) (x j,y j = 0) (x k,f(k))
25
25 Graphical semi-supervised learning (Binary classification) (x i,y i = 1) (x j,y j = 0) (x k,f(k)) Output: –Assign a real value to every vertex in the graph –Find a cutoff to separate the two classes Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) Input: –family members (x i, y i = 1) –nonfamily members: (x j, y j = 0)
26
26 Graphical semi-supervised learning (Binary classification) (x i,y i = 0) G = (V,E) L: Labeled data set U: unlabeled data set (x n,y p = 1) (x k,f(k)) Assign real values to all vertices in the graph, to minimize E(f): S ij
27
27 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well
28
28 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well Works well ?
29
29 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning –Supervised learning Empirical evaluation Conclusion
30
Semi-supervised vs kernel-based supervised learning Semi-supervised learning: Supervised learning: where L is the labeled data set and U is the unlabeled data set
31
31 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion
32
32 Graph construction G = (V,E) V: All mouse sequences from SwissProt (n = 7439) E: based on newly designed sequence similarity measurement. 0 < S(i, j) < 1
33
33 Methodology Graph construction Test set construction Experiments performed Basis for evaluation
34
Test set construction 18 well studied protein families –Receptors, enzymes, transcription factors, motor proteins, structural proteins, and extracellular matrix proteins. ACSLFOXLaminin PDETRAF ADAMGATA SEMA T-box DVLKinase Myosin USP FGFKinesin NotchTNFR WNT
35
35 Test set construction Retrieved all complete mouse sequences from SwissProt database (7,439) Identified sequences for each test family based on – Nomenclature committee reports – Structural properties – Literature surveys
36
36 Methodology Graph construction Test set construction Experiments performed Basis for evaluation
37
Experiments performed Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)
38
Tested parameters number of Labeled Family members number of Non-labeled Family members σ For each set of parameters, 20 tests were performed
39
Tested parameters (1) Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100 S W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0
40
Tested parameters (2) Labeled Family members (LF): 10-70% of family size Labeled Nonfamily members (LN) : 100, 500, 1000 about 1 - 10% of nonfamily size Database size: 7439
41
41 Methodology Graph construction Test set construction Experiments performed Basis for evaluation
42
42 Semi-supervised learning Goal: f(i) > f(j) when x i is a family member and x j is not. Evaluation criteria: Visualization AUC score False negatives
43
Visualization Sort all unlabeled data by f(x) f(x) Rank Family members Nonfamily members
44
1 - specificity sensitivity f(x) Rank Family members Nonfamily members AUC (Area Under ROC Curve) Rank plot
45
Advantages of rank plot AUC = 0.9382
46
AUC scores do not reflect all information we need False negatives after the first false positive The number of missed data after the first false positive
47
47 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion
48
48 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family. Edges to sequences outside the family are mainly a problem if they have strong edge weights
49
49 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights
50
Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)
51
Tested parameters number of Labeled Family members number of Non-labeled Family members σ Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave)
52
The effect of σ Raw similarity score (s) W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0
53
53 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights
54
Edges to sequences outside the family are mainly a problem if they have strong edge weights
55
FOX Notch Number of edges Raw edge weight
56
Case study: Rank plots for semi-supervised learning in FOX σ = 0.1 σ =1 σ = 10 σ=100 LF = 3, LN = 100, family size: 30
57
Case study: rank plots for semi-supervised learning in Notch labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435) σ = 0.1 σ = 1 σ = 10
58
Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave) 10.10.5100.2 AUC (ave) FOX, Lf = 3, Ln =1000 σ
59
Summary on σ For most families, the performance is not very sensitive to σ For almost all families that form a clique, there is at least one value of sigma (usually many) –such that both semi-supervised and supervised learning algorithms have perfect classfication performance
60
Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)
61
61 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights
62
The connection among sequences in ADAM family # of connected ADAM sequences 2692425
63
The connection among sequences in ADAM family
64
Tested parameters number of Labeled Family members number of Non-labeled Family members σ By taking the maximum number of Labeled Family members number of Non-labeled Family members achieve the best average AUC score
65
The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF ADAM
66
The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF Semi-supervised, LN = 100 ADAM Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters
67
The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, ln =100 # labeled family seqs Supervised, ln =1000 Semi-supervised, ln = 100 ADAM
68
The impact of number of labeled family and nonfamily members on the performance 735159 AUC # labeled family seqs Semi-supervised, ln = 1000 Supervised, ln =100 Supervised, ln =1000 Semi-supervised, ln = 100 ADAM
69
Graph structure of ADAM Troublemaker: ADAMTS10 matches with only 8 out of 26 sequences in ADAM family. ADAMTS10 is often misclassified ADAMTS10 is implicated in a genetic disease that causes impaired vision and heat defects.
70
70 Semi-supervised method Supervised method
71
71 Several challenging problems of gene family classification Sequences in the same family –do not form a clique –do not exist in the same connected component Sequences in different families –have significant matches
72
72 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights
73
The connection among sequences in TNFR family
74
1011121314151617181920 6 4 2 # of connected TNFR sequences 20 TNFR in this connected component
75
TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 AUC The impact of number of labeled family and nonfamily members on the performance
76
Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.
77
Rank plots for semi-supervised learning in TNFR σ= 0.1 Lf = 2, ln = 100 AUC values do not reflect all information that we need
78
TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 Number of missed TNFR The impact of number of labeled family and nonfamily members on the performance
79
Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.
80
Summary for Number of labeled non-family members (LN) The performance supervised learning improves as LN increases for all families. For semi-supervised learning, sometimes LN is sometimes helpful and sometimes not.
81
81 Summary of results Clique Connected
82
Insights - 1 SSL is most effective for families that are not cliques but are connected. In test set, 12/18 cliques, 3/18 not connected. What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?
83
Insights - 2 Performance evaluation measures should match the needs of the user. – AUC scores penalize all FNs and FPs. – For experimental biologists, top ranked predictions are of interest – The number of FNs after the first false positive can reveal some information
84
Insights - 3 Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small
85
Acknowledgements John Lafferty Dannie Durand Jerry Zhu Durand Lab Robbie Sedgewick Rose Hoberman Ben Vernot Narayanan Raghupathy Aiton Goldman Jacob Joseph Annette McLeod Maureen Stolzer
86
Thank You !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.