Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand.

Similar presentations


Presentation on theme: "1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand."— Presentation transcript:

1 1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand

2 2 Outline Introduction – A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

3 The complete genetic material of an organism or species The Genome

4 Key genomic component: genes ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence

5 Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E... Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E...

6 6 413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes www.genomesonline.org Whole Genome Sequencing

7 atgcaccttg

8 8 Gene prediction and annotation International Human Genome Consortium, Nature 2001 Predicted genes16,896 Total31,778 Known genes14,882

9 Gene annotation We are given a new genome sequence with predicted genes. A few genes are well studied. Identify other genes in the same family to predict function. Verify predictions experimentally Two contexts: –Individual scientist –High throughput

10 10 Outline Introduction –Molecular biology –A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

11 11 Evolutionarily related genes have related functions Ancestral gene atgccaggactcccagtga… atgcgccgtctggcatgt… β-globin atgcaaggagtcccagagc… γ-globin atgcgaggtctcccatgt… ε-globin Adult Fetal Embryonic Duplication

12 Evolutionarily related genes have related functions Gene family classification is a powerful source of information for inferring evolutionary, functional and structural properties of genes atgcgaggtctcccatgt… Ancestral gene atgccaggactcccagtga… Duplication atgcgccgtctggcatgt…atgcaaggagtcccagagc… β-globinγ-globinε-globin

13 13 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

14 14 …atgcaaggagtcccagagcc… …atgcgaggtctcccagtgtc… xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences

15 15 xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences

16 16 xixi xjxj Gene family classification Goal: Given known genes, identify genes in the same family. Biological scenario: small number of known genes large number of unknown genes

17 17 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

18 18 Framework: binary classification Determine which unlabeled genes belong to the family. Machine learning scenario: small number of labeled data genes known to be in family genes clearly not in family large number of unlabeled data

19 19 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta

20 20 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta

21 21 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family.

22 22 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning algorithm –Supervised learning algorithm Empirical evaluation Conclusion

23 23 Gene family classification Goal: Binary classification Machine learning scenario: large number of unlabeled data small number of labeled data Semi supervised learning: Exploit information from both labeled and unlabeled data Performed well in many applications

24 24 Graphical semi-supervised learning (Binary classification) Notation: V: The whole data set L: Labeled data set U: unlabeled data set Each vertex: (x i, y i ) or (x k, f(k)) Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) (x i,y i = 1) (x j,y j = 0) (x k,f(k))

25 25 Graphical semi-supervised learning (Binary classification) (x i,y i = 1) (x j,y j = 0) (x k,f(k)) Output: –Assign a real value to every vertex in the graph –Find a cutoff to separate the two classes Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) Input: –family members (x i, y i = 1) –nonfamily members: (x j, y j = 0)

26 26 Graphical semi-supervised learning (Binary classification) (x i,y i = 0) G = (V,E) L: Labeled data set U: unlabeled data set (x n,y p = 1) (x k,f(k)) Assign real values to all vertices in the graph, to minimize E(f): S ij

27 27 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well

28 28 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well Works well ?

29 29 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning –Supervised learning Empirical evaluation Conclusion

30 Semi-supervised vs kernel-based supervised learning Semi-supervised learning: Supervised learning: where L is the labeled data set and U is the unlabeled data set

31 31 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion

32 32 Graph construction G = (V,E) V: All mouse sequences from SwissProt (n = 7439) E: based on newly designed sequence similarity measurement. 0 < S(i, j) < 1

33 33 Methodology Graph construction Test set construction Experiments performed Basis for evaluation

34 Test set construction 18 well studied protein families –Receptors, enzymes, transcription factors, motor proteins, structural proteins, and extracellular matrix proteins. ACSLFOXLaminin PDETRAF ADAMGATA SEMA T-box DVLKinase Myosin USP FGFKinesin NotchTNFR WNT

35 35 Test set construction Retrieved all complete mouse sequences from SwissProt database (7,439) Identified sequences for each test family based on – Nomenclature committee reports – Structural properties – Literature surveys

36 36 Methodology Graph construction Test set construction Experiments performed Basis for evaluation

37 Experiments performed Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

38 Tested parameters number of Labeled Family members number of Non-labeled Family members σ For each set of parameters, 20 tests were performed

39 Tested parameters (1) Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100 S W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0

40 Tested parameters (2) Labeled Family members (LF): 10-70% of family size Labeled Nonfamily members (LN) : 100, 500, 1000 about 1 - 10% of nonfamily size Database size: 7439

41 41 Methodology Graph construction Test set construction Experiments performed Basis for evaluation

42 42 Semi-supervised learning Goal: f(i) > f(j) when x i is a family member and x j is not. Evaluation criteria: Visualization AUC score False negatives

43 Visualization Sort all unlabeled data by f(x) f(x) Rank Family members Nonfamily members

44 1 - specificity sensitivity f(x) Rank Family members Nonfamily members AUC (Area Under ROC Curve) Rank plot

45 Advantages of rank plot AUC = 0.9382

46 AUC scores do not reflect all information we need False negatives after the first false positive The number of missed data after the first false positive

47 47 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion

48 48 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family. Edges to sequences outside the family are mainly a problem if they have strong edge weights

49 49 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

50 Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

51 Tested parameters number of Labeled Family members number of Non-labeled Family members σ Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave)

52 The effect of σ Raw similarity score (s) W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0

53 53 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

54 Edges to sequences outside the family are mainly a problem if they have strong edge weights

55 FOX Notch Number of edges Raw edge weight

56 Case study: Rank plots for semi-supervised learning in FOX σ = 0.1 σ =1 σ = 10 σ=100 LF = 3, LN = 100, family size: 30

57 Case study: rank plots for semi-supervised learning in Notch labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435) σ = 0.1 σ = 1 σ = 10

58 Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave) 10.10.5100.2 AUC (ave) FOX, Lf = 3, Ln =1000 σ

59 Summary on σ For most families, the performance is not very sensitive to σ For almost all families that form a clique, there is at least one value of sigma (usually many) –such that both semi-supervised and supervised learning algorithms have perfect classfication performance

60 Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

61 61 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

62 The connection among sequences in ADAM family # of connected ADAM sequences 2692425

63 The connection among sequences in ADAM family

64 Tested parameters number of Labeled Family members number of Non-labeled Family members σ By taking the maximum number of Labeled Family members number of Non-labeled Family members achieve the best average AUC score

65 The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF ADAM

66 The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF Semi-supervised, LN = 100 ADAM Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters

67 The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, ln =100 # labeled family seqs Supervised, ln =1000 Semi-supervised, ln = 100 ADAM

68 The impact of number of labeled family and nonfamily members on the performance 735159 AUC # labeled family seqs Semi-supervised, ln = 1000 Supervised, ln =100 Supervised, ln =1000 Semi-supervised, ln = 100 ADAM

69 Graph structure of ADAM Troublemaker: ADAMTS10 matches with only 8 out of 26 sequences in ADAM family. ADAMTS10 is often misclassified ADAMTS10 is implicated in a genetic disease that causes impaired vision and heat defects.

70 70 Semi-supervised method Supervised method

71 71 Several challenging problems of gene family classification Sequences in the same family –do not form a clique –do not exist in the same connected component Sequences in different families –have significant matches

72 72 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

73 The connection among sequences in TNFR family

74 1011121314151617181920 6 4 2 # of connected TNFR sequences 20 TNFR in this connected component

75 TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 AUC The impact of number of labeled family and nonfamily members on the performance

76 Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.

77 Rank plots for semi-supervised learning in TNFR σ= 0.1 Lf = 2, ln = 100 AUC values do not reflect all information that we need

78 TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 Number of missed TNFR The impact of number of labeled family and nonfamily members on the performance

79 Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.

80 Summary for Number of labeled non-family members (LN) The performance supervised learning improves as LN increases for all families. For semi-supervised learning, sometimes LN is sometimes helpful and sometimes not.

81 81 Summary of results Clique Connected

82 Insights - 1 SSL is most effective for families that are not cliques but are connected. In test set, 12/18 cliques, 3/18 not connected. What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?

83 Insights - 2 Performance evaluation measures should match the needs of the user. – AUC scores penalize all FNs and FPs. – For experimental biologists, top ranked predictions are of interest – The number of FNs after the first false positive can reveal some information

84 Insights - 3 Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small

85 Acknowledgements John Lafferty Dannie Durand Jerry Zhu Durand Lab Robbie Sedgewick Rose Hoberman Ben Vernot Narayanan Raghupathy Aiton Goldman Jacob Joseph Annette McLeod Maureen Stolzer

86 Thank You !

87


Download ppt "1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand."

Similar presentations


Ads by Google