1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand.

1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand

2 Outline Introduction – A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

The complete genetic material of an organism or species The Genome

Key genomic component: genes ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence

Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E... Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence A protein is an amino acid sequence V H L T P E...

6 413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes www.genomesonline.org Whole Genome Sequencing

atgcaccttg

8 Gene prediction and annotation International Human Genome Consortium, Nature 2001 Predicted genes16,896 Total31,778 Known genes14,882

Gene annotation We are given a new genome sequence with predicted genes. A few genes are well studied. Identify other genes in the same family to predict function. Verify predictions experimentally Two contexts: –Individual scientist –High throughput

10 Outline Introduction –Molecular biology –A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

11 Evolutionarily related genes have related functions Ancestral gene atgccaggactcccagtga… atgcgccgtctggcatgt… β-globin atgcaaggagtcccagagc… γ-globin atgcgaggtctcccatgt… ε-globin Adult Fetal Embryonic Duplication

Evolutionarily related genes have related functions Gene family classification is a powerful source of information for inferring evolutionary, functional and structural properties of genes atgcgaggtctcccatgt… Ancestral gene atgccaggactcccagtga… Duplication atgcgccgtctggcatgt…atgcaaggagtcccagagc… β-globinγ-globinε-globin

13 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

14 …atgcaaggagtcccagagcc… …atgcgaggtctcccagtgtc… xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences

15 xixi xjxj A graphical model of sequence relatedness E: weight of the edge is proportional to the similarity between sequences. G = (V,E) V: represent sequences

16 xixi xjxj Gene family classification Goal: Given known genes, identify genes in the same family. Biological scenario: small number of known genes large number of unknown genes

17 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation Conclusion

18 Framework: binary classification Determine which unlabeled genes belong to the family. Machine learning scenario: small number of labeled data genes known to be in family genes clearly not in family large number of unlabeled data

19 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta

20 Several challenging problems of gene family classification Traditionally, similarity is represented by sequence comparison atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… Ancestral gene Duplication Mutations atgcgccccccggcatgt… DNA shuffling atgcgccgt ctggcatg t… ggctcgta

21 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family.

22 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning algorithm –Supervised learning algorithm Empirical evaluation Conclusion

23 Gene family classification Goal: Binary classification Machine learning scenario: large number of unlabeled data small number of labeled data Semi supervised learning: Exploit information from both labeled and unlabeled data Performed well in many applications

24 Graphical semi-supervised learning (Binary classification) Notation: V: The whole data set L: Labeled data set U: unlabeled data set Each vertex: (x i, y i ) or (x k, f(k)) Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) (x i,y i = 1) (x j,y j = 0) (x k,f(k))

25 Graphical semi-supervised learning (Binary classification) (x i,y i = 1) (x j,y j = 0) (x k,f(k)) Output: –Assign a real value to every vertex in the graph –Find a cutoff to separate the two classes Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003) Input: –family members (x i, y i = 1) –nonfamily members: (x j, y j = 0)

26 Graphical semi-supervised learning (Binary classification) (x i,y i = 0) G = (V,E) L: Labeled data set U: unlabeled data set (x n,y p = 1) (x k,f(k)) Assign real values to all vertices in the graph, to minimize E(f): S ij

27 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well

28 Graph-based semi-supervised learning f(x k ) http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html Works well Works well ?

29 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning –Semi-supervised learning –Supervised learning Empirical evaluation Conclusion

Semi-supervised vs kernel-based supervised learning Semi-supervised learning: Supervised learning: where L is the labeled data set and U is the unlabeled data set

31 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion

32 Graph construction G = (V,E) V: All mouse sequences from SwissProt (n = 7439) E: based on newly designed sequence similarity measurement. 0 < S(i, j) < 1

33 Methodology Graph construction Test set construction Experiments performed Basis for evaluation

Test set construction 18 well studied protein families –Receptors, enzymes, transcription factors, motor proteins, structural proteins, and extracellular matrix proteins. ACSLFOXLaminin PDETRAF ADAMGATA SEMA T-box DVLKinase Myosin USP FGFKinesin NotchTNFR WNT

35 Test set construction Retrieved all complete mouse sequences from SwissProt database (7,439) Identified sequences for each test family based on – Nomenclature committee reports – Structural properties – Literature surveys

Experiments performed Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

Tested parameters number of Labeled Family members number of Non-labeled Family members σ For each set of parameters, 20 tests were performed

Tested parameters (1) Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100 S W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0

Tested parameters (2) Labeled Family members (LF): 10-70% of family size Labeled Nonfamily members (LN) : 100, 500, 1000 about 1 - 10% of nonfamily size Database size: 7439

42 Semi-supervised learning Goal: f(i) > f(j) when x i is a family member and x j is not. Evaluation criteria: Visualization AUC score False negatives

Visualization Sort all unlabeled data by f(x) f(x) Rank Family members Nonfamily members

1 - specificity sensitivity f(x) Rank Family members Nonfamily members AUC (Area Under ROC Curve) Rank plot

Advantages of rank plot AUC = 0.9382

AUC scores do not reflect all information we need False negatives after the first false positive The number of missed data after the first false positive

47 Outline Introduction A graphical model of sequence relatedness Gene classification using machine learning Empirical evaluation –Methodology –Results Conclusion

48 Several challenging problems of gene family classification Families –do not form a clique –do not form a connected component –have edges to sequences outside the family. Edges to sequences outside the family are mainly a problem if they have strong edge weights

49 Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

Tested parameters number of Labeled Family members number of Non-labeled Family members σ Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave)

The effect of σ Raw similarity score (s) W 0.02 σ=100 σ=10 σ=1 σ=0.5 σ=0.2 σ=0.1 0.08 0.05 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0

Edges to sequences outside the family are mainly a problem if they have strong edge weights

FOX Notch Number of edges Raw edge weight

Case study: Rank plots for semi-supervised learning in FOX σ = 0.1 σ =1 σ = 10 σ=100 LF = 3, LN = 100, family size: 30

Case study: rank plots for semi-supervised learning in Notch labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435) σ = 0.1 σ = 1 σ = 10

Notch, Lf = 1, Ln =1000 10.10.5100.2 AUC (ave) 10.10.5100.2 AUC (ave) FOX, Lf = 3, Ln =1000 σ

Summary on σ For most families, the performance is not very sensitive to σ For almost all families that form a clique, there is at least one value of sigma (usually many) –such that both semi-supervised and supervised learning algorithms have perfect classfication performance

Results Compare semi-supervised with supervised learning algorithm Tested parameters: – Scaling parameter,σ, in the kernel function –Number of Labeled Family members (LF) – Number of Labeled Nonfamily members(LN)

The connection among sequences in ADAM family # of connected ADAM sequences 2692425

The connection among sequences in ADAM family

Tested parameters number of Labeled Family members number of Non-labeled Family members σ By taking the maximum number of Labeled Family members number of Non-labeled Family members achieve the best average AUC score

The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF ADAM

The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, LN =100 # labeled family seqs, LF Semi-supervised, LN = 100 ADAM Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters

The impact of number of labeled family and nonfamily members on the performance 735159 AUC Supervised, ln =100 # labeled family seqs Supervised, ln =1000 Semi-supervised, ln = 100 ADAM

The impact of number of labeled family and nonfamily members on the performance 735159 AUC # labeled family seqs Semi-supervised, ln = 1000 Supervised, ln =100 Supervised, ln =1000 Semi-supervised, ln = 100 ADAM

Graph structure of ADAM Troublemaker: ADAMTS10 matches with only 8 out of 26 sequences in ADAM family. ADAMTS10 is often misclassified ADAMTS10 is implicated in a genetic disease that causes impaired vision and heat defects.

70 Semi-supervised method Supervised method

71 Several challenging problems of gene family classification Sequences in the same family –do not form a clique –do not exist in the same connected component Sequences in different families –have significant matches

The connection among sequences in TNFR family

1011121314151617181920 6 4 2 # of connected TNFR sequences 20 TNFR in this connected component

TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 AUC The impact of number of labeled family and nonfamily members on the performance

Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.

Rank plots for semi-supervised learning in TNFR σ= 0.1 Lf = 2, ln = 100 AUC values do not reflect all information that we need

TNFR (family size 24) 8241812 Semi, ln = 1000 Supervised, ln =100 Supervised, ln=1000 Semi,, ln = 100 Number of missed TNFR The impact of number of labeled family and nonfamily members on the performance

Summary for Number of labeled family members The performance of both semi-supervised and supervised learning improves as LF increases for all families. In non-clique families, semi-supervised learning performs better than supervised when LF is small.

Summary for Number of labeled non-family members (LN) The performance supervised learning improves as LN increases for all families. For semi-supervised learning, sometimes LN is sometimes helpful and sometimes not.

81 Summary of results Clique Connected

Insights - 1 SSL is most effective for families that are not cliques but are connected. In test set, 12/18 cliques, 3/18 not connected. What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?

Insights - 2 Performance evaluation measures should match the needs of the user. – AUC scores penalize all FNs and FPs. – For experimental biologists, top ranked predictions are of interest – The number of FNs after the first false positive can reveal some information

Insights - 3 Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small

Acknowledgements John Lafferty Dannie Durand Jerry Zhu Durand Lab Robbie Sedgewick Rose Hoberman Ben Vernot Narayanan Raghupathy Aiton Goldman Jacob Joseph Annette McLeod Maureen Stolzer

Thank You !

1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand.

Similar presentations

Presentation on theme: "1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand.

Similar presentations

Presentation on theme: "1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand."— Presentation transcript:

Similar presentations

About project

Feedback