Download presentation
Presentation is loading. Please wait.
Published byGannon Redfern Modified over 10 years ago
1
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS)
2
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
3
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
4
Mouse & Human Do they look like the same? Mouse and human are genetically very similar What do we mean by similar? Many genes that can be found in human are also found in mouse as well – conserved genes Mouse Chromosome 16 Human Chromosome 16 m16 h03
5
Genome A Genome B Gene X Gene Y Gene Z Identify regions on the genomes that possibly contain their conserved genes. Whole Genome Alignment Difference in ordering of conserved could be related to mutations. For related species, num. of mutations is usually small. possibly a mutation
6
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
7
Data size Usually very large (e.g., human chromosomes vs mouse chromosomes) Examples Human Chr No. Length Mouse Chr No. Length 1 245M 1 134M 3 200M 2 181M 11 135M 7 134M 15 100M 8 129M 20 64M 16 99M Cannot use global alignment tools because of the large size
8
Observations a conserved gene may not be identical in the two genomes, nevertheless, there are some common substrings unique to this conserved gene (called MUM) locate all MUMs over the two genomes, yet not every MUM corresponds to conserved genes Gene X Gene Y Gene X Noise
9
Number of MUMs Mouse Chr No. Human Chr No. # of MUMs 71952,394 152271,613 161666,536 162261,200 171629,001 171956,236 191129,814 Size is smaller comparing with chromosome length
10
MUMs for M16-H03 Conserved genes Human Chromosome 03 Mouse Chromosome 16
11
Generation of MUM using suffix tree How to choose the right MUMs?
12
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
13
MUM Selection MUMmer-1 [Delcher et al. Nucleic Acids Research 1999] longest common subsequences (effectively assume no mutations) MUMmer-2 [Delcher et al. Nucleic Acids Research 2002] & MUMmer-3 [Kurtz et al. Genome Biology 2004] clustering heuristics most popular tool to uncover conserved genes in WG scale MaxMinCluster [Wong et al. Bioinformatics 2004*] clustering, optimization MSS Mutation Sensitive Selection [Chan et al. Bioinformatics 2005*] capture mutations Hybrid approach [Chan et al. Bioinformatics 2005*] combine mutation sensitive and clustering approaches * our results
14
Overview of Results Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53) coverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published conserved genes
15
Overview of Results Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53) coverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published conserved genes MSS outperforms MaxMinCluster and MUMmer-3 on closely related species
16
Overview of Results Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53) coverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published conserved genes BUT MSS performs worse on species relatively farther apart
17
Overview of Results Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53) coverage: % of published conserved genes reported sensitivity: % of MUMs reported that reside in published conserved genes both hybrid approaches perform well for species farther apart
18
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
19
Longest Common Subsequence LCS
20
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks LCS Approach (MUMmer-1) does not take mutations into account MUMmer-2 & -3 cluster by heuristic combinatorial optimization problem MaxMinCluster formalizes clustering as a combinatorial optimization problem
21
Clustering approach Observations Noise MUMs are usually short and isolated A conserved gene usually contains a sequence of MUMs that are close and have sufficient length => clusters Gene X Gene Y Gene X Noise
22
Challenge Challenge: some conserved genes do not induce clusters of sufficient length Solution: relax the definition of clusters to allow the presence of noise
23
Noisy cluster Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 1-noisy cluster
24
Noisy cluster Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 2-noisy cluster
25
MaxMinClustesr Problem formulation find a collection of k-noisy clusters such that the smallest cluster has the maximum weight Dynamic programming O(k 2 n 2 ) time, O(k 2 n) space
26
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks Capture mutations more directly
27
Mutation Sensitive Selection select subsets of MUMs subset of MUMs transformed by a few mutations three types of mutations: reversal, transposition, reversed-transposition
28
k-mutated subsequences Given two sequences A & B and an integer k, a pair of subsequence X of A & subsequence Y of B is called a pair of k-mutated subsequences if X can be transformed to Y by at most k mutations reversaltransposition a pair of 2-mutated subsequences MUMs are signed; reversal reverts sign of MUMs
29
Mutation Sensitive Selection Problem formulation: To find a pair of k-mutated subsequences with maximum weight We believe that the problem is NP-hard The Genome Rearrangement Problem, believed to be NP-hard, can be reduced to this problem We give an efficient approximation algorithm the resulting weight is close to (at least 1/(3k+1) times) the maximum possible weight O(n 2 log n + kn 2 ) time, O(n 2 ) space
30
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
31
Hybrid Approach first apply clustering approach to identify clusters which are obviously conserved genes can apply either MUMmer-3 or MaxMinCluster these clusters are treated as MUM with bigger weight then apply MSS to process these MUM together with the remaining MUM
32
Outline Motivation Challenges of Whole Genome Alignment Four approaches and their performance Longest Common Subsequence Clustering Approach Mutation Sensitive Selection Hybrid Approach Remarks
33
Remarks Experiments show that MaxMinCluster > LCS MMS > MaxMinCluster for closely related species MMS does not perform well for species relatively farther apart Hybrid approach is the best for both closely related and farther apart species
34
Thank you! Q & A
35
Approximation Algorithm Super-Backbone maximum weight common subsequences Identify k mutation blocks having high weight do not overlap with Super-Backbone too much this is formulated as a sub-problem and solved optimally by dynamic programming Report Super-Backbone & k mutation blocks O(n 2 log n + kn 2 ) time, O(n 2 ) space
36
Mutations three types of mutations: reversal, transposition, reversed-transposition a b c d e f g h i j k l m n o p q r s t u v w x y z a d c b e f g h i j k l m n o p q r s t u v w x y z reversal a d c b e k l m n o p q r s t u v w x y f g h i j z transpositionreversed-transposition a d c b e k l t s r q p o m n u v w x y f g h i j z
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.