Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS)

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Mouse & Human Do they look like the same? Mouse and human are genetically very similar What do we mean by similar? Many genes that can be found in human are also found in mouse as well – conserved genes Mouse Chromosome 16 Human Chromosome 16 m16 h03

Genome A Genome B Gene X Gene Y Gene Z Identify regions on the genomes that possibly contain their conserved genes. Whole Genome Alignment Difference in ordering of conserved could be related to mutations. For related species, num. of mutations is usually small. possibly a mutation

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Data size  Usually very large (e.g., human chromosomes vs mouse chromosomes) Examples Human Chr No. Length Mouse Chr No. Length 1 245M 1 134M 3 200M 2 181M 11 135M 7 134M 15 100M 8 129M 20 64M 16 99M Cannot use global alignment tools because of the large size

Observations  a conserved gene may not be identical in the two genomes, nevertheless, there are some common substrings unique to this conserved gene (called MUM)  locate all MUMs over the two genomes, yet not every MUM corresponds to conserved genes Gene X Gene Y Gene X Noise

Number of MUMs Mouse Chr No. Human Chr No. # of MUMs 71952,394 152271,613 161666,536 162261,200 171629,001 171956,236 191129,814 Size is smaller comparing with chromosome length

MUMs for M16-H03 Conserved genes Human Chromosome 03 Mouse Chromosome 16

Generation of MUM using suffix tree How to choose the right MUMs?

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

MUM Selection  MUMmer-1 [Delcher et al. Nucleic Acids Research 1999]  longest common subsequences (effectively assume no mutations)  MUMmer-2 [Delcher et al. Nucleic Acids Research 2002] & MUMmer-3 [Kurtz et al. Genome Biology 2004]  clustering heuristics  most popular tool to uncover conserved genes in WG scale  MaxMinCluster [Wong et al. Bioinformatics 2004*]  clustering, optimization  MSS Mutation Sensitive Selection [Chan et al. Bioinformatics 2005*]  capture mutations  Hybrid approach [Chan et al. Bioinformatics 2005*]  combine mutation sensitive and clustering approaches * our results

Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes

Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes MSS outperforms MaxMinCluster and MUMmer-3 on closely related species

Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes BUT MSS performs worse on species relatively farther apart

Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes both hybrid approaches perform well for species farther apart

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Longest Common Subsequence LCS

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks LCS Approach (MUMmer-1) does not take mutations into account  MUMmer-2 & -3 cluster by heuristic combinatorial optimization problem  MaxMinCluster formalizes clustering as a combinatorial optimization problem

Clustering approach  Observations  Noise MUMs are usually short and isolated  A conserved gene usually contains a sequence of MUMs that are close and have sufficient length => clusters Gene X Gene Y Gene X Noise

Challenge  Challenge: some conserved genes do not induce clusters of sufficient length Solution: relax the definition of clusters to allow the presence of noise

Noisy cluster  Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 1-noisy cluster

Noisy cluster  Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 2-noisy cluster

MaxMinClustesr  Problem formulation  find a collection of k-noisy clusters such that the smallest cluster has the maximum weight  Dynamic programming O(k 2 n 2 ) time, O(k 2 n) space

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks Capture mutations more directly

Mutation Sensitive Selection  select subsets of MUMs subset of MUMs transformed by a few mutations  three types of mutations: reversal, transposition, reversed-transposition

k-mutated subsequences  Given two sequences A & B and an integer k,  a pair of subsequence X of A & subsequence Y of B is called a pair of k-mutated subsequences if X can be transformed to Y by at most k mutations reversaltransposition a pair of 2-mutated subsequences MUMs are signed; reversal reverts sign of MUMs

Mutation Sensitive Selection  Problem formulation:  To find a pair of k-mutated subsequences with maximum weight  We believe that the problem is NP-hard  The Genome Rearrangement Problem, believed to be NP-hard, can be reduced to this problem  We give an efficient approximation algorithm  the resulting weight is close to (at least 1/(3k+1) times) the maximum possible weight O(n 2 log n + kn 2 ) time, O(n 2 ) space

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Hybrid Approach  first apply clustering approach to identify clusters which are obviously conserved genes  can apply either MUMmer-3 or MaxMinCluster  these clusters are treated as MUM with bigger weight  then apply MSS to process these MUM together with the remaining MUM

Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

Remarks  Experiments show that  MaxMinCluster > LCS  MMS > MaxMinCluster for closely related species  MMS does not perform well for species relatively farther apart  Hybrid approach is the best for both closely related and farther apart species

Thank you! Q & A

Approximation Algorithm  Super-Backbone  maximum weight common subsequences  Identify k mutation blocks  having high weight  do not overlap with Super-Backbone too much  this is formulated as a sub-problem and solved optimally by dynamic programming  Report Super-Backbone & k mutation blocks O(n 2 log n + kn 2 ) time, O(n 2 ) space

Mutations  three types of mutations: reversal, transposition, reversed-transposition a b c d e f g h i j k l m n o p q r s t u v w x y z a d c b e f g h i j k l m n o p q r s t u v w x y z reversal a d c b e k l m n o p q r s t u v w x y f g h i j z transpositionreversed-transposition a d c b e k l t s r q p o m n u v w x y f g h i j z

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Similar presentations

Presentation on theme: "Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Similar presentations

Presentation on theme: "Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),"— Presentation transcript:

Similar presentations

About project

Feedback