Structural genomics includes the genetic mapping, physical mapping and sequencing of entire genomes
How to get a genomic library: Breaking the DNA, cloning the fragments, and ordering 1,...,6 Cloned DNA Fragments Cleavage site Let us cut the isolated DNA with a restriction enzyme taken at a low concentration many sites will remain unrestricted
Marker every fifth lane Marra et al., Genome Res., 7, (1997) 96 samples, 25 marker lanes BAC Fingerprinting: Gel-based Fragment Separation
Hamming distance H(A,B) = |A i – B i | (mutual overlap) A: B: i=1i=1 n n Probability that at least one fragment will be shared by chance between clones A and B: p = 1- (1- 1/t) m (t=L/2R - number of bins on gel length L; R - resolution). Distance functions Clones as math vectors: A B Limited fingerpinting resolution bands shared by chance
Genome physical mapping problems are computationally challenging “ … We have been looking at the assemblies of large genomes … and for every ‘draft’ genome we look at, we find hundreds - and sometimes thousands - of mis-assemblies ”. Salzberg & Yorke (2005) Beware of mis-assembled genomes. Bioinformatics, 21:
Bioinformatics and Human Factors Reading the scores Clustering (contig assembly) Ordering the clusters Merging contigs Anchoring (getting genetic and physical maps together) Verification of mapping results (at each stage) Which factors may affect the quality of physical map ? Where bioinformatics can help ?
“Mapping” means “positioning” based on some distance The major mapping steps Fingerprinted clones, C k k=1,…, Distances d ij for (C i, C j ) shared bands Clustering (high stringency) Ordering (high stringency) Merging (lower stringency) Anchoring and verification
P-value of clone overlaps Sulston score (Sulston et al., 1988): p = 1-(1-1/N) n(c2) is the probability of random incidence of two bands; n(c) – number of bands in clone c; N – total number of distinguishable bands
Approximation of the exact model of random clone overlap IoE approximation Wendl’s exact theory (J. Com. Biol. 2005, 12: )
Band abundances: Unexploited source to improve mapping quality 3B
Varying cutoff: increasing rather than decreasing stringency protected clusters Adaptive Clustering
Network representation of signific ant clone overlaps vertices correspond to clones and edges – to significant clone overlaps
clones clones from the selected diametric path (MTP) wheat 1B Network representation o f significant clone overlaps 13
Identification of putative Q-clones and Q-overlaps
Identification of contig non-linearity diam Wheat 1BS Ctg13 width Width >1 is diagnostic for a non-linear cluster Using net of significant clone overlaps to find diametric path and calculate width o f the net 15
Diametric path: Calculate ranks r j =r j (c i ) for all clones c j relative to clone c i (through significant clone overlaps). Diametric path ( MTP) is the shortes t path through significant clone overla ps connecting clones c i and c j with ma ximal r j (c i ). Width of net: maximal rank relative to diametric path Width >1 non-linear cluster Identification of contig non-linearity
Identification of contig non-linearity Example with Q-clone: 17
Using net of significant clone overlaps, for each clo ne c i calculate ranks r ij for all clones c j. Diametric path: for pair of clones with maximal r ij id entify the shortest path through significant clone ov erlaps MTP Width of net: maximal rank relative to diametric path Width >1 is diagnostic for a non-linear cluster PAG Identification of contig non-linearity
“Linearization” by removing clones in cluster branching
Reducing genome mapping (linear ordering) problems to traveler salesman problem (TSP) Order 1: a b c d e f g h k l m n l 1 Order 2: b a c d e f g h k l m n l 2 ……… Order N: f c m h e a g n k l b d l N n=60 N =60!/2 ~ orders The problem How to chose the best (true) order, i.e., the one that gives the map of minimal length? A B C D EF G H … a b c d e f g h … a b c d e f g h i j k
Example: A Contig
Re-sampling based order verification Excluding parallel clones allows constructing a stable "skeleton" map and specifying coordinates of all clones relative to this map.
Testing the FPC contigs by using LTC wheat 1B
Testing the FPC contigs by using LTC wheat 1B
Wheat 1B: Some of FPC contigs have non-linear to pological structure inconsistent with chromosome li near structure : Q - clones ? Testing the FPC contigs by using LTC
Edges represent the significant overlaps (with cutoff e-25 of Sulston score). Increasing the stringency up to 1e-75 does not help here in gettingnon-trivial linearization! Ctg2 FPC contigs with non-linear topology, and even cycles Testing the FPC contigs by using LTC
Problematic contigs (simulated maize)
Xuhw258 Xuhiuw264Xuhiuw265 Xuhw259 Xuhw264-5-T7 Xuhw264-3-T7 Xuhw T7 Yr15 #3 #28 #4 #5 #6 #7 Brachypodium synteny-based markers French clones-based markers 450 Kb ?