Assembly of Solexa tomato reads María José Truco Bioinformatics and Genomics Program CRG-Centre de Regulació Genòmica Barcelona
RESULTS: Testing of the three tomato run concentrations Methodology: -Mixture of 9 BACs (2 incomplete). Three concentration runs: 1 pM, 2pM and 4 pM -Alignment to Reference Genomes using ELAND (Solexa) Sequences are aligned if have no more than 2 mismatches exists and align only to a single location of the reference genome ~ 30% of the sequences are contamination from E. coli and vector 1pM 2pM 4pM Sequences % sequences E. coli 632242 25.8 1095147 25.5 1330215 23.4 pBeloBAC11 96316 3.9 170772 4.0 220972 Yeast 84480 3.5 151431 194927 3.4 Human BACs* 84 0.0 68 75 Tomato BACs 1002996 41.0 1852042 43.1 2658390 46.8 Repeats** 414523 16.9 702356 16.3 854140 15.0 Non-matching*** 218044 8.9 336869 7.8 418374 7.4 All reads 2448685 4301227 5677093 Testing for contamination during sample preparation ** Reads matching 2 or more sites in the reference genomes ** * Reads with more than 2 mismatches or not aligning to any reference genome
RESULTS: BAC coverage before assembly 0 20000 40000 60000 80000 100000 500 400 300 200 100
RESULTS: BAC recovery after assembly (VELVET: Zerbino &Birmey; http://www.ebi.ac.uk/~zerbino/velvet/) using sequences from 4pM run (3920769 sequences) and 1 pM run (1627900 sequences) 4pM run: BAC recovery 66.4-89% (81.3-94.6%) 1pM run: BAC recovery 45.1-82.4% (66-97.6%) 0 20000 40000 60000 80000 100000 2500 2000 1500 1000 500
RESULTS: Gap filling of incomplete BAC EF606852 Methodology: -Selection of two fragments of ~4Kb flanking the gap -Blast assembled contigs against the two sequences flanking the gap 4121 25 left gap flanking region 4153 7297 right gap flanking region GAP 25 4121 4153 7297 12372 perfect match mis match assembled contig left gap flanking region right gap flanking region