Download presentation
Presentation is loading. Please wait.
Published byRolf Watkins Modified over 9 years ago
1
Plant Biology Division Post-process of IMGAG M.t. 2.0 Release Affymetrix Medicago Probe set – IMGAG 2.0 / MTGI 8.0 Mapping Zhao Bioinformatics Lab
2
Plant Biology Division IMGAG M.t. 2.0 Data downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gzftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gz ● Summary - 38,844 TU and 38,844 models. One to one - 38,759 gene name, so 82 model is redundant in gene name. - Of the 38,844 models, 85’s CDS region is not compatible with FASTA file - 4644 models with 5’-UTR + CDs; - 5846 models with CDS+3’-UTR - 11656 models with 5’-UTR + CDS + 3’-UTR. - 16698 models CDS only
3
Plant Biology Division Evidence Code ● F (5036 genes) full coverage/FL-cDNA: The complete gene model from translation start to translation stop is covered by expressed Medicago sequence, e.g. FL-cDNA or EST alignments across the full length of the coding sequence. ● E (14737 genes) expressed/EST matches: Expression of the gene is supported by Medicago EST sequence that matches the gene call (partially). ● H (14209 genes) homology/heterologous: the gene call is supported by similarity to Medicago or other ESTs, protein, FL-cDNA, genomic or other sequences with partial or full-length alignments. ● I (1375 genes) intrinsic/ab initio/inferred/hypothetical: the gene call is based only on intrinsic prediction tools such as FGENESH, Genscan or Eugene, and no significant alignments to other sequences are available. The length of the prediction is greater than 300 bp or there is a significant domain match in Interpro. ● L (3830 genes) 'low quality' gene calls: gene calls not in F, E, nor H, with no significant Interpro domain match and a length less than 300 bp. i.e., unsupported intrinsic predictions of short length and thus statistically containing many false predictions. Total genes: 38334 NON-OVERLAPPED genes
4
Plant Biology Division Affymetrix Medicago Probe set – IMGAG gene Mapping Two approaches ● A. Blast-based approach (1) HSP length / Affymetrix probeset target length >= threshold1 (2) Matching identity length / Max_HSP length >= threshold2 ● B. Affy probe-set level matching (1) IMGAG gene sequences were matched to corresponding Affymetrix probe sets using a position-weighted scoring index in which mismatches near the middle of a probe were most heavily penalized as follows: (1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1). (2) A perfect match for a probe set yields a score of 45. Matches were declared when at least 8 of 11 probe sets had scores of 43 or higher.
5
Plant Biology Division Statistics on Probe sets TypeNum of probe sets Percent in the Mtr. set Notes Unique probe sets: e.g. Mtr.10097.1.S1_at 4418286.80 unique to one gene alternative (_a_), e.g.: Mtr.10267.1.S1_a_at 1162.28alternative probe sets to one gene shared (_s_), e.g. Mtr.10146.1.S1_s_at 47939.42common to multiple genes others (_x_), e.g.: Mtr.10093.1.S1_x_at 18093.55other probe sets with complicated mapping Total50900100
6
Plant Biology Division Statistics on Approach A – scenario #1: less stringent ● Affy Probeset Target Blast against IMGAG cDNA Threshold 1=0.7; Threshold 2=0.7 Num of cDNA Matching probe-set Percent 13717035.31 10054125.88 15073>=238.80 38844total100 Num of probe_sets Matching cDNA Percent 25190049.49 15223129.91 10487>=220.60 50900total100
7
Plant Biology Division Statistics on Approach A – scenario #2: Perfect matches ● Affy Probeset Target Blast against IMGAG cDNA Threshold 1=1.0; Threshold 2=1.0 Num of probe_sets Matching cDNA Percent 39593077.79 10344120.32 963>=21.89 50900total100 Num of cDNA Matching probe-set Percent 28169072.52 8864122.82 1811>=24.62 38844total100
8
Plant Biology Division Statistics of Original probe_set EST mapping Num of EST Matching probe- set Percent 6315017.12 29038178.74 1525>=24.14 36878total100
9
Plant Biology Division Statistics of our probe_set vs. EST mapping Num of EST Matching probe-set Percent 330408.96 29535180.09 4039>=210.95 36878total100 Overlapping mapping between our probe-set vs. EST mapping and the Affy original probe-se vs. EST mapping. 37872 ∩ 32108=32106. Our method covered 32106/32108=99.9993% of the Affy original mapping.
10
Plant Biology Division Statistics on Approach B ● IMGAG cDNA versus Probe_set Num of cDNAMatching probe_set Percent 19961051.39 12909133.23 5974 (3134 uni) >=215.38 38844total100
11
Plant Biology Division Probe sets map to IMGAG or ESTs ItemNum of probe_sets Matched ToPercent 17494 None14.72 221284TC/EST only41.82 31436212866TC/EST and unique IMGAGv2 25.2828.22 1496TC/EST and multiple IMGAGv2 2.94 + 477606500Unique IMGAGv2 only 12.7715.25 1260Multiple IMGAGv2 only 2.48 ++ 50900Total 100 EST 41.82 (28.22) IMGAG 15.25 14.72
12
Plant Biology Division MTGI 8 vs.– IMGAG gene Mapping ● Mt2.0 cDNA BLASTN against MTGI8 (expectation 1e-04); ● Further applied blow filters: HSP length/Unigene length (a) Identity length/HSP length (b) ● Result: 9333 (24.0%) cDNA are mapped to 9255 (25.1%) unigene (a>0.9 b>0.9); 11517 (29.6) cDNA are mapped to 11383 (30.9%) unigene (a>0.8 b>0.8); 13284 (34.2%) cDNA are mapped to 13092 (35.5%) unigene (a>0.7 b>0.7); 9959 (25.64.0%) cDNA are mapped to 10543 (28.59%) unigene (a>0.8 b>0.95); 13063 (33.63%) cDNA are mapped to 14585 (39.55%) unigene (a>0.5 b>0.95); ● Total cDNA: 38844, Total unigene: 36878
13
Plant Biology Division MTGI 8 High Quality TC vs.– IMGAG gene Mapping ● I. Retrieved 9,396 High Quality TC based on IMGAG’s criteria BLAST TIGR’s High Quality TC vs. BAC: (1). >95% identity over 80% of the TC length = 64.3% (current 2,500 BACs) -> 73.2% projected for 2,800 BACs to be sequenced (2). >95% identity over 50% of the TC length = 68.6% (current 2,500 BACs) -> 77.0% projected for 2,800 BACs to be sequenced ● II. Our Mt2.0 cDNA BLASTN against 9396 MTGI8 High Quality TC (expectation 1e-04); Further applied blow filters: HSP length/Unigene length (a) Identity length/HSP length (b) Result: 3550 (9.14%) cDNA are mapped to 3294 (35.06%) unigene (a>0.8 b>0.95); 5052 (13.0%) cDNA are mapped to 4613 (49.10%) unigene (a>0.5 b>0.95); Total cDNA: 38844, Total High Quality TC: 9396
14
Plant Biology Division Thank You! ● Suggestions / Comments
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.