Download presentation
Presentation is loading. Please wait.
1
Gene Prediction
2
Assembled Genome/Contigs Protein coding genes prediction
Strategy Assembled Genome/Contigs Protein coding genes prediction RNA genes prediction -GenemarkS -Glimmer 3 -Prodigal -RAST BLAT -RNAmmer -tRNAScanSE Rfam Database search Merge script Crosschecking Validation with Rfam Merged gene calls Final Results
3
Selection of Assembly We compared the rRNA and tRNA predicted by the tools to the ones present in the database for M21709 (H.Influenzae). Assemblies done by Mira alone, Newbles+Mira and Newbler+Mira+Celera showed the most number of rRNA hits. In cases of Mira alone and Newbler+Mira+Celera, there was one false hit in the predicted rRNA that’s not present in the database. We also studied the tRNA predicted on these 3 assemblies and observed that both Newbler +Mira and Newbler+Mira+Celera missed tRNA for “Glu” while for Mira assembly, extra tRNAs was predicted. We also compared the number of Genes predicted by homology on these assemblies. Overall, The assembly Newbler+Mira was chosen as the final assembly.
4
Assembly Number of rRNA Newbler 7 Newbler+Mira 11 amos 5 Newbler+amos 6 Celera Newbler+Celera Mira 12 Newbler+Mira+Celera H.Influenzae 18 In cases of Mira, Newbler+Mira+Celera, the following was an extra prediction that was not present in the database: >rRNA_M21709_c12_ _DIR+ /molecule=5s_rRNA /score=63.7 TGGCAGAGATAGTGCAATAGATCCACCTGATACCATACCGAACTCAGAAGTGAAATGTTG TAACGCTGATGGTAGTGTGGGGTTTCCCCATGTGAGAGTAAGGCACTGCCAATCA
5
tRNAScanSE Assembly Number of tRNA Missing Amino Acid Homology based
Newbler+Mira 50 Glu Mira 55 Newbler+Mira+Celera 51 H.Influenzae Homology based Assembly Strategy # Contigs Querycoverage % Predicted Genes amosCMP_NC014922 226 90 1459 CA_Assembly.ctg 380 1367 NewblerDNLC_CA 31 1540 miraUP 52 1564 newblerDNLC_amosCMP 28 1541 newblerDNLC 37 1544 newblerDNLC_miraUP 1565 Newbler_Mira_CA 1515
6
Ab-Initio prediction
7
Initial Ab-Initio Results
Strain Prodigal GeneMarkS Glimmer3 RAST M19107 1843 1929 1833 1700 M19501 1712 1757 1759 1753 M21127 1983 2049 2041 2027 M21621 1862 1899 1903 1875 M21639 2520 2615 2598 2543 M21709 1763 1805 1807 1801
8
Merging ab initio prediction results
Merging order: Prodigal + GenemarkS -> Glimmer3 -> RAST Overlap Cutoff: Empirically Decided as 75%. Coverage calculation method: Overlap percent of Gene1 over Gene2 w.r.t. Gene1 Overlap percent of Gene2 over Gene1 w.r.t. Gene2 If any coverage ≥ 75, select the gene call with low overlap. Formula:
9
Prodigal + GenemarkS + Glimmer3
RAST 11 1742 35
10
Merge Results Strain Prodigal GeneMarkS Glimmer3 RAST Merged M19107
1843 1929 1833 1700 1972 M19501 1712 1757 1759 1753 1779 M21127 1983 2049 2041 2027 2083 M21621 1862 1899 1903 1875 1920 M21639 2520 2615 2598 2543 2668 M21709 1763 1805 1807 1801 1840
11
ab initio results cross-referenced with Homology based results
Strain Common Gene Calls Unique Gene Calls Total Gene Calls M19107 1016 956 1972 M19501 1115 664 1779 M21127 1078 1005 2083 M21621 1038 882 1920 M21639 1178 1490 2668 M21709 1568 272 1840 Overlap Cutoff: 75%
12
Homology based
13
Homology-based Gene Prediction using BLAT 1709 99 17 29 24 49 31
Protein coding genes Haemophilus influenzae Query Haemophilus haemolyticus Targets Blat-UCSC 99 17 29 24 49 31 M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta Output.pslx Predicted genes QueryCoverage (%) Frequency graphs Define cutoff
14
Frequency Query-Coverage %
Cut-off Frequency Query-Coverage %
15
Homology-based Gene Prediction using BLAT
Preliminary Results Strand Contigs Query-coverage CUTOFF (%) Predicted genes Average Lenght M19107 99 90 787 1049 M19501 17 1063 996 M21127 29 901 963 M21621 24 930 685 M21639 49 970 1277 M21709* 31 1515 813
16
Improvement strategy
17
Improvement strategy predicted genes 2. Test alignment strategies
1. Test Assemblies 2. Test alignment strategies (Blat vs Blast) 3. Increase number of homologous Protein coding genes
18
H. haemolyticus is most closely related to H. influenzae
16S rRNA gene infB gene Multilocus Sequence Analysis (MLSA)
19
Homology-based Gene Prediction using BLAT Haemophilus haemolyticus
8629 Protein coding genes H. influenzae KW20 H. influenzae 86_028NP H. influenzae PittEE H. ducrey HP H. somnus 129PT Homology-based Gene Prediction using BLAT Haemophilus haemolyticus Targets M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta 129 19 32 27 49 31 Blat-UCSC newblerDNLC_miraUP Output.pslx Predicted genes QueryCoverage (%) Frequency graphs Define cutoff
20
Homology-based Gene Prediction using BLAT and newblerDNLC_miraUP
Results Strand # Contigs Querycoverage % Initial # Predicted Genes M19107 129 90 2602 M19501 19 3148 M21127 32 2892 M21621 27 2862 M21639 49 3013 M21709 31 4439
21
Number of Initial Predicted Genes
Blat vs. Blast Number of Initial Predicted Genes Strand BLAT BLAST M19107 2602 2970 M19501 3148 1975 M21127 2892 2110 M21621 2862 3156 M21639 3013 2525 M21709 4439 3125 8629 Protein coding genes H. influenzae KW20 H. influenzae 86_028NP H. influenzae PittEE H. ducrey HP H. somnus 129PT Parameters Assembly: newblerDNLC_miraUP Querycoverage: 90% Input file: 8629 Protein coding genes, all homolog strands
22
Homology-based Gene Prediction
using BLAT , newblerDNLC_miraUP , 90% Querycoverage Results Strand Contigs Initial Predicted Genes PG unique filter Final Total (Non redundant) M19107 129 2602 1259 866 M19501 19 3148 1551 1114 M21127 32 2892 1425 1031 M21621 27 2862 1432 1035 M21639 49 3013 1527 1121 M21709 31 4439 2184 1567
23
Other Functional Elements Prediction
Leo Wu
24
Rfam Database Homology Search
A collection of RNA families Non-coding RNA genes Structured cis-regulatory elements Self-splicing RNAs WU-BLAST search, and keep hits with E-value < 1e-5
25
# of other functional RNA
Rfam BLAST Results The output format is:<rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score> Results: Rfam similarity evalue=2.08e-50;gc-content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam-id=SSU_rRNA_bacteria Accession # Total ncRNA # of rRNA # of tRNA # of other functional RNA Genome Length M19107 83 9 56 18 M19501 86 12 53 21 M21127 52 22 M21621 11 55 20 M21639 96 10 33 M21709 89 24
26
Verified by Rfam 10.1 Database Predicted by Rfam 10.1 Using Contigs
Rfam Validation – rRNA Strains Predicted by RNAmmer 1.2 Verified by Rfam 10.1 Database Predicted by Rfam 10.1 Using Contigs M19107 8 9 M19501 10 12 M21127 11 M21621 M21639 M21709 Some 16S rRNAs predicted by RNAmmer cannot be verified by Rfam BLAST search Rfam BLAST results contain one FP hit – SSU_rRNA_archaea M > ½ 16S can’t be verified *16S+2*5S+1*FP
27
Rfam Validation – tRNA M19107 Strains 54 56 M19501 51 53 M21127 50 52
Predicted by tRNAscan-SE 1.3 Verified by Rfam 10.1 Database Verified by Rfam 10.1 Using Contigs M19107 54 56 M19501 51 53 M21127 50 52 M21621 55 M21639 M21709 Rfam BLAST results have one duplicated hits – tRNA and tRNA- Sec (Selenocysteine) Rfam BLAST results have one more tmRNA prediction M > 2 tRNA-Sec
28
Rfam BLAST Results – Functional RNAs (1/2)
RNA Name M19107 M19501 M21127 M21621 M21639 M21709 SRNA isrK 1 Cis-reg LR-PK1 Alpha_RBS S15 His_leader 2 SECIS_3 sxy Thr_leader RtT Antisense C4 11
29
Rfam BLAST Results – Functional RNAs (2/2)
RNA Name M19107 M19501 M21127 M21621 M21639 M21709 Riboswitch Lysine 1 FMN MOCO_RNA_motif Glycine 2 TPP 3 4 PreQ1 Ribozyme RNaseP_bact_a Others 6S GcvB Bacteria_small_SRP Bacteria_large_SRP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.