Gene Prediction.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
Gene Prediction Preliminary Results Computational Genomics February 20, 2012.
Basics of Comparative Genomics Dr G. P. S. Raghava.
RNA Structure Prediction
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Genome Browsing with the UCSC Genome Browser
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Genome Annotation BCB 660 October 20, From Carson Holt.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Construction of Substitution Matrices
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
From Genomes to Genes Rui Alves.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Cool BaRC Web Tools Prat Thiru. BaRC Web Tools We have.
Construction of Substitution matrices
(H)MMs in gene prediction and similarity searches.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Metagenomic dataset preprocessing – data reduction
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
What is BLAST? Basic BLAST search What is BLAST?
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Sequence based searches:
Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.
Lettuce/Sunflower EST CGPDB project.
Microbial Genome Annotation
Genes, Genomes, and Genomics
GEP Annotation Workflow
Visualization of genomic data
The Web frame for NGS output
Predicting Active Site Residue Annotations in the Pfam Database
INFORMATION FLOW AARTHI & NEHA.
BLAST.
Introduction to Bioinformatics II
Comparative Genomics.
Molecular Modeling By Rashmi Shrivastava Lecturer
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
A web-based platform for structural and functional annotation of model and non-model organisms Jodi Humann, Taein Lee, Stephen Ficklin,
BLAT Blast Like Alignment Tool
Follow-up from last night: XSEDE credits
Microbial gene identification using interpolated Markov models
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Gene Prediction

Assembled Genome/Contigs Protein coding genes prediction Strategy Assembled Genome/Contigs Protein coding genes prediction RNA genes prediction -GenemarkS -Glimmer 3 -Prodigal -RAST BLAT -RNAmmer -tRNAScanSE Rfam Database search Merge script Crosschecking Validation with Rfam Merged gene calls Final Results

Selection of Assembly We compared the rRNA and tRNA predicted by the tools to the ones present in the database for M21709 (H.Influenzae). Assemblies done by Mira alone, Newbles+Mira and Newbler+Mira+Celera showed the most number of rRNA hits. In cases of Mira alone and Newbler+Mira+Celera, there was one false hit in the predicted rRNA that’s not present in the database. We also studied the tRNA predicted on these 3 assemblies and observed that both Newbler +Mira and Newbler+Mira+Celera missed tRNA for “Glu” while for Mira assembly, extra tRNAs was predicted. We also compared the number of Genes predicted by homology on these assemblies. Overall, The assembly Newbler+Mira was chosen as the final assembly.

Assembly Number of rRNA Newbler 7 Newbler+Mira 11 amos 5 Newbler+amos 6 Celera Newbler+Celera Mira 12 Newbler+Mira+Celera H.Influenzae 18 In cases of Mira, Newbler+Mira+Celera, the following was an extra prediction that was not present in the database: >rRNA_M21709_c12_593-707_DIR+ /molecule=5s_rRNA /score=63.7 TGGCAGAGATAGTGCAATAGATCCACCTGATACCATACCGAACTCAGAAGTGAAATGTTG TAACGCTGATGGTAGTGTGGGGTTTCCCCATGTGAGAGTAAGGCACTGCCAATCA

tRNAScanSE Assembly Number of tRNA Missing Amino Acid Homology based Newbler+Mira 50 Glu Mira 55 Newbler+Mira+Celera 51 H.Influenzae Homology based Assembly Strategy # Contigs Querycoverage % Predicted Genes amosCMP_NC014922 226 90 1459 CA_Assembly.ctg 380 1367 NewblerDNLC_CA 31 1540 miraUP 52 1564 newblerDNLC_amosCMP 28 1541 newblerDNLC 37 1544 newblerDNLC_miraUP 1565 Newbler_Mira_CA 1515

Ab-Initio prediction

Initial Ab-Initio Results Strain Prodigal GeneMarkS Glimmer3 RAST M19107 1843 1929 1833 1700 M19501 1712 1757 1759 1753 M21127 1983 2049 2041 2027 M21621 1862 1899 1903 1875 M21639 2520 2615 2598 2543 M21709 1763 1805 1807 1801

Merging ab initio prediction results Merging order: Prodigal + GenemarkS -> Glimmer3 -> RAST Overlap Cutoff: Empirically Decided as 75%. Coverage calculation method: Overlap percent of Gene1 over Gene2 w.r.t. Gene1 Overlap percent of Gene2 over Gene1 w.r.t. Gene2 If any coverage ≥ 75, select the gene call with low overlap. Formula:

Prodigal + GenemarkS + Glimmer3 RAST 11 1742 35

Merge Results Strain Prodigal GeneMarkS Glimmer3 RAST Merged M19107 1843 1929 1833 1700 1972 M19501 1712 1757 1759 1753 1779 M21127 1983 2049 2041 2027 2083 M21621 1862 1899 1903 1875 1920 M21639 2520 2615 2598 2543 2668 M21709 1763 1805 1807 1801 1840

ab initio results cross-referenced with Homology based results Strain Common Gene Calls Unique Gene Calls Total Gene Calls M19107 1016 956 1972 M19501 1115 664 1779 M21127 1078 1005 2083 M21621 1038 882 1920 M21639 1178 1490 2668 M21709 1568 272 1840 Overlap Cutoff: 75%

Homology based

Homology-based Gene Prediction using BLAT 1709 99 17 29 24 49 31 Protein coding genes Haemophilus influenzae Query Haemophilus haemolyticus Targets Blat-UCSC 99 17 29 24 49 31 M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta Output.pslx Predicted genes QueryCoverage (%) Frequency graphs Define cutoff

Frequency Query-Coverage % Cut-off Frequency Query-Coverage %

Homology-based Gene Prediction using BLAT Preliminary Results Strand Contigs Query-coverage CUTOFF (%) Predicted genes Average Lenght M19107 99 90 787 1049 M19501 17 1063 996 M21127 29 901 963 M21621 24 930 685 M21639 49 970 1277 M21709* 31 1515 813

Improvement strategy

Improvement strategy predicted genes 2. Test alignment strategies 1. Test Assemblies 2. Test alignment strategies (Blat vs Blast) 3. Increase number of homologous Protein coding genes

H. haemolyticus is most closely related to H. influenzae 16S rRNA gene infB gene Multilocus Sequence Analysis (MLSA)

Homology-based Gene Prediction using BLAT Haemophilus haemolyticus 8629 Protein coding genes H. influenzae KW20 H. influenzae 86_028NP H. influenzae PittEE H. ducrey 350000HP H. somnus 129PT Homology-based Gene Prediction using BLAT Haemophilus haemolyticus Targets M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta 129 19 32 27 49 31 Blat-UCSC newblerDNLC_miraUP Output.pslx Predicted genes QueryCoverage (%) Frequency graphs Define cutoff

Homology-based Gene Prediction using BLAT and newblerDNLC_miraUP Results Strand # Contigs Querycoverage % Initial # Predicted Genes M19107 129 90 2602 M19501 19 3148 M21127 32 2892 M21621 27 2862 M21639 49 3013 M21709 31 4439

Number of Initial Predicted Genes Blat vs. Blast Number of Initial Predicted Genes Strand BLAT BLAST M19107 2602 2970 M19501 3148 1975 M21127 2892 2110 M21621 2862 3156 M21639 3013 2525 M21709 4439 3125 8629 Protein coding genes H. influenzae KW20 H. influenzae 86_028NP H. influenzae PittEE H. ducrey 350000HP H. somnus 129PT Parameters Assembly: newblerDNLC_miraUP Querycoverage: 90% Input file: 8629 Protein coding genes, all homolog strands

Homology-based Gene Prediction using BLAT , newblerDNLC_miraUP , 90% Querycoverage Results Strand Contigs Initial Predicted Genes PG unique filter Final Total (Non redundant) M19107 129 2602 1259 866 M19501 19 3148 1551 1114 M21127 32 2892 1425 1031 M21621 27 2862 1432 1035 M21639 49 3013 1527 1121 M21709 31 4439 2184 1567

Other Functional Elements Prediction Leo Wu

Rfam Database Homology Search A collection of RNA families Non-coding RNA genes Structured cis-regulatory elements Self-splicing RNAs WU-BLAST search, and keep hits with E-value < 1e-5

# of other functional RNA Rfam BLAST Results The output format is:<rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score> Results: 84 Rfam similarity 25970 27512 1477.28 + . evalue=2.08e-50;gc-content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam-id=SSU_rRNA_bacteria Accession # Total ncRNA # of rRNA # of tRNA # of other functional RNA Genome Length M19107 83 9 56 18 1774129 M19501 86 12 53 21 1809865 M21127 52 22 2029793 M21621 11 55 20 1959123 M21639 96 10 33 2397857 M21709 89 24 1822852

Verified by Rfam 10.1 Database Predicted by Rfam 10.1 Using Contigs Rfam Validation – rRNA Strains Predicted by RNAmmer 1.2 Verified by Rfam 10.1 Database Predicted by Rfam 10.1 Using Contigs M19107 8 9 M19501 10 12 M21127 11 M21621 M21639 M21709 Some 16S rRNAs predicted by RNAmmer cannot be verified by Rfam BLAST search Rfam BLAST results contain one FP hit – SSU_rRNA_archaea M19501 -> ½ 16S can’t be verified. 10-1*16S+2*5S+1*FP

Rfam Validation – tRNA M19107 Strains 54 56 M19501 51 53 M21127 50 52 Predicted by tRNAscan-SE 1.3 Verified by Rfam 10.1 Database Verified by Rfam 10.1 Using Contigs M19107 54 56 M19501 51 53 M21127 50 52 M21621 55 M21639 M21709 Rfam BLAST results have one duplicated hits – tRNA and tRNA- Sec (Selenocysteine) Rfam BLAST results have one more tmRNA prediction M21709 -> 2 tRNA-Sec

Rfam BLAST Results – Functional RNAs (1/2) RNA Name M19107 M19501 M21127 M21621 M21639 M21709 SRNA isrK 1 Cis-reg LR-PK1 Alpha_RBS S15 His_leader 2 SECIS_3 sxy Thr_leader RtT Antisense C4 11

Rfam BLAST Results – Functional RNAs (2/2) RNA Name M19107 M19501 M21127 M21621 M21639 M21709 Riboswitch Lysine 1 FMN MOCO_RNA_motif Glycine 2 TPP 3 4 PreQ1 Ribozyme RNaseP_bact_a Others 6S GcvB Bacteria_small_SRP Bacteria_large_SRP