Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.

Slides:



Advertisements
Similar presentations
Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Origins of recently gained introns in Caenorhabditis Avril Coghlan and Kenneth H. Wolfe Department of Genetics, Trinity College Dublin, Ireland.
Genome analysis and annotation Part II. THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Evidence View S.mansoni PASA assemblies S. japonicum EST alignments.
Introduction to BioInformatics GCB/CIS535
Investigating the Importance of non-coding transcripts.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Eukaryotic Gene Finding
Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome organization Eukaryotic genomes are complex and DNA amounts and organization vary widely between species.
Gene Structure and Identification
Fine Structure and Analysis of Eukaryotic Genes
Tomato genome annotation pipeline in Cyrille2
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Non-Coding Areas & Mutations Within the human genome the majority of the DNA (~75%) is made up of sequences not involved in coding for proteins, RNA, or.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Research about Alternative Splicing recently 楊佳熒.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
Genomes and Their Evolution
Genes, Genomes, and Genomics
Genome Annotation and the Human Genome
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Presentation transcript:

Genome analysis and annotation

Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products ? Can we link genotype to phenotype ? (i.e. What genes are turned on when ? Why do two strains of the same pathogen vary in their pathogenicity ?) Can we trace the evolutionary history of an organism from its genomic sequence and genome organization ? Evolutionary history of a pathway ?

Gene finding Begins with the prediction of gene models through the 1) Identification of Open Reading Frames (ORFs) 2) Examination of base composition differences between coding vs. non-coding regions 3) Computational gene recognition (exons, introns, exo- intron boundaries) using a variety of gene-finding algorithms (GLIMMER, GRAIL, FGENEH, GENSCAN GLIMMER-HMM, etc…)

Gene finding (cont’) Another gene finding/confirmation approach is based on experimental evidence using homology 1)Alignment of Expressed Sequence Tags (EST) and full cDNA sequences with gDNA Advantages: gene discovery, proof of expression, training for gene finders Disadvantages: Disproportionate representations 2) Examination of protein translation profiles: Peptide sequencing, mass spectrometry, etc…

Gene finding (cont’) The gene finding task comes with various levels of difficulty in different organisms Relatively easy in bacterial and archeal genomes mostly due to: 1)High gene density (1 kb per gene on average) 2)Short intergenic regions 3)Lack of introns Much more difficult in eukaryotic genomes and can become major focus of activity in the annotation phase of a genome: 1) Low gene density (1-200 kb per gene) 2)Presence of repeats 3)Most eukaryotic genes have introns and exons, alternative splicing Innacurate predictions and false postives are common

53% id. Sm SR2 sub-familyA non-LTR retrotransposon (SmR2A) 94% id. Sm SR2 sub-familyB non-LTR retrotransposon Unknown repeat SmR2A (95% id.) Unknown repeat SmR2A (91% id.) SmR2A (89% id.) SmR2A (92% id.) SjR2 like (85% id.) SR2A (90% id.) Repeats complicate genome assembly and gene finding (Example: Schistosoma mansoni genome)

Comparing genomes can help with gene finding S. japonicum S. mansoni Nucleotide sequence conservation using mVISTA

Sequence homology at exons S. mansoni as Reference Conclusion: The S. japonicum sequence can be used to find exons in S. mansoni S. japonicum as Reference Conclusion: The S. mansoni sequence can be used to find exons in S. japonicum

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Case study: Gene finding in the Schistosoma mansoni eukaryotic parasite

The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures Prior to gene discovery efforts, repeats must be identified and masked. Prior to gene discovery efforts, repeats must be identified and masked. Repeats tend to confuse ab-initio gene finders. Repeats tend to confuse ab-initio gene finders. Fragments of transposons are often confused for protein-coding exons of genes. Fragments of transposons are often confused for protein-coding exons of genes. By masking repeats, we increase the (signal / noise) ratio. By masking repeats, we increase the (signal / noise) ratio.

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Construction of a S. mansoni Repeat Library Catalog known Schistosoma Transposable Elements (TEs) Catalog known Schistosoma Transposable Elements (TEs) -particularly retrotransposons: SR1, SR2, Sinbad, fugitive, salmonid, boudicca, saci, cercyon De-novo construction of repeat library using RepeatScout (Price, et al. 2005) De-novo construction of repeat library using RepeatScout (Price, et al. 2005) repeat families found

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Genome Masking Statistics Total number basepairs381,816,328 'N's found in gaps6,171,089 'N's found after masking187,957,396 Adjusted totals, accounting for N-gaps Total number of basepairs375,645,239 masked bps181,786,307 Percentage of the genome repeat masked48.3%

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures augustus: augustus: -provided by Mario Stanke -predicted 9,208 genes glimmerHMM: glimmerHMM: -provided by Ela Pertea -predicted 25,890 genes

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures Spliced protein alignments using AAT (Huang, 1997) Spliced protein alignments using AAT (Huang, 1997) -Searched: ùTIGR’s internal non-redundant protein db ùCustom protein databases:  Caenorhabditis elegans and briggsae  Brugia malayi ùGenewise predictions for best protein alignments Spliced transcript alignments Spliced transcript alignments –alignments (blat, sim4) of S. mansoni ESTs and cDNAs, followed by alignment assembly using Program to Assemble Spliced Alignments (PASA) –AAT alignments of S. japonicum ESTs

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures Start End EVidenceModeler (EVM) Combines predicted exons and alignments into weighted consensus gene structures weight PASA transcript alignment assemblies Genewise protein alignments Gene Predictions, AAT alignments

THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Evidence View S.mansoni PASA assemblies S. japonicum EST alignments Genewise alignments(predictions) nr Protein Alignments Caenorhabditis sp. Protein Alignments Brugia malayi Protein Alignments