Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
NCBI Genome Resources Using NCBI Resources for Gene Discovery Kim D. Pruitt Transcriptome 2002 National Center for Biotechnology Information (NCBI) National.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Lecture 2.21 Retrieving Information: Using Entrez.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to access genomic information using Ensembl August 2005.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Tomato genome annotation pipeline in Cyrille2
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation April 5th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
UMR ASP UMR ASP Structural & Comparative Genomics in Bread Wheat TriAnnotPipeline A LifeGrid Project based on AUVERGRID F. Giacomoni, M.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.
EnsEMBL Opening up the whole Genome Philip Lijnzaad
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Sackler Medical School
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Copyright OpenHelix. No use or reproduction without express written consent1.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Introduction to Bioinformatics
The NCBI Annotation Pipeline
INFORMATION FLOW AARTHI & NEHA.
Ensembl Genome Repository.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Welcome - webinar instructions
Presentation transcript:

Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10

Genome Annotation Find known repeats Search for new repeated sequences Predict Genes –BLASTX –Genewise, Fgenes, Genscan… Integrate other data sources. Accuracy highest in “high homology” class

Genome annotation servers Integrate information from several maps –DNA sequence (contigs, quality). –Physical (cytogenetic, STS content). –Genes (show gene annotations and evidence). Several prediction programs. Expressed sequence tags (ESTs, Unigene clusters) Evidence (Predicted, confirmed) Non-coding RNA (ncRNA) transcripts. –Variation (e.g., SNPs) –Regions of shared synteny.

Data Release Human genome sequence released under 1996 Bermuda rules –Assembled sequence greater than 1000bp long is deposited in public database (GenBank/EMBL/DDBJ) every 24 hours –No patents are filed Bermuda principles reaffirmed at January 2003 WT/NIH meeting –Pre-release of data for all “community projects” –Nature 421, 875 (2003) –NHGRI: –WT: statements/WTD htm Benefits of Open Data Access supported by OECD report –

Accessing the Genome Genomes sequences are becoming available very rapidly –Large and difficult to handle computationally –Everyone expects to be able to access them immediately Bench Biologists –Has my gene been sequenced? –What are the genes in this region? –Where are all the GPCRs –Connect the genome to other resources. Research Bioinformatics –Give me a dataset of human genomic DNA. –Give me a protein dataset.

Getting information out Search/browse to find the gene or region. Export formats: –Screen shot –FASTA seq. –Genbank file with features annotated –Feature list (Gff, tab-delimited text) –Pip (plot of sequence identity between organisms).

Challenges Scale and data flow –Presentation, ease of use. –Engineering problems. –User interface design. Algorithmic –Partly engineering (pre-compute hard computations, etc.) –Partly research.

NCBI sequence assembly (sequence chromosome) Remove contaminants Bin by chromosome arms Sequence Layout Sequence Building Place on chromosomes

NCBI sequence assembly - a modified greedy approach Sequence Layout Curated Finished Regions Curated assembly instructions MegaBLAST hits Consider clone order BAC chromosome assignment annotation STS markers personal communication Remove conflicting overlaps, redundant BACs Sequence Building Consider fragment:fragment sequence overlaps for each BAC pair in layout Meld overlapping sequence Order and Orient (o+o ): alignments (mRNA, EST) BAC annotation paired plasmid reads BAC Sequence Fragments Assemble Order NCBI Contig

NCBI Genome Build Process Contig Build & Release Assembly Input Data: Sequences Curated NTs TPF BLAST hits Resource Updates STS Clones Annotation LocusLink RefSeq Collaboration Curation FTP BLAST Input Resources Map Viewer Update: Links gi’s Prepare for release Sequences (contig mRNA protein) Analysis & Review Corrections for next build Freeze LocusLink GenBank GenomeScan dbSNP Public Release Exclude Problem accessions

What is being annotated? Genes: By alignment, by prediction Markers: By ePCR Clones/Cytogenetic location: By alignment (BAC ends) Variation: By alignment Phenotype (MIM): Via Gene identification, associated markers Cytogenetic Position: By annotated BAC-END sequenced clones By FISH-mapped clones used in assembly Feature Method

RefSeq: a reagent for Contig Annotation GenomeScan ESTs TBLASTN RPSBLAST RefSeq mRNAs GenBank mRNAs RefSeq Advantages: Separate Gene Families Not Partial Means to correct problem sequences RefSeq process results in excluding problem GenBank sequences from annotation pipeline Potential Problems With ESTs: Gene Families Partial Chimeric Intron read-through Linker Vector Wrong organism genome

NCBI: Products of annotation RefSeqs (transcripts, proteins) Gene id (LocusID) features in chromosome coordinates features in contig (NT accession) coordinates Available in: Map Viewer –Graphical display –Tabular display –Sequence downloads FTP –RefSeqs (contigs, transcripts, proteins) –Mapping Data –LocusLink & Other resources

NCBI Map Viewer

NCBI Map Viewer: Tabular report

Genes in regions of conserved synteny Anchored by human gene order Anchored by mouse gene order

Chromosomal segments in dog conserved with human and mouse Dog: 38 autosomes + sex chr

Query by sequence: Review the alignment A click away: Alignments (BLAST hit) Gene Description (LocusLink) Report of all features in the region Contig sequence Sequence in the region other mRNAs aligning in the region Define your own gene model based on alignments in the region

Quality Control - Genome review Is the sequence correct? Is the feature correctly placed? Is there a feature that should be placed? Are the attributes of the feature correct? Approaches: In-house analysis & review (manual curation) Shared information (UCSC/Ensembl) Solicited review by experts in local regions

Ensembl Annotation pipeline Set of high quality gene predictions –From known human mRNAs aligned against genome –From similar protein and mRNAs aligned against genome –From Genscan predictions confirmed via BLAST of Protein, cDNA, ESTs databases. Initial functional annotation from Interpro Integration with external resources (SNPs, SAGE, OMIM) Comparative analysis between mouse/human –DNA sequence alignment –Protein orthologs

Ensembl gene prediction pipeline RepeatMasker Genscan Blast genscan peptides v Protein,unigene,est,vert mrna Pmatch all human Proteins and cdnas MiniGenewise MiniEst2genome Genes DNA

Genome Annotation The generic structure of an automatic genome annotation pipeline and delivery system

Detailed View Genes, ESTs, CpG etc. 100kb Overview Genes and Markers 1Mb Chromosome Configuration

Useful genomic annotation and browser URLs EBI/Sanger Institute Ensembl Project: NCBI Human Genome Browser: The Oak Ridge National Laboratories Genome Channel: UCSC Human Genome Browser: The Institute for Genomic Research (TIGR):

Genome annotation -things still being worked out- Annotation servers. Pro: make genomics information accessible to biologists without expert bioinformatics skills. Con: makes it difficult to perform large-scale data mining. Solution: enable more experienced users to retrieve the data they require and to run analyses locally. Open annotation systems. Biologists need to have access to annotations available in the community and to share their own contributions with the community. A common protocol between systems that enables genome data to be freely exchanged AGAVE (Architecture for Genomic Annotation, Visualization and Exchange) Distributed Annotation System (DAS) projects

Genome annotation servers Several ways to find information: –Search by clone, gene, EST, marker. –Browse sequence. –BLAST searches. –Homology, start in one organism, jump to the syntenic region of another.

UCSC Genome Browser