The NCBI Annotation Pipeline

Slides:



Advertisements
Similar presentations
What is RefSeqGene?.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
NCBI Genome Resources Using NCBI Resources for Gene Discovery Kim D. Pruitt Transcriptome 2002 National Center for Biotechnology Information (NCBI) National.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integrating dbSNP with P. falciparum genome resources.
The Consensus CoDing Sequence (CCDS) Database
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
How to access genomic information using Ensembl August 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UniProt - The Universal Protein Resource
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
Part I: Identifying sequences with … Speaker : S. Gaj Date
EnsEMBL Opening up the whole Genome Philip Lijnzaad
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Sackler Medical School
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Basics of BLAST Basic BLAST Search - What is BLAST?
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
PlantGDB: Annotation Principles & Procedures
gene-CENTRIC database
BLAST.
Ensembl Genome Repository.
TAMU Bovine QTL db and viewer
Welcome - webinar instructions
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

The NCBI Annotation Pipeline 6/13/2018 The NCBI Annotation Pipeline Genome Assembly Annotation Automated Manual Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Assembly (sequence chromosome) Remove contaminants Bin by chromosome arms Sequence Layout Sequence Building Place on chromosomes NCBI Genome Resources

Sequence Assembly - a modified greedy approach 6/13/2018 Sequence Assembly - a modified greedy approach Sequence Layout Curated Finished Regions Curated assembly instructions MegaBLAST hits Consider TPF clone order BAC chromosome assignment annotation STS markers personal communication Remove conflicting overlaps, redundant BACs BAC Sequence Fragments Assemble Order NCBI Contig Sequence Building Consider fragment:fragment sequence overlaps for each BAC pair in layout Meld overlapping sequence Order and Orient (o+o ): alignments (mRNA, EST) BAC annotation paired plasmid reads 113,310 Melds NCBI Genome Resources

Genome Build Process Freeze Contig Build & Release Exclude Problem dbSNP STS Clones Collaboration Curation GenomeScan GenBank LocusLink RefSeq Update: Links gi’s Prepare for release LocusLink Annotation Contig Build & Release Assembly Resource Updates Freeze Input Data: Sequences Curated NTs TPF BLAST hits Public Release Sequences (contig mRNA protein) Exclude Problem accessions Analysis & Review Corrections for next build Map Viewer FTP BLAST Input Resources NCBI Genome Resources

What is being annotated? 6/13/2018 What is being annotated? Feature Method Genes: By alignment, by prediction Markers: By ePCR Variation: By alignment Clones/Cytogenetic location: By alignment (BAC ends) Phenotype (MIM): Via Gene identification, associated markers Cytogenetic Position: By annotated BAC-END sequenced clones By FISH-mapped clones used in assembly Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Genome Annotation: Genes 6/13/2018 Genome Annotation: Genes Automated: By mRNA alignment (RefSeq, GenBank) protein coding and transcribed pseudogenes By EST alignments GenomeScan predictions Products: Sequences - RefSeq RNAs and proteins Resources - MapViewer, FTP, BLAST Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

RefSeq: a reagent for Contig Annotation Potential Problems: Gene Families Partial Chimeric Intron read-through Linker Vector Wrong organism genome RefSeq mRNAs GenBank mRNAs RefSeq Advantages: Separate Gene Families Not Partial Means to correct problem sequences ESTs TBLASTN RPSBLAST RefSeq process results in excluding problem GenBank sequences from annotation pipeline GenomeScan NCBI Genome Resources

NCBI Reference Sequences (RefSeq) Genome Oriented Resource A sequence for each macromolecule Central Dogma: Chromosome, mRNA, preprotein, mature protein Linked on a residue by residue basis Objectively non-redundant and comprehensive Curated Resource Authoritative source by genome Derivative of GenBank but corrected, merged, extended Publicly distributed, Entrez Genomes Web site Reagents for Genome Annotation and Analysis Substrate for Functional Genomics NCBI Genome Resources

Reference Sequences Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule Key: curated calculated NC_000000 NM_000000 NP_000000 XM_000000 XP_000000 chromosome NT_000000 contig mRNA predicted mRNA protein predicted protein NG_000000 gene Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID. NCBI Genome Resources

RefSeq: Scope The NCBI Reference Sequence (RefSeq) project provides non-redundant sequence data including bacterial and viral genomes, mitochondrion, chromosomes, constructed genomic contigs, transcripts, and proteins. NCBI Genome Resources

Process Flow Technology: Sybase relational databases Support multiple users Instant updates in-house Daily updates public web site C, C++, Perl, stored procedures, triggers In-house processing External Collaboration LocusLink Public Release: LocusLink Web Site Annotation Data Automated BLAST Analysis RefTrack Provisional & Predicted Records (Transcripts & Proteins) Status QBLAST Update Reviewed Records (Genomic Regions,Transcripts, Proteins) Manual Curation Accessible in: BLAST BLink Entrez FTP LocusLink RefTrack Tracking Database . Decisions . Status . Accessions . History Genome Annotation Pipeline NCBI Genome Resources

Manual Curation Process RefSeq infrastructure: supporting analysis Manual Curation Process Tools: User Input Genome Analysis Database QC checks Assignment Disease Genes Gene Families Gene Clusters Identified Problems Analysis: Precomputed BLAST Vector ESTs GenBank ‘nr’ HTG Contigs blastx blastp Spidey alignments Map Viewer Evidence Viewer Review Alternate Splicing Gene Families Review Sequence Database data Literature Database Editing Sequence Editing Sequence changes Remove contamination Correct errors Extend UTRs Splice Variants Annotation Annotation: Sequin Ingenue Final Check Quality Control Store Data & QC: RefSeq validator Public Public NCBI Genome Resources

New Type of RefSeq: Genomic Regions Why? Correct Assembly through Duplications, Paralogous Gene Clusters Optimize Annotation in Gene Clusters Goals: 1. Establish genomic reference sequence 2. Provide annotation [genes, pseudogenes, variation] 3. Provide corresponding mRNA, protein RefSeq Used in: Genome Assembly & Annotation Pipeline NCBI Genome Resources

Gene Annotation Automated Annotation: Manual Annotation: 6/13/2018 Gene Annotation Automated Annotation: By mRNA alignment (RefSeq, GenBank) By EST alignments GenomeScan predictions Initial RefSeq record (provisional, predicted status) LocusLink – maintaining associations; connections to function Manual Annotation: RefSeq – genomic regions, mRNA,protein (reviewed status) Local regions of Genome Assembly – provide curated assembly instructions (join a->b; don’t join a->c) Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Products of annotation 6/13/2018 Products of annotation RefSeqs (transcripts, proteins) Gene id (LocusID) features in chromosome coordinates features in contig (NT accession) coordinates Available in: Map Viewer Graphical display Tabular display Sequence downloads FTP RefSeqs (contigs, transcripts, proteins) Mapping Data LocusLink & Other resources Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Map Viewer: Query support NCBI Genome Resources

Map Viewer: Tabular report NCBI Genome Resources

Genes in regions of conserved synteny Anchored by human gene order Anchored by mouse gene order NCBI Genome Resources

Query by sequence: Review the alignment 6/13/2018 Query by sequence: Review the alignment A click away: Alignments (BLAST hit) Gene Description (LocusLink) Report of all features in the region Contig sequence Sequence in the region other mRNAs aligning in the region Define your own gene model based on alignments in the region Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Quality Control - Genome review 6/13/2018 Quality Control - Genome review Is the sequence correct? Is the feature correctly placed? Is there a feature that should be placed? Are the attributes of the feature correct? Approaches: In-house analysis & review (manual curation) Shared information (UCSC/Ensembl) Solicited review by experts in local regions Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

Acknowledgments http://www.ncbi.nlm.nih.gov/genome/guide/human.html Genome Build Team: Richa Agarwala Hsiu-Chuan Chen Slava Chetvernin Deanna Church Olga Ermolaeva Renata Geer Wratko Hlavina Wonhee Jang Jonathan Kans Ken Katz Paul Kitts David Lipman Adam Lowe Donna Maglott Jim Ostell Kim Pruitt Sergey Resenchuk Victor Sapojnikov Greg Schuler Steve Sherry Andrei Shkeda Tatiana Tatusova Lukas Wagner Sarah Wheelan Acknowledgments RefSeq Curator Staff BLAST Team Entrez Team NCBI Service Desk Staff Collaborators: Human Gene Nomenclature Committee OMIM Staff The Jackson Laboratory Rat Genome Database http://www.ncbi.nlm.nih.gov/genome/guide/human.html