Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Archives and Information Retrieval
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Bioinformatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
Chapter 21 Eukaryotic Genome Sequences
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression,
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Genomics.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Human Genome.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Introduction to Genes and Genomes with Ensembl
Human Genome Project.
Retrieving Information: Using Entrez
Archives and Information Retrieval
Access to Sequence Data and Related Information
Genomes and Their Evolution
BLAST.
Lesson 3 Bioinformatics Laboratory
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Bioinformatics Overview, NCBI & GenBank JanPlan 2012

What is Bioinformatics Find three different definitions of the word “bioinformatics” How is “bioinformatics different from “computational biology”? What areas of biological research are dependent on bioinformatics?

What is Bioinformatics Used For? Database searching Sequence analysis Phylogenetic reconstruction Molecular evolution Gene expression Genome assembly Genome annotation Metagenomics

Introduction to NCBI NCBI, EMBL & DDBJ What function do these organizations play in the global society? How do their missions differ? NCBI Training and Tutorials page The NCBI Handbook NCBI How-To page NCBI Help Manual

GenBank Annotated collection of all publicly available nucleotide sequences and their protein translations. Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. Grows exponentially, doubling every 10 months

GenBank Initially built and maintained at Los Alamos National Laboratory. Transferred to NCBI in early 1990s by congressional mandate. Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited. Submitters may keep their data confidential for a specified period of time prior to publication.

Direct Submission A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence (contigs) with annotations (metadata). If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped. Example

High-Throughput Genomic Sequence (HTGS) HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.

High-Throughput Genomic Sequence (HTGS) Data submitted in 4 phases.4 phases Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are low- quality sequences that are often used to check whether another center is already sequencing a particular clone. Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known. Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation. Phase 3: Sequences are of finished quality and have no gaps. For each organism, the group overseeing the sequencing effort determines the definition of finished quality.

Whole Genome Shotgun Sequences (WGS) Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

EST, STS, and GSS EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation. STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping. GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.

HTC and FLIC HTC = High-Throughput cDNA/mRNA: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region. FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.

Submission Tools BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.

Sequence Data Flow and Processing Triage: Within 48 hours of direct submission with BankIt or Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an Accession number. All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence. GenBank will not accept sequences constructed in silico GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers. GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA. Submissions are checked to determine whether they are new or updates.

Sequence Data Flow and Processing Indexing: Biological validity: Translation, organism lineage, BLAST searches Vector contamination: Is there any vector DNA present in the sequence? Publication status: If published, citation is included in annotation and linked to Entrez Formatting and spelling Sequences are sent to submitter for final review before release into the public database. Sequences must become publicly available once the accession number or the sequence has been published. GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.

RefSeq A curated collection of DNA, RNA, and protein sequences built by NCBI. Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts. Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Third Party Annotation (TPA) database Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal. Two types of records: Experimental: Annotation supported by wet-lab evidence Inferential: Annotation inferred only Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

Universal Protein Resource (UniProt) Protein sequence database that was formed through the merger of three protein databases: 1.The Swiss Institute of Bioinformatics 2.The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data Library (TrEMBL) databases 3.Georgetown University’s Protein Information Resource Protein Sequence Database (PIR-PSD)

Problem Set ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf Linked on today’s web page Linked on today’s web page