Download presentation
Presentation is loading. Please wait.
Published byArchibald King Modified over 9 years ago
1
Bioinformatics Overview, NCBI & GenBank JanPlan 2012
2
What is Bioinformatics Find three different definitions of the word “bioinformatics” How is “bioinformatics different from “computational biology”? What areas of biological research are dependent on bioinformatics?
3
What is Bioinformatics Used For? Database searching Sequence analysis Phylogenetic reconstruction Molecular evolution Gene expression Genome assembly Genome annotation Metagenomics
4
Introduction to NCBI NCBI, EMBL & DDBJ What function do these organizations play in the global society? How do their missions differ? NCBI Training and Tutorials page The NCBI Handbook NCBI How-To page NCBI Help Manual
5
GenBank Annotated collection of all publicly available nucleotide sequences and their protein translations. Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. Grows exponentially, doubling every 10 months
6
GenBank Initially built and maintained at Los Alamos National Laboratory. Transferred to NCBI in early 1990s by congressional mandate. Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited. Submitters may keep their data confidential for a specified period of time prior to publication.
7
Direct Submission A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence (contigs) with annotations (metadata). If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped. Example
8
High-Throughput Genomic Sequence (HTGS) HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.
9
High-Throughput Genomic Sequence (HTGS) Data submitted in 4 phases.4 phases Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are low- quality sequences that are often used to check whether another center is already sequencing a particular clone. Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known. Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation. Phase 3: Sequences are of finished quality and have no gaps. For each organism, the group overseeing the sequencing effort determines the definition of finished quality.
10
Whole Genome Shotgun Sequences (WGS) Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.
11
EST, STS, and GSS EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation. STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping. GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.
12
HTC and FLIC HTC = High-Throughput cDNA/mRNA: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region. FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.
13
Submission Tools BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.
14
Sequence Data Flow and Processing Triage: Within 48 hours of direct submission with BankIt or Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an Accession number. All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence. GenBank will not accept sequences constructed in silico GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers. GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA. Submissions are checked to determine whether they are new or updates.
15
Sequence Data Flow and Processing Indexing: Biological validity: Translation, organism lineage, BLAST searches Vector contamination: Is there any vector DNA present in the sequence? Publication status: If published, citation is included in annotation and linked to Entrez Formatting and spelling Sequences are sent to submitter for final review before release into the public database. Sequences must become publicly available once the accession number or the sequence has been published. GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.
16
RefSeq A curated collection of DNA, RNA, and protein sequences built by NCBI. Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts. Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).
17
Third Party Annotation (TPA) database Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal. Two types of records: Experimental: Annotation supported by wet-lab evidence Inferential: Annotation inferred only Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.
18
Universal Protein Resource (UniProt) Protein sequence database that was formed through the merger of three protein databases: 1.The Swiss Institute of Bioinformatics 2.The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data Library (TrEMBL) databases 3.Georgetown University’s Protein Information Resource Protein Sequence Database (PIR-PSD)
19
Problem Set ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf Linked on today’s web page Linked on today’s web page
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.