Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Bioinformatics and Phylogenetic Analysis
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Python Programming on PCR Primers Design
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
Nucleic Acid Design Applications Polymerase Chain Reaction (PCR) Calculating Melting Temperature (Tm) PCR Primers Design.
©2003/04 Alessandro Bogliolo Primer design. ©2003/04 Alessandro Bogliolo Outline 1.Polymerase Chain Reaction 2.Primer design.
© Wiley Publishing All Rights Reserved. Working with a Single DNA Sequence.
Interdisciplinary Center for Biotechnology Research
PCR Primer Design Guidelines
Reading the Blueprint of Life
IN THE NAME OF GOD. PCR Primer Design Lecturer: Dr. Farkhondeh Poursina.
PCR optimization. Primers – design must be good but influenced by template sequence Quality of template DNA/impurities Components of PCR may need to be.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gene Expression Omnibus (GEO)
Sequence Databases What are they and why do we need them.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Tools of Bioinformatics
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Copyright © 2010 Pearson Education Inc. Lecture 01 – Genetics & Genomics: An Introduction Based on Chapter 1 – Genetics: An introduction.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Class material and homework for February 9 today’s in-class topic: selected examples of contemporary biotechnology –polymerase chain reaction (PCR) –DNA.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao Chavan Maharashtra Open University, Nashik.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
D. Darban, Ph.D Department of Microbiology School of Medicine Alborz University of Medical Sciences 1 Probe and Primer Design.
Fac. of Agriculture, Assiut Univ.
Virginia Commonwealth University
Polymerase Chain Reaction
PCR TECHNIQUE
Access to Sequence Data and Related Information
Introduction to Bioinformatics II
PCR Polymerase chain reaction (PCR)
Presentation transcript:

Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank nucleic acid sequence database, to which data is submitted by the scientific community, NCBI provides data retrieval systems and computational resources for the analysis of GenBank data and a variety of other biological data

DATABASE RETRIEVAL TOOLS Entrez Entrez is an integrated database retrieval system for DNA and protein sequences derived from several sources, the NCBI taxonomy, genome maps, population sets, gene expression data, protein structures from the Molecular Modeling Database (MMDB), 3D and alignment-based protein domains, and the biomedical literature via PubMed, Online Mendelian Inheritance in Man (OMIM), and online Books. The records retrieved by an Entrez search can be displayed in a wide variety of formats and downloaded singly or in large batches. Formatting options vary for records of different types. For example, display formats for GenBank records include the GenBank Flatfile, FASTA, XML, ASN.1, and others. PMC is a digital archive of peer reviewed journals in the life sciences. PubMed Central (PMC)

Blink BLink displays pre-computed protein BLAST alignments for each protein sequence in the Entrez databases. BLink allows for the display of subsets of these alignments by taxonomic criteria, by database of origin, relation to a complete genome, membership in a Clusters of Orthologous Group (COG) or by relation to a 3D structure or conserved protein domain. UniGene UniGene, is a system for automatically partitioning GenBank sequences, including ESTs, into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, and is linked to related information, such as the tissue types in which the gene is expressed, model organism protein similarities, the LocusLink report for the gene and its map location. UniGene databases are updated weekly with new EST sequences, and bimonthly with newly characterized sequences.

HomoloGene HomoloGene is a database of both curated and calculated gene orthologs and homologs for 14 organisms. Computed orthologs and homologs, which are considered putative, are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet ortholog-based COG -like clusters, which may include up to 14 members, in which the triplet orthologs in two organisms are both orthologous to the same gene in a third organism. The HomoloGene database can be queried using UniGene ClusterIDs, LocusLink Locus IDs, gene symbols, gene names and nucleotide accession numbers, as well as those terms found in UniGene cluster titles. 0 References Sequence (RefSeq) The RefSeq database, provides curated reference sequences for mRNAs, genomic sequences, computationally-derived sequences and proteins for human and other organisms.

Open Reading Frame (ORF) Finder ORF Finder performs a six-frame translation of a nucleotide sequence and returns a graphic that indicates the location of each ORF found. The protein translations of the ORFs detected can be submitted directly for BLAST similarity searching or searching against the COGs database. Electronic PCR (e-PCR) e-PCR is a tool for locating Sequence Tagged Sites (STSs) within a nucleotide sequence by searching against a non-redundant database. Map Viewer The NCBI Map Viewer displays genome assemblies using sets of synchronized chromosomal maps.

Gene Expression Omnibus (GEO) The GEO is a data repository and retrieval system for gene expression data derived from any organism or artificial source. Gene expression data derived from spotted microarray, high-density oligonucleotide array, hybridization filter and SAGE data, are available for download and accepted for deposit.

CAP3 Sequence Assembly Programme The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed. The FAKII program provides a library of routines for each phase of the assembly process. The GAP4 program has a number of useful interactive features. The PHRAP program clips 58 and 38 low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences. TIGR Assembler has been used in a number of megabase microbial genome Projects. Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects.

The CAP3 program includes a number of improvements and new features. A capability to clip 58 and 38 low quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets.

An unusual feature of CAP3 is the use of forward reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward– reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly.

The assembly algorithm consists of three major phases In the first phase, 58 and 38 poor regions of each read are identified and removed. Overlaps between reads are computed. False overlaps are identified and removed. In the second phase, reads are joined to form contigs in decreasing order of overlap scores. Then, forward–reverse constraints are used to make corrections to contigs. In the third phase, a multiple sequence alignment of reads is constructed and a consensus sequence along with a quality value for each base is computed for each contig

Major steps of the assembly algorithm

Computation of the 58 and 38 clipping positions of read f. Read f has high local similarities to reads g and h.Apairof broken lines shows the start and end positions of a similarity. A thick line indicates the high-quality region of a read

The new features of CAP3 sequence assembly program include fast identification of pairs of reads with an overlap, clipping of 58 and 38 poor regions of reads, efficient computation and evaluation of overlaps, use of forward–reverse constraints to correct errors in construction of contigs, and generation of consensus sequences for contigs. An unusual feature of CAP3 is the algorithm for making use of constraints to correct assembly errors. The algorithm is very tolerant of errors in constraints. The algorithm makes a correction to the current assembly only if the correction is supported by a sufficient number of constraints. The experimental results indicate that CAP3 is able to make corrections to contigs using constraints.

PCR (Polymerase Chain Reaction) Polymerase Chain Reaction is widely held as one of the most important inventions of the 20th century in molecular biology. Small amounts of the genetic material can now be amplified to be able to a identify, manipulate DNA, detect infectious organisms, including the viruses that cause AIDS, hepatitis, tuberculosis, detect genetic variations, including mutations, in human genes and numerous other tasks. What is a primer? A primer is a short synthetic oligonucleotide which is used in many molecular techniques from PCR to DNA sequencing. These primers are designed to have a sequence which is the reverse complement of a region of template or target DNA to which we wish the primer to anneal. These primers are designed to have a sequence which is the reverse complement of a region of template or target DNA to which we wish the primer to anneal.

PCR involves the following three steps: Denaturation, Annealing and Extension. First, the genetic material is denatured, converting the double stranded DNA molecules to single strands. The primers are then annealed to the complementary regions of the single stranded molecules. In the third step, they are extended by the action of the DNA polymerase. All these steps are temperature sensitive and the common choice of temperatures is 94 o C, 60 o C and 70 o C respectively. Good primer design is essential for successful reactions. The important design considerations described below are a key to specific amplification with high yield.

Analysis of primer sequences When designing primers for PCR, sequencing or mutagenesis it is often necessary to make predictions about these primers, for example melting temperature (Tm) and propensity to form dimers with itself or other primers in the reaction. When primers form hairpin loops or dimers less primer is available for the desired reaction. For example...

Some thoughts on designing primers. 1. primers should be bases in length; 2. base composition should be 50-60% (G+C); 3. primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends and increases efficiency of priming; 4. Tms between o C are preferred; 5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers will be synthesised preferentially to any other product; 6. primer self-complementarity (ability to form 2 o structures such as hairpins) should be avoided; 7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences (because of stability of annealing), and should be avoided.

Primer3