Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank nucleic acid sequence database, to which data is submitted by the scientific community, NCBI provides data retrieval systems and computational resources for the analysis of GenBank data and a variety of other biological data
DATABASE RETRIEVAL TOOLS Entrez Entrez is an integrated database retrieval system for DNA and protein sequences derived from several sources, the NCBI taxonomy, genome maps, population sets, gene expression data, protein structures from the Molecular Modeling Database (MMDB), 3D and alignment-based protein domains, and the biomedical literature via PubMed, Online Mendelian Inheritance in Man (OMIM), and online Books. The records retrieved by an Entrez search can be displayed in a wide variety of formats and downloaded singly or in large batches. Formatting options vary for records of different types. For example, display formats for GenBank records include the GenBank Flatfile, FASTA, XML, ASN.1, and others. PMC is a digital archive of peer reviewed journals in the life sciences. PubMed Central (PMC)
Blink BLink displays pre-computed protein BLAST alignments for each protein sequence in the Entrez databases. BLink allows for the display of subsets of these alignments by taxonomic criteria, by database of origin, relation to a complete genome, membership in a Clusters of Orthologous Group (COG) or by relation to a 3D structure or conserved protein domain. UniGene UniGene, is a system for automatically partitioning GenBank sequences, including ESTs, into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, and is linked to related information, such as the tissue types in which the gene is expressed, model organism protein similarities, the LocusLink report for the gene and its map location. UniGene databases are updated weekly with new EST sequences, and bimonthly with newly characterized sequences.
HomoloGene HomoloGene is a database of both curated and calculated gene orthologs and homologs for 14 organisms. Computed orthologs and homologs, which are considered putative, are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet ortholog-based COG -like clusters, which may include up to 14 members, in which the triplet orthologs in two organisms are both orthologous to the same gene in a third organism. The HomoloGene database can be queried using UniGene ClusterIDs, LocusLink Locus IDs, gene symbols, gene names and nucleotide accession numbers, as well as those terms found in UniGene cluster titles. 0 References Sequence (RefSeq) The RefSeq database, provides curated reference sequences for mRNAs, genomic sequences, computationally-derived sequences and proteins for human and other organisms.
Open Reading Frame (ORF) Finder ORF Finder performs a six-frame translation of a nucleotide sequence and returns a graphic that indicates the location of each ORF found. The protein translations of the ORFs detected can be submitted directly for BLAST similarity searching or searching against the COGs database. Electronic PCR (e-PCR) e-PCR is a tool for locating Sequence Tagged Sites (STSs) within a nucleotide sequence by searching against a non-redundant database. Map Viewer The NCBI Map Viewer displays genome assemblies using sets of synchronized chromosomal maps.
Gene Expression Omnibus (GEO) The GEO is a data repository and retrieval system for gene expression data derived from any organism or artificial source. Gene expression data derived from spotted microarray, high-density oligonucleotide array, hybridization filter and SAGE data, are available for download and accepted for deposit.
CAP3 Sequence Assembly Programme The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed. The FAKII program provides a library of routines for each phase of the assembly process. The GAP4 program has a number of useful interactive features. The PHRAP program clips 58 and 38 low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences. TIGR Assembler has been used in a number of megabase microbial genome Projects. Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects.
The CAP3 program includes a number of improvements and new features. A capability to clip 58 and 38 low quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets.
An unusual feature of CAP3 is the use of forward reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward– reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly.
The assembly algorithm consists of three major phases In the first phase, 58 and 38 poor regions of each read are identified and removed. Overlaps between reads are computed. False overlaps are identified and removed. In the second phase, reads are joined to form contigs in decreasing order of overlap scores. Then, forward–reverse constraints are used to make corrections to contigs. In the third phase, a multiple sequence alignment of reads is constructed and a consensus sequence along with a quality value for each base is computed for each contig
Major steps of the assembly algorithm
Computation of the 58 and 38 clipping positions of read f. Read f has high local similarities to reads g and h.Apairof broken lines shows the start and end positions of a similarity. A thick line indicates the high-quality region of a read
The new features of CAP3 sequence assembly program include fast identification of pairs of reads with an overlap, clipping of 58 and 38 poor regions of reads, efficient computation and evaluation of overlaps, use of forward–reverse constraints to correct errors in construction of contigs, and generation of consensus sequences for contigs. An unusual feature of CAP3 is the algorithm for making use of constraints to correct assembly errors. The algorithm is very tolerant of errors in constraints. The algorithm makes a correction to the current assembly only if the correction is supported by a sufficient number of constraints. The experimental results indicate that CAP3 is able to make corrections to contigs using constraints.
PCR (Polymerase Chain Reaction) Polymerase Chain Reaction is widely held as one of the most important inventions of the 20th century in molecular biology. Small amounts of the genetic material can now be amplified to be able to a identify, manipulate DNA, detect infectious organisms, including the viruses that cause AIDS, hepatitis, tuberculosis, detect genetic variations, including mutations, in human genes and numerous other tasks. What is a primer? A primer is a short synthetic oligonucleotide which is used in many molecular techniques from PCR to DNA sequencing. These primers are designed to have a sequence which is the reverse complement of a region of template or target DNA to which we wish the primer to anneal. These primers are designed to have a sequence which is the reverse complement of a region of template or target DNA to which we wish the primer to anneal.
PCR involves the following three steps: Denaturation, Annealing and Extension. First, the genetic material is denatured, converting the double stranded DNA molecules to single strands. The primers are then annealed to the complementary regions of the single stranded molecules. In the third step, they are extended by the action of the DNA polymerase. All these steps are temperature sensitive and the common choice of temperatures is 94 o C, 60 o C and 70 o C respectively. Good primer design is essential for successful reactions. The important design considerations described below are a key to specific amplification with high yield.
Analysis of primer sequences When designing primers for PCR, sequencing or mutagenesis it is often necessary to make predictions about these primers, for example melting temperature (Tm) and propensity to form dimers with itself or other primers in the reaction. When primers form hairpin loops or dimers less primer is available for the desired reaction. For example...
Some thoughts on designing primers. 1. primers should be bases in length; 2. base composition should be 50-60% (G+C); 3. primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends and increases efficiency of priming; 4. Tms between o C are preferred; 5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers will be synthesised preferentially to any other product; 6. primer self-complementarity (ability to form 2 o structures such as hairpins) should be avoided; 7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences (because of stability of annealing), and should be avoided.
Primer3