NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001
NCBI n About NCBI n NCBI Sequence Databases Primary Database – GenBank Primary Database – GenBank Derivative Databases - RefSeq Derivative Databases - RefSeq n Entrez Databases and Text Searching n BLAST Services n Genomic Resources NCBI Resources
NCBI The National Center for Biotechnology Information (NCBI) n Created as a part of the National Library of Medicine in 1988 Establish public databases Establish public databases Research in computational biology Research in computational biology Develop software tools for sequence analysis Develop software tools for sequence analysis Disseminate biomedical information Disseminate biomedical information n Tools: BLAST(1990), Entrez (1992) n GenBank (1992) n Free MEDLINE (PubMed, 1997) n Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq
NCBI Molecular Databases n Primary Databases Original submissions by experimentalists Original submissions by experimentalists Database staff organize but don’t add additional information Database staff organize but don’t add additional information Example: GenBank Example: GenBank n Derivative Databases Human curated Human curated compilation and correction of data compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Computationally Derived Example: UniGene Example: UniGene Combinations Combinations Example: NCBI Genome Assembly Example: NCBI Genome Assembly
NCBI What is GenBank? NCBI’s Primary Sequence Database n Nucleotide only sequence database n Archival in nature n GenBank Data Direct submissions individual records (BankIt, Sequin) Direct submissions individual records (BankIt, Sequin) Batch submissions via (EST, GSS, STS) Batch submissions via (EST, GSS, STS) ftp accounts sequencing centers ftp accounts sequencing centers n Data shared nightly among three collaborating databases GenBank GenBank DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL) at EBI. European Molecular Biology Laboratory Database (EMBL) at EBI.
EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates
NCBI
GenBank full release every two months incremental and cumulative updates daily available only through internet ftp://ncbi.nlm.nih.gov/genbank/ or ftp://genbank.sdsc.edu/pub/ Release 126October ,602,262Records 14,396,883,064Nucleotides 80,000 +Species
NCBI GenBank on FTP site ftp> open ftp.ncbi.nlm.nih.gov. ftp> cd genbank Release 125: 243 files; Gigabytes uncompressed
Bulk Sequence Divisions PAT Patent EST Expressed Sequence Tags (133 files) STS Sequence Tagged Site GSS Genome Survey Sequence (41 files) HTG High Throughput Genome (25 files) HTC High Throughput cDNA CON Contig Traditional Divisions BCTINVMAMPHGPLNPRI RODSYNUNA VRLVRT GenBank Divisions
NCBI EST Division: Expressed Sequence Tags ,000 RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
NCBI STS Division : Sequence Tagged Sites n Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) n PCR with STS primers gives unique product (one per genome) n Basis of Radiation Hybrid Mapping UniGene UniGene Genome Assembly Genome Assembly n Related resource: Electronic PCR
NCBI RH mapping using STSs ABCD AB DAB CD ABCDABCD Hybrid Cells PCR Results Human Chromosome
NCBI ePCR Results Hexokinase 1 EST SHGC dbSTS id: 44155, GenBank Accession: G29974 Organism: Homo sapiens Primer1: CATACGACACGGCTCACAAA Primer2: CTGTTTGTCTCGTGGGGG STS location: Chromosome: 10 Expected amplicon size: 129, Observed amplicon size: 130 Primers match in forward orientation Query sequence: 1 TTTTTGAATT GGTACAAAGT TTACTAGGTC ATACGACACG GCTCACAAAG CGGTGGGAAA 61 TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC CAAATGGAGA CAAGACACAT 121 TTCCGCAGAC GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT GTCACACGCG 181 GCTAGGACTG GTTCCACGGA CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC 241 AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG CTATGCCCAC ACTCGCCTTC 301 AGCATGTGCC CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT TTCACACGGG 361 GAGGGGGAAC CAAGGATGAG CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG
NCBI Genome Sequencing Draft Sequence (HTG division) sonication Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division
NCBI GSS Division: Genome Survey Sequences Genomic equivalent of ESTs BAC and other first pass surveys BAC end sequences Whole Genome Shotgun (some) RAPIDS and other anonymous loci Genomic Clone (BAC) T7 end SP6 end
HTG Division: High Throughput Genome Records 40,000 to > 350,000 bp phase 1 phase 2 phase 3 HTG PRI Acc = AC gi = Acc = AC gi = Acc = AC gi =
NCBI The GenBank Record
LOCUS AF bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: A Simple GenBank Record
FEATURES Location/Qualifiers source /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC " /db_xref="GI: " /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // GenBank Record, cont.
LOCUS AF bp mRNA INV 02-MAR-2000 Sequence and Database Identifiers Locus, accession, gi, version DEFINITION Limulus polyphemus myosin III mRNA, complete cds. GB Division Locus Name DEF line (Title) Modification Date mol-type mRNA (= cDNA) rRNA snRNA DNA Sequence length VERSION AF GI: ACCESSION AF Accession Number Accession.version gi number
NCBI Keywords, Source-organism KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. Legacy field exception EST GSS HTG Accepted common name Scientific name Taxonomic lineage according to GenBank
NCBI REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: Citation Article Submitter Block Update history Previous version
FEATURES Location/Qualifiers source /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC " /db_xref="GI: " /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL " Feature Table Coding Sequence Biosource Reading Frame GenPept Protein Identifiers
BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata 3781 aagatacagt aactagggaa aaaaaaaa // Sequence End of record Indicates beginning of sequence data
NCBI NCBI Derivative Sequence Databases: RefSeq NCBI Reference Sequences mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein XM_ Predicted Transcript XP_ Predicted Protein Gene Records NG_ Reference Genomic Sequence Assemblies NT_ Contig (Mouse and Human Genomes) NC_ Chromosome (Microbial Genomes)
Curated RefSeq Records: NM_, NP_ LOCUS NM_ bp mRNA PRI 26-JUL-1999 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator(CFTR) mRNA. ACCESSION NM_ COMMENT REFSEQ: This reference sequence was derived from M PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. LOCUS NP_ aa PRI 26-JUL-1999 DEFINITION cystic fibrosis transmembrane conductance regulator. ACCESSION NP_ PID g VERSION NP_ GI: DBSOURCE REFSEQ: accession NM_ RefSeq Protein RefSeq Nucleotide REFSEQ: This reference sequence was derived from M , M On Feb 17, 2000 this sequence version replaced gi: Summary: Cystic fibrosis transmembrane conductance regulator is member 7 of the ATP-binding cassete sub-family C. The protein functions as a chloride channel and controls the regulation of other transport pathways. Mutations in this gene cause the autosomal recessive disorder, cystic fibrosis (CF) and congenital bilateral aplasia of the vas deferens (CBAVD). Alternative splice variants have been described, many of which result from mutations in the CFTR gene. COMPLETENESS: full length. Reviewed
NCBI Alignment Generated Transcripts: XM_, XP_ mismatch LOCUS XM_ bp mRNA PRI 16-NOV-2000 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA. ACCESSION XM_ VERSION XM_ GI:
RefSeq Human Contig: NT_ LOCUS NT_ bp DNA CON 16-NOV-2000 DEFINITION Homo sapiens chromosome 7 working draft sequence segment, complete sequence. ACCESSION NT_ VERSION NT_ GI: KEYWORDS HTG. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to ) AUTHORS International Human Genome Project collaborators. TITLE Toward the complete sequence of the human genome JOURNAL Unpublished COMMENT GENOME ANNOTATION REFSEQ: NCBI contigs are derived from assembled genomic sequence data. They may include both draft and finished sequence. COMPLETENESS: not full length. mRNA complement(join( , , , , , , , , , , , , , , , , , , , , , , , , , , )) /partial /gene="CFTR" /product="cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7)" /transcript_id="XM_ " /db_xref="LocusID:1080" /db_xref="MIM:602421" /note="derived by automated computational analysis using gene prediction method: Acembly. Supporting evidence includes similarity to: 9 proteins, 1 mRNAs See details in AceView" gene complement( ) /gene="CFTR" /note="CF; MRP7; ABC35; ABCC7" /db_xref="LocusID:1080" CONTIG join(AC : ,gap(100),AC : , gap(100),AC : ,gap(100), complement(AC : ),gap(100), AC : ,AC : , AC : ,AC : ,gap(100), AC : ,gap(100),AC : ,gap(100), complement(AC : ),gap(100), AC : ,gap(100),AC : , gap(100),complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), AC : ,gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), Reordering draft sequence
NCBI Map View of RefSeqs NM_ XM_ NT_
NCBI RefSeq Genome Records: NG_
NCBI LOCUS NC_ bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_ VERSION NC_ GI: KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), (1999) MEDLINE PUBMED RefSeq Chromosomes: NC_
NCBI Other NCBI Derivative Databases UniGene - gene oriented expressed sequence clusters LocusLink - central resource and interface for known genes
NCBI NCBI Homepage
NCBI Entrez Similarity Searching Mendelian Inheritance in Man NCBI Homepage
NCBI Using Entrez An integrated database search and retrieval system
Genomes Taxonomy Entrez: Neighboring and Hard Links PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure (MMDB) 3 -D Structure Word weight VAST BLAST Phylogeny
NCBI WWW Entrez All of MEDLINE plus others Abstracts Links to online Journals GenBank, EMBL, DDBJ RefSeq, PDB GenBank, DDBJ, EMBL translations PDB, PIR, SWISS-PROT, PRF, RefSeq NCBI’s MMDB - derived from PDB Reference Genomes: Graphical views, assembled sequence and mapping data
NCBI Database Searching with Entrez u Using limits and field restriction to find mouse GAPD u Linking and neighboring with mouse GAPD
NCBI Entrez Nucleotides Mouse
NCBI Document Summaries: Mouse[All Fields] Chicken not mouse !? 3 million records
NCBI Entrez Nucleotides: Limits: Preview/Index Mouse
NCBI Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Only From RefSeq GenBank EMBL DDBJ Exclude unwanted categories of sequences Molecule Genomic DNA/RNA mRNA rRNA Gene Location Genomic DNA/RNA Mitochondrion Chloroplast Mouse
NCBI Entrez Nucleotides: Limits: Organism Mouse
NCBI Document Summaries: Mouse[Organism] 2,976,070[All Fields] -2,921,009[Organism] 55,061
NCBI Exclude Bulk Sequences, mRNA
NCBI Adding Terms: Preview/Index glyceraldehyde 3 phosphate dehydrogenase Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Search History
NCBI Mouse GAPD Records
NCBI Displaying Mouse GAPD Records Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)
NCBI Entrez GenBank / GenPept GenPept
NCBI >gi|193425|gb|M |MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC > FASTA Format FASTA Definition Line >gi|193425|gb|M |MUSGAPDS gi number Database Identifiers gbGenBank embEMBL dbjDDBJ spSWISS-PROT pdbProtein Databank pirPIR prf PRF refRefSeq Accession number Locus Name
NCBI Seq-entry ::= set { level 1, class nuc-prot, descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products", update-date std { year 1994, month 11, day 9 }, source { org { taxname "Mus musculus", common "house mouse", db { { db "taxon", tag id } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1
NCBI /***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI Toolbox Toolbox Sources ftp> open ncbi.nlm.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools
NCBI Related Proteins Protein Neighbors-Structure Links Structure Links Cn3D GAPD Structure
NCBI Advanced Neighbors: BLink
NCBI BLink
NCBI PubMed Link
NCBI Online Books
NCBI Entrez Structures Molecular Modeling Database (MMDB) and Cn3D
NCBI MM MMDB: Molecular Modeling Data Base n Derived from experimentally determined PDB records n Value added to PDB records including: Addition of explicit chemical graph information Addition of explicit chemical graph information Validation Validation Inclusion of Taxonomy, Citation, and other information Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Conversion to parseable ASN.1 data description language n Structure neighbors determined by Vector Alignment Search Tool (VAST)
NCBI Searching MMDB 1CET
NCBI Structure Summary Cn3D viewer VAST neighbors BLAST neighbors
NCBI Cn3D : Displaying Structures Chloroquine
NCBI Structure Neighbors
NCBI Structural Alignments Chloroquine NADH
NCBI Why do we need similarity searching? Identification and annotation Incomplete or no annotations (GenBank) Incorrectly annotated sequences Evolutionary relationships homologous molecules may have similar functions but it ain’t necessarily so!
NCBI Basic Local Alignment Search Tool n Widely used similarity search tool n Heuristic approach based on Smith Waterman algorithm n Finds best local alignments n Provides statistical significance n All combinations (DNA/Protein) query and database. DNA vs DNA DNA vs DNA DNA translation vs Protein DNA translation vs Protein Protein vs Protein Protein vs Protein Protein vs DNA translation Protein vs DNA translation DNA translation vs DNA translation DNA translation vs DNA translation n www, server, standalone, and network clients
NCBI Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution Expected number with score S or greater E = Kmne - S or E = mn2 -S ’ K = scale for search space = scale for scoring system S ’ = bitscore = ( S - lnK)/ln2 For ungapped alignments:
NCBI Scoring Systems Nucleic acids identity matrix Proteins Position Independent MatricesPosition Independent Matrices PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstition Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) PSI and RPS BLAST
NCBI A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions
NCBI Position Specific Substitution Rates Active site serineTypical serine
NCBI Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D G V I S S C N G D S G G P L N C Q A Serine scored differently in these two positions Active site nucleophile
NCBI Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)
NCBI Intermission