Download presentation
Presentation is loading. Please wait.
Published byAmos Bradley Modified over 9 years ago
1
NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001
2
NCBI n About NCBI n NCBI Sequence Databases Primary Database – GenBank Primary Database – GenBank Derivative Databases - RefSeq Derivative Databases - RefSeq n Entrez Databases and Text Searching n BLAST Services n Genomic Resources NCBI Resources
3
NCBI The National Center for Biotechnology Information (NCBI) n Created as a part of the National Library of Medicine in 1988 Establish public databases Establish public databases Research in computational biology Research in computational biology Develop software tools for sequence analysis Develop software tools for sequence analysis Disseminate biomedical information Disseminate biomedical information n Tools: BLAST(1990), Entrez (1992) n GenBank (1992) n Free MEDLINE (PubMed, 1997) n Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq
4
NCBI Molecular Databases n Primary Databases Original submissions by experimentalists Original submissions by experimentalists Database staff organize but don’t add additional information Database staff organize but don’t add additional information Example: GenBank Example: GenBank n Derivative Databases Human curated Human curated compilation and correction of data compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Computationally Derived Example: UniGene Example: UniGene Combinations Combinations Example: NCBI Genome Assembly Example: NCBI Genome Assembly
5
NCBI What is GenBank? NCBI’s Primary Sequence Database n Nucleotide only sequence database n Archival in nature n GenBank Data Direct submissions individual records (BankIt, Sequin) Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) Batch submissions via email (EST, GSS, STS) ftp accounts sequencing centers ftp accounts sequencing centers n Data shared nightly among three collaborating databases GenBank GenBank DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL) at EBI. European Molecular Biology Laboratory Database (EMBL) at EBI.
6
EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates
7
NCBI
8
GenBank full release every two months incremental and cumulative updates daily available only through internet ftp://ncbi.nlm.nih.gov/genbank/ or ftp://genbank.sdsc.edu/pub/ Release 126October2001 13,602,262Records 14,396,883,064Nucleotides 80,000 +Species
9
NCBI GenBank on FTP site ftp> open ftp.ncbi.nlm.nih.gov. ftp> cd genbank Release 125: 243 files; 55.23 Gigabytes uncompressed
10
Bulk Sequence Divisions PAT Patent EST Expressed Sequence Tags (133 files) STS Sequence Tagged Site GSS Genome Survey Sequence (41 files) HTG High Throughput Genome (25 files) HTC High Throughput cDNA CON Contig Traditional Divisions BCTINVMAMPHGPLNPRI RODSYNUNA VRLVRT GenBank Divisions
11
NCBI EST Division: Expressed Sequence Tags 80-100,000 RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
12
NCBI STS Division : Sequence Tagged Sites n Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) n PCR with STS primers gives unique product (one per genome) n Basis of Radiation Hybrid Mapping UniGene UniGene Genome Assembly Genome Assembly n Related resource: Electronic PCR http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi
13
NCBI RH mapping using STSs ABCD AB DAB CD ABCDABCD ++-+++-+ --++--++ ++--++-- Hybrid Cells PCR Results Human Chromosome
14
NCBI ePCR Results Hexokinase 1 EST SHGC-35892 dbSTS id: 44155, GenBank Accession: G29974 Organism: Homo sapiens Primer1: CATACGACACGGCTCACAAA Primer2: CTGTTTGTCTCGTGGGGG STS location: 30..160 Chromosome: 10 Expected amplicon size: 129, Observed amplicon size: 130 Primers match in forward orientation Query sequence: 1 TTTTTGAATT GGTACAAAGT TTACTAGGTC ATACGACACG GCTCACAAAG CGGTGGGAAA 61 TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC CAAATGGAGA CAAGACACAT 121 TTCCGCAGAC GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT GTCACACGCG 181 GCTAGGACTG GTTCCACGGA CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC 241 AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG CTATGCCCAC ACTCGCCTTC 301 AGCATGTGCC CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT TTCACACGGG 361 GAGGGGGAAC CAAGGATGAG CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG
15
NCBI Genome Sequencing Draft Sequence (HTG division) sonication Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division
16
NCBI GSS Division: Genome Survey Sequences Genomic equivalent of ESTs BAC and other first pass surveys BAC end sequences Whole Genome Shotgun (some) RAPIDS and other anonymous loci Genomic Clone (BAC) T7 end SP6 end
17
HTG Division: High Throughput Genome Records 40,000 to > 350,000 bp phase 1 phase 2 phase 3 HTG PRI Acc = AC008701 gi = 6601005 Acc = AC008701 gi = 6671909 Acc = AC008701 gi = 7328720
18
NCBI The GenBank Record
19
LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. A Simple GenBank Record
20
FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // GenBank Record, cont.
21
LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 Sequence and Database Identifiers Locus, accession, gi, version DEFINITION Limulus polyphemus myosin III mRNA, complete cds. GB Division Locus Name DEF line (Title) Modification Date mol-type mRNA (= cDNA) rRNA snRNA DNA Sequence length VERSION AF062069.2 GI:7144484 ACCESSION AF062069 Accession Number Accession.version gi number
22
NCBI Keywords, Source-organism KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. Legacy field exception EST GSS HTG Accepted common name Scientific name Taxonomic lineage according to GenBank
23
NCBI REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Citation Article Submitter Block Update history Previous version
24
FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL " Feature Table Coding Sequence Biosource Reading Frame GenPept Protein Identifiers
25
BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata 3781 aagatacagt aactagggaa aaaaaaaa // Sequence End of record Indicates beginning of sequence data
26
NCBI NCBI Derivative Sequence Databases: RefSeq NCBI Reference Sequences mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein XM_123456 Predicted Transcript XP_123456 Predicted Protein Gene Records NG_123456 Reference Genomic Sequence Assemblies NT_123456 Contig (Mouse and Human Genomes) NC_123455 Chromosome (Microbial Genomes)
27
Curated RefSeq Records: NM_, NP_ LOCUS NM_000492 6159 bp mRNA PRI 26-JUL-1999 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator(CFTR) mRNA. ACCESSION NM_000492 COMMENT REFSEQ: This reference sequence was derived from M55131. PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. LOCUS NP_000483 1480 aa PRI 26-JUL-1999 DEFINITION cystic fibrosis transmembrane conductance regulator. ACCESSION NP_000483 PID g4502785 VERSION NP_000483.1 GI:4502785 DBSOURCE REFSEQ: accession NM_000492.1 RefSeq Protein RefSeq Nucleotide REFSEQ: This reference sequence was derived from M28668.1, M55131.1. On Feb 17, 2000 this sequence version replaced gi:4502784. Summary: Cystic fibrosis transmembrane conductance regulator is member 7 of the ATP-binding cassete sub-family C. The protein functions as a chloride channel and controls the regulation of other transport pathways. Mutations in this gene cause the autosomal recessive disorder, cystic fibrosis (CF) and congenital bilateral aplasia of the vas deferens (CBAVD). Alternative splice variants have been described, many of which result from mutations in the CFTR gene. COMPLETENESS: full length. Reviewed
28
NCBI Alignment Generated Transcripts: XM_, XP_ mismatch LOCUS XM_004980 6128 bp mRNA PRI 16-NOV-2000 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA. ACCESSION XM_004980 VERSION XM_004980.3 GI:13631444
29
RefSeq Human Contig: NT_ LOCUS NT_007935 1888399 bp DNA CON 16-NOV-2000 DEFINITION Homo sapiens chromosome 7 working draft sequence segment, complete sequence. ACCESSION NT_007935 VERSION NT_007935.1 GI:11422165 KEYWORDS HTG. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1888399) AUTHORS International Human Genome Project collaborators. TITLE Toward the complete sequence of the human genome JOURNAL Unpublished COMMENT GENOME ANNOTATION REFSEQ: NCBI contigs are derived from assembled genomic sequence data. They may include both draft and finished sequence. COMPLETENESS: not full length. mRNA complement(join(1255889..1257642,1258986..1259091, 1259690..1259862,1271619..1271708,1281957..1282112, 1296780..1297028,1309837..1309937,1312742..1312969, 1313881..1314031,1317797..1317876,1320768..1321018, 1321687..1321724,1329492..1329620,1331893..1332616, 1334111..1334197,1336717..1336811,1364895..1365086, 1375727..1375909,1382442..1382534,1384204..1384450, 1387877..1388002,1389139..1389302,1390185..1390274, 1393436..1393651,1415408..1415516,1420187..1420297, 1444403..1444587)) /partial /gene="CFTR" /product="cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7)" /transcript_id="XM_004980.1" /db_xref="LocusID:1080" /db_xref="MIM:602421" /note="derived by automated computational analysis using gene prediction method: Acembly. Supporting evidence includes similarity to: 9 proteins, 1 mRNAs See details in AceView" gene complement(1255889..1444587) /gene="CFTR" /note="CF; MRP7; ABC35; ABCC7" /db_xref="LocusID:1080" CONTIG join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445, gap(100),AC074390.2:1..5245,gap(100), complement(AC074390.2:17705..23645),gap(100), AC074390.2:97658..119425,AC073042.3:106479..121155, AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100), AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100), complement(AC073042.3:183627..209083),gap(100), AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437, gap(100),complement(AC073042.3:6483..8319),gap(100), complement(AC073042.3:39354..45372),gap(100), complement(AC073042.3:21461..24064),gap(100), AC074390.2:156347..160294,gap(100), complement(AC074390.2:5346..10750),gap(100), complement(AC074390.2:153911..156246),gap(100), complement(AC074390.2:23746..32402),gap(100), complement(AC074390.2:151546..153810),gap(100), complement(AC074390.2:57277..75275),gap(100), complement(AC074390.2:75376..97557),gap(100), Reordering draft sequence
30
NCBI Map View of RefSeqs NM_ XM_ NT_
31
NCBI RefSeq Genome Records: NG_
32
NCBI LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605 RefSeq Chromosomes: NC_
33
NCBI Other NCBI Derivative Databases UniGene - gene oriented expressed sequence clusters LocusLink - central resource and interface for known genes
34
NCBI NCBI Homepage
35
NCBI Entrez Similarity Searching Mendelian Inheritance in Man NCBI Homepage
36
NCBI Using Entrez An integrated database search and retrieval system
37
Genomes Taxonomy Entrez: Neighboring and Hard Links PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure (MMDB) 3 -D Structure Word weight VAST BLAST Phylogeny
38
NCBI WWW Entrez All of MEDLINE plus others Abstracts Links to online Journals GenBank, EMBL, DDBJ RefSeq, PDB GenBank, DDBJ, EMBL translations PDB, PIR, SWISS-PROT, PRF, RefSeq NCBI’s MMDB - derived from PDB Reference Genomes: Graphical views, assembled sequence and mapping data
39
NCBI Database Searching with Entrez u Using limits and field restriction to find mouse GAPD u Linking and neighboring with mouse GAPD
40
NCBI Entrez Nucleotides Mouse
41
NCBI Document Summaries: Mouse[All Fields] Chicken not mouse !? 3 million records
42
NCBI Entrez Nucleotides: Limits: Preview/Index Mouse
43
NCBI Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Only From RefSeq GenBank EMBL DDBJ Exclude unwanted categories of sequences Molecule Genomic DNA/RNA mRNA rRNA Gene Location Genomic DNA/RNA Mitochondrion Chloroplast Mouse
44
NCBI Entrez Nucleotides: Limits: Organism Mouse
45
NCBI Document Summaries: Mouse[Organism] 2,976,070[All Fields] -2,921,009[Organism] 55,061
46
NCBI Exclude Bulk Sequences, mRNA
47
NCBI Adding Terms: Preview/Index glyceraldehyde 3 phosphate dehydrogenase Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Search History
48
NCBI Mouse GAPD Records
49
NCBI Displaying Mouse GAPD Records Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)
50
NCBI Entrez GenBank / GenPept GenPept
51
NCBI >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC > FASTA Format FASTA Definition Line >gi|193425|gb|M60978.1|MUSGAPDS gi number Database Identifiers gbGenBank embEMBL dbjDDBJ spSWISS-PROT pdbProtein Databank pirPIR prf PRF refRefSeq Accession number Locus Name
52
NCBI Seq-entry ::= set { level 1, class nuc-prot, descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products", update-date std { year 1994, month 11, day 9 }, source { org { taxname "Mus musculus", common "house mouse", db { { db "taxon", tag id 10090 } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1
53
NCBI /***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI Toolbox Toolbox Sources ftp> open ncbi.nlm.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools
54
NCBI Related Proteins Protein Neighbors-Structure Links Structure Links Cn3D GAPD Structure
55
NCBI Advanced Neighbors: BLink
56
NCBI BLink
57
NCBI PubMed Link
58
NCBI Online Books
59
NCBI Entrez Structures Molecular Modeling Database (MMDB) and Cn3D
60
NCBI MM MMDB: Molecular Modeling Data Base n Derived from experimentally determined PDB records n Value added to PDB records including: Addition of explicit chemical graph information Addition of explicit chemical graph information Validation Validation Inclusion of Taxonomy, Citation, and other information Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Conversion to parseable ASN.1 data description language n Structure neighbors determined by Vector Alignment Search Tool (VAST)
61
NCBI Searching MMDB 1CET
62
NCBI Structure Summary Cn3D viewer VAST neighbors BLAST neighbors
63
NCBI Cn3D : Displaying Structures Chloroquine
64
NCBI Structure Neighbors
65
NCBI Structural Alignments Chloroquine NADH
66
NCBI Why do we need similarity searching? Identification and annotation Incomplete or no annotations (GenBank) Incorrectly annotated sequences Evolutionary relationships homologous molecules may have similar functions but it ain’t necessarily so!
67
NCBI Basic Local Alignment Search Tool n Widely used similarity search tool n Heuristic approach based on Smith Waterman algorithm n Finds best local alignments n Provides statistical significance n All combinations (DNA/Protein) query and database. DNA vs DNA DNA vs DNA DNA translation vs Protein DNA translation vs Protein Protein vs Protein Protein vs Protein Protein vs DNA translation Protein vs DNA translation DNA translation vs DNA translation DNA translation vs DNA translation n www, email server, standalone, and network clients
68
NCBI Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution Expected number with score S or greater E = Kmne - S or E = mn2 -S ’ K = scale for search space = scale for scoring system S ’ = bitscore = ( S - lnK)/ln2 For ungapped alignments: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
69
NCBI Scoring Systems Nucleic acids identity matrix Proteins Position Independent MatricesPosition Independent Matrices PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstition Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) PSI and RPS BLAST
70
NCBI A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions
71
NCBI Position Specific Substitution Rates Active site serineTypical serine
72
NCBI Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile
73
NCBI Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)
74
NCBI Intermission
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.