NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005.

NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005

NCBI n About NCBI n NCBI Sequence Databases Primary Database – GenBank Primary Database – GenBank Derivative Databases - RefSeq Derivative Databases - RefSeq n Entrez Databases and Text Searching n BLAST NCBI Resources

NCBI The National Institutes of Health Bethesda, MD

NCBI The National Center for Biotechnology Information n Accepts submissions of primary data n Develops tools to analyze these data n Creates derivative databases based on the primary data n Provides free search, link, and retrieval of these data, primarily through the Entrez system

NCBI The National Center for Biotechnology Information (NCBI) n Created as a part of the National Library of Medicine in 1988 Establish public databases Establish public databases Research in computational biology Research in computational biology Develop software tools for sequence analysis Develop software tools for sequence analysis Disseminate biomedical information Disseminate biomedical information n Tools: Entrez (1992) ， BLAST(1990), n GenBank (1992) n Free MEDLINE (PubMed, 1997) n Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq

NCBI NCBI WWW Users per Day

NCBI Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 Christmas & New Year

NCBI Homepage - accessing the data all[filter]

NCBI all[filter] 1/11/2005

NCBI Entrez Nucleotide Primary Data n GenBank / DDBJ / EMBL 46,974,918 (98.86 %) Derivative Data n RefSeq 533,236 (1.12 %) n PDB(structures) 5,484 n Third Party Annotation (TPA) 4,516 “ Total ” 47,518,338 GenBank

NCBI GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 145 Dec 2004 40.6 x 10 6 Records 44.5 x 10 9 Nucleotides 153 Gigabytes 705 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt

NCBI Molecular Databases n Primary Databases Original submissions by experimentalists Original submissions by experimentalists Database staff organize but don’t add additional information Database staff organize but don’t add additional information Example: GenBank Example: GenBank n Derivative Databases Human curated Human curated compilation and correction of data compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Computationally Derived Example: UniGene Example: UniGene Combinations Combinations Example: NCBI Genome Assembly Example: NCBI Genome Assembly

NCBI Primary vs. Derivative Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and Genomes pipelines RefSeq: annotation pipeline Labs Updated ONLY by submitters EST UniSTS STS GSS HTG PRIRODPLNMAMBCT INVVRTPHGVRL Curators ATT GA ATT C GA C C C C ATT TA ACT Updated by NCBI RefSeq

NCBI The GenBank Record

NCBI GenBank Records Header Feature Table Sequence The Flatfile Format

NCBI A Typical GenBank Record LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS. = Title Entrez

NCBI GenBank Record: Feature Table Entrez

NCBI GenBank Record: Feature Table GenPept identifier Blast Entrez

NCBI GenBank Record: sequence skip Blast

NCBI NCBI Homepage http://www.ncbi.nlm.nih.gov/

NCBI BLAST Mendelian Inheritance in Man NCBI Homepage Entrez

NCBI Online Help

NCBI Using Entrez An integrated database search and retrieval system

Genomes Taxonomy Entrez: Neighboring and Hard Links PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure (MMDB) 3 -D Structure Word weight VAST BLAST Phylogeny

NCBI GEO(gene expression omnibus, 基因表达汇编 ) ：收集、存贮微阵列基因表达数据的数据库。

Unigene

Database Searching with Entrez u Using limits and field restriction to find mouse GAPD u Linking and neighboring with mouse GAPD

NCBI Entrez Nucleotides Mouse

NCBI Document Summaries: Mouse[All Fields] 7 million records

NCBI Data Rich ， Knowledge Poor 不要把自己淹没于「数据信息的海洋」中，要去找「知识的岛屿」。

NCBI 什么是数据、信息、知识？一定注意现在生物信息学存贮数据库叫 DATABASE

NCBI Entrez Nucleotides: Limits: Preview/Index Mouse

NCBI Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Only From RefSeq GenBank EMBL DDBJ Exclude unwanted categories of sequences Molecule Genomic DNA/RNA mRNA rRNA Gene Location Genomic DNA/RNA Mitochondrion Chloroplast Mouse

NCBI Entrez Nucleotides: Limits: Organism Mouse

NCBI Document Summaries: Mouse[Organism] 7,247,131[All Fields] -6,850,905[Organism] 397,226

NCBI Exclude Bulk Sequences, mRNA

NCBI 502497

NCBI Preview / Index

NCBI Adding Terms: Preview/Index Search History

NCBI glyceraldehyde 3 phosphate dehydrogenase

mouse AND glyceraldehyde 3 phosphate dehydrogenase[Title]

NCBI 161 Mouse GAPD Records

NCBI History

#18 AND # 6

Displaying Records

NCBI Displaying Mouse GAPD Records Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)

Entrez GenBank / GenPept GenPept

NCBI >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC > FASTA Format FASTA Definition Line >gi|193425|gb|M60978.1|MUSGAPDS gi number Database Identifiers gbGenBank embEMBL dbjDDBJ spSWISS-PROT pdbProtein Databank pirPIR prf PRF refRefSeq Accession number Locus Name

Seq-entry ::= set { level 1, class nuc-prot, descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products", update-date std { year 1994, month 11, day 9 }, source { org { taxname "Mus musculus", common "house mouse", db { { db "taxon", tag id 10090 } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

/***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI Toolbox Toolbox Sources ftp> open ncbi.nlm.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools

NCBI Protein Neighbors-Structure Links

Related Proteins Protein Neighbors-Structure Links Structure Links Cn3D GAPD Structure

NCBI Advanced Neighbors: BLink

NCBI BLink

NCBI Online Books

NCBI 建议千万不要使自己成为 data 的收集者，不要使自己只成为 database （这是计算机的工作），要成为这些信息的加工者，使自己成为有知识的人！千万不要使自己成为 data 的收集者，不要使自己只成为 database （这是计算机的工作），要成为这些信息的加工者，使自己成为有知识的人！ n 华罗庚读书要从薄到厚, 从厚到薄。读书要从薄到厚, 从厚到薄。

Entrez Structures Molecular Modeling Database (MMDB) and Cn3D

NCBI MM MMDB: Molecular Modeling Data Base n Derived from experimentally determined PDB records n Value added to PDB records including: Addition of explicit chemical graph information Addition of explicit chemical graph information Validation Validation Inclusion of Taxonomy, Citation, and other information Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Conversion to parseable ASN.1 data description language n Structure neighbors determined by Vector Alignment Search Tool (VAST)

NCBI Searching MMDB 1CET

NCBI Structure Summary Cn3D viewer VAST neighbors BLAST neighbors

NCBI Cn3D : Displaying Structures Chloroquine

NCBI Structure Neighbors

NCBI Structural Alignments Chloroquine NADH

NCBI Why do we need similarity searching?  Identification and annotation Incomplete or no annotations (GenBank) Incorrectly annotated sequences  Evolutionary relationships homologous molecules may have similar functions but it ain’t necessarily so!

NCBI Basic Local Alignment Search Tool n Widely used similarity search tool n Heuristic approach based on Smith Waterman algorithm n Finds best local alignments n Provides statistical significance n All combinations (DNA/Protein) query and database. DNA vs DNA DNA vs DNA DNA translation vs Protein DNA translation vs Protein Protein vs Protein Protein vs Protein Protein vs DNA translation Protein vs DNA translation DNA translation vs DNA translation DNA translation vs DNA translation n www, email server, standalone, and network clients

NCBI Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution Expected number with score S or greater E = Kmne - S or E = mn2 -S ’ K = scale for search space = scale for scoring system S ’ = bitscore = ( S - lnK)/ln2 For ungapped alignments: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

NCBI Scoring Systems Nucleic acids identity matrix Proteins Position Independent MatricesPosition Independent Matrices PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstition Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) PSI and RPS BLAST

NCBI A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

NCBI Position Specific Substitution Rates Active site serineTypical serine

NCBI Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile

NCBI Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

NCBI Intermission

NCBI 建议千万不要使自己成为 data 的收集者，不要使自己只成为 database （这是计算机的工作），要成为这些信息的加工者，使自己成为有知识的人！千万不要使自己成为 data 的收集者，不要使自己只成为 database （这是计算机的工作），要成为这些信息的加工者，使自己成为有知识的人！ n 华罗庚读书要从薄到厚, 从厚到薄。读书要从薄到厚, 从厚到薄。

NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005.

Similar presentations

Presentation on theme: "NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005.

Similar presentations

Presentation on theme: "NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005."— Presentation transcript:

Similar presentations

About project

Feedback