Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005.

Similar presentations


Presentation on theme: "NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005."— Presentation transcript:

1 NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005

2 NCBI n About NCBI n NCBI Sequence Databases Primary Database – GenBank Primary Database – GenBank Derivative Databases - RefSeq Derivative Databases - RefSeq n Entrez Databases and Text Searching n BLAST NCBI Resources

3 NCBI The National Institutes of Health Bethesda, MD

4 NCBI The National Center for Biotechnology Information n Accepts submissions of primary data n Develops tools to analyze these data n Creates derivative databases based on the primary data n Provides free search, link, and retrieval of these data, primarily through the Entrez system

5 NCBI The National Center for Biotechnology Information (NCBI) n Created as a part of the National Library of Medicine in 1988 Establish public databases Establish public databases Research in computational biology Research in computational biology Develop software tools for sequence analysis Develop software tools for sequence analysis Disseminate biomedical information Disseminate biomedical information n Tools: Entrez (1992) , BLAST(1990), n GenBank (1992) n Free MEDLINE (PubMed, 1997) n Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq

6 NCBI NCBI WWW Users per Day

7 NCBI Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 Christmas & New Year

8 NCBI Homepage - accessing the data all[filter]

9 NCBI all[filter] 1/11/2005

10 NCBI Entrez Nucleotide Primary Data n GenBank / DDBJ / EMBL 46,974,918 (98.86 %) Derivative Data n RefSeq 533,236 (1.12 %) n PDB(structures) 5,484 n Third Party Annotation (TPA) 4,516 “ Total ” 47,518,338 GenBank

11 NCBI GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 145 Dec 2004 40.6 x 10 6 Records 44.5 x 10 9 Nucleotides 153 Gigabytes 705 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt

12 NCBI Molecular Databases n Primary Databases Original submissions by experimentalists Original submissions by experimentalists Database staff organize but don’t add additional information Database staff organize but don’t add additional information Example: GenBank Example: GenBank n Derivative Databases Human curated Human curated compilation and correction of data compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Computationally Derived Example: UniGene Example: UniGene Combinations Combinations Example: NCBI Genome Assembly Example: NCBI Genome Assembly

13 NCBI Primary vs. Derivative Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and Genomes pipelines RefSeq: annotation pipeline Labs Updated ONLY by submitters EST UniSTS STS GSS HTG PRIRODPLNMAMBCT INVVRTPHGVRL Curators ATT GA ATT C GA C C C C ATT TA ACT Updated by NCBI RefSeq

14 NCBI The GenBank Record

15 NCBI GenBank Records Header Feature Table Sequence The Flatfile Format

16 NCBI A Typical GenBank Record LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS. = Title Entrez

17 NCBI GenBank Record: Feature Table Entrez

18 NCBI GenBank Record: Feature Table GenPept identifier Blast Entrez

19 NCBI GenBank Record: sequence skip Blast

20 NCBI NCBI Homepage http://www.ncbi.nlm.nih.gov/

21 NCBI BLAST Mendelian Inheritance in Man NCBI Homepage Entrez

22 NCBI Online Help

23 NCBI Using Entrez An integrated database search and retrieval system

24 Genomes Taxonomy Entrez: Neighboring and Hard Links PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure (MMDB) 3 -D Structure Word weight VAST BLAST Phylogeny

25 NCBI GEO(gene expression omnibus, 基因表达汇编 ) : 收集、存贮微阵列基因表达数据的数据库。

26 NCBI

27

28

29 Unigene

30 NCBI

31

32 Database Searching with Entrez u Using limits and field restriction to find mouse GAPD u Linking and neighboring with mouse GAPD

33 NCBI Entrez Nucleotides Mouse

34 NCBI Document Summaries: Mouse[All Fields] 7 million records

35 NCBI Data Rich , Knowledge Poor 不要把自己淹没于「数据信息的海洋」中, 要去找「知识的岛屿」。

36 NCBI 什么是数据、信息、知识? 一定注意现在生物信息学存贮数据库叫 DATABASE

37 NCBI Entrez Nucleotides: Limits: Preview/Index Mouse

38 NCBI Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Only From RefSeq GenBank EMBL DDBJ Exclude unwanted categories of sequences Molecule Genomic DNA/RNA mRNA rRNA Gene Location Genomic DNA/RNA Mitochondrion Chloroplast Mouse

39 NCBI Entrez Nucleotides: Limits: Organism Mouse

40 NCBI Document Summaries: Mouse[Organism] 7,247,131[All Fields] -6,850,905[Organism] 397,226

41 NCBI Exclude Bulk Sequences, mRNA

42 NCBI 502497

43 NCBI Preview / Index

44 NCBI Adding Terms: Preview/Index Search History

45 NCBI glyceraldehyde 3 phosphate dehydrogenase

46 mouse AND glyceraldehyde 3 phosphate dehydrogenase[Title]

47 NCBI 161 Mouse GAPD Records

48 NCBI

49 19 3

50 NCBI History

51 NCBI

52 #18 AND # 6

53 NCBI

54 Displaying Records

55 NCBI Displaying Mouse GAPD Records Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)

56 NCBI

57

58 Entrez GenBank / GenPept GenPept

59 NCBI >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC > FASTA Format FASTA Definition Line >gi|193425|gb|M60978.1|MUSGAPDS gi number Database Identifiers gbGenBank embEMBL dbjDDBJ spSWISS-PROT pdbProtein Databank pirPIR prf PRF refRefSeq Accession number Locus Name

60 NCBI

61

62

63 Seq-entry ::= set { level 1, class nuc-prot, descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products", update-date std { year 1994, month 11, day 9 }, source { org { taxname "Mus musculus", common "house mouse", db { { db "taxon", tag id 10090 } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

64 NCBI

65 /***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI Toolbox Toolbox Sources ftp> open ncbi.nlm.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools

66 NCBI Protein Neighbors-Structure Links

67 NCBI

68

69

70

71

72

73

74

75

76 Related Proteins Protein Neighbors-Structure Links Structure Links Cn3D GAPD Structure

77 NCBI Advanced Neighbors: BLink

78 NCBI BLink

79 NCBI Online Books

80 NCBI 建 议 千万不要使自己成为 data 的收集者,不要使 自己只成为 database (这是计算机的工作 ),要成为这些信息的加工者,使自己成 为有知识的人! 千万不要使自己成为 data 的收集者,不要使 自己只成为 database (这是计算机的工作 ),要成为这些信息的加工者,使自己成 为有知识的人! n 华罗庚 读书要从薄到厚, 从厚到薄。 读书要从薄到厚, 从厚到薄。

81 NCBI

82 Entrez Structures Molecular Modeling Database (MMDB) and Cn3D

83 NCBI MM MMDB: Molecular Modeling Data Base n Derived from experimentally determined PDB records n Value added to PDB records including: Addition of explicit chemical graph information Addition of explicit chemical graph information Validation Validation Inclusion of Taxonomy, Citation, and other information Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Conversion to parseable ASN.1 data description language n Structure neighbors determined by Vector Alignment Search Tool (VAST)

84 NCBI Searching MMDB 1CET

85 NCBI Structure Summary Cn3D viewer VAST neighbors BLAST neighbors

86 NCBI Cn3D : Displaying Structures Chloroquine

87 NCBI Structure Neighbors

88 NCBI Structural Alignments Chloroquine NADH

89 NCBI Why do we need similarity searching?  Identification and annotation Incomplete or no annotations (GenBank) Incorrectly annotated sequences  Evolutionary relationships homologous molecules may have similar functions but it ain’t necessarily so!

90 NCBI Basic Local Alignment Search Tool n Widely used similarity search tool n Heuristic approach based on Smith Waterman algorithm n Finds best local alignments n Provides statistical significance n All combinations (DNA/Protein) query and database. DNA vs DNA DNA vs DNA DNA translation vs Protein DNA translation vs Protein Protein vs Protein Protein vs Protein Protein vs DNA translation Protein vs DNA translation DNA translation vs DNA translation DNA translation vs DNA translation n www, email server, standalone, and network clients

91 NCBI Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution Expected number with score S or greater E = Kmne - S or E = mn2 -S ’ K = scale for search space = scale for scoring system S ’ = bitscore = ( S - lnK)/ln2 For ungapped alignments: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

92 NCBI Scoring Systems Nucleic acids identity matrix Proteins Position Independent MatricesPosition Independent Matrices PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstition Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) PSI and RPS BLAST

93 NCBI A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

94 NCBI Position Specific Substitution Rates Active site serineTypical serine

95 NCBI Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile

96 NCBI Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

97 NCBI Intermission

98 NCBI 建 议 千万不要使自己成为 data 的收集者,不要使 自己只成为 database (这是计算机的工作 ),要成为这些信息的加工者,使自己成 为有知识的人! 千万不要使自己成为 data 的收集者,不要使 自己只成为 database (这是计算机的工作 ),要成为这些信息的加工者,使自己成 为有知识的人! n 华罗庚 读书要从薄到厚, 从厚到薄。 读书要从薄到厚, 从厚到薄。


Download ppt "NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar. 2005."

Similar presentations


Ads by Google