NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
NCBI Molecular Biology Resources
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
NCBI Molecular Biology Resources
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Phage? New Sequence Horizontal Transfer Molecular Evolution.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
NCBI Field Guide NCBI Molecular Biology Resources November 2008 NCBI Databases.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.
An Introduction to Bioinformatics Molecular Biology Databases.
The Ensembl Gene set The “Genebuild” 21 April 2008.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
NCBI FieldGuide NCBI Molecular Biology Resources Part 2 November 2008 Peter Cooper.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica.
NCBI FieldGuide NCBI Molecular Biology Resources January 12, 2007 A Field Guide Part 1.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
NCBI Molecular Biology Resources February 2007 Part 1.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Bioinformatics lectures at Rice University Li Zhang Lecture 1 Department of Bioinformatics and Computational Biology MD Anderson Cancer Center March-April,
Introduction to Genes and Genomes with Ensembl
Wolbachia Bioinformatics
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Bioinformatics for your classroom
Archives and Information Retrieval
생물정보학 Bioinformatics.
BLAST.
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases

NCBI Field Guide The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Bethesda,MD

NCBI Field Guide Web Access:

NCBI Field Guide NCBI Databases and Services GenBank largest sequence database Free public access to biomedical literature –PubMed free Medline –PubMed Central full text online access Entrez integrated molecular and literature databases BLAST highest volume sequence search service VAST structure similarity searches Software and Databases

NCBI Field Guide Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

NCBI Field Guide Entrez Nucleotides Primary GenBank / EMBL / DDBJ 86,766,287 Derivative RefSeq 1,715,255 Third Party Annotation 5,312 PDB 7,334 Total 88,494,392

NCBI Field Guide What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature –Historical –Reflective of submitter point of view (subjective) –Redundant GenBank Data –Direct submissions (traditional records) –Batch submissions (EST, GSS, STS) –ftp accounts (genome data) Three collaborating databases –GenBank –DNA Database of Japan (DDBJ) –European Molecular Biology Laboratory (EMBL) Database

NCBI Field Guide EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration

NCBI Field Guide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ Release 158February ,639,920Records 157,335,689,977Total Bases 263 Gigabytes (non-WGS) 1115 files (non-WGS) full release every two months incremental updates daily available only via ftp full release every two months incremental updates daily available only via ftp

NCBI Field Guide The Growth of GenBank Non-WGS: 71.3 billion bases WGS: 86.0 billion bases Release 158 Doubling time months

NCBI Field Guide Organization of GenBank: Traditional Divisions Records are divided into 18 Divisions. 12 Traditional 6 Bulk Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized PRI Primate PLN Plant and Fungal BCT Bacterial and Archeal INV Invertebrate ROD Rodent VRL Viral VRT Other Vertebrate MAM Mammalian PHG Phage SYN Synthetic (cloning vectors) ENV Environmental Samples UNA Unannotated Entrez query: gbdiv_xxx[Properties]

NCBI Field Guide Organization of GenBank: Bulk Divisions Records are divided into 18 Divisions. 12 Traditional 6 Bulk BULK Divisions: Batch Submission ( and FTP) Inaccurate Poorly characterized EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic STS Sequence Tagged Site HTC High Throughput cDNA PAT Patent Entrez query: gbdiv_xxx[Properties]

NCBI Field Guide A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format

NCBI Field Guide Traditional GenBank Record ACCESSION U07418 VERSION U GI: ACCESSION U07418 VERSION U GI: Accession Stable Reportable Universal Accession Stable Reportable Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use well annotated the sequence is the data

NCBI Field Guide Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch Submission and htg ( and ftp) Inaccurate Poorly Characterized

NCBI Field Guide GenBank Bulk Sequence: EST poorly characterized poorly characterized

NCBI Field Guide ESTs in Entrez Total 41 million records Human 7.9 million Mouse 4.7 million Cow1.3 million Rice1.2 million Zebrafish 1.2 million Maize1.2 million Xenopus tropicalis1.0 million Rat 0.9 million Wheat0.9 million Chicken0.6 million Barley0.4 million Total 41 million records Human 7.9 million Mouse 4.7 million Cow1.3 million Rice1.2 million Zebrafish 1.2 million Maize1.2 million Xenopus tropicalis1.0 million Rat 0.9 million Wheat0.9 million Chicken0.6 million Barley0.4 million

NCBI Field Guide HTG Division: Opossum Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division

NCBI Field Guide Whole Genome Shotgun Projects ftp://ftp.ncbi.nih.gov/genbank/wgs/ >450 Projects >400 Taxa –302 bacteria –128 eukaryotes 47 fungi 53 animals 3 flowering plants >450 Projects >400 Taxa –302 bacteria –128 eukaryotes 47 fungi 53 animals 3 flowering plants

NCBI Field Guide Mammalian WGS Duck-billed platypus Nine-banded armadillo Northern tree shrew Domestic rabbit Guinea pig Mouse Rat Thirteen-lined ground squirrel Small-eared galago Orangutan Human Chimpanzee Gorilla Rhesus macaque Tenrec African elephant Dog Cat Horse European hedgehog Eurasian shrew Little brown bat Cow Gray short-tailed opossum Duck-billed platypus Nine-banded armadillo Northern tree shrew Domestic rabbit Guinea pig Mouse Rat Thirteen-lined ground squirrel Small-eared galago Orangutan Human Chimpanzee Gorilla Rhesus macaque Tenrec African elephant Dog Cat Horse European hedgehog Eurasian shrew Little brown bat Cow Gray short-tailed opossum

NCBI Field Guide Derivative Databases

NCBI Field Guide Entrez Protein: Derivative Database Data Source GenPept Sequences 6,937,176 RefSeq 3,359,561 Third Party Annotation 5,136 Swiss Prot 255,159 PIR 29,996 PRF 12,079 PDB 91,116 PAT Division 669,035 Total 10,690,223 BLAST nr total (no patents or env) 4,545,310

NCBI Field Guide FEATURES Location/Qualifiers source /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene /gene="MLH1" CDS /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC " /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GenPept: GenBank CDS translations >gi|463989|gb|AAC | DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC | DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

NCBI Field Guide Redundant Proteins >gi|741682|prf|| A DNA mismatch repair protei... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC | DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi| |ref|NP_ | MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi| |gb|AAH | MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi| |gb|AAA | DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... GenPept NCBI RefSeq Swiss-Prot PRF

NCBI Field Guide Protein Sequences from Structures >gi| |pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ >gi| |pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ

NCBI Field Guide Primary vs. Derivative Sequence DatabasesGenBank SequencingCenters GA ATT C C GA ATT C C AT GA ATT C C GA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters

NCBI Field Guide RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins –reviewed –human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more Model transcripts and proteins Assembled Genomic Regions (contigs) –human genome –mouse genome –rat genome Chromosome records –Human genome –microbial –organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties] – chicken – honeybee – sea urchin

NCBI Field Guide Selected RefSeq Accession Numbers mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein NR_ Curated non-coding RNA XM_ Predicted mRNA XP_ Predicted Protein XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence Chromosome NC_ Microbial replicons, organelle genomes, human chromosomes Assemblies NT_ Contig NW_ WGS Supercontig

NCBI Field Guide GenBank to RefSeq

NCBI Field Guide RefSeqs: Annotation Reagents Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Curated Protein (NP) Scanning.... = ? GenBank Sequences RefSeq

NCBI Field Guide RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators

NCBI Field Guide Mouse Assembly RefSeq Contig RefSeq Contig BAC WGS Other GenBank Other GenBank RefSeq Transcript RefSeq Transcript UniGene Transcript UniGene Transcript

NCBI Field Guide Expressed Sequences UniGene GEO

NCBI Field Guide A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

NCBI Field Guide EST hits: Human mRNA Albumin mRNA 5’ EST hits 3’ EST hits

NCBI Field Guide UniGene Chordates Invertebrates Plants Fungi et al.

NCBI Field Guide Xenopus laevis MLH1Cluster Uncharacterized ESTs

NCBI Field Guide Human ALB Cluster

NCBI Field Guide Expression Data

NCBI Field Guide Other NCBI Databases Structure: imported structures (PDB) Cn3D viewer, NCBI curation CDD: conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) dbSNP: nucleotide polymorphism Gene: gene records Unifies LocusLink and Microbial Genomes

NCBI Field Guide NCBI Structures and Domains

NCBI Field Guide MM MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: –Addition of explicit chemical graph information –Validation (secondary structure elements) –Inclusion of Taxonomy, Citation –Conversion to ASN.1 data description language Structure neighbors determined by Vector Alignment Search Tool (VAST)

NCBI Field Guide Cn3D 4.1: Bacillus thuringiensis Toxin

NCBI Field Guide VAST: Structure Neighbors Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors Human IL-4 IL-4 & Leptin align the vectors

NCBI Field Guide Protein Domains Structural Domain –Discrete independently folding unit of a protein Conserved Domain (sequence-based) –Protein region with recognizable position-specific pattern of sequence conservation Sequence-based domains often roughly correspond to structural domains Domains often have distinct, identifiable functions

NCBI Field Guide NCBI’s Conserved Domain Database PSI-BLAST –based score matrices Searchable with RPS-BLAST Sources –SMART –PFAM –COGs –NCBI curated domains structure informed alignments

NCBI Field Guide Src Domains Four 3d domains Three conserved domains Four 3d domains Three conserved domains

NCBI Field Guide Structure vs Conserved Domain SH2 SH3 TyrKC SH2 Conserved phosphotyrosine binding residues