Databases. Where to get data? GenBank –http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov Protein Databases –SWISS-PROT:

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Databases (“knowledge bases”) used in genome analysis
Bunu databases’in icine koy lecture 5i de sonuna
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Bioinformatics.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy January 29, 2008.
Introduction to Bioinformatics Part 1 of 2 Jonathan Pevsner, Ph.D. M.E: September 8, 2003.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Introduction to Bioinformatics Monday, November 15, 2010 Jonathan Pevsner Bioinformatics M.E:
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Introduction to Bioinformatics Introduction to Databases
Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Organizing information in the post-genomic era The rise of bioinformatics.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI Literature Databases: PubMed
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Genome Bioinformatics DNA and protein Databases I.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Chapter 2: Access to Information Jonathan Pevsner, Ph.D.
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Archives and Information Retrieval
Mangaldai College, Mangaldai
Access to Sequence Data and Related Information
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Databases

Where to get data? GenBank – Protein Databases –SWISS-PROT: –PDB: And many others

Bibliograph y

Growth in genome sequencing

Working Draft Sequence gaps

The reagent: databases Organized array of information Place where you put things in, and (if all is well) you should be able to get them out again. Resource for other databases and tools. Simplify the information space by specialization. Bonus: Allows you to make discoveries.

Contains files or tables, each containing numerous records and fields Simplest form, either a large single text file or collection of text files Commonest type, stores the data within a number of tables (with records and fields). Each table will link each other by a shared file called a key

Flat file Relational database model The operators are written in query-specific languages based on relational algebra Structured Query Language (SQL) is commonly used

XML (eXtensible Markup Language) is now a general tool for storage of data and information. HTML and XHTML are subsets of XML. The key feature is to use identifiers called tabs Understanding Bioinformatics tag can be defined and used to identify book publishers Extraction from XML file is similar to database querying.

Databases Information system Query system Storage System Data GenBank flat file PDB file Interaction Record Title of a book Book

Databases Information system Query system Storage System Data Boxes Oracle MySQL PC binary files Unix text files Bookshelves

Databases Information system Query system Storage System Data A List you look at A catalogue indexed files SQL grep

The UBC library Google Entrez SRS Databases Information system Query system Storage System Data

Bioinformatics Information Space July 17, 1999 Nucleotide sequences:4,456,822 Protein sequences: 706,862 3D structures: 9,780 Human Unigene Clusters: 75,832 Maps and Complete Genomes: 10,870 Different species node: 52,889 dbSNP 6,377 RefGenes 515 human contigs > 250 kb 341 (4.9MB) PubMed records: 10,372,886 OMIM records: 10,695

The challenge of the information space: Nucleotide records 36,653,899 Protein sequences 4,436,362 3D structures 19,640 Interactions & complexes 52,385 Human Unigene Cluster 118,517 Maps and Complete Genomes 6,948 Different taxonomy Nodes 283,121 Human dbSNP 13,179,601 Human RefSeq records 22,079 bp in Human Contigs > 5,000 kb (116) 2,487,920,000 PubMed records 12,570,540 OMIM records 15,138 Feb

From a CBW student course evaluation: “I could probably live the rest of my life happily without ever seeing the ‘growth of GenBank’ curve … again.”

Databases Primary (archival) –GenBank/EMBL/DDBJ –UniProt –PDB –Medline (PubMed) –BIND Secondary (curated) –RefSeq –Taxon –UniProt –OMIM –SGD

Databases –PubMed and other NCBI databases –Biochemical databases –Protein domain databases –Structural databases –Genome comparison databases Tools –CDD / COGs –VAST / FSSP Tools of trade for the “armchair scientist”

Distribution of the type of databases as classified at the NAR database web site

Archival or Primary Data –Text: PubMed –DNA Sequence: GenBank –Protein Sequence: Entrez Proteins, TREMBL –Protein Structures: PDB Curated or Processed Data –DNA sequences : RefSeq, LocusLink, OMIM –Protein Sequences: SWISS-PROT, PIR –Protein Structures : SCOP, CATH, MMDB –Genomes: Entrez Genomes, COGs Types of databases Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance. [4] ExPASy SRS

4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene [4] ExPASy SRS There are many bioinformatics servers outside NCBI. Try ExPASy’s sequence retrieval system at (ExPASy = Expert Protein Analysis System) Or try ENSEMBL at for a premier human genome web browser.

National Center for Biotechnology Information (NCBI) Page 24

The National Center for Biotechnology Information (NCBI) Created as a part of the National Library of Medicine, National Institutes of Health in 1988 –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq

What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK

Some guiding principles of working with GenBank GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in

NCBI databases and their links Word Weight VAST BLASTBLAST Phylogeny Genomes Taxonomy Nucleotide Sequences Protein Sequences Article Abstracts Medline 3-D Structure 3 D Structure MMDB

Fig. 2.5 Page 25

Fig. 2.5 Page 25

PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

Entrez is a search and retrieval system that integrates NCBI databases Page 24

Entrez: An integrated search and retrieval system

BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25

OMIM record for Presenilin 1 (PSEN1) Associated LocusLink record External resources Additional info in OMIM Content s Each record provides a state of the art summary of current knowledge Extensive references to literature

OMIM Search Results by Titles alzheimer AND presenilin 1

Entrez Genome: Gene Location View of chromoso me 14 Gene Name Multiple Maps STSs, ESTs, etc.

Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Integrated View of Chromosome 7 Multiple Maps STSs, ESTs, etc.

Entrez Genome: Gene Location View of chromoso me 14 Gene Name

Entrez Genome: Gene Location Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes

Books is… searchable resource of on-line books Page 26

TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

PDB Protein DataBase –Protein and NA 3D structures –Sequence present –YAFFF

HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V DGC 12 JRNL REFN ASTM JMOBAK UK ISSN DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE DGC 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS DGC 25 REMARK 3 RESOLUTION RANGE ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION DGC 28 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS SINCE AMINO ACIDS DGC 41 REMARK ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK X X SYMM 1DGC 57 REMARK Y = Y SYMM 1DGC 58 REMARK Z Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A DGC 67 CRYST P DGC 68 ORIGX DGC 69 ORIGX DGC 70 ORIGX DGC 71 SCALE DGC 72 SCALE DGC 73 SCALE DGC 74 ATOM 1 N PRO A DGC 75 ATOM 2 CA PRO A DGC 76 ATOM 842 C5 C B DGC 916 ATOM 843 C6 C B DGC 917 TER 844 C B 9 1DGC 918 MASTER DGC 919 END 1DGC 920 PDB HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES

Accessing information on molecular sequences Page 26

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27 Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31).

4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

From the NCBI home page, type “rbp4” and hit “Go” Pevsner Fig. 2.7 Page 29

revised Fig. 2.7 Page 29

By applying limits, there are now just two entries

[ rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] GenBank Record Accession Number gi Number Protein Sequence Nucleotide Sequence Locus Name Medline ID GenPept ID

LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.

revised Fig. 2.8 Page 30 Entrez Gene (top of page) Note that links to many other RBP4 database entries are available

Entrez Gene (middle of page)

Entrez Gene (bottom of page)

Fig. 2.9 Page 32

Fig. 2.9 Page 32

Fig. 2.9 Page 32

FASTA format Fig Page 32

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNG_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_ ProteinNP_###### e.g. NP_ Page 29-30

Accession MoleculeMethodNote AC_ GenomicMixedAlternate complete genomic AP_ ProteinMixedProtein products; alternate NC_ GenomicMixedComplete genomic molecules NG_ GenomicMixedIncomplete genomic regions NM_ mRNAMixedTranscript products; mRNA NM_ mRNAMixedTranscript products; 9-digit NP_ ProteinMixedProtein products; NP_ ProteinCurationProtein products; 9-digit NR_ RNAMixedNon-coding transcripts NT_ GenomicAutomatedGenomic assemblies NW_ GenomicAutomatedGenomic assemblies NZ_ABCD GenomicAutomatedWhole genome shotgun data XM_ mRNAAutomatedTranscript products XP_ ProteinAutomatedProtein products XR_ RNAAutomatedTranscript products YP_ ProteinAuto. & CuratedProtein products ZP_ ProteinAutomatedProtein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

DNARNA complementary DNA (cDNA) protein UniGene Fig. 2.3 Page 23 In genetics, complementary DNA (cDNA) is DNA synthesized from a mature mRNA template in a reaction catalyzed by the enzyme reverse transcriptase.geneticsDNAmRNAreverse transcriptase

Expressed Sequence Tag What Are ESTs and How Are They Made? ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene.

STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see:

UniGene: unique genes via ESTs Find UniGene at NCBI: UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21

Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23

Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10

Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1  42,800 26, , , , ,300  ,128   ,00021  16,000-30,0008 UniGene build 194, 8/06

UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further later (gene expression). Page 31

Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

Ensembl to access protein and DNA sequences Try Ensembl at for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

click human

enter RBP4

Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33

ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit Page 33

Fig Page 33

Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

Searching for HIV-1 pol: Following the “genome” link yields a manageable three results

Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

only 1 RefSeq over 100,000 nucleotide entries for HIV-1

Examples of how to access sequence data: histone query for “histone”# results protein records21847 RefSeq entries7544 RefSeq (limit to human)1108 NOT deacetylase697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein

Access to Biomedical Literature Page 35

PubMed at NCBI to find literature information

PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to Page 35

MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35

PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): Page 35

lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT Fig Page 34 8/04

true positive “globin” is found 8/06 “globin” is not found “globin” is present “globin” is absent Article contents: Search result: true negative false negative ( article discusses globins ) false positive ( article does not discuss globins )

Protein sequence motif is a descriptor of a protein family Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG] [C is the active site residue]

Searching MMDB

Principles of structural alignment Dali: Looks for minimal RMSD between C  atoms. Calculate C  - C  distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) looks for pairs of secondary structure elements (  -helices,  -strands) that have similar orientation and connectivity

Dali alignment of Tyr phosphatase

VAST Structure Neighbors

Structure Summary Cn3D viewer VAST neighbors BLAST neighbors

Cn3D : Displaying Structures Chloroquine

Structure Neighbors

Use of structural alignments Chloroquine NADH

PDB Protein DataBase –Protein and NA 3D structures –Sequence present –YAFFF

HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V DGC 12 JRNL REFN ASTM JMOBAK UK ISSN DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE DGC 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS DGC 25 REMARK 3 RESOLUTION RANGE ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION DGC 28 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS SINCE AMINO ACIDS DGC 41 REMARK ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK X X SYMM 1DGC 57 REMARK Y = Y SYMM 1DGC 58 REMARK Z Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A DGC 67 CRYST P DGC 68 ORIGX DGC 69 ORIGX DGC 70 ORIGX DGC 71 SCALE DGC 72 SCALE DGC 73 SCALE DGC 74 ATOM 1 N PRO A DGC 75 ATOM 2 CA PRO A DGC 76 ATOM 842 C5 C B DGC 916 ATOM 843 C6 C B DGC 917 TER 844 C B 9 1DGC 918 MASTER DGC 919 END 1DGC 920 PDB HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES

UniProt New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data. UniProt is a Flat-File database just like EMBL and GenBank Flat-File format is SwissProt-like, or EMBL-like

Swiss-Prot

SWISS-PROT incorporates: Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc. Swiss-Prot SWISS-PROT incorporates: Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.