1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.

Slides:



Advertisements
Similar presentations
Aim: How does a chromosome code for a specific protein ?
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
Archives and Information Retrieval
Biological databases.
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Biological Databases By : Lim Yun Ping E mail :
PROTEIN SYNTHESIS NOTES #1. Review What is transcription? Copying of DNA onto mRNA Where does transcription occur? In the Nucleus When copying DNA onto.
1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Organizing information in the post-genomic era The rise of bioinformatics.
DNA and Protein Synthesis. Nucleic Acid Review Name of the molecule identified by the arrow: 1.Phosphate group 2.Nitrogen base 3.Adenine 4.Sugar.
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Aim: How does DNA direct the production of proteins in the cell?
DNA and Protein Synthesis. Protein Synthesis It’s a process –DNA -> RNA -> Amino Acids (Protein)
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
PubMed: Scientific Journals Entrez: Keyword Search of Database BLAST: Sequence Queries OMIM: Online Mendelian Inheritance in Man Books.
Online – animated web site 5Storyboard.htm.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Body System Project Animal Nutrition Chapter 41 Kristy Blake and Krystal Brostek.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
DNA and Protein Synthesis. Nucleic Acids Nucleic Acids - Function Control the processes of heredity by which cells and organisms reproduce proteins.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Replication, Transcription, Translation PRACTICE.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Biochemistry Free For All
Protein Synthesis: Translation
Archives and Information Retrieval
RNA Ribonucleic Acid.
Do now activity #2 Name all the DNA base pairs.
UNIT 3: Genetics-DNA vs. RNA
Section 3-4: Translation
20.2 Gene Expression & Protein Synthesis
How is the genetic code contained in DNA used to make proteins?
Transcription and Translation
Transcription and Translation
Do now activity #6 What is the definition of: RNA?
Translation.
Replication, Transcription, Translation PRACTICE
Do now activity #5 How many strands are there in DNA?
Chapter 3. THE GENBANK SEQUENCE DATABASE
Aim: How does DNA direct the production of proteins in the cell?
Replication, Transcription, Translation PRACTICE
Replication, Transcription, Translation PRACTICE
DNA and Protein Synthesis Notes
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 Discussion Practical 1

Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of PubMed Db 3

Epub ahead of print and journal impact factor 4 How to get impact factor of any journal: 1) Direct source – web of science database 2) In direct source, e.g. blogs, sites etc (do Google search) Adopted from :

Anatomy of a PubMed record 5

Demo on downloading articles 6

Anatomy of a Protein Db 7

8 Other popular sources: dbj – DDBJ (DNA Data Bank of Japan database) emb – The European Molecular Biology Laboratory (EMBL) database prf – Protein Research Foundation database sp – SwissProt gb – GenBank pir – Protein Information Resource Version NM_ GI or Geninfo Identifier) Source Refseq database Accession NM_ Accession numbers and GenInfo Identifiers

9 Why do we need accession number and GI for one record? 1) What is the difference between accession and GI? 2) Why do we need these two when both seem to be accession numbers?

10 Q1) Which revision will NCBI show if you were to search by the accession only without the version number? Sequence_v1 NM_ Sequence_v2 NM_ Sequence_v3 NM_ NM_ NM_ NM_ Sequence update Sequence update GI Version Why do we need accession number and GI for one record?

11 Accession numbers -The unique identifier for a sequence record. -An accession number applies to the complete record. -Accession numbers do not change, even if information in the record is changed at the author's request. -Sometimes, however, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences, or if for some reason a new submission supercedes an earlier record.

12 GenInfo Identifiers - GenInfo Identifier: sequence identification number - If a sequence changes in any way, a new GI number will be assigned - A separate GI number is also assigned to each protein translation Within a nucleotide sequence record -A new GI is assigned if the protein translation changes in any way -GI sequence identifiers run parallel to the new accession.version system of sequence identifiers

13 Version -A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database. -If there is any change to the sequence data (even a single base), the version number will be increased, e.g., U → U , but the accession portion will remain stable. -The accession.version system of sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence, it receives a new GI number AND an increase to its version number. -A Sequence Revision History tool ( is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record

14 Anatomy of a Protein Db record

15 Fasta Sequence

Fasta Format Text-based format for representing  nucleic acid sequences or peptide sequences (single letter codes). Easy to manipulate and parse sequences to programs. >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Description line/row Sequence data line(s) Description line/row Sequence data line(s)

Fasta Format (cont.) Begins with a single-line description, followed by lines of sequence data. Description line –Distinguished from the sequence data by a greater-than (">") symbol. –The word following the ">" symbol in the same row is the identifier of the sequence. –There should be no space between the ">" and the first letter of the identifier. –Keep the identifier short and clear ; Some old programs only accept identifiers of only 10 characters. For example: > gi| |Human or >HumanP53 Sequence line(s) –Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature) –The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Description line/row Sequence data line(s) Description line/row Sequence data line(s)

Amino acids & Nucleotides 18

IUPAC One Letter Amino Acid Code A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Alanine Cysteine Glycine Histidine Isoleucine Leucine Methionine Proline Serine Threonine Valine Glutamic Acid Aspartic Acid Phenylalanine Lysine Asparagine Glutamine Arginine Tryptophan Tyrosine 21 st (Sec) Selenocysteine 22 nd (Pyl) Pyrrolysine GLx ASx Glutamic Acid Aspar(D)ic Acid (F)enylalanine Lysine Asparagi(N)e (Q)lutamine (R)ginine T(W)ptophan T(Y)rosine 21 st (Sec)Selenocysteine 22 nd (Pyl) Pyrr(O)lysine GLx ASx

Note Amino acidThree letter codeSingle letter code Asparagine or aspartic acidAsxB Glutamine or glutamic acid,GLxZ Leucine or Isoleucine,XleJ Unspecified or unknown amino acidXaaX

22 Standard IUPAC Nucleotide code is used to describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide. The code is shown in the table below. IUPAC Nucleotide Code

Advice We highly recommend that you memorize the amino acid codes and their structures Memorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes. It is not compulsory that you memorize these for this module.

Features of major database (Gene Db) 24

25 Anatomy of Gene Db

26 Anatomy of a Gene Db record

A section of Gene Db record: Reference Sequences 27 mRNA Accession number Protein Accession number

28 Nucleic Acid Databases Entrez nucleotide database (nt) GenBank DDBJ EMBL RefSeq_genomic

29 Amino Acid Databases 1) Sequence repositories GenPept (redundant; translation of GenBank; minimal annotation) Entrez Protein (redundant or NR) translated DDBJ/EMBL/GenBank ( i.e. GenPept) Swiss-Prot, PIR, RefSeq_protein and PDB RefSeq (non-redundant; reference sequences; minimal manual curation; limited species) 2) Universal curated databases PIR-PSD (non-redundant; focus on protein family classification) Swiss-Prot (non-redundant; manually annotated) TrEMBL (non-redundant; extensively computer-annotated) 3) Next-generation of protein sequence database UniProtKB (Swiss-Prot, TrEMBL and PIR-PSD integrated; less redundant than UniProt NREF) UniParc (like Entrez Protein but more comprehensive) UniProt NREF (like RefSeq but more comprehensive and rich with annotation) Read more:

30 The RefSeq Project Goal: a “ comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. ” Designed to reduce duplication by selecting one representative sequence for each locus, except when there are naturally occurring paralogs and splice variants. Info from: –Predictions from genomic sequence –Analysis of GenBank Records –Collaborating databases

Genbank versus refseq

Choice of databases for genomic/proteomic data Promoter Enhancer Gene EE I UU Nucleotide Protein RefSeq_genomeRefSeq_Protein Gene All of above in multiple records All real/ reliably predicted proteins in multiple records Reference ones only Reference proteins only Gene record with all related Information included (mRNA Protein, promoter, enhancer) Genome architecture Databases to house genomic/proteomic data

Database searching can help answer questions like What is the sequence of human IL-10? What is the gene coding for human IL-10? Is the function of human IL-10 known? What is it? Are there any variants of human IL-10? Who sequenced this gene? What are the differences between IL-10 in human and in other species? Which species are known to have IL-10? Is the structure of IL-10 known? What are structural and functional domains of the IL-10? Are there any motifs in the sequence that explain their properties? What is an upstream region of IL-10 containing transcriptional regulation sites? IL10 = X?

Take home messages for databases Bioinformatics = databases + tools General databases versus specialized databases Databases come and go (especially the small ones) Database redundancy - many databases for the same topic (use the most comprehensive, if not use all for comprehensiveness) Database accuracy – published ones are more reliable; nevertheless, they are still prone to errors; always good to spend sometime assessing the reliability of your data of interest by doing cross-referencing to literature or other databases Fortunately, most databases are cross-referenced Unfortunately, no common standard format; need to spend some time familiarizing each; becomes easy after some practice Finding databases relevant to you –NAR Database catalogue –Pubmed –Google 2 main methods for searching databases (each with its own pros and cons) –1. Keyword search (covered today) –2. Sequence search (day 2) 34