Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Bioinformatics Ayesha M. Khan Spring 2013.
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Archives and Information Retrieval
Biological databases.
Databases מאגרי מידע אחסון שליפה. DNARNA cDNA ESTs Non-coding RNA phenotype DNA sequences (individual genes or complete genomes) Protein sequences Translated.
Lecture 2.21 Retrieving Information: Using Entrez.
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Databases מאגרי מידע אחסון שליפה. DNARNA cDNA ESTs Non-coding RNA phenotype DNA sequences (individual genes or complete genomes) Protein sequences Translated.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
An Introduction to Bioinformatics Molecular Biology Databases.
Joint EBI-Wellcome Trust Summer School June 2010.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological databases Nicky Mulder:
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Function preserves sequences
The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression,
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Introduction to Bioinformatics and Biological databases Nicky Mulder:
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Introduction to Genes and Genomes with Ensembl
Protein databases Henrik Nielsen
Retrieving Information: Using Entrez
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
BLAST.
Chapter 3. THE GENBANK SEQUENCE DATABASE
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Databases מאגרי מידע - חלק ב' אחסון שליפה

What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy Reliable data (periodic updating) Informative links to other DBs Efficient and user-friendly associated tools (software) necesary for db access/query, db information insertion, db information deletion Curated vs. non-curated DBs

Repository DBs (archives) vs. topic centered First generation vs. advanced generations Not curated vs. well curated Partially annotated vs. fully annotated Nucleotide & Protein Sequence DBs ~20 Years of Data Accumulation More redundant vs. less redundant

Primary Sequence Repositories בור סוד שאינו מאבד טיפה (highly redundant) אך גם אינו מעבד טיפה (poorly annotated) First Generation Databases EMBL/GenBank/DDBJ

EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit ! (editorial control of the content belongs to the authors) Redundancy, insufficient annotation.

Unexpected information you can find in these dbs: מי חבר של פידל? EMBL כמה שנים הוא שמר את הסיגר?

EMBL/GenBank/DDBJ Unexpected information you can find in these db: Z71230 EMBL FT source FT /db_xref="taxon:4097" FT /organelle="plastid:chloroplast" FT /organism="Nicotiana tabacum" FT /isolate="Cuban cahibo cigar, gift from President Fidel Castro" Or: FT source FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"

Advanced generations of nucleotide sequence databases Non-redundant sequence-centric database A comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq Gene-centric databases All the sequence information relevant to a given gene is made accessible at once Gene Genome-centric databases Information about gene sequence, relative position, strand orientation, biochemical functions… Genome browsers Different entries Single entry

Boolean operatorsKeywords Fields Syntax 4. Access additional entries discussing same or similar entities by links to additional databases (DBXref) 2. Choose appropriate database Think, evaluate. The computer is just a machine. You are (hopefully) a thinking organism. 1. Think – phrase your scientific question. Phrase your query Current tutorial Preview/index Preview/index, limits MeSH terms Previous and current tutorials History

Found (+) Not found (-) True positive False negative Related False positive True negative Unrelated Search results “ s c i e n ti fi c t r u t h ” Evaluating Search Results Easy to detect Harder to detect (?)

A database is a structured collection of information. A database is composed of basic objects called records or entries ( רשומות ). Each record is composed of fields ( שדות ), which hold defined data that is related to that record. The organization of each record into predetermined fields, allows us to use queries on fields. Common to all databases

Real life of a protein sequence … TrEMBL Genpept CoDing Sequences provided by submitters cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… Swiss-Prot CoDing Sequences provided by submitters and « de novo » gene prediction RefSeq XP_NNNNN UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Manually annotated PRF Scientific publications derived sequences with or without annotated CDS PRF, PIR Protein Identification Resource Protein research foundation, Japan

Type of recordSample Accession Format GenBank/EMBL/DDBJOne letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF Swiss-Prot/TrEMBLOne letter and five digits/letters: e.g. P12345 RefSeq nucleotideTwo letters, underscore bar and six digit: e.g. mRNA NM_ e.g. genomic NT_ RefSeq proteine.g. NP_00483 RefSeq predictione.g. XM_ e.g. XP_ PDB (protein structure)One digit followed by three letters: e.g. 1TUP The AC number jungle Not always easy to recognize the origin of the record