Biological Databases.

Slides:



Advertisements
Similar presentations
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Advertisements

NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Introductory Overview
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Gene Expression Omnibus (GEO)
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
Protein databases Henrik Nielsen
Biological Databases By: Komal Arora.
Biological databases: Collection, storage and maintenance
Archives and Information Retrieval
생물정보학 Bioinformatics.
Functional Annotation of the Horse Genome
Mangaldai College, Mangaldai
Access to Sequence Data and Related Information
Gene Expression Omnibus (GEO)
Genomes and Their Evolution
Instructor: Kritika Karri
Introduction to Bioinformatics
Introduction to Databases
Gene Safari (Biological Databases)
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological Databases

Biology Is A Data Science Hundreds of thousand of species Million of articles in scientific literature Genetic Information Gene names Phenotype of mutants Location of genes/mutations on chromosomes Linkage (relationships between genes)

Data and Metadata Data are “concrete” objects e.g. number, tweet, nucleotide sequence Metadata describes properties of data e.g. object is a number, each tweet has an author Database structure may contain metadata Type of object (integer, float, string, etc) Size of object (strings at most 4 characters long) Relationships between data (chromosomes have zero or more genes)

What is a Database? A data collection that needs to be : Organized Searchable Up-to-date Challenge: change “meaningless” data into useful, accessible information

A spreadsheet can be a Database Rectangular data Structured No metadata Search tools: Excel grep python/R

A filesystem can be a Database Hierarchical data Some metadata File, symlink, etc Unstructured Search tools: ls find locate

Organization and Types of Databases Every database has tools that: Store Extract Modify Flat file databases (flat DBMS) Simple, restrictive, table Hierarchical databases Simple, restrictive, tables Relational databases (RDBMS) Complex, versatile, tables Object-oriented databases (ODBMS) Data warehouses and distributed databases Unstructured databases (object store DBs)

Where do the data come from ?

Types of Biological Data Primary data types (observed properties): Molecular Sequence: nucleic or amino acids Quantity: DNA, RNA, Protein, cell count, metabolites Locality: membrane, nucleus, epithelium Structure: 3D conformation, proximity, size Secondary data types (inferred properties): Molecular Function Dynamics Relation: multiplicity, distribution, binding Association: co-occurrence, correlation Predicted: computational models

Types of Biological Databases Primary Databases: Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, Trace, SRA, SNP, GEO Secondary databases: Results of analysis of primary databases Aggregate of many databases Content controlled by third party (NCBI) Examples: NCBI Protein, Refseq, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain

NCBI National Center for Biotechnology Information >50 different biological databases and tools Most popular: PubMed - search for citations/papers BLAST - search for sequences Nucleotide - all nucleotide sequences (DNA, etc) Genome - published genomes SNP (dbSNP) - all human SNPs Protein - amino acid sequences

International Sequence Database Collaboration National Centre for Biotechnology Information (NCBI) : https://www.ncbi.nlm.nih.gov/ European Nucleotide Archive (ENA) : https://www.ebi.ac.uk/ena DNA Data Bank of Japan (DDBJ) : http://www.ddbj.nig.ac.jp/

Data sharing collaboration Ensure data consistency Avoid duplication Open data sharing

Biological Databases I: Biomedical Literature

Biological Database I : Biomedical Literature Database Citation/publication databases Medline: https://www.nlm.nih.gov/bsd/pmresources.html NLM journal citation database. Includes citations 5,600 scholarly journals PubMed https://www.ncbi.nlm.nih.gov/pubmed/ Includes MEDLINE journals/manuscripts deposited in PMC NCBI Bookshelf

Searching PubMed with MeSH terms MeSH (Medical Subject Headings) is the NLM controlled vocabulary used for indexing articles for PubMed. the U.S. National Library of Medicine's controlled vocabulary arranged in a hierarchical manner called the MeSH Tree Structures updated annually

Biological Databases II: Genomics and Transcriptomics

Biological Database II - Genomics and Transcriptomics GenBank: https://www.ncbi.nlm.nih.gov/genbank/ Flat file DNA only sequence database Archival in nature: Historical, Redundant Sample GenBank record (accession number U49845) NCBI: https://www.ncbi.nlm.nih.gov/nuccore/U49845 ENA: https://www.ebi.ac.uk/ena/data/view/U49845 DDBJ: http://getentry.ddbj.nig.ac.jp/top-e.html

GenBank Flat File

Ensembl Comprehensive DNA/RNA sequence and annotation database Automated annotation: Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homology Analysis tools: BLAST BioMart Variant Effect Predictor

Nucleic Acid Structure Database NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/ NTDB Thermodynamic data for nucleic acids http://ntdb.chem.cuhk.edu.hk/ RNABase RNA-containing structures from PDB and NDB http://www.rnabase.org/ SCOR Structural classification of RNA: RNA motifs by structure, function and tertiary interactions http://scor.lbl.gov/

Biological Databases III: Proteomics

Biological Database III - Proteomics Protein sequence database: https://www.ncbi.nlm.nih.gov/protein/

Genpept

Uniprot The Universal Protein Resource Comprehensive resource for protein sequence and annotation data Collaboration between: EMBL-EBI Swiss Institute of Bioinformatics Protein Information Resource Entries in two categories: Swiss-Prot (experimentally verified) TrEMBL (computer-annotated) http://www.uniprot.org/

Protein Structure database - PDB Protein Data Bank (PDB) http://www.rcsb.org/ Dedicated to 3D structure of proteins and peptides ~150,000 predicted and experimental (solved) structures

PDB: kinesin 6

Protein Family Database http://pfam.xfam.org/family/piwi Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models

Protein-Protein Interaction Database STRING: https://string-db.org/ Search Tool for the Retrieval of Interacting Genes/Proteins Database of protein/protein interactions Information from numerous sources: experimental data computational prediction methods public text collections Expressed as interaction graphs: Nodes: Network nodes represent proteins Edges: Edges represent protein-protein associations

Data vs Annotation Database RefSeq: curated nonredundant biological sequences https://www.ncbi.nlm.nih.gov/refseq/ Source: Genbank (INSDC) Annotated: Community collaboration, automated computer, NCBI staff curation Advantages of using RefSeq Non-redundancy Curated, validated Format consistency Distinct accession series Updates to reflect current sequence data and biology https://www.nature.com/scitable/topicpage/genomic-data-resources-challenges-and-promises-743721#

Selected Refseq Accession

High-Throughput Sequencing Databases Gene Expression Omnibus (GEO) NCBI Sequence Read Archive (SRA) db of Genotype and Phenotype (dbGAP) European Genome Phenome Archive GEO (e.g. GSE37757) Submitters supply their gene expression data in four sections: Platform: describes the list of features on the array (e.g.,cDNAs, oligonucleotides, etc.) Sample: describes the biological material and the experimental conditions under which the sample was handled, and the abundance measurement of each feature derived from it. Series: defines a set of related Samples that are considered to be part of an experiment. Supplementary data: original microarray scan images or raw quantification data. FTP download: All Platform, Sample and Series records, raw data, with annotation are available for bulk download via FTP at

Other Specialised Databases UCSC Xena: https://xenabrowser.net/datapages/ Genotype-Tissue Expression Gtex: https://www.gtexportal.org/home/ mirBase:http://www.mirbase.org/ Pubchem: https://pubchem.ncbi.nlm.nih.gov/ DrugBank: https://www.drugbank.ca/ Many more...

BLAST BLAST stands for Basic Local Alignment Search Tool Good balance of sensitivity and speed Searches all publicly available sequences Flexible Produce local alignments Applies heuristic, no optimality guarantee