Archives and Information Retrieval

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Databases (“knowledge bases”) used in genome analysis
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
Literature Databases June 14, 2005 Learning objectives: What is the general arrangement of biological data in the public databases? How does one retrieve.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics Primer HC Lee 2000 July. What is Bioinformatics? Biomedical/biotechnical information Reproduction and annotation of biosequences – DNA.
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Introductory Overview
Course Module: Introduction to Bioinformatics – CS 2001 July CS Databases.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Biological Databases and Tools Sandra Sinisi / Kathryn Steiger November 25, 2002.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI Literature Databases: PubMed
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Protein databases Henrik Nielsen
Biological Databases By: Komal Arora.
Biological databases: Collection, storage and maintenance
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Biological Sequence Databases
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Archives and Information Retrieval Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 4

Introduction Learning objectives: What is the general arrangement of biological data in the public databases? To know the information retrieval skills that will allow you to make effective use of the databases. To become familiar with basic operations. How does one retrieve information on a particular subject in the literature? Tutorial embedded in many databanks, which make it easy to explore their facilities.

Primary public domain bioinformatics servers Facilities European Bioinformatics Institute (EBI) United Kingdom National Center For Biotechnology Information (NCBI) United States Genome Net (KEGG & DDBJ) Japan Databases Analysis Tools Data library, National Institute of genetics, Janpan, DNA Data Bank

The Archives Massive biological experimental data These biological information databases can be classified into two types The first level databases Come from the raw data which were obtained via the experiments. “simple” The second level databases Further reorganized based on.. in order to achieve some specific goals the original data which obtains to the experiment, only passes through the simple classification reorganization and the annotation;

The Archives Some examples: The first level databases Nucleic acid sequence databases: GenBank, EMBL Data Library, DNA Database of Japan (DDBJ) Protein sequence database: SWISS-PROT, PIR Protein structure database: PDB The second level databases GDB TRANSFAC SCOP GDB Human Genome Database: a center for the collection of human genetic mapping data, its use of world-class leaders in human genetics to act as curators for the data. TRANSFAC is the database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles. SCOP: Structural Classification of Proteins

Nucleic acid sequence databases International DNA Sequence Database Collaboration NCBI (GenBank) – USA (1982) EMBL (Data Library)– Europe (1982) DDBJ (DNA Data Bank)– Japan (1988) Triple partnership, National Center for biotechnology Information (USA), European Bioinformatics Institute , National Institute of Genetics. The raw data are identical, but the format in which they are stored, and the nature of the annotation vary among them

NCBI Established in USA in 1988 as a national resource for molecular biology information creates public databases conducts research in computational biology develops software tools for analyzing genome data disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

Nucleic acid sequence databases GenBank nucleic acid sequence and the protein sequence literature work biological annotation A new release is made every two months GenBank information retrieval system The Genbank storehouse has contained all known nucleic acid sequence and the protein sequence, as well as literature work and biology annotation which is connected with them. GenBank is an annotated collection of all DNA sequences. A new release of GenBank is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration which also includes the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). The data are exchanged among these three organization on a daily basis. GenBank is the primary sequence repository. It contains the annotated sequences submitted by the original authors and only the original authors can change them.

NCBI ENTREZ A platform that provides access to and links to databases with biological information ENTREZ PubMed GenBank Protein databases Genomes PopSet Taxonomy OMIM MedLine Entrez offers access through the following database divisions

NCBI ENTREZ Literature Database GenBank Protein databases Genomes PopSet Taxonomy OMIM MedLine Literature Database Database of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. Database of human genes and genetic disorders Database of all publicly available DNA sequences Database of amino acid sequences from SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq. Database of genomes from organisms and viruses Database of names of organisms with sequences in GenBank or Prot PubMed Central PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. Access to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articles OMIM Database The OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help. The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. RefSeqs are used as a reagent for the functional annotation of some genome sequencing projects, including those of human and mouse. Taxonomy Database The Taxonomy database contains the names of all organisms that are represented in the NCBI genetic database by at least one nucleotide or protein sequence PopSet Database The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. These alignments describe such events as evolution and population variation.

PubMed Center the U.S. National Library of Medicine's digital archive of life sciences journal literature Access to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articles

OMIM-Online Mendelian Inheritance in Man A catalog of human genes linked to diseases Began by Victor A. McKusick at Johns Hopkins University A good place to start when you want to research a certain disease or biological molecule This database is cross-referenced to PubMed and other NCBI- based databases OMIM Database The OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help.

Complete ENTREZ database divisions

How to submit sequence data to GenBank Bankit based web interface http://www.ncbi.nlm.nih.gov/BankIt Sequin program http://www.ncbi.nlm.nih.gov/Sequin

Protein databases The Protein Information Resource (PIR) was established in 1984 by the National Biomedical Research Foundation (NBRF). The PIR Protein Sequence Database evolved from the original NBRF Protein Sequence Database, developed over 20 years PIR-International is a collaboration between NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID) collect and publish what is now the oldest and largest database of biomolecular sequence, source, literature, and feature information.

PIR PIR-International Protein Sequence Database: an annotated, non- redundant and cross-referenced database of protein sequences. PIR Alignment Database, PIR-ALN: contains sequence alignments of superfamilies, families and homology domains produced from information in the Protein Sequence Database. FAMBASE Family Database: a searchable database containing a single representative sequence from each protein family. RESID Database of Amino Acid Modifications: based on feature information in the Protein Sequence Database. major nucleic acid, literature, genome, structure, sequence alignment and family databases PIR maintains several auxiliary databases to help annotation and for integrity checking. These include: PIR-ALN, containing alignments of superfamilies, families and homology domains; FAMBASE, a searchable database of family representatives; and the RESID Database of covalent protein modifications. All the Databases can be accessed on the PIR Web site (http://www-nbrf.georgetown.edu/pir/) and contain hypertext-links to each other and relevant external databases. The Web site is being redesigned to include new BLAST similarity search engines and pattern matching capabilities. The latest quarterly release of the databases can be accessed through the ATLAS multi-database retrieval software on the Atlas CD-ROM and downloaded by FTP

PIR http://www-nbrf.georgetown.edu/pir/

SWISS-PROT http://www.ebi.ac.uk/swissprot/ an well-annotated protein sequence database established in 1986. It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and a high level of integration with other databases. Note: UniProtKB/TrEMBL and UniProtKB/Swiss-Prot have been incorporated into the UniProt (Universal Protein Resource). a one-stop shop allowing easy access to all publicly available information about protein sequences.

PROSITE http://ca.expasy.org/prosite/ a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. a database of biologically significant sites patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

PDB http://www.rcsb.org/pdb/ The single international repository for public data on the 3- dimensional structures of biological macromolecules Is established by the Brookhaven National Lab of United States The contents are primarily experimental data derived from X-ray crystallography and NMR experiments Rasmol may demonstrate 3D structure of the biological macromolecule according to the PDB document The Protein Data Bank (PDB) is the single international repository for public data on the 3-dimensional structures of biological macromolecules