Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877.

Slides:



Advertisements
Similar presentations
Databases (“knowledge bases”) used in genome analysis
Advertisements

NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
LSM3241: Bioinformatics and Biocomputing Lecture 2: Bioinformatics of viral genome Prof. Chen Yu Zong Tel:
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
An Introduction to Bioinformatics Molecular Biology Databases.
A Study of Cystic Fibrosis Using Web-Based Tools Anuradha Datta Murphy Graduate Student, Dept. of Molecular and Integrative Physiology, University of Illinois.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Course Module: Introduction to Bioinformatics – CS 2001 July CS Databases.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Development of Bioinformatics and its application on Biotechnology
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
CZ3253: Computer Aided Drug design Lecture 3: Drug and Cheminformatics Databases Prof. Chen Yu Zong Tel:
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
NCBI Literature Databases: PubMed
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Archives and Information Retrieval
Mangaldai College, Mangaldai
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS January 2003

Essential Bioinformatics and Biocomputing (LSM2104: Section I) Four lectures Part 1: Biological databases: Lecture 2. Biological information and databases Lecture 3. More databases, retrieval systems, and database searching Part 2: Software: Lecture 4. Examples of the applications of bioinformatics software and basic principles Lecture 5. Overview of bioinformatics software

Part 1: Biological databases Part 1 outline: Biological information and databases Overview and definition, types of biological databases 2. Popular databases, records, data format Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed 3. Accessing biological databases, retrieval systems Entrez, SRS 4. Searching biological databases Data quality, coverage, redundancy, errors Textbook: --T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics. Biological databases: chapters 3 and 4 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological Information Cancer as an example: Genes: Growth Genes Tumor suppressor genes Proteins: Growth Factors Enzymes Receptors Pathways: Cell death Systems: Immune system Blood supply Function: Role of proteins Molecular interactions Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological Information Nucleic acids: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions Proteins: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites) Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological Information Pathways: Molecular networks, biological chain events, regulation, feedback, kinetic data Function: Binding sites, interactions, molecular action (binding, chemical reaction, etc.) Biological effect (signaling, transport, feedback, regulation, modification, etc.) Functional relationship, protein families, motifs, and homologs Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases Purpose To disseminate biological data and information To provide biological data in computer-readable form To allow analysis of biological data A database needs to have at minimum a specific tool for searching and data extraction. Web pages, books, journal articles, tables, text files, and spreadsheet files cannot be considered as databases Reading materials: Baxevanis AD.The Molecular Biology Database Collection: 2002 update. Nucleic Acids Res. 2002 Jan 1;30(1):1-12. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases Lists of biological databases INFOBIOGEN Catalog of Databases http://www.infobiogen.fr/services/dbcat/ Nucleic Acids Research Database Listing http://nar.oupjournals.org/cgi/content/full/30/1/1/DC1 These serve as starting point of biological databases. More than 500 databases have been catalogued to date and those from the two listings satisfy minimal criteria for the content, access, and quality. Other sites as a starting point. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases INFOBIOGEN Catalog of Databases Type of database No of records DNA 87 RNA 29 Protein 94 Genomic 58 Mapping 29 Protein structure 18 Literature 43 Miscellaneous 153 Total 511 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases- in Nucleic Acids Research Type of database No of records Major Sequence Repositories 7 Comparative Genomics 7 Gene Expression 20 Gene Identification and Structure 30 Genetic and Physical Maps 10 Genomic Databases 48 Intermolecular Interactions 5 Metabolic Pathways and Cellular Regulation 12 Mutation Databases 33 Pathology 8 Protein Databases 50 Protein Sequence Motifs 18 Proteome Resources 7 RNA Sequences 26 Retrieval Systems and Database Structure 3 Structure 32 Transgenics 2 Varied Biomedical Content 18 TOTAL 336 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Literature databases – PubMed (MedLine) 1. It contains entries for more than 11 million abstracts of scientific publications. 2. It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others. 3. Efficient searching PubMed requires some skill. For example, searching with a keyword “interleukin” returns 108,366 matches. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS PubMed web-site (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS PubMed Search (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) Key Word No. of Entries Cancer 1.45M Blood supply 22K Protein 3.9K Enzyme 1.5K Cancer treatment by targeting blood supply: Cancer growth depends on blood supply (why?) and thus requires the growth of new blood vessels – angiogenesis Proteins involved in angiogenesis may be potential anticancer targets You can find some of these targets by searching Pubmed Key word “cancer angiogenesis enzyme drug” produces 856 entries Cancer Blood supply Enzyme Drug 500 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Nucleic Acids databases What info are in these databases: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions Essential Bioinformatics and Biocomputing (LSM2104), NUS

Nucleic Acids databases DNA databases – GenBank, EMBL, DDBJ 1. General purpose databases focusing on DNA sequences and their properties 2. GenBank, EMBL-bank and DDBJ exchange data to ensure comprehensive worldwide coverage and accession numbers are managed consistently between the three centers. Reading materials: Textbook, chapter 4 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS DNA databases GenBank database (http://www.ncbi.nih.gov/Genbank/) Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Accessible through Entrez, NCBI’s integrated retrieval system (studied later) Sequence similarity search tools: BLAST (studied later) EMBL nucleotide sequence database (http://www.ebi.ac.uk/embl/) Contains nucleotide sequences collected from all public sources. Accessible through Sequence Retrieval System (SRS) which allows keyword searching (studied later) Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later) Essential Bioinformatics and Biocomputing (LSM2104), NUS

DNA databases: GenBank Web page Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS DNA databases An Example from GenBank– flat file Human Alpha-Lactalbumin gene This protein is a complex of 2 proteins A and B. In the absence of the B protein, the enzyme catalyzes the transfer of galactose from UDP-galactose to Nacetylglucosamine (cf. EC 2.4.1.90). Essential Bioinformatics and Biocomputing (LSM2104), NUS

A GenBank entry – HEADER Essential Bioinformatics and Biocomputing (LSM2104), NUS

GenBank Entry – Links provided in the Header MapViewer – find the gene position in chromosome Related Sequences – other entries related to this gene (or sequence) OMIM– link to catalog of human genes and genetic disorders Protein – retrieve protein record from GenPept Medline and PubMed –literature abstracts related to this gene Taxonomy – Classification of organisms UniGene – Unified gene data UniSTS – Unified sequence tagged sites, marker and mapping data LinkOut – links to publishers, aggregators libraries, biological databases, sequence centers, and other Web resources REFSEQ – reference sequence standards Note: These links are representative. Other links may also be found in GenBank entries. Essential Bioinformatics and Biocomputing (LSM2104), NUS

GenBank entry - FEATURES Essential Bioinformatics and Biocomputing (LSM2104), NUS

GenBank Entry– Links provided in the Feature section LocusID – locus and display of genomic and mRNA sequences MIM – Link to OMIM description, other entries for this sequence EC_number – link to the corresponding cataloged enzymes Protein_id – retrieve protein record from GenPept CD– conserved protein domain (SMART), CDD – conserved protein domain (Pfam). Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: GenBank - SEQUENCE Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS GenBank - NOTES Majority of GenBank entries have similar form to our example. When accessing the database, the following needs to be noticed: Some entries are huge, containing as much as 30,000 lines. (NT_021877 Homo sapiens chromosome 1 working draft sequence segment) Some entries have contig information instead of sequence information. (NT_021877 Homo sapiens chromosome 1 working draft sequence segment) Some entries are derived from cDNA sequences and thus represent putative genes/proteins. These should be used with caution. (AK007430. Mus musculus 10 d...[gi:12840976]). Some annotations are predicted using automated analysis. These should also be used with caution. (XM_131483 Mus musculus simi...[gi:20832685]). Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS GenBank - Statistics Year Base Pairs Sequences 680338 606 101008486 78608 11101066288 10106023 15849921438 14976310 Data size is large and increases fast Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological Databases Database Searching Databases must have methods for accessing and extracting data stored. The most basic search is keyword searching Keywords can be any word that occurs somewhere in the database records. It can be the name of the gene or protein (e.g. lactalbumin), species (e.g.homo sapiens, human), a taxonomy term (e.g.primates), or a word from the reference title (e.g. cancer) Others include: Entry Id number, sequence Databases typically have hyperlinks that provide access to additional information related to the entry from other sources. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases: OMIM Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/Omim/) The OMIM database contains abstracts and texts describing genetic disorders to support genomics efforts and clinical genetics. It provides gene maps, and known disorder maps in tabular listing formats. Contains keyword search. Hamosh A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders Nucleic Acids Res. 2002 30: 52-55. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: OMIM web-page Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: OMIM search engine Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: OMIM statistics All Entries : 14088 Established Gene Locus : 10476 Phenotype Descriptions : 1194 Other Entries : 2418 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases Protein databases SWISS-PROT (http://us.expasy.org/sprot/sprot-top.html) is a curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants, etc.) of proteins. TrEMBL is Computer-annotated supplement to SWISS-PROT Reading materials: Textbook, chapter 3 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Protein databases What are in these databases: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites) Essential Bioinformatics and Biocomputing (LSM2104), NUS

Protein databases – SWISS-PROT Notes: SWISS-PROT provides high-quality annotations and detailed info about sequence, structural, functional, and other properties of proteins. It provides a rich set of links to other sources of information on SWISS-PROT entries. Unfortunately, some of the links will not work at all times, because of the dynamical change of the Web. It also provides a rich set of protein analysis tools. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS SWISS-PROT web-page Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS SWISS-PROT entry P00709 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS SWISS-PROT entry P00709 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS SWISS-PROT entry P00709 Essential Bioinformatics and Biocomputing (LSM2104), NUS

Essential Bioinformatics and Biocomputing (LSM2104), NUS Biological databases: Protein structure database: PDB (http://www.pdb.org) More than 18,000 macromolecular structures on proteins, peptides, viruses, protein/nucleic acids complexes, nucleic acids, and carbohydrates. Among the oldest databases – the first structure was deposited in 1972. New deposited structures has been steadily growing (3298 in 2001, and 1486 Jan 1-June 5, 2002). Determined mainly by the X-ray diffraction and NMR. It Contains tools for keyword search, comprehensive visualization, and information extraction – such as sequence, geometry, and structural neighbors details. Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: PDB web-page http://www.rcsb.org/pdb/ Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases: A PDB entry http://www.rcsb.org/pdb/ Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases PDB statistics Essential Bioinformatics and Biocomputing (LSM2104), NUS

Biological databases Summary of Today’s lecture Types of Biological information, data and databases Simple data retrieval method. Popular databases: Pubmed, Genbank, SwissProt, OMIM, PDB Statistics: Large number of publications (MEDLINE: >12M since 1960) Large amount of data for sequence (DNA: >14M, Protein: > 120K) Fair amount of data for 3D structure (Protein >14K, Nucleic acid >1K) Essential Bioinformatics and Biocomputing (LSM2104), NUS