Bioinformatics Ayesha M. Khan Spring 2013.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Databases (“knowledge bases”) used in genome analysis
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
1 Unity of Invention: Biotech Examples TC1600 Special Program Examiner Julie Burke (571)
An Introduction to Bioinformatics Molecular Biology Databases.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
DNA TECHNOLOGY AND BIOTECHNOLOGY PAGES Chapter 10.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Function preserves sequences
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Genetic Engineering Genetic engineering is also referred to as recombinant DNA technology – new combinations of genetic material are produced by artificially.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
Protein databases Henrik Nielsen
Biological Databases By: Komal Arora.
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Archives and Information Retrieval
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatics
Chapter 3. THE GENBANK SEQUENCE DATABASE
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Bioinformatics Ayesha M. Khan Spring 2013

Introduction to databases Lec-3 Introduction to databases If we are to derive the maximum benefit from the deluge of sequence information, we must deal with it in a concerted way by doing the following: Establish Maintain Disseminate the information contained in databases

Introduction to Databases Lec-3 Databases are effectively electronic filing cabinets, a convenient and efficient method of storing vast amounts of information. Central, shareable resources Many different types of databases, depending on -Nature of information being stored -Manner of data storage

Primary & Secondary databases Lec-3 Primary & Secondary databases Primary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence information Primary or derived databases Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases /Composite databases Links to other data items Combination of data Consolidation of data

Primary sequence databases Lec-3 Primary sequence databases Early 1980’s Nucleic acidEMBL (Europe), GenBank (USA), DDBJ (Japan) Protein PIR, SWISS-PROT, TrEMBL, NRL-3D PIR: Protein Information Resource EMBL: European Molecular Biology Laboratory TrEMBL: Translated EMBL MIPS: Munich Information Center for Protein Sequences

EMBL: EMBL is the nucleotide sequence database from European Bioinformatics Institute (EBI) It has sequences from: direct author submissions, genome sequencing groups, scientific literature and patent applications. DDBJ: DNA databank of Japan, produced maintained and distributed at the National Institute of Genetics. GenBank: DNA database from National Center for Biotechnology Information (NCBI). Lec-3

Principal requirements of a database Lec-3 Principal requirements of a database The principal requirements on the public data services are: • Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. • Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network- accessible laboratory databases. • Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic data object in the database. • Timelines - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. • Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.

Lec-3 Exercise Look for a gene of your interest in the three primary nucleic acid databases: compare the information given in each one of them.

Lec-3 Primary Sequence Database Amino Acid Nucleic Acid e.g. GenBank, EMBL, DDBJ SwissProt and PIR Secondary Sequence Database Protein Domains & Families Metabolic Pathways e.g. RefSeq and Conserved Domain Database (CDD) within NCBI Sequencing centers Literature Researchers CDD: The Conserved Domain Database is a resource for the annotation of functional units in proteins. Its collection of domain models includes a set curated by NCBI, which utilizes 3D structure to provide insights into sequence/structure/function relationships. RefSeq: A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by the National Center for Biotechnology Information (NCBI). Flowchart of sequence data from labs and literature to primary sequence database and subsequent secondary databases

This depends primarily on the methods used to produce it. Always remember that: The data within primary databases is as reliable as the data submitted. This depends primarily on the methods used to produce it. Regardless of who obtains the sequence data, nucleic acid and amino acid sequencing results are subject to errors. Lec-3

Protein Sequence databases Lec-3 The protein sequence database was developed at the National Biomedical Research Foundation (NBRF) Early 1960’s by Margaret Dayhoff to investigate evolutionary relationships among proteins 1988 onwards, maintained collectively by: Protein Information Resource (PIR) at NBRF, International Protein Information Database of Japan (JIPID), and the Martinsried Institute for Protein Sequences (MIPS).

Examples of molecular sequence types in NCBI records Description Genome Sequence Tagged site (STS) Draft sequences A unique segment of DNA that occurs only once in a genome and marks a particular location. Can be generated from genomic DNA or cDNA. Pieces of a genome that are compiled from a DNA or cDNA library. Usually large collection of contigs and are in the process of being ordered and catalogued. The complete genome of an organism.

The whole sequence of a single chromosome. Type Description Chromosome Locus Contig A known location on a chromosome for a particular gene or collection of genes that codes for a specific function. A contiguous segment of a chromosome made by joining overlapping clones or sequences. The whole sequence of a single chromosome. Lec-3

A complete coding sequence for a protein. Type Description Gene Domain Complete CDS A discrete portion of a protein assumed to fold independently of the rest of the protein and which possesses its own function. A complete coding sequence for a protein. Whole gene sequence for a protein Lec-3

Expressed sequence tag (EST) Type Description mRNA Expressed sequence tag (EST) Complementary DNA sequence (cDNA) Complete CDS A partial sequence of cDNA in mRNA form from either the 5’ or 3’ end of a gene sequence. A cDNA sequence in mRNA form. A complete mRNA sequence for a protein coding region. Lec-3

Protein Sequence databases Lec-3 SWISS-PROT Started in 1986-University of Geneva and EMBL It is now maintained by Swiss Institute of Bioinformatics (SIB) and EBI/EMBL TrEMBL Started in 1996-Follows SWISS-PROT format and contains translations of coding sequences in EMBL. It also provides: synthetic sequences, short amino acid fragments, and codons that do not encode real proteins.

Composite protein sequence databases Lec-3 A database that merges a variety of different primary sources. They obviate the need to interrogate multiple resources. It can eliminate identical sequence copies, or eliminate both identical and highly similar sequences.