Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: 6874-6877 Email:

Slides:



Advertisements
Similar presentations
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
BIOINFORMATICS Ency Lee.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
LSM3241: Bioinformatics and Biocomputing Lecture 2: Bioinformatics of viral genome Prof. Chen Yu Zong Tel:
The European Bioinformatics Institute (EBI) Toolbox Julie Pellegrini Introduction to Bioinformatics.
Archives and Information Retrieval
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
Protein structure (Part 2 of 2).
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Lecture 12 Splicing and gene prediction in eukaryotes
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Finding prokaryotic genes and non intronic eukaryotic genes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Multiple sequence alignment
Protein Tertiary Structure Prediction
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Bioinformatics.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Intelligent Systems for Bioinformatics Michael J. Watts
Biological Databases By : Lim Yun Ping E mail :
T.Jadczyk, Bioinformatics Applications in the Virtual Laboratory Bioinformatics Applications in the Virtual Laboratory Tomasz Jadczyk AGH University of.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Part I: Identifying sequences with … Speaker : S. Gaj Date
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
From Genomes to Genes Rui Alves.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Annotation of eukaryotic genomes
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Search Services (PDBelite, PDBePro and BIObar) Sanchayita Sen, Ph.D. PDB Depositions.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Bioinformatics Overview
Demo: Protein Information Resource
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Center of Wisconsin, UW-Madison
Predicting Active Site Residue Annotations in the Pfam Database
Bioinformatics and BLAST
There are four levels of structure in proteins
Presentation transcript:

Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS January

Essential Bioinformatics and Biocomputing (LSM2104)2 Lecture 5: Bioinformatics software Outline: –Types of bioinformatics software Sequence, pattern and domain Evolutionary analysis Visualization Modeling and prediction (sequence, structure and function) Data mining (bibliographic and text searches) –Examples

Essential Bioinformatics and Biocomputing (LSM2104)3 Types of Bioinformatics software 1.Analysis of biological data/systems and characterization of molecules and sequences. 2.Analysis and interpretation of experimental results 3.Simulation of laboratory experiments, important for tackling large scale problems 4.Predictions that lead to the design of experiments 5.Bioinformatics software can be accessed via WWW, or through integrated software packages (such as Emboss, GCG, Staden, DNAstar, …). It may be coupled with databases, or may stand alone.

Essential Bioinformatics and Biocomputing (LSM2104)4 Bioinformatics software Major sources Software package at ExPASy Molecular Biology Server ; Software at PBIL Bio-Informatique Lyonnais Toolbox at EBI European Bioinformatics Institute

Essential Bioinformatics and Biocomputing (LSM2104)5 Bioinformatics software Major types of bioinformatics tools Sequence analysis tools Sequence comparison Pattern and domain search Evolutionary analysis Prediction of sequence structure and function Visualization of molecular structures Structure modeling Bibliographic and text searches Specialized and other tools

Essential Bioinformatics and Biocomputing (LSM2104)6 Bioinformatics software Sequence analysis tools This kind of software focuses on extraction and comparison of properties in DNA and protein sequences –Sequence analysis provides for identification of domains, structure, and function, and other properties -The analysis of individual sequences helps with sequence comparison Textbook chapter 5, pages 81-93

Essential Bioinformatics and Biocomputing (LSM2104)7 Bioinformatics software Sequence analysis tools This kind of software focuses on extraction and comparison of DNA and protein sequence properties such as –composition of nucleotide or protein sequences –codon usage in DNA –translation and backtranslation Textbook chapter 5, pages 81-93

Essential Bioinformatics and Biocomputing (LSM2104)8 Bioinformatics software Composition of nucleotide or protein sequences Composition (frequency of occurrence of a nucleotide or of an amino acid) is the most basic analysis. It can give us important functional and structural clues. For example, CG-rich regions called CpG islands are often found in promoters. A short region just before the splice site at the end of introns often has high C+T content.

Essential Bioinformatics and Biocomputing (LSM2104)9 Bioinformatics software Composition of protein and DNA sequences Web: Network Protein (Amino-acid composition) –AA Composition JEMBOSS (in our own laboratory) – (nucleic, composition, compseq)

Essential Bioinformatics and Biocomputing (LSM2104)10 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)11 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)12 Bioinformatics software Codon usage in DNA Web: –Count-codon program in Codon Usage Database (needs start and stop codons at the start and the end of the sequence) –Tool for Gene to Codon Usage Table –(does not care about start and stop codons) JEMBOSS (in the laboratory) – (nucleic, codon usage, cusp) DNA coding region should have only one stop codon

Essential Bioinformatics and Biocomputing (LSM2104)13 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)14 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)15 Bioinformatics software Translation (DNA to protein) and back translation (protein to DNA) Web: –Translate tool at ExPASy (DNA to protein) JEMBOSS (in the laboratory) – (DNA to protein and reverse) (nucleic, translation, transeq; nucleic, translation, backtranseq) If we translate and back translate the same sequence we will typically not get the same sequence as the starting one.

Essential Bioinformatics and Biocomputing (LSM2104)16 Bioinformatics Software Sequence comparison (the most important software) This will be taught next month by A/P Tan Tin Wee. Web: Local alignment (BLAST, FASTA) – – – Multiple alignment (Clustal W) – JEMBOSS (in the laboratory) – Local alignment: Smith-Waterman (alignment, local, water) Global alignment: Needleman-Wunsh (alignment, global, needle)

Essential Bioinformatics and Biocomputing (LSM2104)17 Bioinformatics software Evolutionary analysis Multiple sequence alignments can be used as measures of evolutionary distance between proteins. The phylogeny systems are used to represent evolutionary distances between sequences. WebPhylip GeneBee Read textbook, page 83.

Essential Bioinformatics and Biocomputing (LSM2104)18 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)19 Bioinformatics software Prediction of sequence structure and function Sequences that have similar structure often have similar function. For many sequences we can extract secondary and tertiary structure from the PDB database. What if our sequence is not in the PDB? We can predict structure of a biological sequence using appropriate software. There are several programs for prediction of secondary structure. For prediction of tertiary structure we can do modelling. (PHD method for secondary structure prediction)

Essential Bioinformatics and Biocomputing (LSM2104)20 Bioinformatics software Secondary structure prediction:

Essential Bioinformatics and Biocomputing (LSM2104)21 Bioinformatics software Secondary structure prediction: –The PHD program predicted four alpha helices in the human IL-2 (red). The number of helices is correct, but their lengths and boundaries are not correct (purple). –When we make a prediction in bioinformatics, we must have an idea about the accuracy of prediction programs. –To assess the accuracy of a program, we can test it with known data. Our test must have sufficient examples, so that we can make reasonable conclusions.

Essential Bioinformatics and Biocomputing (LSM2104)22 Secondary structure prediction Bioinformatics software alpha –Lactalbumin PDB 1A4V

Essential Bioinformatics and Biocomputing (LSM2104)23 Bioinformatics software We used nine different programs for prediction of secondary structure of alpha–Lactalbumin (PDB 1A4V). The results show that the best predictions for this molecule were from “Predator”, while DSC was the laggard. This test does not mean that Predator is the best of the tested programs, nor that DSC is the worst. To make such conclusions we must make test set first. The test set should contain the examples from the family of proteins that our query protein belongs to. The learning point – none of the prediction programs (and this applies across all bioinformatics software, not only secondary structure prediction) is 100% accurate. The users must be cautious when interpreting results from the predictive software.

Essential Bioinformatics and Biocomputing (LSM2104)24 Bioinformatics software Common measure (other measures also exist) Sensitivity SE=TP/(TP+FN) Specificity SP=TN/(TN+FP) For example, prediction of binding peptides to a particular receptor Experimental Predicted Class Example 1 Binder Binder True positive (TP) Example 2 Non-binder Non-binder True negative (TN) Example 3 Binder Non-binder False negative (FN) Example 4 Non-binder Binder False positive (FP) Prediction system that has SE=0.8 and SP=0.9 will correctly predict 8 of 10 experimental positives, and for each 10 experimental negatives it will make one false prediction. This prediction accuracy may be very good for prediction of peptide binding, but is not very good for some other predictions, for example gene prediction.

Essential Bioinformatics and Biocomputing (LSM2104)25 Bioinformatics software Prediction of 3-D structure Various modelling programs –comparative modelling, using known structures as templates –ab initio modelling, using atomic simulation, residue statistics, etc. These methods will be covered later in the course An example of the comparative modelling software is SWISS- MODEL This model is provided by . This tool has the facility for assessing the quality of predictions

Essential Bioinformatics and Biocomputing (LSM2104)26 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)27 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)28 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)29 Bioinformatics software Software for visualisation of 3-D structures. Provides different views to 3-D molecular structure, which will be taught by A/P Shoba. –Chime, Rasmol (they use files in PDB format) –Scorpion database uses Chime. Chime can be downloaded from:

Essential Bioinformatics and Biocomputing (LSM2104)30 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)31 Bioinformatics software

Essential Bioinformatics and Biocomputing (LSM2104)32 Bioinformatics software Text searches Text searching software is used associated with databases. Most commonly we search by keywords or combinations of keywords. Examples of PubMed searches: –Diabetes –181,672 matches –Diabetes AND IDDM – 35,841 –Diabetes AND IDDM AND autoimmunity – 1,109 –Diabetes OR autoimmunity – 190,674 –Diabetes[Title/Abstract] – 114,624 The last example is more advanced PubMed option “preview/index”

Essential Bioinformatics and Biocomputing (LSM2104)33 Summary of Today’s lecture Bioinformatics software Summary of Today’s lecture Why bioinformatics software? Types of software: sequence, motif, evolution, visualization, structural modeling, simulation, test search. Examples of selected software: –Sequence composition –DNA-protein sequence translation –Evolutionary analysis –Protein secondary structure prediction –Comparative modeling –Text search To be taught later: Sequence comparison, visualization etc.

Essential Bioinformatics and Biocomputing (LSM2104)34 Summary of the Section: Biological databases and bioinformatics software We first focused on biological databases. We covered topics: –discussed types of biological databases –briefly described popular databases –structure of the GenBank and SWISS-PROT entries –searching biological databases –types of questions that can be answered by searching databases –completeness and errors in the databases

Essential Bioinformatics and Biocomputing (LSM2104)35 Summary of the Section: Biological databases and bioinformatics software The second topic was bioinformatics software. We covered: –why do we need bioinformatics software? –briefly described major types of bioinformatics software –described software for sequence composition, codon usage, translation and backtranslation –introduced the concept of sequence alignment, evolutionary analysis –secondary and tertiary structure prediction, molecular visualization –accuracy of prediction software –text searching