What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics at IU - Ketan Mane. Bioinformatics at IU What is Bioinformatics? Bioinformatics is the study of the inherent structure of biological information.
Prediction of protein localization and membrane protein topology Gunnar von Heijne Department of Biochemistry and Biophysics Stockholm Bioinformatics Center.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
An Introduction to Bioinformatics Protein Structure Prediction.
What is bioinformatics? Finding patterns in molecular biological data Implies: managing molecular biological data identifying correlations in molecular.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Introduction to BioInformatics GCB/CIS535
CBS Ph.D. course, 14. april 2005 Using Microarrays to Study Cell Cycle Regulation Ulrik de Lichtenberg Center for Biological Sequence Analysis The Technical.
Bio 465 Summary. Overview Conserved DNA Conserved DNA Drug Targets, TreeSAAP Drug Targets, TreeSAAP Next Generation Sequencing Next Generation Sequencing.
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Master’s Degrees in Bioinformatics in Switzerland: Past, present and near future Patricia M. Palagi Swiss Institute of Bioinformatics.
Topic 1 Introduction to the Study of Life 1.1 The Unifying Characteristics of Life Biology 1001 September 9, 2005.
PART II. Prediction of functional regions within disordered proteins Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry.
Finish up array applications Move on to proteomics Protein microarrays.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
NY Times Molecular Sciences Institute Started in 1996 by Dr. Syndey Brenner (2002 Nobel Prize winner). Opened in Berkeley in Roger Brent,
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Central dogma: the story of life RNA DNA Protein.
Introduction to biological molecular networks
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
AN INTRODUCTION TO BIOLOGY Nicky Mulder Acknowledgements: Anna Kramvis.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Dynamical Modeling in Biology: a semiotic perspective Junior Barrera BIOINFO-USP.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
bacteria and eukaryotes
Bioinformatics Overview
Prediction of protein features. Beyond protein structure
The Nobel Prize in Physiology or Medicine 1999
Protein Families, Motifs & Domains.
Functional Annotation of Transcripts
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Sequence Based Analysis Tutorial
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

What is bioinformatics?

What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe... Compare molecular biological data Find patterns in molecular biological data –phylogenies –correlations (sequence / structure / expression / function / disease)‏ Goals: characterise biological patterns & processes predict biological properties –low level data ⇒ high level properties (eg., sequence ⇒ function)‏

Bioinformatics: neighbour disciplines Computational biology –Broader concept: includes computational ecology, physiology, neurology etc... -omics: –Genomics –Transcriptomics –Proteomics Systems biology –Putting it all together... –Building models, identify control & regulation

Bioinformatics: prerequisites Bio- side: –Molecular biology –Cell biology –Genetics –Evolutionary theory -informatics side: –Computer science –Statistics –Theoretical physics

Molecular biology data... >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA DNA sequences

Molecular biology data... Amino acid sequences Protein structure: –X-ray crystallography –NMR

Cell biology & proteomics data... Subcellular localization

Cell biology & proteomics data... protein-protein interactions

DNA microarray technology Transcriptomics: DNA microarray technology

Proteomics & transcriptomics data Proteins encoded by periodically expressed genes: a functionally diverse protein category

Cell cycle regulation Multiprotein complexes in yeast (Saccharomyces cerevisiae). Coloured components are periodically expressed (dynamic). White components are always expressed (static). Complexes contain both kinds of components (just-in-time assembly). Dynamic complex formation during the yeast cell cycle. Science Feb 4;307(5710): de Lichtenberg U, Jensen LJ, Brunak S, Bork P.

Cell cycle regulation 2 Transcriptional regulation is poorly conserved! Lars Juhl Jensen, Thomas Skøt Jensen, Ulrik de Lichtenberg, Søren Brunak and Peer Bork, Nature, Oct 5, 2006 Dynamic components overlap:

Phenotype data: human diseases

Prediction methods Homology / Alignment Simple pattern (“word”) recognition Statistical methods –Weight matrices: calculate amino acid probabilities –Other examples: Regression, variance analysis, clustering Machine learning –Like statistical methods, but parameters are estimated by iterative training rather than direct calculation –Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)‏ Combinations

Simple pattern (“word”) recognition‏ Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L> NB: only yes/no answers!

Statistical methods ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Estimate probabilities for nucleotides / amino acids Information content in sequences; logos; Position- Weight Matrices. Quantitative answers.

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching anything but G => large negative score Anything can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Unlike Position-Weight Matrices, sequence profiles are able to handle gaps

HMM ( Hidden Markov Models ) Which parts of the sequence are –in the cytoplasm? –embedded in the membrane (TM-helices)?‏ –outside? Rule-of-thumb says amino acids in the TM helices are hydrophobic – but there are exceptions TMHMM predicts topology from sequence using a statistical model TMHMM: predicting the topology of α-helix transmembrane proteins

I K E E H V I I Q A E H E C IKEEHVIIQAE FYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights Neural Networks: ”Architecture”

Neural networks Neural networks can learn higher order correlations! –What does this mean? A A => 0 A C => 1 C A => 1 C C => 0 No linear function can learn this pattern Neural Networks: advantages

How to estimate parameters for prediction?

Model selection Linear Regression Quadratic RegressionJoin-the-dots

The test set method

Problem: sequences are related If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Protein sorting in eukaryotes Proteins belong in different organelles of the cell – and some even have their function outside the cell Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Secretory proteins have a signal peptide Initially, they are transported across the ER membrane Protein sorting: secretory pathway / ER

Signal peptides A signal peptide is an N- terminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize.

SignalP Method for prediction of signal peptides from the amino acid sequence Based on Neural Networks The first SignalP paper (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997) was cited 3144 times in a 10 year period