Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) Tel ,
Bioinformatics at IU - Ketan Mane. Bioinformatics at IU What is Bioinformatics? Bioinformatics is the study of the inherent structure of biological information.
Bioinformatics Master’s Course Genome Analysis ( Integrative Bioinformatics ) Lecture 1: Introduction Centre for Integrative Bioinformatics VU (IBIVU)
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Sequence analysis course Lecture 8 Sequence databank searching 1.
Sequence analysis course Lecture 7 Multiple sequence alignment 3 of 3 Optimizing progressive multiple alignment methods.
1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Introduction to BioInformatics GCB/CIS535
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 10 Database searching Issues (1)
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Bioinformatics Original definition (1979 by Paulien Hogeweg): “application of information technology and computer science to the field of molecular biology”
1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Medical Informatics Basics
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Bioinformatics.
Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)
Information Systems Basic Core Specialization Clinical Imaging BioInformatics Public Health Computer Science Methods (formal models) Biomedical Decision.
Medical Informatics Basics
Bioinformatics and medicine: Are we meeting the challenge?
Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW
Intelligent Systems for Bioinformatics Michael J. Watts
High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Centre for Integrative Bioinformatics VU (IBIVU) Tel ,
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Harbin Institute of Technology Computer Science and Bioinformatics Wang Yadong Second US-China Computer Science Leadership Summit.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
DIALing the IBIVU Vrije Universiteit Amsterdam Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Faculty of Sciences / Faculty of Earth and.
Course Sequence Analysis for Bioinformatics Master’s Bart van Houte, Radek Szklarczyk, Victor Simossis, Jens Kleinjung, Jaap Heringa
Overview of Bioinformatics 1 Module Denis Manley..
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.
Genes and Genomic Datasets. DNA compositional biases Base composition of genomes: E. coli: 25% A, 25% C, 25% G, 25% T P. falciparum (Malaria parasite):
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU.
High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution Introduction to bioinformatics 2005 Lecture 3.
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
High throughput biology data management and data intensive computing drivers George Michaels.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Bioinformatics Overview
생물정보학 Bioinformatics.
High-throughput Biological Data The data deluge
Genomes and Their Evolution
C E N T R F O I G A V B M S U 2MNW/3I/3AI/3PHAR bachelor course Introduction to Bioinformatics Lecture 1: Introduction Centre for Integrative Bioinformatics.
Bioinformatics For MNW 2nd Year
SnapDRAGON: protein 3D prediction-based
M-H Pinard-van der Laan
Presentation transcript:

Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU) Tel

Gathering knowledge Anatomy, architecture Dynamics, mechanics Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems) Genomics, bioinformatics Rembrandt, 1632 Newton, 1726

Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics

“Studying informational processes in biological systems” (Hogeweg Utrecht; early 1970s) Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere Taking care of the computational infrastructure and data management everywhere Is a supporting science everywhere “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

The Human Genome June 2000

Dinner discussion: Integrative Bioinformatics & Genomics VU metabolome proteome genome transcriptome physiome Genomics

A gene codes for a protein Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA 4-letter alphabet 20-letter alphabet

Humans have spliced genes…

DNA makes RNA makes Protein

Remarks Proteins can use different combinations of exons => alternative splicing The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp. Single Nucleotide Polymorphism (SNP) data important for health

Microarray with about 20K genes…

Proteomics X-ray crystallography NMR Mass spectrometry data Structural genomics: solving and categorising all existing protein folds (3D structures) Protein-protein interactions Protein-ligand interactions (drug design)

Metabolic networks Glycolysis and Gluconeogenesis Kegg database (Japan)

Physiome Metabolomics + all other little things in the cell Ions, protons, etc.

Algorithms in bioinformatics string algorithms dynamic programming machine learning (NN, k-NN, SVM, GA,..) Markov chain models hidden Markov models Markov Chain Monte Carlo (MCMC) algorithms stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms text analysis hybrid/combinatorial techniques and more…

Free University initiatives Integrative Bioinformatics Institute VU (IBIVU) Centre for Research on BioComplex Systems (CRBCS) – Systems Biology Centre for Neurobiology and Cognitive Research (CNCR) VU Medical Centre (Microarray, CGH data)

IBIVU supporting Dutch initiatives BioRange: Pan-Dutch bioinformatics proposal (65M Euro) Centre for Medical Systems Biology (Leiden, A’dam, R’dam) Ecogenomics (A’dam, Wageningen, Nat. Inst. For Health and Environment (RIVM)) BioASP: streamline/stimulate bioinformatics teaching across The Netherlands

Dutch Centres of Excellence Cancer Genomics Consortium [DCGP] Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato) Kluyver Centre for Genomics of Industrial Fermentation [Kluyver] Center for Medical Systems Biology [CMSB], focuses on multifactorial disease Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline

Dutch academic/industrial initiatives Nutrigenomics exploration into the prevention and care of nutrional inroads in vascular disease, diabetes, hypertension and obesity Interaction between the immune system and food; a functional genomics approach to celiac disease Mechanisms of life-threatening virus disease and new leads for treatment and vaccines Genomics of host – respiratory virus interactions: towards novel intervention strategies; Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products (ecology, toxicology and sustainable innovation)

Ecogenomics

Table 1: Operational organisation of the CMSB LOCATION: local coordinator, other participants PROJECT Coordinator Task force EPIDEMIOLOGY Dorret Boomsma(VU/mc)* Cornelia van Duyn(EMC) Populations EMC : van Duyn, Hofman, Oostra VU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der Knaap, Meier, Pena, Pinedo LUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse, Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der Velde, Westendorp, Zitman Genotyping LUMC : Slagboom, Sandkuijl, den Dunnen EMC : Oostra, Heutink VU/mc : Boomsma, (Heutink) SYSTEMS BIOLOGY Jan vd Greef (TNO/UL)* Cor Verweij (VU/mc) Arraying LUMC: den Dunnen, Boer, Fodde VU/mc: Verweij, Ylstra, Brakenhoff EMC: Oostra Proteomics LUMC: Koning, Deelder, den Dunnen, van der Maarel UL: Overkleeft, Abrahams, VU/mc: Smit, Li, van Kooyk Metabolomics UL: Verduijn Lunel, van de Geer, Verheij TNO: van der Greef, Havekes, te Koppele VU/mc: Jakobs TECHNOLOGY Huub de Groot (UL) Molecular interactions UL: Abrahams, Brouwer, IJzerman, van Boom LUMC: Tanke, Raap, Deelder, den Dunnen VU/mc: Leurs, Irth In vivo imaging UL: de Groot, Kok LUMC: Reiber, van Buchem, de Roos, Poelmann, Lowick VU/mc: Witter, Bal, Lammertsma EMC: van Duyn, van Swieten MODEL SYSTEMS Rune Frants (LUMC) Mouse / Rat Zebrafish Drosophila Yeast EMC: Oostra LUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer, Mullenders TNO: Havekes; UL: Spaink, Brouwer, Schmidt VU/mc: Verhage, Smit, Vandenbroucke-Grauls CLINICAL APPLICATIONS Cornelis Melief (LUMC) Cells, vaccines LUMC : Melief, Goulmy, Falkenburg, Ottenhoff; Spaan, de Vries VU/mc: van Kooiyk; Meijer,Pinedo Viral LUMC: Spaan, Wiertz, Hoeben VU/mc: Gerritsen, Curiel Methodologies, Pharmaceuticals UL: IJzerman, Mulder, van Boom LUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari, Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, Zitman VU/mc: Maassen, Dijkmans, Leurs, Meijer, Pinedo EMC : Stricker CENTRAL PROJECT Coordinator / Elements DATA INTEGRATION, ANALYSIS AND LOGISTICS NN Central Information Management TFBI / BIG-VU / EBB - Rosetta Resolver® - LIMS integration/ /interfacing Biostatistics van Houwelingen, Eijlers, Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers (VU/mc); Houwing- Duiistermaat (EMC), van de Geer (UL) Bioinformatics Boer, Svensson, Gorbalenya (LUMC), Heringa, vBeek (VU/mc), Stijnen, van der Lei, Mons (EMC), Kok (UL) BioASP Interface ism: Vriend/Tellegen GRID – Virtual Laboratory NWO- BMI FLEXwork van Ommen, Boer, Svensson ism: - Stiekema (Wag) - Herzberger (Ams) - Vriend (Nijm) The managing team Medical Systems Biology

Integrate data sources Integrate methods Integrate data through method integration (biological model) Integrative bioinformatics Data integration

Bioinformatics tool Data Algorithm Biological Interpretation (model) tool

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in Bioinformatics makes sense except in the light of Biology” Bioinformatics

Pair-wise sequence alignment (more than just string matching) MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution Global dynamic programming

Data Algorithm Biological Interpretation (model) tool Integrative bioinformatics Data integration

Data 1Data 2Data 3

Integrative bioinformatics Data integration Data 1 Algorithm 1 Biological Interpretation (model) 1 tool Algorithm 2 Biological Interpretation (model) 2 Algorithm 3 Biological Interpretation (model) 3 Data 2Data 3

“The solution includes an infrastructure or data pipeline involving: a general portal virtual lab technology (virtual LIMS) ‘petabase’ data handling facilities methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain. This infrastructure calls for a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces” Could Gridlab do this? Integrative bioinformatics Data integration

Integrating Primary and Predicted Secondary Structure data for Multiple Alignment

Using secondary structure in multiple alignment “Structure more conserved than sequence” 10 years SS prediction method development: Q3 += 3% 10 years MA method development: difference in Q3 can be >30%

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Flavodoxin-cheY: Praline alignment (prepro=1500) 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD SLKIDGD--PE--RDEIVSwGSGIADKI FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS SLKIDGE--PD--SAEVLDwAREVLARV fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNA-PECKElGEAAAKA FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF chy VTAEAKK--ENIIAA AQAGAS GYVV-----KPFTAATLEEKLNKIFEKLGM G Iteration 0 SP= AvSP= SId= 4009 AvSId= 0.313

Flavodoxin-cheY NJ tree

Secondary structure-induced alignment iteration

Iteration Convergence Limit cycle Divergence

3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP| 3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE | 3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE | 3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE | 3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE | 3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE | 3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM| 3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH | 3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | 3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | Flavodoxin-cheY multiple alignment/ secondary structure iteration cheY SSEs

Secondary structure prediction using MA (SymSS) EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH EEEEE HHHHHH EEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEE ?HHHHH EEEE HH EEEEE HHHH EEE HH EEEE? ?HHH EEE H EEEEE HHH? ??EE HH EEEEE HHH? EEEE HH EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE ?HHHHH EEEE HHHH EEEEE HHHHH EEE HEEEE HHHH EE HHHEEEE HHHHH EEE HEEEE HHH EEE HH

Optimal segmentation of predicted secondary structures H score …. E score …. C score … EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH 1 ->1 1 -> 2 1 ->3 1 ->4 ? Score …. Region …. C E H Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment. The predictions are recorded by secondary structure type and region position in a single matrix

Optimal segmentation of predicted secondary structures by Dynamic Programming sequence position window size Max score Offset Label H score E score C score The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score. Restrictions: H only if ws >= 4 E only if ws >= 2 5 H 26 Segmentation score (Total score of each path) ? score Region

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy GYVV-----KPFTAATLEEKLNKIFEKLGM chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ????????? 3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ????????? 3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh 3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ???? 3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ?????? 3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ??????????? 3chy <- 3chy hhhhhhhhhhhhhh Consensus EEEE----- HHHHHHHHHHHHH Consensus-DSSP ****.....****xx*************** PHD HHHHHHHHHHHHHH PHD-DSSP xxxx.....******************x** DSSP EEEE.....SS HHHHHHHHHHHHHHHT LumpDSSP EEEE..... HHHHHHHHHHHHHHH......

Integrating secondary structure prediction and multiple alignment Low key example But difficult How to scale up? Need new formalisms and technology

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, Integrating protein multiple alignment, secondary and tertiary structure prediction to predict structural domains in sequence data

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers.

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

The C  distance matrix is divided into smaller clusters. Seperately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data SnapDRAGON

Domains in structures assigned using method by Taylor (1997) Domain boundary positions of each model against sequence Summed and Smoothed Boundaries (Biased window protocol) SnapDRAGON

Predicting domain boundaries for single average size protein could take hours on 128-node cluster computer with simplified significance testing. How to do scale up to structural genomics? 30,000 human proteins of 1 hr each gives 3.5 years..

What we still cannot do well “Give us sequence, we do rest” failed so far; e.g., number of human genes Gene prediction bad, RNA genes missed Protein structure/function prediction unsolved; we have no clue about function of 50% of human genes No theory of gene regulation Cannot well predict post-translational modification Many (database) solutions not generic We have no E=mc 2 so need to keep all data Integrating methods and data Understand biologically

Future Bioinformatics Research Topics Integration of knowledge –We have some formalisms (ontologies, distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now

Conclusions Getting important integrative Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant Bioinformatics and genomics are getting clinical. Gridlab could play an important role

The end. Thanks

Future Bioinformatics Research Topics Keywords morning session Integration of knowledge –Information transfer from one object to another –What are the rules –From genotype to phenoypes, current algorithms and ontologies not sufficient –Biological interpretation needs context –DB maintenance is dynamic process, most info is static –Need resources –Environment should allow student to make method in 3 hours

Genomics –Identifying genetic elements still bad –Collect easy primary biological facts –Gene pred, struct pred, functional all unsolved –Genetic “parts” list is uncomplete and scanty –Many omics “unknowme”

Genomics –Hypothesis driven versus systematic approaches –Need databases,algorithms, biol knowledge –Data structures not suitable for complexity –Solutions such as Ensembl not generic –Need technologies beyond ontologies –Need new formalisms to be able to do “vertical genomics”

Systems Biology –Very promising area Health Pharmaceuticals Biotechnology Environment

(Medical) Systems Biology –Diego di Bernardo –Ilias Jakovidis –Very promising area

Summary How can Europe regain ground

Hans Werner Mewes DNA contains all Identifying genetic elements still bad Collect easy primary biological facts Gene pred, struct pred, functional all unsolved Genetic “parts” list is uncomplete and scanty Many omics “unknowme”

Hypothesis driven versus systematic approaches Need databases,algorithms, biol knowledge Data structures not suitable for complexity Solutions such as Ensembl not generic Need technologies beyond ontologies

Information transfer from one object to another What are the rules From genotype to phenoypes, current algorithms not sufficient Biological interpretation needs context DB maintenance is dynamic process, most info is static Need resources Environment should allow student to make method in 3 hours

Diego di Bernardo TIGEM: disease genes Bioinformatics and comp biol not at a par 81 of genome “genomics&databases” and 19% “genomics&algorithms Important topics: regulation, network, digital signal processing HMMs Problems : algorithms not biological and no experimental verification Bioinformatics helps design biological experimnents Richard Durbin: value of physics and engineering

Computational tools for discovery of novel objects

Ilias Jakovidis Medical informatics Health telematics eHealth Medical ontologies didn’t help Paul Schofield at all (tried with NCBI-big mess) Middleware includes ontologies so covers biology (IBM!)

Language engineering Natural language in medicine, computerize medical community Biomedical informatics: applications in healthcare, how to get to clinical? Synergy between medicine and biology informatics Alphonso, med will dominate, lot of money with unclear methods

Medical info has worked coherently, how can we do that? How can we change? Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.

Gunnar von Heijne Databases should be funded Start problem for 5 years: and then what? With infrastructure this problem is less, so funding is relatively OK. Technology development should not become dominant Most biologist are small scale hypothesis driven Marketing problem

From 19 bioinf nethods, 15 are European in genomics Validation is not always key (Alfonso) EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)

Alfonso Valencia Often, 1 bioinformatician for everything Need of integration/collaboration –Social, technical barriers People should realise that Bioinf and Bioinformaticians are very different Integrated (med) system –Underfunded (1 postdoc) –Difficult to develop –lack of standards and repositories –Difficult to interact with biologist –All these things essential

3-4 good bioinf groups in Spain Make virtual institute for bioinformatics There are few large groups with national funding There are few large groups with European funding There are many small groups with weak institutional funding

Create framework valid for biology Interaction reduces overhead System access for biologists, point to the right expert Create new science beyond current needs This does not compete with basic needs Support strong European areas (eg. protein interaction)

Bioinformatics is a new discipline Who solves the problem, who is interested in solving it, and not always who qualifies to solving it (engineers,..) Example “information extraction in molecular biology”: after years no real progress made. Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment

Should bioinf talk to biology or vice versa?

Jean-Marie Claverie 1951 first protein sequence (insulin) Field has come of age, so outsiders shouldn’t tell us what to do and how Bioinformatics is part of the foundation Clear difference in application of informatics or bioinformatics Future will be different Give us sequence, we do rest failed! Number of human genes is example.

Gene finding: standard genes good, RNA genes missed, no theory of gene transcription We have no E=mc 2 so need to keep data Computational biology is same as systems biology Good integration: E. Coli Bioinf-project, find all genes in small bacterium. Inclusive project. Now good consortium. Bioinformatics becomes invisible for biologists (Blast).

Howard Bilofsky PRISM forum Provide challenges for (bio)informatics Drives Bioinf,omics,.. techniques