Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)

Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU) heringa@cs.vu.nl, www.cs.vu.nl/~ibivu, Tel. +31-20-4447649

Gathering knowledge Anatomy, architecture Dynamics, mechanics Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems) Genomics, bioinformatics Rembrandt, 1632 Newton, 1726

Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics

“Studying informational processes in biological systems” (Hogeweg Utrecht; early 1970s) Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere Taking care of the computational infrastructure and data management everywhere Is a supporting science everywhere “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

The Human Genome -- 26 June 2000

Dinner discussion: Integrative Bioinformatics & Genomics VU metabolome proteome genome transcriptome physiome Genomics

A gene codes for a protein Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA 4-letter alphabet 20-letter alphabet

Humans have spliced genes…

DNA makes RNA makes Protein

Remarks Proteins can use different combinations of exons => alternative splicing The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp. Single Nucleotide Polymorphism (SNP) data important for health

Microarray with about 20K genes…

Proteomics X-ray crystallography NMR Mass spectrometry data Structural genomics: solving and categorising all existing protein folds (3D structures) Protein-protein interactions Protein-ligand interactions (drug design)

Metabolic networks Glycolysis and Gluconeogenesis Kegg database (Japan)

Physiome Metabolomics + all other little things in the cell Ions, protons, etc.

Algorithms in bioinformatics string algorithms dynamic programming machine learning (NN, k-NN, SVM, GA,..) Markov chain models hidden Markov models Markov Chain Monte Carlo (MCMC) algorithms stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms text analysis hybrid/combinatorial techniques and more…

Free University initiatives Integrative Bioinformatics Institute VU (IBIVU) Centre for Research on BioComplex Systems (CRBCS) – Systems Biology Centre for Neurobiology and Cognitive Research (CNCR) VU Medical Centre (Microarray, CGH data)

IBIVU supporting Dutch initiatives BioRange: Pan-Dutch bioinformatics proposal (65M Euro) Centre for Medical Systems Biology (Leiden, A’dam, R’dam) Ecogenomics (A’dam, Wageningen, Nat. Inst. For Health and Environment (RIVM)) BioASP: streamline/stimulate bioinformatics teaching across The Netherlands

Dutch Centres of Excellence Cancer Genomics Consortium [DCGP] Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato) Kluyver Centre for Genomics of Industrial Fermentation [Kluyver] Center for Medical Systems Biology [CMSB], focuses on multifactorial disease Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline

Dutch academic/industrial initiatives Nutrigenomics exploration into the prevention and care of nutrional inroads in vascular disease, diabetes, hypertension and obesity Interaction between the immune system and food; a functional genomics approach to celiac disease Mechanisms of life-threatening virus disease and new leads for treatment and vaccines Genomics of host – respiratory virus interactions: towards novel intervention strategies; Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products (ecology, toxicology and sustainable innovation)

Ecogenomics

Table 1: Operational organisation of the CMSB LOCATION: local coordinator, other participants PROJECT Coordinator Task force EPIDEMIOLOGY Dorret Boomsma(VU/mc)* Cornelia van Duyn(EMC) Populations EMC : van Duyn, Hofman, Oostra VU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der Knaap, Meier, Pena, Pinedo LUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse, Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der Velde, Westendorp, Zitman Genotyping LUMC : Slagboom, Sandkuijl, den Dunnen EMC : Oostra, Heutink VU/mc : Boomsma, (Heutink) SYSTEMS BIOLOGY Jan vd Greef (TNO/UL)* Cor Verweij (VU/mc) Arraying LUMC: den Dunnen, Boer, Fodde VU/mc: Verweij, Ylstra, Brakenhoff EMC: Oostra Proteomics LUMC: Koning, Deelder, den Dunnen, van der Maarel UL: Overkleeft, Abrahams, VU/mc: Smit, Li, van Kooyk Metabolomics UL: Verduijn Lunel, van de Geer, Verheij TNO: van der Greef, Havekes, te Koppele VU/mc: Jakobs TECHNOLOGY Huub de Groot (UL) Molecular interactions UL: Abrahams, Brouwer, IJzerman, van Boom LUMC: Tanke, Raap, Deelder, den Dunnen VU/mc: Leurs, Irth In vivo imaging UL: de Groot, Kok LUMC: Reiber, van Buchem, de Roos, Poelmann, Lowick VU/mc: Witter, Bal, Lammertsma EMC: van Duyn, van Swieten MODEL SYSTEMS Rune Frants (LUMC) Mouse / Rat Zebrafish Drosophila Yeast EMC: Oostra LUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer, Mullenders TNO: Havekes; UL: Spaink, Brouwer, Schmidt VU/mc: Verhage, Smit, Vandenbroucke-Grauls CLINICAL APPLICATIONS Cornelis Melief (LUMC) Cells, vaccines LUMC : Melief, Goulmy, Falkenburg, Ottenhoff; Spaan, de Vries VU/mc: van Kooiyk; Meijer,Pinedo Viral LUMC: Spaan, Wiertz, Hoeben VU/mc: Gerritsen, Curiel Methodologies, Pharmaceuticals UL: IJzerman, Mulder, van Boom LUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari, Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, Zitman VU/mc: Maassen, Dijkmans, Leurs, Meijer, Pinedo EMC : Stricker CENTRAL PROJECT Coordinator / Elements DATA INTEGRATION, ANALYSIS AND LOGISTICS NN Central Information Management TFBI / BIG-VU / EBB - Rosetta Resolver® - LIMS integration/ /interfacing Biostatistics van Houwelingen, Eijlers, Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers (VU/mc); Houwing- Duiistermaat (EMC), van de Geer (UL) Bioinformatics Boer, Svensson, Gorbalenya (LUMC), Heringa, vBeek (VU/mc), Stijnen, van der Lei, Mons (EMC), Kok (UL) BioASP Interface ism: Vriend/Tellegen GRID – Virtual Laboratory NWO- BMI FLEXwork van Ommen, Boer, Svensson ism: - Stiekema (Wag) - Herzberger (Ams) - Vriend (Nijm) The managing team Medical Systems Biology

Integrate data sources Integrate methods Integrate data through method integration (biological model) Integrative bioinformatics Data integration

Bioinformatics tool Data Algorithm Biological Interpretation (model) tool

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in Bioinformatics makes sense except in the light of Biology” Bioinformatics

Pair-wise sequence alignment (more than just string matching) MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution Global dynamic programming

Data Algorithm Biological Interpretation (model) tool Integrative bioinformatics Data integration

Data 1Data 2Data 3

Integrative bioinformatics Data integration Data 1 Algorithm 1 Biological Interpretation (model) 1 tool Algorithm 2 Biological Interpretation (model) 2 Algorithm 3 Biological Interpretation (model) 3 Data 2Data 3

“The solution includes an infrastructure or data pipeline involving: a general portal virtual lab technology (virtual LIMS) ‘petabase’ data handling facilities methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain. This infrastructure calls for a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces” Could Gridlab do this? Integrative bioinformatics Data integration

Integrating Primary and Predicted Secondary Structure data for Multiple Alignment

Using secondary structure in multiple alignment “Structure more conserved than sequence” 10 years SS prediction method development: Q3 += 3% 10 years MA method development: difference in Q3 can be >30%

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Flavodoxin-cheY: Praline alignment (prepro=1500) 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI-------- FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV-------- 2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA--------- FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF----------- 3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------ G Iteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Flavodoxin-cheY NJ tree

Secondary structure-induced alignment iteration

Iteration Convergence Limit cycle Divergence

3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP| 3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE | 3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE | 3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE | 3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE | 3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE | 3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM| 3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH | 3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | 3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | Flavodoxin-cheY multiple alignment/ secondary structure iteration cheY SSEs

Secondary structure prediction using MA (SymSS) 12341234 21342134 31243124 41234123 EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH EEEEE HHHHHH EEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEE ?HHHHH EEEE HH EEEEE HHHH EEE HH EEEE? ?HHH EEE H EEEEE HHH? ??EE HH EEEEE HHH? EEEE HH EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE ?HHHHH EEEE HHHH EEEEE HHHHH EEE HEEEE HHHH EE HHHEEEE HHHHH EEE HEEEE HHH EEE HH 11111111

Optimal segmentation of predicted secondary structures H score 0 0 0 0 0…. E score 3 4 4 4 3…. C score 1 0 0 0 0….. 12341234 EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH 1 ->1 1 -> 2 1 ->3 1 ->4 ? Score 0 0 0 0 1…. Region 0 1 1 1 0…. C E H Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment. The predictions are recorded by secondary structure type and region position in a single matrix

Optimal segmentation of predicted secondary structures by Dynamic Programming sequence position window size Max score Offset Label H score E score C score The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score. Restrictions: H only if ws >= 4 E only if ws >= 2 5 H 26 Segmentation score (Total score of each path) ? score Region

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------ 3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ????????? 3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ????????? 3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh 3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ???? 3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ?????? 3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ??????????? 3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------ Consensus ---------------EEEE----- HHHHHHHHHHHHH ------ Consensus-DSSP...............****.....****xx***************...... PHD --------------- ----- HHHHHHHHHHHHHH ------ PHD-DSSP...............xxxx.....******************x**...... DSSP...............EEEE.....SS HHHHHHHHHHHHHHHT...... LumpDSSP...............EEEE..... HHHHHHHHHHHHHHH......

Integrating secondary structure prediction and multiple alignment Low key example But difficult How to scale up? Need new formalisms and technology

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851. Integrating protein multiple alignment, secondary and tertiary structure prediction to predict structural domains in sequence data

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

The C  distance matrix is divided into smaller clusters. Seperately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data SnapDRAGON

Domains in structures assigned using method by Taylor (1997) Domain boundary positions of each model against sequence Summed and Smoothed Boundaries (Biased window protocol) SnapDRAGON 123123

Predicting domain boundaries for single average size protein could take hours on 128-node cluster computer with simplified significance testing. How to do scale up to structural genomics? 30,000 human proteins of 1 hr each gives 3.5 years..

What we still cannot do well “Give us sequence, we do rest” failed so far; e.g., number of human genes Gene prediction bad, RNA genes missed Protein structure/function prediction unsolved; we have no clue about function of 50% of human genes No theory of gene regulation Cannot well predict post-translational modification Many (database) solutions not generic We have no E=mc 2 so need to keep all data Integrating methods and data Understand biologically

Future Bioinformatics Research Topics Integration of knowledge –We have some formalisms (ontologies, distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now

Conclusions Getting important integrative Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant Bioinformatics and genomics are getting clinical. Gridlab could play an important role

The end. Thanks

Future Bioinformatics Research Topics Keywords morning session Integration of knowledge –Information transfer from one object to another –What are the rules –From genotype to phenoypes, current algorithms and ontologies not sufficient –Biological interpretation needs context –DB maintenance is dynamic process, most info is static –Need resources –Environment should allow student to make method in 3 hours

Genomics –Identifying genetic elements still bad –Collect easy primary biological facts –Gene pred, struct pred, functional all unsolved –Genetic “parts” list is uncomplete and scanty –Many omics “unknowme”

Genomics –Hypothesis driven versus systematic approaches –Need databases,algorithms, biol knowledge –Data structures not suitable for complexity –Solutions such as Ensembl not generic –Need technologies beyond ontologies –Need new formalisms to be able to do “vertical genomics”

Systems Biology –Very promising area Health Pharmaceuticals Biotechnology Environment

(Medical) Systems Biology –Diego di Bernardo –Ilias Jakovidis –Very promising area

Summary How can Europe regain ground

Hans Werner Mewes DNA contains all Identifying genetic elements still bad Collect easy primary biological facts Gene pred, struct pred, functional all unsolved Genetic “parts” list is uncomplete and scanty Many omics “unknowme”

Hypothesis driven versus systematic approaches Need databases,algorithms, biol knowledge Data structures not suitable for complexity Solutions such as Ensembl not generic Need technologies beyond ontologies

Information transfer from one object to another What are the rules From genotype to phenoypes, current algorithms not sufficient Biological interpretation needs context DB maintenance is dynamic process, most info is static Need resources Environment should allow student to make method in 3 hours

Diego di Bernardo TIGEM: disease genes Bioinformatics and comp biol not at a par 81 of genome “genomics&databases” and 19% “genomics&algorithms Important topics: regulation, network, digital signal processing HMMs Problems : algorithms not biological and no experimental verification Bioinformatics helps design biological experimnents Richard Durbin: value of physics and engineering

Computational tools for discovery of novel objects

Ilias Jakovidis Medical informatics Health telematics eHealth Medical ontologies didn’t help Paul Schofield at all (tried with NCBI-big mess) Middleware includes ontologies so covers biology (IBM!)

Language engineering Natural language in medicine, computerize medical community Biomedical informatics: applications in healthcare, how to get to clinical? Synergy between medicine and biology informatics Alphonso, med will dominate, lot of money with unclear methods

Medical info has worked coherently, how can we do that? How can we change? Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.

Gunnar von Heijne Databases should be funded Start problem for 5 years: and then what? With infrastructure this problem is less, so funding is relatively OK. Technology development should not become dominant Most biologist are small scale hypothesis driven Marketing problem

From 19 bioinf nethods, 15 are European in genomics Validation is not always key (Alfonso) EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)

Alfonso Valencia Often, 1 bioinformatician for everything Need of integration/collaboration –Social, technical barriers People should realise that Bioinf and Bioinformaticians are very different Integrated (med) system –Underfunded (1 postdoc) –Difficult to develop –lack of standards and repositories –Difficult to interact with biologist –All these things essential

3-4 good bioinf groups in Spain Make virtual institute for bioinformatics There are few large groups with national funding There are few large groups with European funding There are many small groups with weak institutional funding

Create framework valid for biology Interaction reduces overhead System access for biologists, point to the right expert Create new science beyond current needs This does not compete with basic needs Support strong European areas (eg. protein interaction)

Bioinformatics is a new discipline Who solves the problem, who is interested in solving it, and not always who qualifies to solving it (engineers,..) Example “information extraction in molecular biology”: after years no real progress made. Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment

Should bioinf talk to biology or vice versa?

Jean-Marie Claverie 1951 first protein sequence (insulin) Field has come of age, so outsiders shouldn’t tell us what to do and how Bioinformatics is part of the foundation Clear difference in application of informatics or bioinformatics Future will be different Give us sequence, we do rest failed! Number of human genes is example.

Gene finding: standard genes good, RNA genes missed, no theory of gene transcription We have no E=mc 2 so need to keep data Computational biology is same as systems biology Good integration: E. Coli Bioinf-project, find all genes in small bacterium. Inclusive project. Now good consortium. Bioinformatics becomes invisible for biologists (Blast).

Howard Bilofsky PRISM forum Provide challenges for (bio)informatics Drives Bioinf,omics,.. techniques

Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)

Similar presentations

Presentation on theme: "Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)

Similar presentations

Presentation on theme: "Bioinformatics as an integrative science Jaap Heringa Faculty of Sciences Faculty of Earth and Life Sciences Integrative Bioinformatics Institute VU (IBIVU)"— Presentation transcript:

Similar presentations

About project

Feedback