Alignement multiple: progrès et perspectives dans l’estimation et l’exploitation des algorithmes et des données Marseille 17 Novembre 2005 Laboratory of.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Protein sequence analysis is a key issue in post-genomic biology. High-throughput genome sequencing and assembly techniques, structural proteomics and.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gene Ontology John Pinney
Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Introduction to Bioinformatics
Molecular Evolution Revised 29/12/06
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Structural bioinformatics
Bioinformatics and Phylogenetic Analysis
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
The Protein Data Bank (PDB)
Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Chapter 5 Multiple Sequence Alignment.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Protein Tertiary Structure Prediction
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Extending MAO : towards an Ontology of Genetic and Evolutionary Events Laboratory of Integrative BioInformatics and Genomics (LBGI), Department of Biology.
Multiple Sequence Alignment
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
The ideal approach is simultaneous alignment and tree estimation.
Sequence Based Analysis Tutorial
Homology Modeling.
Protein structure prediction.
Presentation transcript:

Alignement multiple: progrès et perspectives dans l’estimation et l’exploitation des algorithmes et des données Marseille 17 Novembre 2005 Laboratory of Integrative Bioinformatics and Genomics Department of Structural Biology and Genomics Dino Moras Institut de Génétique et de Biologie Moléculaire et Cellulaire Illkirch-Graffenstaden Strasbourg France Olivier Poch et Jean Claude Thierry Collège de France

Laboratory of Integrative Bioinformatics and Genomics BioInformatics and BioAnalysis/Genomics Algorithms for interconnected analysis of high throughput sequence data Algorithms for interconnected analysis of high throughput sequence data (automatic and high quality collection, validation, curation, analysis and maintenance) Specialized benchmark databases and software development Specialized benchmark databases and software development Specialized developments scientifically driven by biological projects Specialized developments scientifically driven by biological projects (genome annotation, transcriptomics data from cancers, retinal illness…) Major thematics : Major thematics : Integration and analysis of high quality Sequence/Structure/Evolution information in Systems Biology informational systems informational systems (replication, transcription, translation) Backup Lab. of BioInformatics platforms (RIO Génopole Alsace Lorraine) Centre of open biocomputing resources Centre of open biocomputing resources user accounts, Web Servers - Software, database maintenance & implementation ( generalist and specialized ) - Formation and courses Centre of competence and know how Centre of competence and know how - hot line ( bioinformatics analysis, program implementation and use… ) - solution design ( high throughput, specialized projects )

Structure comparison, modelling Interaction networks Hierarchical function annotation: homologs, domains, motifs Phylogenetic studies Human genetics, SNPs Therapeutics, drug discovery Therapeutics, drug design DBD LBD insertion domain binding sites / mutations Gene identification, validation RNA sequence, structure, function Comparative genomics MACS Multiple Alignment of Complete Sequences : central role MACS

Structures : - Structural genomics : complement the protein fold universe (~1000?) - “complete” : proteome - “specialized” : all family members (kinase, helicase, nuclear receptor…) GENERATED BY HIGH THROUGHPUT PROCESSES Sequences : - “Complete” genomes of virus, prokaryotes, eukaryotes, organelles… - “Specialized” sequencing : partial genomes, ESTs, … Biology : new context « from data poor to data rich science » Growth of PDB

MACS : new landscape Length: from tens of amino acids or nucleotides to thousands or millions (genomes) Length: from tens of amino acids or nucleotides to thousands or millions (genomes) Number: from tens up to thousands of sequences Number: from tens up to thousands of sequences Variability: from small percent identity to almost identical Variability: from small percent identity to almost identical Complexity: of the sequences to be aligned Complexity: of the sequences to be aligned - Family with linear or highly irregular repartition of sequence variability - Heterogeneity of length, structure or composition (large insertions or extensions, repeats, circular permutations, transmembrane regions…) Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) High volume & heterogeneity of sequence data

MACS : new concepts Distinct objectives imply distinct needs & strategies Overview of one sequence family to quickly infer and integrate information from a limited number of closely related, well annotated sequences (reliable and efficient) Overview of one sequence family to quickly infer and integrate information from a limited number of closely related, well annotated sequences (reliable and efficient) Exhaustive analysis of one sequence family for (very high quality) Exhaustive analysis of one sequence family for (very high quality) - homology modeling - phylogenetic studies - subfamily-specific features (differentially conserved domains, regions or residues) Massive analysis of sets of sequences (reliable/high quality and efficient) Massive analysis of sets of sequences (reliable/high quality and efficient) - phylogenetic distribution, co-presence and co-absence and structural complex - genome annotation - target characterisation for functional genomics studies (transcriptomics…)

MACS : new questions Can one unique algorithm process all sequence alignment types ? What is the pertinent information available within a sequence alignment ? What are the strengths and weaknesses of the different algorithms ? How can we evaluate the quality of highly heterogeneous alignments ? How we can identify and exploit the pertinent information ? Construction of a benchmark database to evaluate algorithms : BAliBASE (1999) Definition of a objective function to evaluate sequence alignments : NorMD (2001) Development of cooperative algorithms PipeAlign (2003) Construction of an ontology to integrate and exploit the information : MAO (2005)

BAliBASE: objective evaluation of MACS programs High-quality alignments based on 3D structural superpositions and manually verified Alignments compared only in reliable ‘core blocks’, excluding non-superposable regions Separate reference sets specifically designed to address distinct alignment problems reference setdescription 1small number of sequences: divergence, length 2a family with one to 3 orphans 3several sub-families 4long N/C terminal extensions 5long insertions 6repeats 7transmembrane regions 8circular permutations BAliBASE1 :Thompson et al Bioinformatics BAliBASE2 : Bahr et al, 2001 Nucl Acids Res.

Example of BAliBASE Alignment CORE BLOCKS are defined, which exclude non-superposable regions eg. borders of helices, loops Up to 30%!

multal N/AN/AN/AN/A multalign pileup clustalx prrp saga hmmt N/AN/A MLpima dialign SBpima Reference 1: < 6 sequences Tous < 100 résidues > 400 résidues Reference 2: a family with an orphan Reference 3: several sub-families Reference 4: long N/C terminal extensions Reference 5: long insertions Global algorithms work well when sequences are homologous over their full lengths, local algorithms are better for non-colinear sequences Comparison of multiple alignment methods All PPPPIIIPPIPPPPIIIPPI Iterative algorithms can improve alignment quality, but are too slow for most applications Thompson et al Nucl. Acids Res.

DbClustal (Thompson et al, 2000) - integrates local motifs mined by a database search in a ClustalW global alignment T-COFFEE (Notredame et al, 2000) - uses DP to compute ALL local and global alignments for each pair of sequences MAFFT (Katoh et al, 2002) - detects locally conserved segments using a Fast Fourier Transform MUSCLE (Edgar, 2004) -kmer distances and log-expectation scores, progressive and iterative refinement PROBCONS (Do et al, 2005) -pairwise consistency based objective function More recent developments : cooperative algorithms local and global

BAliBASE More difficult test cases More divergent sequences (V3 excluded) More multi-domain proteins More protein folds : (SCOP coverage) More sequences (total proteins increased from 1444 to 6255) Full-length sequences for all test cases Semi-automatic update protocol Annotations in XML format Web site re-designed Thompson et al Proteins

BAliBASE 3.0 Reference 3 (subfamilies): protein kinase pkinase focal_at polo_box pkinase_c fha sh2 sam pb1 sh3 polo_box conserved domain core blocks

BAliBASE 3.0 ATP binding active site alpha helix beta strand core block

Multiple Alignment Quality Ref1Ref2Ref3Ref4Ref5Time V1 (<20%)V2 (20- 40%) orphanssubgroupsextensionsinsertion s (sec) ClustalW Dialign Mafft Maffti Muscle Muscle_fast Muscle_med Tcoffee Probcons muscle_fast : muscle –maxiters=1 –diags1 –sv –distance1 kbit20_3 muscle_medium : muscle –maxiters=2 Truncated Alignments 2. Twilight zone still exists 3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE 4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient 1. Significant improvement in accuracy/efficiency since 2000

Multiple Alignment Quality Ref1Ref2: orphansRef3: subgroupsTime (sec) for all refs V1 (<20%)V2 (20-40%) T FL T T T T ClustalW Dialign Mafft Maffti Muscle Muscle_fast Muscle_med Tcoffee Probcons Comparison: truncated versus full-length sequences 1.Loss of accuracy is more important in twilight zone (Ref1 V1, orphans, and subgroups) 2.Probcons still scores best in all tests 3.MAFFT still scores better than MUSCLE in all tests

Evaluation of alignment quality Objective Function : Estimation of the quality of a Multiple Alignment of Complete Sequences Detection of badly aligned or unrelated sequences Detection of badly aligned or non superposable regions Use of MACS in automatic high-throughput genome analysis projects

MD : Mean Distance Coordinates of each AA of a sequence in each column according to the substitution values found in the Gonnet matrix AA of Seqi in Col.I AA of Seqj in Col.I Ala axis Cys axis Tyr axis Gly axis Other AA axes * * MD scoring Calculate the distance for each column betweeen each pair of sequences MD : exponential of the negative weighted mean distance (Q) Range of values is equal for all columns : from 0 to 1 for a completely conserved column Incorporation of sequence weights D ij

NorMD : Normalised Mean Distance MD – GAPCOST MaxMD*LQRID NORMD = Distribution of Pairwise Sequence Hash Scores Number of pairs Pairwise Score 25% LQRID LQRID : representative of potential orphan sequences Introduction of GOP (Gap Opening Penalty) and GEP (Gap Extension Penallty) M M M M M MaxMD : maximum score if the set of studied sequences were all identical Thompson J.D. et al. (2001) J. Mol. Biol., 314 (4):

Evaluation of Objective Functions using BAliBASE Sum of Pairs Relative Entropy Mean Distances OrphansSub-families Extensions/ Insertions Small number of sequences ShortMedium Long

NorMD: Normalised Mean Distance 0.5 OrphansSub-families Extensions/ Insertions Small number of sequences ShortMedium Long

Major observations and perspectives Above 30-35% identity : all algorithms perform reliably Above 30-35% identity : all algorithms perform reliably % : dependant on the algorithm and the sequence family % : dependant on the algorithm and the sequence family More information is needed More information is needed - coupling of local and global strategies - structural data (when available) : e.g. 3D-COFFEE Iteration can improve quality (but can be time consuming) Iteration can improve quality (but can be time consuming) MORE PERTINENT INFORMATION IS NEEDED better understanding of the local information : more robust statistics better understanding of the local information : more robust statistics better understanding of initial heterogeneity of the data better understanding of initial heterogeneity of the data -e.g. composition (MUSCLE), length, … integration of “non-sequence” information integration of “non-sequence” information -Fragments, Phylogenetic position, Fold plasticity Analysis and post processing to eliminate alignment incongruities Analysis and post processing to eliminate alignment incongruities -RASCAL (horizontal, vertical clustering) and LEON incorporated in PipeAlign To improve quality in the ‘twilight zone’ :

LMS (local maximum segments) BlastP search, E<10 Plewniak et al. (2000) Bioinformatics Plewniak et al. (2003) Nucl. Acids Res. Ballast AnchorsDbClustal Alignment Query Sequence Anchors Thompson et al. (2000) Nucl Acids Res. PipeAlign : high quality MACS production RASCALED MACS Multiple Alignment of Complete Sequences Thompson et al. (2003) Bioinformatics. Thompson et al (2004) Nucl Acids Res. Homologous regions Thompson et al. (2001) J Mol Biol.

LMS (local maximum segments) BlastP search, E<10 Plewniak et al. (2000) Bioinformatics Ballast AnchorsDbClustal Alignment Query Sequence Anchors Thompson et al. (2000) Nucl Acids Res. RASCALED MACS Multiple Alignment of Complete Sequences Thompson et al. (2003) Bioinformatics. Thompson et al (2004) Nucl Acids Res. Homologous regions Thompson et al. (2001) J Mol Biol. Secator/DPC : automatic clustering algorithms Secator/DPC : automatic clustering algorithms Wicker et al. (2001) Mol Biol Evol. Wicker et al. (2002) Nucl Acids Res. OrdAli : Ordered Alignment analysis of differentially conserved OrdAli : Ordered Alignment analysis of differentially conserved residues with automatic visualization on structure Strictly/mostly conserved (black, grey) Conserved between groups (red + yellow = orange) Conserved within group (red, yellow, blue) PipeAlign : high quality MACS production

Database searches : Extended mining: text, structures, OMIM… Statistics: hyper local p-value, correlation… (Daedalus …) Higher quality alignment : Post-processing approaches Clustering algorithms: sequence, evolution (Rascal, Leon …) Information cross-validation and analysis: Clustering algorithms: hierarchized info. Correlation and combinatorial algorithms Mined info. analysis and propagation (MACSIM & MAO…) Developments Exploit the informational content of MACS

Integration of mined structural/functional information (Daedalus/SRS) Cross-validation analysis and propagation Graphical interface to access the information SH3 SH2 PI-PLC-X PI-PLC-Y PH C2 CH rhoGEF DAG_PE-bind MACSIM : Integration of structural/functional information in the context of the multiple sequence alignment

******** E E E E C C C C MACSIM : cross-validation and propagation GSVPTG GSTKVG GETRTG GSTEVG GSVSAG GSRDVG GSTNVF GSTAVF BAliBASE reference 3: aldehyde dehydrogenase-like NAD binding Active site Uniprot annotation

Application: target characterisation for SPINE (Structural Proteomics IN Europe) 223 (44%)PipeAlign (PDB-Blast) 196 (38%)PipeAlign (BlastP E<10) 142 (28%)BlastP (E<10 -7 ) 166 (33%)BlastP (E<10 -4 ) No. of targets with at least 1 PDB neighbour Detection of structural homologs, for a training set of 510 potential targets : No. of targets with at least 1 domainTotal no. of domains Pfam APfam A / BPfam APfam A / B Pfam database288 ( 56%)336 (67%) propagated414 (81%)477 (94%) Domain organisation:

SPINE target identity cards Target: nuclear receptor coactivator 2 (NCOA-2) ODD PAC PAS SWISSPROT domains HLH Dna-binding Pfam domains NTAD CTAD Interaction with CREBBP Acetyltransferase activity ID NCoA-2 Clock HIF-1  Single- minded BMAL NCoA-3 PAC PASHLH CREBBP interaction AT Poly-Gln PAS LXXLL acetylation (by CREBBP) S-nitrosylation NCoA-2 Receptor-interacting domain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> hydroxylation S-nitrosylation acetylation LXXLL (In collaboration with EBI)

Semantic differences: terminology, naming conventions - Sequence ‘names’: Genbank GI ≠ EMBL ID - Function definitions e.g. Glycyl-tRNA synthetase alpha chain, Glycyl-tRNA synthetase alpha subunit, Glycine-tRNA synthetase, Glycine--tRNA ligase, GlyQ Natural language parsing: text mining, statistical approaches Constrained vocabularies, ontologies explicitly define concepts Data integration issues : Multiple Alignment Ontology ( MAO) Integration of data from different domains poses a number of problems: Syntactic differences: file formats - Sequence databases: GenBank, TrEMBL, Swissprot, PDB Indexing applications: Entrez, SRS Standard data exchange formats: XML

Ontologies provide a community reference: knowledge is authored in a single language. a standardised vocabulary that facilitates data integrationa standardised vocabulary that facilitates data integration Concepts are structured in a hierarchy that represents knowledge and allows computational reasoningConcepts are structured in a hierarchy that represents knowledge and allows computational reasoning What is an ontology? An ontology is a formal specification of a shared conceptualisation of a domain of interest Gruber, 1993 cell nuclear membrane cell wall nucleus nucleoplasm inner membraneouter membrane part_of data exchange formats for program communication, reuse of softwaredata exchange formats for program communication, reuse of software integration of information from different databases (e.g. transcriptomics, proteomics, metabolomics, toxicology)integration of information from different databases (e.g. transcriptomics, proteomics, metabolomics, toxicology) more efficient database queries: exact terms for searching can be usedmore efficient database queries: exact terms for searching can be used e.g. searching for `mitochondrial double stranded DNA binding proteins', all and only those proteins will be found computational reasoning: automatic, large-scale analysescomputational reasoning: automatic, large-scale analyses presentation of relevant information to the biologistpresentation of relevant information to the biologistApplications

MAO : Multiple Alignment Ontology Also available from OBO web site: MAO consortium: - RNA analysis - RNA analysis (Steve HOLBROOK, Berkeley) - MACS algorithm - MACS algorithm (Kazutake KATOH, Kyoto) - Protein 3D analysis - Protein 3D analysis (Patrice KOEHL, Davis) - Protein 3D structure - Protein 3D structure (Dino MORAS, Strasbourg) - 3D RNA structure - 3D RNA structure (Eric WESTHOF, Strasbourg) Thompson et al. (2005) Nucleic Acids Res.

Hierarchical organisation, characterisation multiple_sequence_alignment sub_alignment alignment_column alignment_sequence sequence_feature residue amino_acid column_conservation nucleotide sequence_feature_type part_of is_a is_attribute part_of is_attribute domain is_a Multiple sequence alignment sub_alignment domain motif is_a alignment_sequence alignment_column Most of the features associated with multiple alignments are defined as MAO concepts, ranging from a single residue to sub-families of sequences and/or 3D structures. Concepts are organised in a DAG (directed acyclic graph). Links are provided to OBO ontologies and external databases. Scope and structure

Sequence-structure relationships Either link to existing PDB entry or enter 3D coordinates for atoms multiple_sequence_alignment sub_alignment alignment_column alignment_sequence residue amino_acid nucleotide atom 3d_atomic_point is_attribute part_of is_a part_of is_attribute pdb_name part_of ndb_name is_attribute

OBO interactions alignment_column alignment_sequence residue is_attribute part_of residue_function is_attribute structural_location column_conservation is_attribute binding_site is_a ptm is_a mutation is_a MI structural_bond is_a GO is_attribute is_a TAXID is_attribute accession is_attribute EC is_attribute enzyme_active_site CSA_catalytic_site is_a phenotype is_attribute feature domain Interpro is_a

IL-1 proteins (C-terminal mature form) are involved in the inflammatory response and immunity. IL1Fx IL1A IL1B IL1RN Interleukin-1 Interleukin-1 propeptide MAO Knowledge Base Differential effects of IL-1 on tumor development: IL1A reduces tumorigenicity; IL1B promotes invasiveness (Song et al, 2003) Within the nucleus, the IL1A propeptide may interact with elements of RNA processing affecting alternate splicing of genes involved in the regulation of apoptosis. (Pollock et al, 2003). NLS RNA interaction myristoylation phosphorylation mutation R>Q «damaging » mutation A>S «benign » cleavage sites IIL1A propeptide processing and comparison with IL1B ** mutations from SeattleSNPs IL1A IL1B Knowledge base of annotated protein family alignments

Rational for objective sequence/structure/evolution analysis Rational for objective sequence/structure/evolution analysis - role of sequence conservation in structure (fold, plasticity, oligomerisation, …) - impact of sequence changes (evolutionary, “indel”, mutation…) - spatial relation between conservation and physico-chemical properties… Automatic correction and integration of high throughput data Automatic correction and integration of high throughput data - from sequence/structure/evolution data to systems biology DbW: automatic daily update of protein alignments (Prigent et al, 2005 Bioinformatics) vALId: validation of predicted protein sequences (Bianchetti et al, 2005 JBCB) GOAnno: GO annotation based on multiple alignment (Chalmel et al, 2005 Bioinformatics) Promotor analysis : Multiple alignment algorithms in test Promotor analysis : Multiple alignment algorithms in test - phylogenetic footprinting coupled to MACS, promotor site prediction and statistical estimation Perspectives for algorithm developments

Bioanalysis and biological projects Implications in cancer and inherited disease Implications in cancer and inherited disease : Target characterisation and high throughput data analysis & integration (transcriptomics, interactomics…) Cancer targets in Structural Proteomics IN Europe (SPINE, European Integrated Project 2000) Cancer targets in Structural Proteomics IN Europe (SPINE, European Integrated Project 2000) Head and Neck Squamous Cell Carcinomas (HNSCC, European I. P. 2000) Head and Neck Squamous Cell Carcinomas (HNSCC, European I. P. 2000) Prostate cancer (ProCure BioPharm, European I. P. 2001) Prostate cancer (ProCure BioPharm, European I. P. 2001) Prostate cancer ( Prima, European I. P. 2003) Prostate cancer ( Prima, European I. P. 2003) Retinal disease (RetNet, European Research Training Network, 2003) WP5 Retinal disease (RetNet, European Research Training Network, 2003) WP5 Functional Genomics of the Retina (EVI-GENORET, European I.P. 2005) WP14 & 16 Functional Genomics of the Retina (EVI-GENORET, European I.P. 2005) WP14 & 16 Annotation of full length cDNAs from the thermotolerant metazoan Alvinella pompejana (Alvinella consortium, Genoscope) Annotation of full length cDNAs from the thermotolerant metazoan Alvinella pompejana (Alvinella consortium, Genoscope) Implications in integrated and automated processes Construction of a « GRID version » of PipeAlign (IBM/AFM/CNRS) Transcriptomic and Bioinformatic platforms (CRP santé, Luxembourg) Transcriptomic and Bioinformatic platforms (CRP santé, Luxembourg) MS2PH project : from Structural Mutation to Phenotype of Human Pathology (Decrypthon program, IBM/AFM/CNRS) MS2PH project : from Structural Mutation to Phenotype of Human Pathology (Decrypthon program, IBM/AFM/CNRS) Integrative approach of start codon prediction (ACI Protéomique et génie des protéines) Integrative approach of start codon prediction (ACI Protéomique et génie des protéines)

Laboratory of Integrative Genomics and Bioinformatics IGBMC, Strasbourg

Integrated analyses Sequence validation ~44% of predicted proteins from whole-genome shotgun sequencing projects, ~30% of high-throughput cDNA (HTC) may contain errors (Bianchetti et al. 2005) Structural characterisation ~50% sequences in GenBank can be assigned to known structures (PSSH, Schafferhans et al, 2003) Functional characterisation 20-30% of ORFs are ‘hypothetical proteins’ (Siew, 2004) cross-validation of experimental and predicted data propagation of information from known to unknown proteins  Multiple alignments of complete sequences (MACS) provide an ideal environment :  Integrated processes for automatic information collection, validation and analysis High quality, automatic multiple alignments

Multiple alignment methods : a brief history Local Global SBpima multal multalign pileup clustalx MLpima Progressive Iterative prrp dialign saga hmmer SEGMENT GA HMM

Progressive multiple alignment methods Local Global SB ML UPGMA NJ SBpima multal multalign pileup clustalx MLpima SB - sequential branching UPGMA- Unweighted Pair Grouping Method ML - maximum likelihood NJ - neighbor-joining