Joint EBI-Wellcome Trust Summer School 14-18 June 2010.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Archives and Information Retrieval
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Role of IT in Bioinformatics Naveena.Y. What is bioinformatics ? Study of Information content and information flow in biological systems and processes.
Protein Databases EBI – European Bioinformatics Institute
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Data, data standards and sharing Dr Daniel Swan Bioinformatics Support Unit
UniProt - The Universal Protein Resource
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
9/10/20151 Teresa K.Attwood University of Manchester.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Database David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Copyright OpenHelix. No use or reproduction without express written consent1.
Algorithms for Biological Sequence Analysis Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University,
Bioinformatics and Computational Biology
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK Bioinformatics:
Computer Storage of Sequences
Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Bioinformatics Summer School June 2011
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.
Demo: Protein Information Resource
Archives and Information Retrieval
생물정보학 Bioinformatics.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
InterPro An Introduction
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Joint EBI-Wellcome Trust Summer School June 2010

8/12/20152 Concepts, historical milestones & the central place of bioinformatics in modern biology: a European perspective Teresa K.Attwood University of Manchester

8/12/20153 Concepts, historical milestones & the central place of bioinformatics in modern biology: a personal perspective from a European Teresa K.Attwood University of Manchester

8/12/20154 Concepts, historical milestones & the central place of bioinformatics in modern biology: a personal perspective from a European Teresa K.Attwood University of Manchester

Where the concept of bioinformatics originated Some key milestones & key people Its place in ‘the new biology’ 8/12/2015Teresa K.Attwood University of Manchester 5 Overview

Disclaimer Bear in mind that this is a personal view That it’s hard –to step out of a situation & look back in & remain objective –to separate the European & American histories Observers from different perspectives will see & tell the story differently! So this is just my perspective… –& it’s bound up with sequences & dbs 8/12/2015Teresa K.Attwood University of Manchester 6

Origin of bioinformatics The origins of bioinformatics are rooted in sequence analysis And driven by the desire to –collect them –annotate them –& analyse them systematically (i.e., using computers)! 8/12/2015Teresa K.Attwood University of Manchester 7 The concept ‘bioinformatics’ was barely known pre 1990…

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas Key milestones ARPAnet

Margaret Dayhoff Pioneered development of computer methods to compare protein sequences –& to derive evolutionary histories from alignments Particularly interested in deducing evolutionary connections from sequence evidence 8/12/2015Teresa K.Attwood University of Manchester 9

Margaret Dayhoff Collected all the known protein sequences –made them available to the scientific community In 1965, she compiled a book –the 1 st Atlas of Protein Sequence and Structure 8/12/2015Teresa K.Attwood University of Manchester 10

Margaret Dayhoff 8/12/2015Teresa K.Attwood University of Manchester 11

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing Internet 7 structures Key milestones

Data overload in the USA 8/12/2015Teresa K.Attwood University of Manchester 13

Data overload in the USA 8/12/2015Teresa K.Attwood University of Manchester 14

Data overload in Europe The data overload problem had also been noticed in Europe The solution was to create the 1 st nucleotide sequence database –this was the EMBL databank this preceded the 1 st release of GenBank by ~6 months 8/12/2015Teresa K.Attwood University of Manchester 15

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR-PSD 859 sequences Internet 7 structures Key milestones

Enter Amos Bairoch A crazy postgrad student in Switzerland –interested in space exploration & the search for ET life His project was to develop software to analyse protein & nucleotide sequences –PC/Gene 8/12/2015Teresa K.Attwood University of Manchester 17

Amos Bairoch He published his 1 st paper in 1982 A letter to the BJ suggesting the use of checksums to “facilitate the detection of typographical & keyboard errors” –a true computer nerd! 8/12/2015Teresa K.Attwood University of Manchester 18

Amos Bairoch Why did he do this? In the process of developing PC/Gene, he typed in >1,000 protein sequences –some from the literature, most from the Atlas by 1981, this was a large book & several supplements, & listed 1,660 proteins it was not then available electronically 8/12/2015Teresa K.Attwood University of Manchester 19

Amos Bairoch In 1983, he acquired a computer tape of the EMBL databank –this was version 2, with 811 sequences In 1984, he received the 1 st available computer tape copy of the Atlas –(which quickly became the PIR-PSD) –but he was deeply unhappy with the PIR format 8/12/2015Teresa K.Attwood University of Manchester 20

Amos Bairoch So he decided to convert the PIR database into the semi-structured format of EMBL –part manually & part automatically –the result was PIR+ –it was distributed as part of PC/Gene (now commercial) In summer 1986, he decided to release the database independently of PC/Gene –so that it would be available to all, free of charge 8/12/2015Teresa K.Attwood University of Manchester 21

Amos Bairoch The new database was called Swiss-Prot The 1 st release was made on 21 July 1986 –the exact number of entries is unknown, as he can’t find the original floppy disks! 8/12/2015Teresa K.Attwood University of Manchester 22

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures Key milestones

Global data overload The number of sequences was growing The number of structures was growing So was the number of protein family signatures Two extraordinary developments had yet to take place –what were they? 8/12/2015Teresa K.Attwood University of Manchester 24

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures www FlyBase Key milestones

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase PfamInterPro 2,423entries TrEMBL 70,000 sequences Key milestones

8/12/ InterPro Pfam Profiles ProDom PRINTS Prosite ProDom Original InterPro partners Teresa K.Attwood University of Manchester

What is InterPro? “InterPro is an integrated documentation resource for protein families, domains & sites. By uniting databases that use different methodologies & a varying degree of biological information, InterPro capitalises on their individual strengths, producing a powerful integrated database & diagnostic tool.” 8/12/201528Teresa K.Attwood University of Manchester

The vision? Naïvely, we wanted to make life easier! We aimed to –simplify & rationalise protein family analysis –centralise & streamline the annotation process & reduce manual annotation burdens –&, in the wake of all the genome projects, to facilitate automatic functional annotation of uncharacterised proteins 8/12/201529Teresa K.Attwood University of Manchester In fact (& now with 11 partners) we made life a lot harder! But that’s another story…

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase PfamInterPro 2,423entries TrEMBL 70,000 sequences Key milestones

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries Key milestones

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries 10,867,798 sequences 185,231,366 sequences ENA 517,100 sequences Key milestones

8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries 10,867,798 sequences ENA 517,100 sequences 185,231,366 sequences hundreds more billions more hundreds more Key milestones

The central place of bioinformatics in modern biology 8/12/2015Teresa K.Attwood University of Manchester 34 Hopefully, this potted history speaks for itself In the last 30 years, bioinformatics has given us –the first ‘complete’ catalogues of DNA & protein sequences including genomes & proteomes of organisms across the Tree of Life –software to analyse biological data on an unprecedented scale –& hence tools to help understand more about evolutionary processes in general our place on the Tree of Life in particular &, ultimately, more about health & disease It isn’t a panacea, but its contribution has been huge

8/12/201535Teresa K.Attwood University of Manchester Recommended reading A.B.Richon. A short history of bioinformatics ( A.Bairoch (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times. Bioinformatics, 16(1), M.Ashburner (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Laboratory Press. B.J.Strasser (2008) GenBank – Natural history in the 21 st century? Science, 322,