Joint EBI-Wellcome Trust Summer School June 2010
8/12/20152 Concepts, historical milestones & the central place of bioinformatics in modern biology: a European perspective Teresa K.Attwood University of Manchester
8/12/20153 Concepts, historical milestones & the central place of bioinformatics in modern biology: a personal perspective from a European Teresa K.Attwood University of Manchester
8/12/20154 Concepts, historical milestones & the central place of bioinformatics in modern biology: a personal perspective from a European Teresa K.Attwood University of Manchester
Where the concept of bioinformatics originated Some key milestones & key people Its place in ‘the new biology’ 8/12/2015Teresa K.Attwood University of Manchester 5 Overview
Disclaimer Bear in mind that this is a personal view That it’s hard –to step out of a situation & look back in & remain objective –to separate the European & American histories Observers from different perspectives will see & tell the story differently! So this is just my perspective… –& it’s bound up with sequences & dbs 8/12/2015Teresa K.Attwood University of Manchester 6
Origin of bioinformatics The origins of bioinformatics are rooted in sequence analysis And driven by the desire to –collect them –annotate them –& analyse them systematically (i.e., using computers)! 8/12/2015Teresa K.Attwood University of Manchester 7 The concept ‘bioinformatics’ was barely known pre 1990…
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas Key milestones ARPAnet
Margaret Dayhoff Pioneered development of computer methods to compare protein sequences –& to derive evolutionary histories from alignments Particularly interested in deducing evolutionary connections from sequence evidence 8/12/2015Teresa K.Attwood University of Manchester 9
Margaret Dayhoff Collected all the known protein sequences –made them available to the scientific community In 1965, she compiled a book –the 1 st Atlas of Protein Sequence and Structure 8/12/2015Teresa K.Attwood University of Manchester 10
Margaret Dayhoff 8/12/2015Teresa K.Attwood University of Manchester 11
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing Internet 7 structures Key milestones
Data overload in the USA 8/12/2015Teresa K.Attwood University of Manchester 13
Data overload in the USA 8/12/2015Teresa K.Attwood University of Manchester 14
Data overload in Europe The data overload problem had also been noticed in Europe The solution was to create the 1 st nucleotide sequence database –this was the EMBL databank this preceded the 1 st release of GenBank by ~6 months 8/12/2015Teresa K.Attwood University of Manchester 15
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR-PSD 859 sequences Internet 7 structures Key milestones
Enter Amos Bairoch A crazy postgrad student in Switzerland –interested in space exploration & the search for ET life His project was to develop software to analyse protein & nucleotide sequences –PC/Gene 8/12/2015Teresa K.Attwood University of Manchester 17
Amos Bairoch He published his 1 st paper in 1982 A letter to the BJ suggesting the use of checksums to “facilitate the detection of typographical & keyboard errors” –a true computer nerd! 8/12/2015Teresa K.Attwood University of Manchester 18
Amos Bairoch Why did he do this? In the process of developing PC/Gene, he typed in >1,000 protein sequences –some from the literature, most from the Atlas by 1981, this was a large book & several supplements, & listed 1,660 proteins it was not then available electronically 8/12/2015Teresa K.Attwood University of Manchester 19
Amos Bairoch In 1983, he acquired a computer tape of the EMBL databank –this was version 2, with 811 sequences In 1984, he received the 1 st available computer tape copy of the Atlas –(which quickly became the PIR-PSD) –but he was deeply unhappy with the PIR format 8/12/2015Teresa K.Attwood University of Manchester 20
Amos Bairoch So he decided to convert the PIR database into the semi-structured format of EMBL –part manually & part automatically –the result was PIR+ –it was distributed as part of PC/Gene (now commercial) In summer 1986, he decided to release the database independently of PC/Gene –so that it would be available to all, free of charge 8/12/2015Teresa K.Attwood University of Manchester 21
Amos Bairoch The new database was called Swiss-Prot The 1 st release was made on 21 July 1986 –the exact number of entries is unknown, as he can’t find the original floppy disks! 8/12/2015Teresa K.Attwood University of Manchester 22
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures Key milestones
Global data overload The number of sequences was growing The number of structures was growing So was the number of protein family signatures Two extraordinary developments had yet to take place –what were they? 8/12/2015Teresa K.Attwood University of Manchester 24
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures www FlyBase Key milestones
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase PfamInterPro 2,423entries TrEMBL 70,000 sequences Key milestones
8/12/ InterPro Pfam Profiles ProDom PRINTS Prosite ProDom Original InterPro partners Teresa K.Attwood University of Manchester
What is InterPro? “InterPro is an integrated documentation resource for protein families, domains & sites. By uniting databases that use different methodologies & a varying degree of biological information, InterPro capitalises on their individual strengths, producing a powerful integrated database & diagnostic tool.” 8/12/201528Teresa K.Attwood University of Manchester
The vision? Naïvely, we wanted to make life easier! We aimed to –simplify & rationalise protein family analysis –centralise & streamline the annotation process & reduce manual annotation burdens –&, in the wake of all the genome projects, to facilitate automatic functional annotation of uncharacterised proteins 8/12/201529Teresa K.Attwood University of Manchester In fact (& now with 11 partners) we made life a lot harder! But that’s another story…
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase PfamInterPro 2,423entries TrEMBL 70,000 sequences Key milestones
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries Key milestones
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries 10,867,798 sequences 185,231,366 sequences ENA 517,100 sequences Key milestones
8/12/2015Teresa K.Attwood University of Manchester insulin ribonuclease Dayhoff Atlas ARPAnet 65 sequences Auto protein sequencers DNA sequencing PDB Auto DNA sequencing EMBL, GenBank 568 sequences PIR DDBJ, Swiss-Prot 859 sequences ~3,900 sequences PROSITE PRINTS 58 entries 30 entries Internet 7 structures HT DNA sequencing www H.influenzae genome M.jannachii genome S.cerevisae genome D.Melanogaster genome H.sapiens genome C.elegans genome FlyBase InterProPfam TrEMBL 70,000 sequences UniProt 2,423entries 10,867,798 sequences ENA 517,100 sequences 185,231,366 sequences hundreds more billions more hundreds more Key milestones
The central place of bioinformatics in modern biology 8/12/2015Teresa K.Attwood University of Manchester 34 Hopefully, this potted history speaks for itself In the last 30 years, bioinformatics has given us –the first ‘complete’ catalogues of DNA & protein sequences including genomes & proteomes of organisms across the Tree of Life –software to analyse biological data on an unprecedented scale –& hence tools to help understand more about evolutionary processes in general our place on the Tree of Life in particular &, ultimately, more about health & disease It isn’t a panacea, but its contribution has been huge
8/12/201535Teresa K.Attwood University of Manchester Recommended reading A.B.Richon. A short history of bioinformatics ( A.Bairoch (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times. Bioinformatics, 16(1), M.Ashburner (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Laboratory Press. B.J.Strasser (2008) GenBank – Natural history in the 21 st century? Science, 322,