Download presentation
Presentation is loading. Please wait.
1
Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University
2
Outline Describe the background problem Introduce distributed services, workflows, eScience and (a bit of) ontologies. CARMEN Provenance Can we repeat an experiment?
3
Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
4
Around the world in 80 days Biology is still largely a cottage industry On a global stage
5
Websites everywhere 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
6
WBS Workflows: GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetative elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro PFAM Prosite Smart Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? ncbiBlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence ncbiBlastWrapper Sort for appropriate Sequences only Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns RepeatMasker START
7
myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.nethttp://taverna.sf.net
8
Web Services Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web Web services are a: –technology and standard for exposing code / databases with an API that can be consumed by a third party remotely. –describes how to interact with it. They are: Self-contained Self-describing Modular Platform independent
9
Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. The METHODS section of a scientific publication Workflows
10
The Taverna Workbench http://taverna.sourceforge.net http://www.mygrid.org.uk
11
Workflows Automating away cutting and pasting. Helps to deal with distribution of data. myGrid and Taverna built on the open nature of bioinformatics. Can we adapt the same approach to another discipline?
12
CARMEN Code, Analysis, Repository and Modelling for e-Neuroscience www.carmen.org.uk www.carmen.org.uk Engineering and Physical Sciences Research Council
13
Consortium & Profile Stirling St. Andrews Newcastle York Sheffield Cambridge Imperial Plymouth Warwick Leicester Manchester $10M over 4 years 20 Investigators Commenced 1 st October 2006
14
Industry & Associates
15
Virtual Laboratory for Neurophysiology Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated
16
Potential Barriers Technical –Multiple propietary formats –No standardised metadata –Volume of data to be analysed Cultural –Multiple Communities acting independently –Concerns about implications of sharing
17
Comparing to bioinformatics Cottage industry Global distribution Need to share But….
18
Age and Impact.
19
No sequences! DNA and Protein sequence form a core datatype for bioinformatics It’s simple to structure and to store, and it is of high-value Initially, there wasn’t much of it, and textual metadata was fine. Many people built tools over it, for transforming and manipulating.
20
The need for clear metadata Most neurosciences data is relative simple in structure But often contextually complex Sometimes associated with behavioural features
21
Neuroscience spike data The raw data is just a waveform But what is the experiment for? What stimulus is the organism/tissue receiving? Even, which channel is which? The data sets being produced are (reasonably) large (10’s of Gb, or 1Tb in three months)
22
Data Sharing in bioinformatics Data Sharing was an early tradition in biology. Gene patenting, NDAs and the like came as quite a surprise Many political battles were fought, culminating with Clinton/Blair statement
23
Data Sharing in Neurosciences The data is easy to structure, but the metadata is not There is, therefore, less point to sharing data Many neuroscientists come from a medical background tends to be more of a hierarchical, secretive profession – all worried about getting sued. A lot of neuroscientists use invasive, live animal experiments security is more than a passing concern.
24
A Following Wind The achievements and processes of bioinformatics are familiar to neuroscience it seems to be easier to argue for the value of standardisation But less of a do-it-yourself attitude “But you can’t just make up a standard” “We’re just trying to build a list of terms, which we all understand. Then the experts can turn it into an ontology”
25
The difference in neuroscience Less data sharing tradition No rich ecosystem of tools Higher barrier to entry for metadata Larger datasets
26
Virtual Laboratory Node Search for Data & Analysis Code Raw Signal Data Search & Visualisation Deployment of Data & Analysis Code in Processes Raw & Derived Data File Store Security Policies Controlling Access to Data & Code Structured Metadata Store Enabling Search & Annotation Analysis & Model Code Store
27
CARMEN Metadata (April 2008) Data and Scripting Support (April 2008) Security (April 2008) Provenance (July 2008) CARMEN v1.0 (October 2008) CARMEN v2.0 (October 2009) Structured Metadata allowing data and analysis code to be described and searched Support for extended range of data formats and scripting languages Security allowing access to data and analysis code to be controlled Provenance of analysis and modelling processes leading to scientific results Release of CARMEN v 1.0 Virtual laboratory nodes open to the CARMEN consortium Release of CARMEN v 2.0 Virtual laboratory nodes “networked” Development Timeline
28
Virtual Laboratory Infrastructure Networked Nodes at Newcastle and York. More planned …
29
Vision – Global Laboratory
30
Some Unexpected Advantages Big problem with bioinformatics services Over time they tend to disappear CARMEN keeps services and data together This means we should be able to rerun analyses later. We should be able to store provenance
31
What is Provenance
32
CARMEN’s perspective We wish to store data, store it’s provenance, store it’s usage. We need release policies, we need retention policies, we need to understand ownership
33
What does it mean to rerun an experiment? Replicability: one scientist should be able to repeat another’s experiment, under equivalent conditions, at a different time. Rerunability: a scientist should be able to apply an equivalent technique under new circumstances. The addition of services into this mix complicate the issue. New DataOld Data Replicability Rerunability
34
New Data Old Data Old Services New Services Replicability Rerunability Is the specification of what happened actually right? Has the state of the world advanced since previously? Has the world changed, in a comparable way? Has the service changed in a comparable way? Error-Prone Neuroscientist Eager Neuroscientist Neuroscientist comparing to existing work Tool Builder
35
There is a difficulty There is less tradition of data sharing The tendancy to want to control data is much larger If we want to data mine, we have to cope with data is mine If we have many different repositories, this needs to be supported computationally
36
An Example: Licensing Computationally amenable licenses are available Take, for example, Creative Commons
38
Conclusions Automated workflows have been applied very successfully in bioinformatics. But applying these directly to neuroinformatics is a different issue. Technology has to fit the domain. We are investigating metadata for describing neuroinformatics
39
my Grid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble. Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell. Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust.
40
Acknowledgements Professor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan University of St Andrews The University Of Sheffield
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.