GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester

Slides:



Advertisements
Similar presentations
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Advertisements

ISWC 2005, Galway Seven Bottlenecks to Workflow Reuse and Repurposing Antoon Goderis Ulrike Sattler Phillip Lord Carole Goble University of Manchester.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
1 Middleware for In silico Biology Phillip Lord
Migrating to the Semantic Web: Bioinformatics as a case study.
Bioinformatics Grid Application for Life Science. COMMUNICATION NETWORK DEVELOPMENT SPECIFIC SUPPORT ACTION BIOINFOGRID Luciano Milanesi CNR-ITB.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Development of Bioinformatics and its application on Biotechnology
Taverna and my Grid Basic overview and Introduction Tom Oinn
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
High level Knowledge-based Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK myGrid project
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
1 The myGrid Project Professor Chris Greenhalgh University of Nottingham.
The Grid as Future Scientific Infrastructure Ian Foster Argonne National Laboratory University of Chicago Globus Alliance
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
KAROLINSKA INSTITUTET International Biobank and Cohort Studies: Developing a Harmonious Approch February 7-8, 2005, Atlanta; GA Standards The P 3 G knowledge.
MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Organizing information in the post-genomic era The rise of bioinformatics.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
Towards an understanding of Genotype-Phenotype correlations Paul Fisher et al.,
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
High level Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK Robin McEntire, GSK.
MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
GGF Summer School 24th July 2004, Italy Part 2: Architecture overview Professor Carole Goble University of Manchester
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004 Exploring Williams-Beuren Syndrome.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Bioinformatics and Computational Biology
EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
7. Grid Computing Systems and Resource Management
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.
Toward a common data and command representation for quantum chemistry Malcolm Atkinson Director 5 th April 2004.
E-Science Process. Thoughts on the e-Science Mediator in myGrid M.Nedim Alpdemir.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
High throughput biology data management and data intensive computing drivers George Michaels.
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Katy Wolstencroft University of Manchester
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Functional Annotation of the Horse Genome
1st International Conference on Semantics, Knowledge and Grid
A myGrid Project Tutorial
Presentation transcript:

GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester

GGF Summer School 24th July 2004, Italy Vision: Collaboratory A collaboratory is …a center without walls, in which the nation's researchers can perform their research without regard to geographical location, interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information in digital libraries William Wulf, 1989 U.S. National Science Foundation

GGF Summer School 24th July 2004, Italy Vision: The Grid Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we [define] the "Grid problem”…as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations From "The Anatomy of the Grid: Enabling Scalable Virtual Organizations" by Foster, Kesselman and Tuecke

GGF Summer School 24th July 2004, Italy Knowledge workers, fluid communities Capturing, generating, gathering, integrating, sharing, processing, analysing, weeding, cleaning, correlating, archiving, retiring knowledge Much of it not theirs & not of their creation Much of it destined for others Know-how as important as know- what Know-why, when, where, who as important

GGF Summer School 24th July 2004, Italy Roadmap Part 1 –Application context Part 2 –Architecture Part 3 –Information model –LSIDs –Services –Data and metadata –Workflow environment Part 4 –Semantic publication and discovery –Provenance –Semantic Web and the Grid Part 5 –Wrap up

GGF Summer School 24th July 2004, Italy myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project,

GGF Summer School 24th July 2004, Italy Application Testbeds Grave’s Disease Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle Autoimmune disease of the thyroid Discover all you can about a gene: Affymetrix microarray analysis, Gene annotation Services from Japan, Hong Kong, various sites in UK Williams-Beuren Syndrome Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Microdeletion of 155 Mbases on Chromosome 7 Characterise an unknown gene: Gene alerting service, gene and protein annotation Services from USA, Japan, various sites in UK Trypanosomiasis in cattle Steve Kemp, University of Liverpool, UK Annotation pipelines and Gene expression analysis Services from USA, Japan, various sites in UK

GGF Summer School 24th July 2004, Italy Point, click, cut, paste ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI Slide courtesy of GSK

GGF Summer School 24th July 2004, Italy Life Sciences: knowledge generation Informational Science Large Scale Distributed No one organisation owns it all Integrating across scales, models, types, communities Small groups drawing on pooled resources

GGF Summer School 24th July 2004, Italy Data deluge, processing bottleneck ESTs Combinatorial Chemistry Human Genome Pharmacogenomics Metabolic Pathways Computational Load Genome Data Moores Law

GGF Summer School 24th July 2004, Italy Union of lots of small experiments... atcgaattccaggcgtcacattctcaattcca... billions DNA sequences alignments MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Proteins sequence 2º structure 3º structure Hundred thousands Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure millions billions Polymorphism and Variants genetic variants individual patients epidemiology Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. millions ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based

GGF Summer School 24th July 2004, Italy Much data is text Descriptive as well as numeric Literature Analogy/ knowledge- based

GGF Summer School 24th July 2004, Italy The bottleneck is not computation Its integration

GGF Summer School 24th July 2004, Italy Williams-Beuren Syndrome Microdeletion ** Chr 7 ~155 Mb ~1.5 Mb 7q11.23 GTF2I RFC2CYLN2 GTF2IRD1 NCF1 WBSCR1/E1f4H LIMK1ELNCLDN4CLDN3STX1A WBSCR18 WBSCR21 TBL2BCL7BBAZ1B FZD9 WBSCR5/LAB WBSCR22 FKBP6POM121 NOLR1 GTF2IRD2 C-cen C-midA-cen B-mid B-cen A-midB-telA-telC-tel WBSCR14 WBS SVAS STAG3 PMS2L Block A FKBP6T POM121 NOLR1 Block C GTF2IP NCF1P GTF2IRD2P Block B Patient deletions CTA-315H11 CTB-51J22 Gap Physical Map

GGF Summer School 24th July 2004, Italy WBS Workflows: GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetative elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro PFAM Prosite Smart Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? ncbiBlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence ncbiBlastWrapper Sort for appropriate Sequences only Pink: Outputs/inputs of a service Purple: Taylor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns RepeatMasker Interoperability

GGF Summer School 24th July 2004, Italy The problem Two major steps: Extend into the gap: Similarity searches; RepeatMasker, BLAST Characterise the new sequence: NIX, Interpro, etc… Numerous web-based services (i.e. BLAST, RepeatMasker) Cutting and pasting between screens Large number of steps Frequently repeated – info now rapidly added to public databases Don’t always get results Time consuming Huge amount of interrelated data is produced – handled in lab book and files saved to local hard drive Mundane Much knowledge remains undocumented Bioinformatician does the analysis

GGF Summer School 24th July 2004, Italy Classical Approach to the Bioinformatics Data Analysis - Microarray Import microarray data to Affymetrix data Mining Tool, Run Analyses and select Experiment Design to test Hypotheses Find restriction sites and design primers by eye for genotyping experiments Study Annotations for many different Genes Select Gene and Visually examine SNPS lying within

GGF Summer School 24th July 2004, Italy The Graves’ Disease Scenario Microarray data Storage + analysis Annotation Pipeline Gene & SNP Characterisation Experimental Design

GGF Summer School 24th July 2004, Italy Experiment life cycle Discovering and reusing experiments and resources Managing lifecycle, provenance and results of experiments Sharing services & experiments Personalisation Forming experiments Executing and monitoring experiments The Grid is a technology; the scientist wants a solution.

GGF Summer School 24th July 2004, Italy Scientists… …Experiment Can workflow be used as an experimental method? How many times has this experiment been run? …Analyze How do we manage the results to draw conclusions from them? …Collaborate Can we share workflows, results, metadata etc? …Publish Can we link to these workflows and results from our papers? …Review Can I find, comprehend and review your work? How was that result derived?

GGF Summer School 24th July 2004, Italy

Service Registration

GGF Summer School 24th July 2004, Italy Portal

GGF Summer School 24th July 2004, Italy

Status reporting

GGF Summer School 24th July 2004, Italy

WBS Life Cycle Wrap services as web services Register them Build a workflow using the services Evolve the workflow Run it over and over again in case data has changed Record results & provenance Inspect and compare results & provenance Set up event notification to fire the workflow Set up a portal to run the workflow Publish the workflow template in a registry to share with the world

GGF Summer School 24th July 2004, Italy Delivering results William-Beuren Syndrome Cuts down the time taken to perform one pipeline from 2 weeks to 2 hours Much more systematic collection and analysis. More regularly undertaken. Less boring. Less prone to mistakes. Once notification installed won’t even have to initiate it. Possible lead already found – but I can’t tell you. Benchmark: first run though of two iterations of workflows –Reduced gap by bp at its centrmeric end –Correctly located all seven known genes in this region –Identified 33 of the 36 known exons residing in this location

GGF Summer School 24th July 2004, Italy Delivering results Easy to get started with Taverna Sharing happens –IPR issues, and suspicions still abound Network effect necessary and happens Managed the transition from generic middleware development to practical day to day useful services. Architecture is solid. SOA – good idea

GGF Summer School 24th July 2004, Italy

In a nutshell Bioinformatics toolkit Open (Web) Services – my Grid components and external domain services –Publication, discovery, interoperation, composition, decommissioning of my Grid services –No control or influence over domain service providers Metadata Driven –LSIDs, Common information model, Ontologies, Semantic Web technologies Open extensible architecture –Assemble your own components –Designed to work together –Loosely coupled Freefluo WfEE Taverna WfDE View UDDI registry Event Notification mIR Pedro Semantic Discovery Feta Info. Model Soaplab Gowlab Gateway & CHEF Portal LSID Haystack Provenance Browser

GGF Summer School 24th July 2004, Italy Virtual organisations Service Providers Bioinformaticians Biologists Annotation providers Tool & middleware developers Service & Platform Administrators Service Information Workflow Registries mIRs Resources Reuse

GGF Summer School 24th July 2004, Italy Collaborative e-Science High level services for e-Science experimental management; –Provenance –Event notification –Personalisation Sharing knowledge and sharing components –Scientific discovery is personal & global. –Federated third party registries for workflows and services –Workflow and service discovery for reuse and repurposing Registry Register Find Annotate

GGF Summer School 24th July 2004, Italy Key Characteristics Data Intensive, Up stream analysis Pipelines - experiments as workflows (chiefly) Adhoc exploratory investigative workflows for individuals from no particular a priori community Openness – the services are not ours. Low activation energy, incremental take-on Foundations for sharing knowledge and sharing experimental objects Multiple stakeholders Collection of components for assembly

GGF Summer School 24th July 2004, Italy Putting the user first User-driven end to end scenarios essential Whole solution that fits with them Users vs Machines (vs Interesting computer science) –Mismatch for information needs Scufl instead of BPEL/WSFL Layers of Provenance Service/workflow descriptions for PEOPLE not just machines Bury complexity, increasingly simplify –Bioinformaticans HARDLY EVER want to have their services automatically selected Except SHIMs, Replicas, User specified equivalences Service providers and developers are users too!

GGF Summer School 24th July 2004, Italy Discovering and reusing experiments and resources Managing lifecycle, provenance and results of experiments Sharing services & experiments Personalisation Forming experiments Executing and monitoring experiments Soaplab

GGF Summer School 24th July 2004, Italy Publications R. Stevens, H.J. Tipney, C. Wroe, T. Oinn, M. Senger, P. Lord, C.A. Goble, A. Brass and M. Tassabehji Exploring Williams-Beuren Syndrome Using myGrid to appear in Proceedings of 12th International Conference on Intelligent Systems in Molecular Biology, 31st Jul-4th Aug 2004, Glasgow, UK. C.A. Goble, S. Pettifer, R. Stevens and C. Greenhalgh Knowledge Integration: In silico Experiments in Bioinformatics in The Grid: Blueprint for a New Computing Infrastructure Second Edition eds. Ian Foster and Carl Kesselman, 2003, Morgan Kaufman, November R. Stevens, A. Robinson, and C.A. Goble myGrid: Personalised Bioinformatics on the Information Grid in proceedings of 11th International Conference on Intelligent Systems in Molecular Biology, 29th June–3rd July 2003, Brisbane, Australia, published Bioinformatics Vol. 19 Suppl , pp

GGF Summer School 24th July 2004, Italy

GGF Summer School 24th July 2004, Italy Statistics Data sizes (results, iterations) Number of services Duration of stateful collaborations Size limitations on Axis – large messages. Document, rpc, literal encoded stuff. Concaternating Complex types and toolkits. Performance ? Coding round Axis. Internal processes – documents, rpc Complexity of workflows