Presentation is loading. Please wait.

Presentation is loading. Please wait.

GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester

Similar presentations


Presentation on theme: "GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester"— Presentation transcript:

1 GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester http://www.mygrid.org.uk

2 GGF Summer School 24th July 2004, Italy Vision: Collaboratory A collaboratory is …a center without walls, in which the nation's researchers can perform their research without regard to geographical location, interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information in digital libraries William Wulf, 1989 U.S. National Science Foundation

3 GGF Summer School 24th July 2004, Italy Vision: The Grid Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we [define] the "Grid problem”…as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations From "The Anatomy of the Grid: Enabling Scalable Virtual Organizations" by Foster, Kesselman and Tuecke

4 GGF Summer School 24th July 2004, Italy Knowledge workers, fluid communities Capturing, generating, gathering, integrating, sharing, processing, analysing, weeding, cleaning, correlating, archiving, retiring knowledge Much of it not theirs & not of their creation Much of it destined for others Know-how as important as know- what Know-why, when, where, who as important

5 GGF Summer School 24th July 2004, Italy Roadmap Part 1 –Application context Part 2 –Architecture Part 3 –Information model –LSIDs –Services –Data and metadata –Workflow environment Part 4 –Semantic publication and discovery –Provenance –Semantic Web and the Grid Part 5 –Wrap up

6 GGF Summer School 24th July 2004, Italy myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.nethttp://taverna.sf.net

7 GGF Summer School 24th July 2004, Italy Application Testbeds Grave’s Disease Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle Autoimmune disease of the thyroid Discover all you can about a gene: Affymetrix microarray analysis, Gene annotation Services from Japan, Hong Kong, various sites in UK Williams-Beuren Syndrome Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Microdeletion of 155 Mbases on Chromosome 7 Characterise an unknown gene: Gene alerting service, gene and protein annotation Services from USA, Japan, various sites in UK Trypanosomiasis in cattle Steve Kemp, University of Liverpool, UK Annotation pipelines and Gene expression analysis Services from USA, Japan, various sites in UK

8 GGF Summer School 24th July 2004, Italy Point, click, cut, paste ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI Slide courtesy of GSK

9 GGF Summer School 24th July 2004, Italy Life Sciences: knowledge generation Informational Science Large Scale Distributed No one organisation owns it all Integrating across scales, models, types, communities Small groups drawing on pooled resources

10 GGF Summer School 24th July 2004, Italy Data deluge, processing bottleneck 199020002010 ESTs Combinatorial Chemistry Human Genome Pharmacogenomics Metabolic Pathways Computational Load Genome Data Moores Law

11 GGF Summer School 24th July 2004, Italy Union of lots of small experiments... atcgaattccaggcgtcacattctcaattcca... billions DNA sequences alignments MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Proteins sequence 2º structure 3º structure Hundred thousands Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure millions billions Polymorphism and Variants genetic variants individual patients epidemiology Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. millions ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based

12 GGF Summer School 24th July 2004, Italy Much data is text Descriptive as well as numeric Literature Analogy/ knowledge- based

13 GGF Summer School 24th July 2004, Italy The bottleneck is not computation Its integration

14 GGF Summer School 24th July 2004, Italy Williams-Beuren Syndrome Microdeletion ** Chr 7 ~155 Mb ~1.5 Mb 7q11.23 GTF2I RFC2CYLN2 GTF2IRD1 NCF1 WBSCR1/E1f4H LIMK1ELNCLDN4CLDN3STX1A WBSCR18 WBSCR21 TBL2BCL7BBAZ1B FZD9 WBSCR5/LAB WBSCR22 FKBP6POM121 NOLR1 GTF2IRD2 C-cen C-midA-cen B-mid B-cen A-midB-telA-telC-tel WBSCR14 WBS SVAS STAG3 PMS2L Block A FKBP6T POM121 NOLR1 Block C GTF2IP NCF1P GTF2IRD2P Block B Patient deletions CTA-315H11 CTB-51J22 Gap Physical Map

15 GGF Summer School 24th July 2004, Italy WBS Workflows: GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetative elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro PFAM Prosite Smart Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? ncbiBlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence ncbiBlastWrapper Sort for appropriate Sequences only Pink: Outputs/inputs of a service Purple: Taylor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns RepeatMasker Interoperability

16 GGF Summer School 24th July 2004, Italy The problem Two major steps: Extend into the gap: Similarity searches; RepeatMasker, BLAST Characterise the new sequence: NIX, Interpro, etc… Numerous web-based services (i.e. BLAST, RepeatMasker) Cutting and pasting between screens Large number of steps Frequently repeated – info now rapidly added to public databases Don’t always get results Time consuming Huge amount of interrelated data is produced – handled in lab book and files saved to local hard drive Mundane Much knowledge remains undocumented Bioinformatician does the analysis

17 GGF Summer School 24th July 2004, Italy Classical Approach to the Bioinformatics Data Analysis - Microarray Import microarray data to Affymetrix data Mining Tool, Run Analyses and select Experiment Design to test Hypotheses Find restriction sites and design primers by eye for genotyping experiments Study Annotations for many different Genes Select Gene and Visually examine SNPS lying within

18 GGF Summer School 24th July 2004, Italy The Graves’ Disease Scenario Microarray data Storage + analysis Annotation Pipeline Gene & SNP Characterisation Experimental Design

19 GGF Summer School 24th July 2004, Italy Experiment life cycle Discovering and reusing experiments and resources Managing lifecycle, provenance and results of experiments Sharing services & experiments Personalisation Forming experiments Executing and monitoring experiments The Grid is a technology; the scientist wants a solution.

20 GGF Summer School 24th July 2004, Italy Scientists… …Experiment Can workflow be used as an experimental method? How many times has this experiment been run? …Analyze How do we manage the results to draw conclusions from them? …Collaborate Can we share workflows, results, metadata etc? …Publish Can we link to these workflows and results from our papers? …Review Can I find, comprehend and review your work? How was that result derived?

21 GGF Summer School 24th July 2004, Italy

22 Service Registration

23 GGF Summer School 24th July 2004, Italy Portal

24 GGF Summer School 24th July 2004, Italy

25

26

27 Status reporting

28 GGF Summer School 24th July 2004, Italy

29

30 WBS Life Cycle Wrap services as web services Register them Build a workflow using the services Evolve the workflow Run it over and over again in case data has changed Record results & provenance Inspect and compare results & provenance Set up event notification to fire the workflow Set up a portal to run the workflow Publish the workflow template in a registry to share with the world

31 GGF Summer School 24th July 2004, Italy Delivering results William-Beuren Syndrome Cuts down the time taken to perform one pipeline from 2 weeks to 2 hours Much more systematic collection and analysis. More regularly undertaken. Less boring. Less prone to mistakes. Once notification installed won’t even have to initiate it. Possible lead already found – but I can’t tell you. Benchmark: first run though of two iterations of workflows –Reduced gap by 267 693 bp at its centrmeric end –Correctly located all seven known genes in this region –Identified 33 of the 36 known exons residing in this location

32 GGF Summer School 24th July 2004, Italy Delivering results Easy to get started with Taverna Sharing happens –IPR issues, and suspicions still abound Network effect necessary and happens Managed the transition from generic middleware development to practical day to day useful services. Architecture is solid. SOA – good idea

33 GGF Summer School 24th July 2004, Italy

34 In a nutshell Bioinformatics toolkit Open (Web) Services – my Grid components and external domain services –Publication, discovery, interoperation, composition, decommissioning of my Grid services –No control or influence over domain service providers Metadata Driven –LSIDs, Common information model, Ontologies, Semantic Web technologies Open extensible architecture –Assemble your own components –Designed to work together –Loosely coupled Freefluo WfEE Taverna WfDE View UDDI registry Event Notification mIR Pedro Semantic Discovery Feta Info. Model Soaplab Gowlab Gateway & CHEF Portal LSID Haystack Provenance Browser

35 GGF Summer School 24th July 2004, Italy Virtual organisations Service Providers Bioinformaticians Biologists Annotation providers Tool & middleware developers Service & Platform Administrators Service Information Workflow Registries mIRs Resources Reuse

36 GGF Summer School 24th July 2004, Italy Collaborative e-Science High level services for e-Science experimental management; –Provenance –Event notification –Personalisation Sharing knowledge and sharing components –Scientific discovery is personal & global. –Federated third party registries for workflows and services –Workflow and service discovery for reuse and repurposing Registry Register Find Annotate

37 GGF Summer School 24th July 2004, Italy Key Characteristics Data Intensive, Up stream analysis Pipelines - experiments as workflows (chiefly) Adhoc exploratory investigative workflows for individuals from no particular a priori community Openness – the services are not ours. Low activation energy, incremental take-on Foundations for sharing knowledge and sharing experimental objects Multiple stakeholders Collection of components for assembly

38 GGF Summer School 24th July 2004, Italy Putting the user first User-driven end to end scenarios essential Whole solution that fits with them Users vs Machines (vs Interesting computer science) –Mismatch for information needs Scufl instead of BPEL/WSFL Layers of Provenance Service/workflow descriptions for PEOPLE not just machines Bury complexity, increasingly simplify –Bioinformaticans HARDLY EVER want to have their services automatically selected Except SHIMs, Replicas, User specified equivalences Service providers and developers are users too!

39 GGF Summer School 24th July 2004, Italy Discovering and reusing experiments and resources Managing lifecycle, provenance and results of experiments Sharing services & experiments Personalisation Forming experiments Executing and monitoring experiments Soaplab

40 GGF Summer School 24th July 2004, Italy Publications R. Stevens, H.J. Tipney, C. Wroe, T. Oinn, M. Senger, P. Lord, C.A. Goble, A. Brass and M. Tassabehji Exploring Williams-Beuren Syndrome Using myGrid to appear in Proceedings of 12th International Conference on Intelligent Systems in Molecular Biology, 31st Jul-4th Aug 2004, Glasgow, UK. C.A. Goble, S. Pettifer, R. Stevens and C. Greenhalgh Knowledge Integration: In silico Experiments in Bioinformatics in The Grid: Blueprint for a New Computing Infrastructure Second Edition eds. Ian Foster and Carl Kesselman, 2003, Morgan Kaufman, November 2003. R. Stevens, A. Robinson, and C.A. Goble myGrid: Personalised Bioinformatics on the Information Grid in proceedings of 11th International Conference on Intelligent Systems in Molecular Biology, 29th June–3rd July 2003, Brisbane, Australia, published Bioinformatics Vol. 19 Suppl. 1 2003, pp302-304.

41 GGF Summer School 24th July 2004, Italy http://www.mygrid.org.uk

42 GGF Summer School 24th July 2004, Italy Statistics Data sizes (results, iterations) Number of services Duration of stateful collaborations Size limitations on Axis – large messages. Document, rpc, literal encoded stuff. Concaternating Complex types and toolkits. Performance ? Coding round Axis. Internal processes – documents, rpc Complexity of workflows


Download ppt "GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester"

Similar presentations


Ads by Google