Presentation is loading. Please wait.

Presentation is loading. Please wait.

The iPlant Collaborative Cyberinfrastructure aka Development of Public Cyberinfrastructure to Support Plant Science Presented by Dan Stanzione Co-PI and.

Similar presentations


Presentation on theme: "The iPlant Collaborative Cyberinfrastructure aka Development of Public Cyberinfrastructure to Support Plant Science Presented by Dan Stanzione Co-PI and."— Presentation transcript:

1 The iPlant Collaborative Cyberinfrastructure aka Development of Public Cyberinfrastructure to Support Plant Science Presented by Dan Stanzione Co-PI and Cyberinfrastructure Lead, iPlant Collaborative Deputy Director, Texas Advanced Computing Center

2 Today’s Schedule Presentations – Me: Overview and architecture – Matt: CI for Genotype to Phenotype – Sheldon: CI for Tree of Life – Uwe: CI for Education the next generation of biologists. A quick break Live interactive demos of DNA Subway, Discovery Environment, and MyPlant site.

3 What is iPlant? iPlant’s mission is to build the CI to support plant biology’s Grand Challenge solutions Grand Challenges were not defined in advance, but identified through engagement with the community A virtual organization with Grand Challenge teams relying on national cyberinfrastructure Long term focus on sustainable food supply, climate change, biofuels, ecological stability, etc Hundreds of participants globally… Working group members at >50 US institutions, USDA, DOE, etc.

4 Brief History Formally approved by National Science Board – 12/2007 Funding by NSF – February 1 st, 2008 iPlant Kickoff Conference at CSHL – April 2008 o ~200 participants  Grand Challenge Workshops – Sept-Dec 2008  CI workshop – Jan 2009  Grand Challenge White Paper Review – March 2009  Project Recommendations – March 2009  Project Kickoffs – May 2009 & August 2009  First Release of Discovery Environments – April 2010

5 Grand Challenge & CI Workshops Mechanistic Basis of Plant Adaptation (9-30-08) Impact of Climate Change on Plant Productivity: Prediction of Phenotype from Genotype (9-30-08) Developing common models for molecular mechanisms, crop physiology, and ecology (11-7-08) Assembling the Tree of Life to Enable the Plant Sciences (11-19-08) Computational Morphodynamics of Plants (12-15-08) Botanical Information & Ecology Network CI Workshop

6 GC Projects Recommended by the iPlant Board of Directors March 2009 Initial Projects: Plant Tree of Life – iPToL – May ‘09 +Taxonomic Intelligence + APWeb2 + Social Networking Website Genotype to Phenotype – iPG2P – Aug ‘09 + Image Analysis Platform

7 iPlant Tree of Life Working Groups  Trait Evolution, Brian Omeara – Post-tree analysis and mapping of ancestral traits  Tree Reconciliation, Todd Vision – Large-scale reconciliation of gene trees, co-evolving parasites, etc., with species trees  Big Trees, Alexandros Stamatakis – HPC Phylogenetic inference with 500K taxa  Tree Visualization Michael Sanderson; Karen Cranston – Cross cutting group for the viz needs of all  Data Integration, Val Tannen, Bill Piel – Cross cutting group for the data integration needs of all  Data Assembly, Doug Soltis, Pam Soltis, Michael Donoghue – Community and network building, data assembly

8 iPlant Genotype to Phenotype Working Groups NextGen Sequencing – Establishing an informatics pipeline that will allow the plant community to process NextGen sequence data Statistical Inference – Developing a platform using advanced computational approaches to statistically link genotype to phenotype Modeling Tools – Developing a framework to support tools for the construction, simulation and analysis of computational models of plant function at various scales of resolution and fidelity Visual Analytics – Generating, adapting, and integrating visualization tools capable of displaying diverse types of data from laboratory, field, in silico analyses and simulations Data Integration – Investigating and applying methods for describing and unifying data sets into virtual systems that support iPG2P activities

9 NSF Cyberinfrastructure Vision High Performance Computing Data and Data Analysis Virtual Organizations Learning and Workforce Ref: “Cyberinfrastructure Vision for 21st Century Discovery”, NSF Cyberinfrastructure Council, March 2007.

10 What is Cyberinfrastructure? (Originally about TeraGrid) And More!: - Viz - Facilities - Data collections … It’s a Grid! It’s Storage! It’s a Common Software Environ! It’s a Network! They are HPC Centers! It’s Apps and Support! It was six men of Indostan, To learning much inclined, Who went to see the elephant, (Though all of them were blind), That each by observation Might satisfy his mind. WWW.TERAGRID.ORG

11 Cyberinfrastructure versus Bioinformatics Leveraging production compute and storage infrastructure; hundreds of millions in NSF investment… these aren’t machines in our lab. Focus on a *platform* not tools – Methods for leveraging physical resources – Methods for integrating tools – Methods for integrating data Emphasis on a sustainable, species independent platform.

12 What is the iPlant CI? Two grand challenges: iPlant Tree of Life (IPTOL): – Build a single tree showing the evolutionary relationships of all green plant species on Earth iPlant Genotype-to- Phenotype (IPG2P) – Construct a methodology whereby an investigator, given the genomic and environmental information about a given plant, can predict it’s characteristics. Strong focus on data integration, not simulation: Plant science is truly data driven. Still many computational challenges (e.g. inferring phylogenies from genome data Prototype visualization tool, showing 220,000 taxa phylogenetic tree

13 Open Source Philosophy, Commercial Quality Process iPlant is open in every sense of the word: – Open access to source – Open API to build a community of contributors – Open standards adopted wherever possible – Open access to data (where users so choose). iPlant code design, implementation, and quality control will be based in best industrial practice

14 CI Timelines Per NSF mandate: No development before conclusion of GC selection in March, 2009 GC projects kicked off requirements gathering phase in May and July 2009, respectively. Software engineering practices established, staffing expansion in summer of 2009 Architecture design first production and prototype coding began in September 2009 Initial prototype rollouts began Jan. 2010 First product betas began March 2010 New releases of DE quarterly, with periodic releases of other products.

15 IPTOL CI – At a Very High Level Goal: Build and use very large trees, perhaps all green plant species Needs: – Most of the data isn’t collected. A lot of what is collected isn’t organized. – Lots of analysis tools exist (probably plenty of them) – but they don’t work together, and use many different data formats. – The tree builder tools take too long to run. – The visualization tools don’t scale to the tree sizes needed.

16 IPTOL CI – High Level Addressing these needs through CI – MyPlant – the social networking site for phylogenetic data collection (organized by clade) – Provide a common repository for data without an NCBI home (e.g. 1kP) – Discovery Environment: Build a common interface, data format, and API to unite tools. – Enhance tree builder tools (RAxML, NINJA, Sate’) with parallelization and checkpointing – Build a remote visualization tool capable of running where we can guarantee RAM resources

17 Support of Existing Tools The IPTOL working groups have determined a number of tools that needed to be enhanced to meet initial scientific goals: – NINJA (Neighbor Joining) – RAXML (Maximum Likeklihood) – Both pose significant scalability challenges iPlant staff are helping the developers tackle.

18 Tree Visualization Clade-based navigation, scaling to 1M taxa guaranteed

19 Discovery Environment

20 First DE Support of only one workflow, independent contrasts, but: – Remote execution of compute tasks on TeraGrid resources seamlessly – Incorporation of existing informatics tools behind iPlant interface – Parsing of multiple data formats into iPlant format – Seamless integration of online data resources – Role based access and basic provenance support Mostly foundation work…

21 Second DE Release Added Functionality for G2P, specifically high throughput sequencing (transcript abundance, variant detection) Substantially enhanced UI IRODS Integration

22 Portfolio of Activities Maintaining a balance of “past, present, future” strategies – “Past”: make services, systems, and support available to existing bioinformatics projects, either to enhance them or simply make critical tools more widely available. – “Present” build the best bioinformatics software tools that today’s technologies can provide. – “Future” track emerging technologies, and where appropriate stimulate research into the creation and use of those technologies.

23 Portfolio of Activities In a nutshell: – 12 Working groups in the two grand challenges, each of which is defining requirements for DE development. Each group not only has discussions that leads to final projects, but they also spawn prototyping efforts, tech eval projects, tool support projects, etc. – Services group: provide cycles, storage, hosting, etc. to users. – A comprehensive technology evaluation program to find, borrow, or build relevant technologies, headlined by the semantic web effort. – A number of ancillary projects related to grand challenges, i.e. APWEB, high throughput image analysis – The Core development/integration effort.

24 The iPlant Cyberinfrastructure Physical Infrastructure Compute Storage Persistent Virtual Machines TeraGrid Open Science Grid UA/ASU/TACC iPlant Middleware Job SubmissionWorkflow ManagementService/Data APIs iRODS, Grid Technologies, Condor, RESTful Services iPlant Discovery Environments Grand Challenge Workflows, iPlant Interfaces Third Party Tools, iPlant-built Tools, Community Contributed Tools and Data! Build a CI that’s robust, leverages national infrastructure, and can grow through community contribution! User

25 Systems and Services Provide access for problems like these on large scale systems Provide the storage infrastructure for biological data (again, in support of existing projects) Provide cloud style VM infrastructure for service hosting.

26 Existing Systems We have made resources available to iPlant users from a number of TeraGrid and local systems – Ranger (TG/Large Scale Supercomputer) – Stampede (TACC/High Throughput) – Longhorn (TG/Remote Visualization and GPU) The Contrast tool runs in production on Stampede; TreeViz on Longhorn Several groups accessing these systems for real science now; command line only, but open for business!

27 Storage Services We have also begun offering storage to a number of projects connected to the grand challenges in some way, as well as iPlant internal. – IRODS interface – Corral at TACC, a local storage array at UA Data arriving now for 1KP project, Gates C3/C4 project, some labs starting to use… open for business.

28 Cloud Services iPlant is now offering “cloud” style hosting services. Dynamically launch virtual servers hosted by iPlant. Still in prototype

29 A Discovery Environment remote repositories for Data, Models and Algorithms Local datasets Community annotation Computational tools & web services Integration layer Community ontologies & controlled vocabularies Programmatic Access API Collaboration- friendly front end

30 Discovery Environment Releases First release was March; Technology Preview – Tests of architecture, identity management, remote system integration, etc. – One supported workflow for IPTOL Second release in June 2010 – Six workflows for IPG2P around high throughput sequencing – Integrated tree visualization tool – UI Refinements based on user feedback Third release September 2010 – Basic Support for incorporation of 3 rd party tools – Enhanced collaboration features – Taxon name scrubber

31 The iPlant Application Programmer Interface And What it Means to You

32 API – Why is it important? The API is the Application Programmer Interface (it comes with an associated SDK, or software developer kit). This is the way bioinformatics tools and data get integrated with iPlant. First pieces to be released late Sept/early October. The cardinal sin of API support: – Release lots of versions, each incompatible with the last. – Our approach: Incremental releases; each release will add new areas of functionality, not change old syntax – Initial support: Getting files in and out of the environment, running jobs.

33 Architecture

34 Core Services Eventing I/O Data Transforms App Discovery Job Mgmt. User Profile Mgmt. Authentication User/Project Auditing Mashups (Orchestration)

35 Application Discovery Services Application discovery and management (different from semantic web service discovery) /apps: add a new application to the iPlant CI /apps/list: list all supported applications /apps/search: search for a specific application /apps/type/list: list all supported application types /apps/type/ : list all supported applications of a specific type /apps/name/ : list all supported applications matching a given name

36 API/DE Back End

37 Publishing Your Service Through iPlant Wrap your service in our API (or get us too). Give us the package to deploy on our platforms (optional, but a good idea). We register as a service, discoverable through app API. Describe the user interface to our discovery environment (Graphical tool to build forms).

38 Using the API in your work *outside* the Discovery Environment You need not come thru the DE to make use of a service. Embed calls to the web service in your own code, or even from the command line. For example, to get an output file from your Phylip run: https://services.iplantcollaborative.org/contrast/file/get/( ) While it is nice to do this by hand, the key thing is it can be *automated*.

39 Roadmap API ServiceExpected Beta ReleaseDevelopment Team io, data, eventSeptember, 2010TACC, CSH apps, jobOctober, 2010TACC profile, authOctober, 2010UA, TACC auditNovember, 2010TACC, UA mashupJanuary, 2010UA, TACC, CSH TACC – Texas Advanced Computing Center CSH – Cold Spring Harbor Laboratory UA – University of Arizona

40 Technology Eval Activities Largest investment in semantic web activities – Key for addressing the massive data integration challenges Exploring alternate implementations of QTL mapping algorithms Experimental Reproducability Policy and Technology for Provenance Management Evaluation of HubZero, Workflow engines, numerous other tools

41 Deployment Strategy Broadly, iPlant CI deployment can be grouped into 3 categories: – Systems, services (middleware), and tools In each category, there are a couple of types of development/deployment activities. – Prototype, production The transition from prototype to production can usually follow a relatively robust engineering schedule; prototyping less so.

42 So, what can I get from iPlant Right NOW! Tools: – Use the Discovery Environment to do transcript abundance, variant detection, trait evolution, or just store your stuff – Access to prototype tools for large scale tree visualization or very large tree building runs with neighbor joining or maximum likelihood. – Use MyPlant to find data and colleagues working on related species – Use DNASubway to do genome annotation and train your students.

43 So, what can I get from iPlant Right NOW! Systems/Services: – Request a repository to provide command line or WebDAV access to large scale datasets on high integrity storage systems – Get command line access to the most powerful computing and visualization systems in the world. – Use the iPlant Cloud to host your web application in a virtual machine.

44 So, what can I get from iPlant SOON Services: – Use the API to embed access to iPlant tools, systems, and data repositories in your own scripts and workflows. – Submit your bioinformatics tool to be registered as an iPlant service (run on large platforms, available to others thru API), or make your web service discoverable thru iPlant. A little later: Have your tool incorporated in the iPlant DE with it’s own graphical interface. Tools: more coming on line steadily

45 Collaborations More than 80 faculty at 45 institutions involved in working groups. Gates Integrated Breeding Platform Gates C3/C4 photosynthesis project 1KP thousand plant transcriptome project Nascent “National Virtual Herbaria”, many existing herbaria.

46 Discussion (See demo clips at http://iplantcollaborative.org/videos

47 CI Master Project List DE API Semantic Web GLM GLM – GPU MyPlant RaXML NINJA Experimental Reproducability Image management pipeline APWEB refit Ingest pipeline (Phlawd) DNA Subway BrachyBio DropBox Cloud service Storage repositories Analytics pipeline Visualization explorations Workflow tools analysis Large Scale Tree Visualization Name resolution service

48 DNA Subway

49 My Plant Social networking for plant biologists Organized by clade Used to organize the data collection for the “big tree”

50 Scope: What iPlant won’t do iPlant is not a funding agency – A large grant shouldn’t become a bunch of small grants iPlant does not fund data collection iPlant will (probably) not continue funding for whose funding is ending. iPlant will not seek to replace all online data repositories iPlant will not *impose* standards on the community.

51 Scope: What iPlant *will* do Provide storage, computation, hosting, and lots of programmer effort to support grand challenge efforts. Work with the community to support and develop standards Provide forums to discuss the role and design of CI in plant science Help organize the community to collect data Provide appropriate funding for time spent helping us design and test the CI

52 Experimental Systems We are experimenting with some newer technologies to plug gaps in the existing lineup for demonstrated needs (also leveraging some other funding) – New model for shared memory (ScaleMP cluster to be deployed soon) Will support whole-genome assembly – “Cloud Storage” models to reduce archive cost, increase capacity (HDFS system on commodity cluster to be deployed this quarter) Will also support Hadoop data processing

53 Deployment Timelines Summary Systems: – Production systems (HPC, storage, throughput, visualization) available *now* and in use. – Experimental systems (simulated shared memory, cloud, cloud storage) coming up in prototype stage. Services: – Web service API to incorporate 3 rd party tools prototyping now, public releases in Q3. Tools – A number of prototypes available now, many underway – Contrast workflow, Variant detection, Transcript quantification all released now.

54 The iPlant CI Engagement with the CI Community to leverage best practice and new research Unprecedented engagement with the user community to drive requirements A single CI for all plant scientists, with customized discovery environments to meet grand challenges An exemplar virtual organization for modern computational science A Foundation of Computational and Storage Capability Open source principles, commercial quality development process


Download ppt "The iPlant Collaborative Cyberinfrastructure aka Development of Public Cyberinfrastructure to Support Plant Science Presented by Dan Stanzione Co-PI and."

Similar presentations


Ads by Google