Unleash your inner (data) scientist : The ability and audacity to scale your science with extensible cyberinfrastructure Nirav Merchant The University.

Slides:



Advertisements
Similar presentations
Supporting Research on Campus - Using Cyberinfrastructure (CI) Public research use of ICT has rapidly increased in the past decade, requiring high performance.
Advertisements

1 US activities and strategy :NSF Ron Perrott. 2 TeraGrid An instrument that delivers high-end IT resources/services –a computational facility – over.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Client+Cloud The Future of Research Dr. Daniel A. Reed Corporate Vice President Extreme Computing Group & Technology Strategy and Policy.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
Data to Discovery The iPlant Collaborative Community Cyberinfrastructure for Life Science Nirav Merchant iPlant / University.
April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.
Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical Big Data Initiative (BD2K)
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Annual SERC Research Review - Student Presentation, October 5-6, Extending Model Based System Engineering to Utilize 3D Virtual Environments Peter.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
BISQUE: Enabling Cloud and Grid Powered Image Analysis Ramona Walls iPlant Collaborative
Data to Discovery The iPlant Collaborative Community Cyberinfrastructure for Life Science Nirav Merchant iPlant / University.
Enabling Cloud and Grid Powered Image Phenotyping Nirav Merchant iPlant Collaborative
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Customized cloud platform for computing on your terms ! Nirav Merchant
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Enabling Cloud and Grid Powered Image Phenotyping Martha Narro iPlant Collaborative Adapted.
Fourth Paradigm Science-based on Data-intensive Computing.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop – Part 2 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 29, 2015,
1 Applied CyberInfrastructure Concepts Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons
Pascucci-1 Valerio Pascucci Director, CEDMAV Professor, SCI Institute & School of Computing Laboratory Fellow, PNNL Massive Data Management, Analysis,
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store.
The iPlant Collaborative Using iPlant for sharing, managing, and analyzing ecological data Ramona Walls Presented at ESA 2014 – Ignite session August 12,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
Overview of Atmosphere
Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Bringing your favorite analysis applications to iPlant using Docker containers Nirav Merchant
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
1 Kostas Glinos European Commission - DG INFSO Head of Unit, Géant and e-Infrastructures "The views expressed in this presentation are those of the author.
Transforming Science Through Data-driven Discovery Genomics in Education University of Delaware – February 2016 Jason Williams, Education, Outreach, Training.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Teaching How to Scale Science (and People) Using Cloud Resources Nirav Merchant The University of Arizona
Transforming Science Through Data-driven Discovery Using CyVerse Cyberinfrastructure to Enable Data Intensive Research, Collaboration, and Education Joslynn.
Transforming Science Through Data-driven Discovery Bringing your Bioinformatics tools to CyVerse’s Discovery Environment using Docker Upendra Kumar Devisetty.
Accessing the VI-SEEM infrastructure
CyVerse Tools and Services
Tools and Services Workshop
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Steven Newhouse EGI-InSPIRE Project Director, EGI.eu
Tools and Services Workshop Overview of Atmosphere
EGI-Engage Engaging the EGI Community towards an Open Science Commons
Cyberinfrastructure for the Life Sciences
Presentation transcript:

Unleash your inner (data) scientist : The ability and audacity to scale your science with extensible cyberinfrastructure Nirav Merchant The University of Arizona & iPlant Collaborative

Topic Coverage The “Big Data” and “Data Scientist” wave What is cyberinfrastructure (CI) Delivering pragmatic CI ecosystem What has the community built with our CI Lifecycle of research and innovation Continuing education and learning with CI Future thoughts and challenges

Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation 3 Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007

The Fourth Paradigm: Data-Intensive Scientific Discovery Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. 4

The Discovery Lifecycle 5 The Fourth Paradigm: Data-Intensive Scientific Discovery

Evolution of X-Info The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics, Computational-Biology) How to codify and represent our knowledge The Generic Problems: 6 How to share it with others Query and Vis tools Building and executing models Integrating data and literature Documenting experiments Curation and long-term preservation Data ingest Managing a petabyte Common schema How to organize it How to reorganize it The Fourth Paradigm: Data-Intensive Scientific Discovery

7 Classic paradigm: You produce data, analyze, interpret (end to end) Conventional paradigm: Consortium/centers produce data and you consume it New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge” Paradigm Shift

8 ∧ big

Real Cost of Sequencing (2011) The real cost of sequencing: higher than you think Genome Biol. 2011; 12(8): 125The real cost of sequencing: higher than you think

Big Data Extracting meaningful results from vast amount of data (linked data) Big data “information assets” demand cost-effective, innovative forms of information processing for enhanced insight and decision making. “Big Data” Is only the Beginning of Extreme Information Management Not NewBig Data Technology, all Is Not New 10 Attributed to Gartner Consulting

A few word about “Big Data” and “Data Science” The 2014 Gartner Technology Hype-Cycle

12 + = Simple Formula for Success

The Reality Excel, R PERL Python ARCGIS Java Ruby Fortran C C# C++ Matlab etc. Excel, R PERL Python ARCGIS Java Ruby Fortran C C# C++ Matlab etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. and lots of glue…..

+ = Simple Formula

The relevance Bioinformatics has become too central to biology to be left to specialist bioinformaticians. Biologists are all bioinformaticians now - Lincoln Stein Dec

iPlant Collaborative: Vision Enable life science researchers and educators to use and extend cyberinfrastructure

The iPlant Collaborative We are a Cyberinfrastructure Platforms, tools, datasets Storage and compute Training and support

The iPlant Collaborative And a virtual organization Developer Expertise Computational Capacity Science Domain Expertise Training Administrative and Organization

Facilitating the 4A’s of “Computational Thinking” approaches for Life Sciences: Abstraction, Automation, Ability and Audacity Allowing researchers and educators to establish and manage data driven collaborations: Supporting distributed teams and virtual organizations (VO) at global scale Making efficient and coordinated use of CI resources from national, regional, institutional and commercial providers: NSF XSEDE, iPlant, campus HPC and high bandwidth connections to commercial cloud providers Adopting best practices from science domains where key CI challenges have been solved: Astronomy, Particle Physics etc. Community driven, self-provisioning, extensible and open source: Development and prioritization driven through community engagement, active engagement with CISE communities iPlant Collaborative: CI for Scalable Science

iPlant Collaborative: Platform Philosophy Strive to provide the CI Lego blocks Danish 'leg godt' - 'play well’ Also translates as 'I put together' in Latin If desired functionality is not available, the community can craft their own by using and extending iPlant CI components (like lego blocks) Through these extensible and customized platforms create a ecosystem of interoperable tools that benefit the broad community (and not few lab groups) Provide the tools to allow community to manage their digital assets (cloud, HPC etc.) Improve Computational Productivity

Who did we build it for ?

iPlant: Platform for Big Data Collaborations

Ready to use Platforms Foundational Capabilities Established CI Components Extensible Services Ease of use iPlant Collaborative: Products

iPlant: Cohesive Platform for Big Data lifecycle

Researchers like to share ! User Statistics ~27000 user accounts 4900 users with data 2600 users (53% of users with data) made at least 1 share 2100 shares per user 42 million files (58% shared) 59 million (1.1 million/month) shares Community Data Statistics 5 million files 55 million (1.0 million/month) shares ~1.1PB of User Managed data Our users consume 5M+ SU annually and more (we graduate them to compete for their own allocations from XSEDE)

How is it being used ? User build their own systems (powered by iPlant components) but managed by them Consume specific components (a la carte, data store, Atmosphere) Directly use applications (DE) Custom design appliances (Atmosphere) Publish their findings (PNAS, Nature) Advocate use Create learning material and courses

Many 1000’s omes project manage their data & analysis Execute large scale workflows (25-50TB data, Million+ CPU hours) Data infrastructure to coordinate digitization efforts for multiple sites Sharing, Visualizing (3D) & Analyzing high resolution microscopy images (40K x 40K) via web browser Learning material, new course work, custom applications iPlant CI: What is the community building ?

And it goes way beyond plants and life science

Partnership with Software Carpentry and Data Carpentry to provide best practices necessary to make efficient use of CI Allowing individual researchers and educators to utilize data and computational infrastructure at scale (and encounter real challenges) Community contributed material (built on iPlant CI) iPlant Collaborative: Training data scientists

Applied Cyberinfrastructure Concepts (ACIC) Semester long project based learning course: introduces fundamental concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets. Graduate + Undergraduate course working on a REAL research workflows where scalability is a bottleneck Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, Cloud (Future Grid and commercial providers such as Amazon). Learning to apply relevant CI skills (for final project) and developing wiki based documentation of these best practices. Learning how to effectively collaborate in interdisciplinary team settings. Deliver a functional solution to the stakeholder

From research question to reality

Why is it valuable ? Users are able to over come data and computational bottle necks Share data of ANY size with ANYONE Connect data and compute on single platform Manage their data and computations regardless of scale Build their own apps and solutions (create their own community iAnimal, iVirome) Create custom appliances

Even the tech geeks notice

Connect with iPlant! Get a account: us: Questions: #iPlant Facebook: facebook.com/iPlantCollab LinkedIn: iplant.co/iPlantCollabLinkedIn Google+: iplant.com/iPlantGooglePlus

Luck favors the brave Analysis favors the organized