3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,

Slides:



Advertisements
Similar presentations
1 Challenges and New Trends in Data Intensive Science Panel at Data-aware Distributed Computing (DADC) Workshop HPDC Boston June Geoffrey Fox Community.
Advertisements

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
International Conference on Cloud and Green Computing (CGC2011, SCA2011, DASC2011, PICom2011, EmbeddedCom2011) University.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Science of Cloud Computing Panel Cloud2011 Washington DC July Geoffrey Fox
CI Days: Planning Your Campus Cyberinfrastructure Strategy Russ Hobby, Internet2 Internet2 Member Meeting 9 October 2007.
DISTRIBUTED COMPUTING
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Research and Educational Networking and Cyberinfrastructure Russ Hobby, Internet2 Dan Updegrove, NLR University of Kentucky CI Days 22 February 2010.
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
ICETE 2012 Joint Conference on e-Business and Telecommunications Hotel Meliá Roma Aurelia Antica, Rome, Italy July
Agenda Motion Imagery Challenges Overview of our Cloud Activities -Big Data -Large Data Implementation Lessons Learned Summary.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
Computing Research Testbeds as a Service: Supporting large scale Experiments and Testing SC12 Birds of a Feather November.
7. Grid Computing Systems and Resource Management
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
CLOUD COMPUTING When it's smarter to rent than to buy.. Presented by D.Datta Sai Babu 4 th Information Technology Tenali Engineering College.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Accessing the VI-SEEM infrastructure
Organizations Are Embracing New Opportunities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Private Public FG Network NID: Network Impairment Device
Introduction to Distributed Platforms
Computing models, facilities, distributed computing
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
An Open Source Project Commonly Used for Processing Big Data Sets
Grid Computing.
EIS Fast-track Revision Om Trivedi Enterprise Information Systems
Sky Computing on FutureGrid and Grid’5000
Ch 4. The Evolution of Analytic Scalability
Clouds from FutureGrid’s Perspective
Services, Security, and Privacy in Cloud Computing
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Panel on Research Challenges in Big Data
Sky Computing on FutureGrid and Grid’5000
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Convergence of Big Data and Extreme Computing
Presentation transcript:

3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman

Discussion Topics? Programming methods: languages vs. frameworks (advantages and disadvantages of each) Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? Does Life Sciences want clouds or supercomputers? Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? How important is data security and privacy?

Programming methods: languages vs. frameworks for data intensive/Life Science areas SaaS offers “Blast etc.” on demand Role of PGAS, Data parallel compilers (like Chapel) i.e. main stream HPC high level approaches Nodes v. Cores v. GPU’s – do hybrid programming models have special features MapReduce v. MPI Distributed environments like SAGA Data parallel analysis languages like Pig Latin, Sawzall Role of databases like SciDB and SQL based analysis – See DryadLINQ Is R (cloud R, parallel R) critical – What about Excel, Matlab …

Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? This related to privacy and programming model questions Is data stored in central resources Does data have co-located compute resources (cloud) If co-located, are data and compute on same cluster – How is data spread out over disks on nodes? Or is data in a storage system supporting wide area file system shared by nodes of cloud? Or is data in a database (SciDB SkyServer)? Or is data in an object store like OpenStack? What kind of middleware exists, or needs to be developed to enable effective compute-data movement? Or it just a run- time scheduling problem? What are performance issues and how do we get data there for dynamic data as that produced by sequencers.

Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? Is it a few large centers? Is it a distributed set of repositories containing say all data from a particular lab? – Or both of the above? – How to manage and present stream of new data The world created ~1000 exabytes of data this year – how much will Life Sciences create? Relative importance of large shared data centers versus instrumental or computer generated individually owned data? Is Data replication important? Storage model – files, objects, databases? How often is the different types of data read (presumably written once!) – Which data is most important? Raw or processed to some level? What is metadata challenge?

Does Life Sciences want Clouds or Supercomputers? Clouds are cost effective and elastic for varying need Supercomputers support low latency (MPI) parallel applications Clouds main commercial offering; supercomputers main academic large scale computing solution – Also Open Science Grid, EGI …. Cost(time) of transporting data from sequencers and repositories to analysis engines ( clouds) – Will NLR or Internet2 link to clouds; they do to TeraGrid What can LS data-intensive community learn from the HEP community? e.g., Will the HEP approach of community-wide "workload management systems" and VOs work? What is the role of Campus Clusters/resources in genomic data-sharing? No history of large cloud budgets in federal grants

How important is data security and privacy? Human Genome processing cannot use most cost effective solutions which will be shared resources such as public clouds – Commercial, military applications What other research applications have such concerns – Analysis of copyrighted material such as digital books Partly technical; partly policy issue of establishing a trusted approach – Companies accept off site paper storage? See recent hacking attacks such as Sony network, gmail How important is fault tolerance/autonomic computing?