1 Overview of Cyberinfrastructure and the Breadth of Its Application Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director.

Slides:



Advertisements
Similar presentations
1 US activities and strategy :NSF Ron Perrott. 2 TeraGrid An instrument that delivers high-end IT resources/services –a computational facility – over.
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Indiana University QuakeSim Activities Marlon Pierce, Geoffrey Fox, Xiaoming Gao, Jun Ji, Chao Sun.
Clouds will win! Geoffrey Fox Director,
1 Clouds and Sensor Grids CTS2009 Conference May Alex Ho Anabas Inc. Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department.
1 Earthquake Polar and Sensor Grids Community Grids Laboratory November Geoffrey Fox Community Grids Laboratory, School of informatics Indiana.
Student Visits August Geoffrey Fox
1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
“Grandpa’s up there somewhere.”. Making your IT skills virtual What it takes to move your services to the cloud Erik Mitchell | Kevin Gilbertson | Jean-Paul.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
“Clouds: a construction zone” (and Why PaaS is the future…) Matt Thompson General Manager, Developer & Platform Evangelism Microsoft.
SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1
Future Grid Future Grid User Portal Marlon Pierce Indiana University.
Web 2.0: Concepts and Applications 6 Linking Data.
Overview of Cyberinfrastructure and the Breadth of Its Application Howard University Cyberinfrastructure Day April Geoffrey Fox
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
SALSA HPC Group School of Informatics and Computing Indiana University.
Building Effective CyberGIS: FutureGrid Marlon Pierce, Geoffrey Fox Indiana University.
SBIR Final Meeting Collaboration Sensor Grid and Grids of Grids Information Management Anabas July 8, 2008.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Rochester Institute of Technology Cyberaide Shell: Interactive Task Management for Grids and Cyberinfrastructure Gregor von Laszewski, Andrew J. Younge,
1 CReSIS Lawrence Kansas February Geoffrey Fox (PI) Computer Science, Informatics, Physics Chair Informatics Department Director Digital Science.
QuakeSim Project: Portals and Web Services for Geo-Sciences Marlon Pierce Indiana University
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
November Geoffrey Fox Community Grids Lab Indiana University Net-Centric Sensor Grids.
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Panel: Beyond Exascale Computing
Clouds , Grids and Clusters
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Status and Challenges: January 2017
Science Clouds and Campus Clouds
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Biology MDS and Clustering Results
Clouds from FutureGrid’s Perspective
Cyberinfrastructure and PolarGrid
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Technology Futures and Lots of Sensor Grids
Cyberinfrastructure and its Applications
Computer Science Undergraduate Honors Program January Geoffrey Fox
CReSIS Cyberinfrastructure
Technology Futures and Lots of Sensor Grids
Presentation transcript:

1 Overview of Cyberinfrastructure and the Breadth of Its Application Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director Community Grids Laboratory and Digital Science Center Indiana University Bloomington IN (Presenter: Marlon Pierce)

2 Time Parallel Computing Grids and Federated Computing Scientific Enterprise Computing Scientific Web 2.0 Cloud Computing Parallel Computing Evolution of Scientific Computing, Evidence of Intelligent Design? Y-Axis is whatever you want it to be.

What is High Performance Computing? The meaning of this was clear 20 years ago when we were planning/starting the HPCC (High Performance Computing and Communication) Initiative It meant parallel computing and HPCC lasted for 10 years As an outgrowth of this, NSF started funding of supercomputer centers and we debated vector versus “massively parallel systems”. Data did not exist …. TeraGrid is the current incarnation. NSF subsequently established the Office of Cyberinfrastructure Comprehensive approach to physical infrastructure Complementary NSF concept “Computational Thinking” Everyone needs cyberinfrastructure Core idea is always connecting resources through messages: MPI, JMS, XML, Twitter, etc. 3

4 TeraGrid High Performance Computing Systems Computational Resources (size approximate - not to scale) Slide Courtesy Tommy Minyard, TACC SDSC TACC NCSA ORNL PU IU PSC NCAR (504TF) 2008 (~1PF) Tennessee LONI/LSU UC/ANL

5 Resources for many disciplines! > 120,000 processors in aggregate Resource availability grew during 2008 at unprecedented rates

TOTEM pp, general purpose; HI LHCb: B-physics ALICE : HI  pp  s =14 TeV L=10 34 cm -2 s -1  27 km Tunnel in Switzerland & France Large Hadron Collider CERN, Geneva: 2008 Start CMS Atlas Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma, … the Unexpected Physicists 250+ Institutes 60+ Countries Challenges: Analyze petabytes of complex data cooperatively Harness global computing, data & network resources

7 Linked Environments for Atmospheric Discovery Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition

C YBERINFRASTRUCTURE C ENTER FOR P OLAR S CIENCE (CICPS) 8

9 Environmental Monitoring Cyberinfrastructure at Clemson

10

Forces on Cyberinfrastructure: Clouds, Multicore, and Web

12 Gartner 2008 Technology Hype Curve Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA becoming mainstream Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA becoming mainstream

13 Gartner’s 2005 Hype Curve

14 Relevance of Web 2.0 Web 2.0 can help e-Research in many ways Its tools (web sites) can enhance scientific collaboration, i.e. effectively support virtual organizations, in different ways from grids The popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e- Research and preferable to complex Grid or Web Service solutions The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience Cyberinfrastructure is research analogue of major commercial initiatives e.g. to important job opportunities for students!

Enterprise ApproachWeb 2.0 Approach JSR 168 PortletsGoogle Gadgets, Widgets, badges Server-side integration and processingAJAX, client-side integration and processing, JavaScript SOAPRSS, Atom, JSON WSDLREST (GET, PUT, DELETE, POST) Portlet ContainersOpen Social Containers (Orkut, LinkedIn, Shindig); Facebook; StartPages User Centric GatewaysSocial Networking Portals Workflow managers (Taverna, Kepler, XBaya, etc) Mash-ups WS-Eventing, WS-Notification, Enterprise Messaging Blogging and Micro-blogging with REST, RSS/Atom, and JSON messages (Blogger, Twitter) Semantic Web: RDF, OWL, ontologiesMicroformats, folksonomies

Cloud Computing: Infrastructure and Runtimes Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. Handled through Web services that control virtual machine lifecycles. Cloud runtimes: tools for using clouds to do data- parallel computations. Apache Hadoop, Google MapReduce, Microsoft Dryad, and others Designed for information retrieval but are excellent for a wide range of machine learning and science applications. Apache Mahout Also may be a good match for core computers available in the next 5 years.

Some Commercial Clouds Cloud/ Service AmazonMicrosoft Azure Google (and Apache) DataS3, EBS, SimpleDB Blob, Table, SQL Services GFS, BigTable ComputingEC2, Elastic Map Reduce (runs Hadoop) Compute Service MapReduce (not public, but Hadoop) Service Hosting Amazon Load Balancing Web Hosting Service AppEngine/A ppDrop Bold faced entries have open source equivalents

Clouds as Cost Effective Data Centers 18 Exploit the Internet by allowing one to build giant data centers with 100,000’s of computers; ~ to a shipping container “Microsoft will cram between 150 and 220 shipping containers filled with data center gear into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.”

Clouds Hide Complexity Build portals around all computing capability SaaS: Software as a Service IaaS: Infrastructure as a Service or HaaS: Hardware as a Service PaaS: Platform as a Service delivers SaaS on IaaS Cyberinfrastructure is “Research as a Service” 19 2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW (Future) each 150 watts per core Save money from large size, positioning with cheap power and access with Internet

Open Architecture Clouds Amazon, Google, Microsoft, et al., don’t tell you how to build a cloud. Proprietary knowledge Indiana University and others want to document this publically. What is the right way to build a cloud? It is more than just running software. What is the minimum-sized organization to run a cloud? Department? University? University Consortium? Outsource it all? Analogous issues in government, industry, and enterprise. Example issues: What hardware setups work best? What are you getting into? What is the best virtualization technology for different problems?

Data-File Parallelism and Clouds Now that you have a cloud, you may want to do large scale processing with it. Classic problems are to perform the same (sequential) algorithm on fragments of extremely large data sets. Cloud runtime engines manage these replicated algorithms in the cloud. Can be chained together in pipelines (Hadoop) or DAGs (Dryad). Runtimes manage problems like failure control. We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.

Data Intensive Research Research is advanced by observation i.e. analyzing data from Gene Sequencers Accelerators Telescopes Environmental Sensors Web Crawlers Ethnographic Interviews This data is “filtered”, “analyzed”, “data mined” (term used in Computer Science) to produce conclusions Weather forecasting and Climate prediction are of this type 22

Geospatial Examples Image processing and mining Ex: SAR Images from Polar Grid project (J. Wang) Apply to 20 TB of data Flood modeling I Chaining flood models over a geographic area. Flood modeling II Parameter fits and inversion problems. Real time GPS processing Filter

Parallel Clustering and Parallel Multidimensional Scaling MDS Points : Pairwise Aligned 4500 Points : Clustal MSA 3000 Points : Clustal MSA Kimura2 Distance Applied to ~5000 dimensional gene sequences and ~20 dimensional patient record data Very good parallel speedup 4000 Points : Patient Record Data on Obesity and Environment

Some Other File/Data Parallel Examples from Indiana University Biology Dept EST (Expressed Sequence Tag) Assembly: (Dong) 2 million mRNA sequences generates files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering: (Dong) 476 core years just for Prokaryotes Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP Systems Microbiology: (Brun) BLAST, InterProScan Metagenomics (Fortenberry, Nelson) Pairwise alignment of s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop 25

Intel’s Projection Technology might support: 2010: 16—64 cores 200GF—1 TF 2013: 64—256 cores 500GF– 4 TF 2016: cores 2 TF– 20 TF

Too much Computing? Historically both grids and parallel computing have tried to increase computing capabilities by Optimizing performance of codes at cost of re-usability Exploiting all possible CPU’s such as Graphics co- processors and “idle cycles” (across administrative domains) Linking central computers together such as NSF/DoE/DoD supercomputer networks without clear user requirements Next Crisis in technology area will be the opposite problem – commodity chips will be way parallel in 5 years time and we currently have no idea how to use them on commodity systems – especially on clients Only 2 releases of standard software (e.g. Office) in this time span so need solutions that can be implemented in next 3-5 years Intel RMS analysis: Gaming and Generalized decision support (data mining) are ways of using these cycles