Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.

Slides:



Advertisements
Similar presentations
Future Directions and Initiatives in the Use of Remote Sensing for Water Quality.
Advertisements

Bayesian tools for analysing and reducing uncertainty Tony OHagan University of Sheffield.
Software Sustainability Institute “Doing Science Properly in the Digital Age” UK e-Infrastructure Academic User Community Forum 12 September.
The Role of Environmental Monitoring in the Green Economy Strategy K Nathan Hill March 2010.
Data and Information Framework: Principles Sue Barrell Bureau of Meteorology, Australia CBS-Ext.(14), Asuncion, September 2014.
Driving Innovation Robert Lowson National Contact Point FP7 Space Cork 17 September 2012 The United Kingdom; a major force in space 1.
Chapter 3: Modules, Hierarchy Charts, and Documentation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Alternate Software Development Methodologies
December 2008 MRC Data Support Services (DSS) Chris Morris 13 th February 2009 Sharing Research Data: Pioneers, Policies and Protocols The seventh cat.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
6.1 Copyright © 2014 Pearson Education, Inc. publishing as Prentice Hall Building Information Systems Chapter 13 VIDEO CASES Video Case 1: IBM: Business.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Lecture 1.
Chapter 3 Computer Science and the Foundation of Knowledge Model
Results Matter. Trust NAG. Numerical Algorithms Group Mathematics and technology for optimized performance Andrew Jones IDC HPC User Forum, Imperial College.
TECHWARZ. (Multiplexed Information and Computing Service)  Multics was an extremely influential early time-sharing operating system.  Goal: Develop.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Ian Bird WLCG Workshop Okinawa, 12 th April 2015.
Chapter 1- Introduction Lecture 1 Ready, fire, aim (the fast approach to software development). Ready, aim, aim, aim, aim... (the slow approach to software.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Procurement Innovation for Cloud Services in Europe CERN – 14 May 2014 Bob Jones (CERN) This document produced by Members of the Helix Nebula consortium.
The Climate Prediction Project Global Climate Information for Regional Adaptation and Decision-Making in the 21 st Century.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
Storage and data services eIRG Workshop Amsterdam Dr. ir. A. Osseyran Managing director SARA
material assembled from the web pages at
Results Matter. Trust NAG. Numerical Algorithms Group Mathematics and technology for optimized performance Alternative Processors Panel IDC, Tucson, Sept.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
CSED Computational Science & Engineering Department CHEMICAL DATABASE SERVICE The Current Service is Well Regarded The CDS has a long and distinguished.
ITR: Collaborative research: software for interpretation of cosmogenic isotope inventories - a combination of geology, modeling, software engineering and.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
NIST Data Science SymposiumMarch 4, 2014 NIST Data Science SymposiumMarch 4, Climate Archives in NOAA: Challenges and Opportunities March 4, 2014.
MEDIN Work Plan for By March 2011 MEDIN will be 3 years into the original 5 year development plan started in Would normally ask for continued.
The Practical Challenges of Implementing a Terminology on a National Scale Professor Martin Severs.
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
1 Curation and Characterization of Web Services Jose Enrique Ruiz October 23 rd IVOA Fall Interop Meeting - Sao Paolo.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
GEOSCIENCE NEEDS & CHALLENGES Dogan Seber San Diego Supercomputer Center University of California, San Diego, USA.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Aquarius Mission Simulation A realistic simulation is essential for mission readiness preparations This requires the ability to produce realistic data,
Evolving Scientific Data Workflow CAS 2011 Pamela Gillman
Exscale – when will it happen? William Kramer National Center for Supercomputing Applications.
HEP and NP SciDAC projects: Key ideas presented in the SciDAC II white papers Robert D. Ryne.
Chapter 1- Introduction Lecture 1. Topics covered  Professional software development  What is meant by software engineering.  Software engineering.
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
Tackling I/O Issues 1 David Race 16 March 2010.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
High throughput biology data management and data intensive computing drivers George Michaels.
1. 2 Quick Background I have an ecological background but I strayed……and ended up in computer science The good news is I have been able to blend the two.
BIG DATA. The information and the ability to store, analyze, and predict based on that information that is delivering a competitive advantage.
Data Science Interview Questions 1.What do you mean by word Data Science? Data Science is the extraction of knowledge from large.
6. (supplemental) User Interface Design. User Interface Design System users often judge a system by its interface rather than its functionality A poorly.
Software Project Configuration Management
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.
Chapter 1- Introduction
OCP: High Performance Computing Project
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Panel on Research Challenges in Big Data
KEY INITIATIVE Financial Data and Analytics
Overview of Computer system
Presentation transcript:

Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14

2 “Explosion of data”  Able to generate ever-growing volumes of data  Rapid growth in computational capability and capacity  Increasingly cost-viable HPC at individual scale (GPU etc)  Improvement in resolution of instrumental sources  Collecting more data than ever before  Increasingly use multiple data sources – fusion from various sensors and models to form predictions

3 The problem of huge data growth  Drowning in volume of data?  Hindered by the diversity, formats, quality, etc?  Potential value of the increased data lost by our inability to manage and comprehend it?  Does having more data mean more information – or less due to analysis overload?

4 This is not a new story  When I started using computers for research lots of my time was spent doing stuff with data:  Writing routines to convert from one format to another  Manually massaging source data into a form usable by the codes (e.g. cleaning up object models, calibrating instrumental data to correct limitations of collection, etc)  Storing model output with meta-data for re-use  Attending meetings to tackle the coming curation and provenance challenges

5 Challenge can always mean opportunity...  New insights possible as a result of volume  Statistics, trends, exceptions, data mining, coverage  Enhances alternative research process of data exploration rather than hypothesis validation  Anecdote of discovery by anomaly  Broader market for data led HPC  Databases, data analytics, business intelligence, etc  Growth opportunity for HPC community

6 e.g. key driver of HPC – Climate Science  Predict and store variations of tens of variables over centuries in pseudo-3D space  Include various measurement data  Compare multiple models against each other  First to demand greater storage, archival facilities or similar from HPC centres

7 e.g. key driver of HPC – Engineering  Computer model of design  geometry, materials parameters, etc  Perform CFD/CEM/structures/etc simulations  Preparation activity often a much greater task (human effort, elapsed time) than the simulation  Input data has high value – and often drives memory limits  Output data (field data) large in volume but also critically needs audit trail to be meaningful  Post-processing - especially visualization

8 e.g. key driver of HPC - Intelligence  Monitoring communications of other governments has evolved to monitoring communications of millions of individuals and groups  voice/SMS/ /skype/twitter/forums/etc  Plus data from non-comms sensors  Plus people movement data (border control)  Probably more data than can collect, let alone process coherently, and generate assessments

9 Three example key drivers  3 key technical and “political” drivers of HPC  Each requires major increases in compute  E.g. They can describe a clear need for exascale  The data aspects are at least as important as compute to success in each case

10 And yet... in supercomputer centres  For most of my time in HPC management, “massive data” has been the next big thing...  Data intensive computing  Data explosion ...

11 And yet... in supercomputer centres ... and the last thing on anyone's mind in procurements  How many CPUs? How fast? What’s the price? (oh, and it has some disk stuff too? – that’s good)

12 The challenge – 1: cultural/political  Recognise data not FLOPS as the primary act of HPC?  What data are we collecting?  What do we need out?  What are the requirements for curation, provenance etc.  Design compute engine to enable this  Community priorities  e.g. Data theme left until breakfast session on last day...

13 Example - SKA  State scientific need  Design sensor  huge data collected  Recognize user output requirements  Useful images with sufficient underlying data to enable further analysis  Design data processing at each stage to deliver this  ~PF at each sensor to manage live data bandwidth  ~EF centrally to process global (stored?) data

14 The challenge – 2: technical  Curation  Do we know what is there?  Can we read it a specified lifetime later?  Original platform obsolete?  Provenance  how data was created, validity/limitations, ownership, etc  Assurance  Security, reliability, etc  Comprehension  Processing, data mining, visualization, etc

15 Summary  Explosion in data volume is not news  But it is real  Obstacles to insights increasing with volume might be firstly cultural/political rather than technical  Opportunity (both scientific and commercial) as much as a problem  But either way - it is a challenge

16 Summary  The HPC community (vendors, centres, funding bodies, users) finally appears to be taking software as seriously as hardware for the exascale roadmap  Can we step up to the same need for the data challenge?  Will we ever see software, data and hardware as three balanced design axes in exascale computing technology for innovation, insight, and new science?

A Research Councils UK High End Computing Service Capability Science. NAG HPC expertise.