Programming models for data-intensive computing. A multi-dimensional problem Sophistication of the target user – N(data analysts) > N(computational scientists)

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Database Architectures and the Web
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
High Performance Computing Course Notes Grid Computing.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
Distributed Processing, Client/Server, and Clusters
Chapter 13 Embedded Systems
The Future of the Internet Jennifer Rexford ’91 Computer Science Department Princeton University
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
HDF5 A new file format & software for high performance scientific data management.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Software-Defined Networking - Attributes, candidate approaches, and use cases - MK. Shin, ETRI M. Hoffmann, NSN.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
The Globus Project: A Status Report Ian Foster Carl Kesselman
Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
SDM Center Parallel I/O Storage Efficient Access Team.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Productive Performance Tools for Heterogeneous Parallel Computing
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
课程名 编译原理 Compiling Techniques
Recap: introduction to e-science
Chapter 9 – RPCs, Messaging & EAI
CHAPTER 3 Architectures for Distributed Systems
Storage Virtualization
Ch 4. The Evolution of Analytic Scalability
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Introduction to Apache
Overview of big data tools
CS 239 – Big Data Systems Fall 2018
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Programming models for data-intensive computing

A multi-dimensional problem Sophistication of the target user – N(data analysts) > N(computational scientists) Level of expressivity – High level important for interactive analysis Volume of data – The complex gigabyte vs. the enormous petabyte Scale and nature of platform – How important are reliability, failure, etc. – What QoS needs? Where enforced?

Separating concerns What things carry over from conventional HPC? – Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF What things carry over from conventional data? – Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna – Streaming databases, streaming data systems What is unique to “data HPC”? – New needs at the platform level – New tradeoffs between HL and platform

Current models Data-parallel – A space of data objects – A set of operators on those objects Streaming Scripting

Conclusions Current HPC programming models fail to address important data-intensive needs An urgent need for a careful gap analysis aimed at identifying important things that cannot [easily] be done with current tools – Ask people for their “top 20” questions – Ethnographic studies A need to revisit the “stack” from the perspective of data-intensive HPC apps

Programming models for data-intensive computing Will flat message-passing model scale for >1M cores? How does multi-level //ism impact DIC (e.g., GPUs) MR, Dryad, Swift—what apps do they support? – how suited for PDEs How will 1K-core PCs change DIC? Powerful data-centric programming primitives to express HL //ism in a natural way while shielding physical configuration issues— what do we need? If we design a supercomputer for DIC, what are reqs? If storage controllers allow application-level control? Permit cross- layer control New frameworks for reliability and availability (go beyond checkpointing) How will different models and frameworks interoperate? How do we support people who want large shared memory?

Programming models Data parallel – MapReduce Loosely synchronized chunks of work – Dryad, Swift, scripting Libraries – e.g., Ntropy Expressive power vs. scale BigTable (Hbase) Streaming, online Dataflow What operators for data-intensive computing? (>M/R) – Sum, Average, … Two main models – Data parallel – Streaming Goal: “use within 30 minutes; still discovering new power in 2 years time” Integration with programming environments Working remotely with large datasets

Dataset – put in time domain, freq domain, plot the result Multiple levels of abstraction? All-pairs. Note that there are many ways to express things at the high level, the challenge is implementing things “Users don’t want to compile anymore” Who are we targeting? Specialists or generalist? Focus on need for rapid decision making Composable models Dimensions of problem – Level of expressivity – Volume of data – Scale of platform – reliability, failure, etc. Gauge the size of the problem you are asking to solve QoS guarantees Ability to practice on smaller datasets

Types of data + nature of the operators Select, e.g. on spatial region, temporal operators Data scrubbing: Data transposition, transforms Data normalization Statistical analysis operators Look at LINQ Aggregation – combine Smart segregation to fit on the hardware Need to deal with distributed data – e.g., column- oriented stores can help with that

What things carry over from conventional HPC – Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF What things carry over from conventional data – Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna What is unique to data HPC

Moving forward Ethnographic studies (e.g., Borgman) Ask for people’s top 20 questions/scenarios – Astronomers – Environmental science – Chemistry … – … E.g., see SciDB is reaching out to communities

DIC hardware architecture Different compute-I/O balance – 0.1 B/flop for supercomputer (“all mem to disk in 5 mins” is an unrealizable goal) – Assume that it should be greater: Amdahl – See Alex Szalay paper – GPU-like systems but with more memory per core – Future streaming rates – what are they? – Innovating networking—data routing – Heterogeneous systems perhaps –e.g., M vs Ws Reliability – where is it implemented? – What about software failures – A special OS? New ways of combining hardware and software? – Within a system, and/or between systems

Modeling “Query estimation” and status monitoring for DIC applications

1000-core PCs Increases data management problem Enables a wider range of users to do DIC More complex memory hierarchy—200 mems We’ll have amazing games with realistic physics

Infinite bandwidth Do everything in the cloud

MapReduce-related thoughts MR is library-based. This makes optimization more difficult. Type checking. Annotations. Are there opportunities for optimization if we incorporate ideas into extensible languages? Ways to enforce/leverage/enable domain- specific semantics. Interoperability/portability?

Most important ideas How badly it doesn’t work so well: current HPC practice fails for DIC. Make it easier for the domain scientist, enable new types of science Gap analysis—articulate what we can do with MPI and MR; what we can’t do with either, and why Propagating information between layers