Programming models for data-intensive computing
A multi-dimensional problem Sophistication of the target user – N(data analysts) > N(computational scientists) Level of expressivity – High level important for interactive analysis Volume of data – The complex gigabyte vs. the enormous petabyte Scale and nature of platform – How important are reliability, failure, etc. – What QoS needs? Where enforced?
Separating concerns What things carry over from conventional HPC? – Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF What things carry over from conventional data? – Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna – Streaming databases, streaming data systems What is unique to “data HPC”? – New needs at the platform level – New tradeoffs between HL and platform
Current models Data-parallel – A space of data objects – A set of operators on those objects Streaming Scripting
Conclusions Current HPC programming models fail to address important data-intensive needs An urgent need for a careful gap analysis aimed at identifying important things that cannot [easily] be done with current tools – Ask people for their “top 20” questions – Ethnographic studies A need to revisit the “stack” from the perspective of data-intensive HPC apps
Programming models for data-intensive computing Will flat message-passing model scale for >1M cores? How does multi-level //ism impact DIC (e.g., GPUs) MR, Dryad, Swift—what apps do they support? – how suited for PDEs How will 1K-core PCs change DIC? Powerful data-centric programming primitives to express HL //ism in a natural way while shielding physical configuration issues— what do we need? If we design a supercomputer for DIC, what are reqs? If storage controllers allow application-level control? Permit cross- layer control New frameworks for reliability and availability (go beyond checkpointing) How will different models and frameworks interoperate? How do we support people who want large shared memory?
Programming models Data parallel – MapReduce Loosely synchronized chunks of work – Dryad, Swift, scripting Libraries – e.g., Ntropy Expressive power vs. scale BigTable (Hbase) Streaming, online Dataflow What operators for data-intensive computing? (>M/R) – Sum, Average, … Two main models – Data parallel – Streaming Goal: “use within 30 minutes; still discovering new power in 2 years time” Integration with programming environments Working remotely with large datasets
Dataset – put in time domain, freq domain, plot the result Multiple levels of abstraction? All-pairs. Note that there are many ways to express things at the high level, the challenge is implementing things “Users don’t want to compile anymore” Who are we targeting? Specialists or generalist? Focus on need for rapid decision making Composable models Dimensions of problem – Level of expressivity – Volume of data – Scale of platform – reliability, failure, etc. Gauge the size of the problem you are asking to solve QoS guarantees Ability to practice on smaller datasets
Types of data + nature of the operators Select, e.g. on spatial region, temporal operators Data scrubbing: Data transposition, transforms Data normalization Statistical analysis operators Look at LINQ Aggregation – combine Smart segregation to fit on the hardware Need to deal with distributed data – e.g., column- oriented stores can help with that
What things carry over from conventional HPC – Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF What things carry over from conventional data – Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna What is unique to data HPC
Moving forward Ethnographic studies (e.g., Borgman) Ask for people’s top 20 questions/scenarios – Astronomers – Environmental science – Chemistry … – … E.g., see SciDB is reaching out to communities
DIC hardware architecture Different compute-I/O balance – 0.1 B/flop for supercomputer (“all mem to disk in 5 mins” is an unrealizable goal) – Assume that it should be greater: Amdahl – See Alex Szalay paper – GPU-like systems but with more memory per core – Future streaming rates – what are they? – Innovating networking—data routing – Heterogeneous systems perhaps –e.g., M vs Ws Reliability – where is it implemented? – What about software failures – A special OS? New ways of combining hardware and software? – Within a system, and/or between systems
Modeling “Query estimation” and status monitoring for DIC applications
1000-core PCs Increases data management problem Enables a wider range of users to do DIC More complex memory hierarchy—200 mems We’ll have amazing games with realistic physics
Infinite bandwidth Do everything in the cloud
MapReduce-related thoughts MR is library-based. This makes optimization more difficult. Type checking. Annotations. Are there opportunities for optimization if we incorporate ideas into extensible languages? Ways to enforce/leverage/enable domain- specific semantics. Interoperability/portability?
Most important ideas How badly it doesn’t work so well: current HPC practice fails for DIC. Make it easier for the domain scientist, enable new types of science Gap analysis—articulate what we can do with MPI and MR; what we can’t do with either, and why Propagating information between layers