Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:



Advertisements
Similar presentations
U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Computational models of the physical world Cortical bone Trabecular bone.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Today’s topics Single processors and the Memory Hierarchy
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Introduction CS 524 – High-Performance Computing.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
SIAM CSE’03 Combinatorial Scientific Computing: The Role of Discrete Algorithms in Computational Science & Engineering Bruce Hendrickson Sandia National.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation.
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.
Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
LTE Review (September 2005 – January 2006) January 17, 2006 Daniel M. Dunlavy John von Neumann Fellow Optimization and Uncertainty Estimation (1411) (8962.
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct
Parallel Computing Sciences Department MOV’01 Multilevel Combinatorial Methods in Scientific Computing Bruce Hendrickson Sandia National Laboratories Parallel.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
SAND C 1/17 Coupled Matrix Factorizations using Optimization Daniel M. Dunlavy, Tamara G. Kolda, Evrim Acar Sandia National Laboratories SIAM Conference.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Ray-Cast Rendering in VTK-m
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Department of Computer Science University of California, Santa Barbara
Toward a Unified HPC and Big Data Runtime
Adaptive Single-Chip Multiprocessing
Gary M. Zoppetti Gagan Agrawal
Combinatorial Scientific Computing:
Panel on Research Challenges in Big Data
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Latent Semantic Analysis
Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Combinatorial Scientific Computing: A View to the Future Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

Combinatorial Scientific Computing The development, application and analysis of combinatorial algorithms to enable scientific and engineering computations Highlighted areas from a survey talk I composed in 2003 –Sparse matrices (direct & iterative methods) –Optimization & derivatives –Parallel computing –Mesh generation –Statistical physics –Chemistry –Biology

A Brief History of CSC Grew out of series of minisymposia at SIAM meetings –Deeper origins in Sparse direct methods community (1950s and onward) Statistical physics – graphs and Ising models (1940s & 1950s) Chemical classification (1800s, Cayley) –Recognition of common esthetic, techniques and goals among researchers who were far apart in traditional scientific taxonomy Name selected in 2002 –After lengthy discussion among ~ 30 people. –Now, ~3000 hits for “combinatorial scientific computing” on Google.

Previous Milestones This is the 4 th major CSC workshop –SIAM ’04 (with Parallel Processing) Organizers J. Gilbert, B. Hendrickson, A. Pothen, H. Simon, S. Toledo –CERFACS ’05 –SIAM ’07 (with Computational Science & Engineering) –Coming soon: SIAM ’09 (with Applied Linear Algebra) SIAM ’11 (with Optimization) (?) Special issue of ETNA in 2004 Importance recognized by scientific community and funding agencies

Invited Speakers from Earlier CSC Workshops Richard Brualdi (combinatorial matrix theory) Dan Gusfield (computational biology) Shang-Hua Teng (smoothed analysis of algorithms) Stan Eisenstat (sparse direct methods) Dan Halperin (geometric algorithms) Denis Trystram (parallel scheduling) Iain Duff (sparse direct methods) Phil Duxbury (statistical physics)

Outline A look back: –A brief history of a brief history A look ahead: –New application opportunities: data-centric computing Graph models of information retrieval Emerging science of complex networks –Architectural revolution: challenges and promise Challenges of near-future machines Potential architectures for discrete problems Conclusions

Data-Centric Computing Many science disciplines generate huge data sets –Biology, astronomy, high-energy physics, environmental science, social sciences (internet data), etc. Important scientific knowledge lurks within this data What abstractions and algorithms are needed? Claim: –Combinatorial algorithms have an important role to play –“Combinatorial problems generated by challenges in data mining and related topics are now central to computational science.” [I. Beichl & F. Sullivan, 2008]

Example 1: Information Retrieval Consider a document corpus –Each document is a “bag of words” –Represent as non-negative term/document matrix –A(i,j) encodes frequency of term i in document j –A set of terms in a query can be thought of as a vector q –Large entries in A T q identify good matches for retrieval t terms d documents

Latent Semantic Analysis LSA uses truncated SVD for dimension reduction –A ≈ U k  k V k T Retrieval query now becomes –A T q ≈ V k  k U k T q Widely used idea to reduce noise and reduce query expense –[Deerwester, et al., 1990] Basic idea has many applications –Image recognition, machine translation, pattern recognition, etc.

Graph Based Alternative View the term-document matrix as a bipartite graph –Terms and documents have weighted links if they are related Embed the graph in a low dimensional space using (for example) Laplacian eigenvectors Given a query vector, map it to same space and look for nearby documents –Fiedler retrieval [H., 2007] Algebraically, this involves low eigenvectors of the matrix L= Note that LSA involves low eigenvectors of

Advantages of Graph Representation Terms & Documents live in same space –Principled method for adding doc-doc or term-term similarities E.g. former from dictionary, latter from citation analysis or hyperlinks Unified text and link analysis Supports more complex queries –“similar to these documents and these terms” Supports extensions to more classes of objects. –E.g., instead of just term-document, could do term-document-author.

Example II: Network Science Graphs are ideal for representing entities and relationships Rapidly growing use in social, environmental, and other sciences Zachary’s karate club (|V|=34) The way it was … Twitter social network (|V|≈20K) The way it is now …

New Questions New algorithms –Community detection, centrality, graph generation, etc. –Right set of questions and concepts still emerging. New issues –Noisy, error-filled data. What can we conclude robustly? –Semantic graphs with edges and vertices of different types. E.g. people, organizations, events How should this be exploited algorithmically? –Multilinear instead of linear algebra? New paradigms: –E.g. graph evolves over time –Temporal analysis, dynamics, streaming algorithms on graphs, etc Enormous opportunities for combinatorial algorithms

Outline A look back: –A brief history of a brief history A look ahead: –New application opportunities: data-centric computing Graph models of information retrieval Emerging science of complex networks –Architectural revolution: challenges and promise Challenges of near-future machines Potential architectures for discrete problems Conclusions

A Renaissance in Architecture Research Good news –Moore’s Law marches on –Real estate on a chip is essentially free Major paradigm change – huge opportunity for innovation Bad news –Power considerations limit the improvement in clock speed Eventual consequences are unclear Current response, multicore processors –Computation/Communication ratio will get worse Makes life harder for applications

Applications Also Getting More Complex Leading edge scientific applications increasingly include: –Adaptive, unstructured data structures –Complex, multiphysics simulations –Multiscale computations in space and time –Complex synchronizations (e.g. discrete events) Significant parallelization challenges on today’s machines –Finite degree of coarse-grained parallelism –Load balancing and memory hierarchy optimization Dramatically harder on millions of cores Huge need for new algorithmic ideas – CSC will be critical

Architectural Challenges for Graph Algorithms Runtime is dominated by latency –Particularly true for data-centric applications –Random accesses to global address space –Perhaps many at once – fine-grained parallelism Essentially no computation to hide access time Access pattern is data dependent –Prefetching unlikely to help –Usually only want small part of cache line Potentially abysmal locality at all levels of memory hierarchy

Locality Challenges What we traditionally care about What industry cares about Emerging Codes From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Lat. Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance) Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) Bus DDR HT Memory and I/O Interfaces Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) FPU Execution Int Execution Bus DDR HT Memory and I/O Interfaces COMPUTER Example: AMD Opteron Thanks to Thomas Sterling

Architectural Wish List for Graphs Low latency / high bandwidth –For small messages! Latency tolerant Light-weight synchronization mechanisms Global address space –No graph partitioning required –Avoid memory-consuming profusion of ghost-nodes –No local/global numbering conversions One machine with these properties is the Cray MTA-2 –And successor XMT

How Does the MTA Work? Latency tolerance via massive multi-threading –Context switch in a single tick –Global address space, hashed to reduce hot-spots –No cache or local memory. –Multiple outstanding loads Remote memory request doesn’t stall processor –Other streams work while your request gets fulfilled Light-weight, word-level synchronization –Minimizes conflicts, enables parallelism Flexible dynamic load balancing Notes: –220 MHz clock –Largest machine is 40 processors

Case Study I: MTA-2 vs. BlueGene With LLNL, implemented S-T shortest paths in MPI Ran on IBM/LLNL BlueGene/L, world’s fastest computer Finalist for 2005 Gordon Bell Prize –4B vertex, 20B edge, Erdös-Renyi random graph –Analysis: touches about 200K vertices –Time: 1.5 seconds on 32K processors Ran similar problem on MTA-2 –32 million vertices, 128 million edges –Measured: touches about 23K vertices –Time:.7 seconds on one processor,.09 seconds on 10 processors Conclusion: 4 MTA-2 processors = 32K BlueGene/L processors –[Berry, H., Kahan, Konecny, 2007]

Case Study II: Single Source Shortest Path PBGL SSSP Time (s) MTA SSSP # Processors Parallel Boost Graph Library (PBGL) –Lumsdaine, et al., on Opteron cluster –Some graph algorithms can scale on some inputs PBGL - MTA Comparison on SSSP –Erdös-Renyi random graph (|V|=2 28 ) –PBGL SSSP can scale on non-power law graphs –Order of magnitude speed difference –2 orders of magnitude efficiency difference Big difference in power consumption –[Lumsdaine, Gregor, H., Berry, 2007]

Longer Term Architectural Opportunities Near future trends –Multithreading to tolerate latencies –XMT-like capability on commodity machines? Potential big impact on latency-dominated applications (e.g. graphs) Further out –Application-specific circuitry E.g. hashing, feature detection, etc. –Reconfigurable hardware? Adapt circuits to the application at run time Lots of new combinatorial problems in these alternative computing models

Conclusions CSC is in robust health – Growing in breadth, depth, impact and visibility Trends in science play to our strengths –Growing complexity of traditional applications requires more CSC Unstructured, adaptive meshes; bigger problems; multiphysics; optimization; etc. –New science domains with combinatorial needs are emerging Social sciences, ecology, structural biology, etc. –Many sciences are becoming more data-rich –Complex computers require new discrete algorithms We can help applications on multicore nodes, and maybe influence future architectures Enormous need for new models and algorithmic improvements It’s a great time to be doing CSC!

Thanks Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.