Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology.

Nat DucaJonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Stream Caching: Mechanisms for General Purpose Stream Processing.

CH12 CPU Structure and Function

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.

Computer System Architectures Computer System Software

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Threads, Thread management & Resource Management.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

COMP25212: System Architecture Lecturers Alasdair Rawsthorne Daniel Goodman

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Programmability Hiroshi Nakashima Thomas Sterling.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Simultaneous Multithreading

Parallel Programming By J. H. Wang May 2, 2017.

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Hyperthreading Technology

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Introduction to Operating Systems

Introduction to Operating Systems

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Chapter 4 Multiprocessors

Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Graphs, Data Mining, and High Performance Computing Bruce Hendrickson Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

Outline High performance computing Why current approaches can’t work for data mining Test case: graphs for knowledge representation High performance graph algorithms, an oxymoron? Implications for broader data mining community Future trends

Data Mining and High Performance Computing “We can only consider simple algorithms” –Data too big for anything but O(n) algorithms –Often have some kind of real-time constraints This greatly limits the kinds of questions we can address –Terascale data gives different insights than gigascale data –Current search capabilities are wonderful, but innately limited Can high-performance computing make an impact? –What if our algorithms ran 100x faster and could use 100x more memory? 1000x? Assertion: Quantitative improvements in capabilities result in qualitative changes in the science that can be done.

Modern Computers Fast processors, slow memory Use memory hierarchy to keep processor fed –Stage some data in smaller, faster memory (cache) –Can dramatically enhance performance But only if accesses have spatial or temporal locality –Use accessed data repeatedly, or use near-by data next Parallel computers are collections of these –Pivotal to have a processor own most data it needs Memory patterns determine performance –Processor speed hardly matters

High Performance Computing Largely the purview of science and engineering communities –Machines, programming models, & algorithms to serve their needs –Can these be utilized by learning and data mining communities? –Search companies make great use of parallelism for simple things But not general purpose Goals –Large (cumulative) core for holding big data sets –Fast and scalable and performance of complex algorithms –Ease of programmability

Algorithms We’ve Seen This Week Hashing (of many sorts) Feature detection Sampling Inverse index construction Sparse matrix and tensor products Training Clustering All of these involve –complex memory access patterns –only small amounts of computation Performance dominated by latency – waiting for data

Architectural Challenges Runtime is dominated by latency –Lots of indirect addressing, pointer chasing, etc. –Perhaps many at once Very little computation to hide memory costs Access pattern can be data dependent –Prefetching unlikely to help –Usually only want small part of cache line Potentially abysmal locality at all levels of memory hierarchy –Bad serial and abysmal parallel performance

Graphs for Knowledge Representation Graphs can capture rich semantic structure in data More complex than “bag of features” Examples: –Protein interaction networks –Web pages with hyperlinks –Semantic web –Social networks, etc. Algorithms of interest include –Connectivity (of various sorts) –Clustering and community detection –Common motif discovery –Pattern matching, etc.

Semantic Graph Example

Finding Threats = Subgraph Isomorphism Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Omar Khadr (at Guantanamo) Mohammed Jabarah (Canadian citizen handed over to US authorities on suspicion of links to 9/11). Thanks to Kevin McCurley

Graph-Based Informatics: Data Graphs can be enormous –High performance computing may be needed for memory and performance Graphs are highly unstructured –High variance in number of neighbors –Little or no locality – Not partitionable –Experience with scientific computing graphs of limited utility Terrible locality in memory access patterns

Desirable Architectural Features Low latency / high bandwidth –For small messages! Latency tolerant Light-weight synchronization mechanisms Global address space –No data partitioning required –Avoid memory-consuming profusion of ghost-nodes –No local/global numbering conversions One machine with these properties is the Cray MTA-2 –And successor XMT

Massive Multithreading: The Cray MTA-2 Slow clock rate (220Mhz) 128 “streams” per processor Global address space Fine-grain synchronization Simple, serial-like programming model Advanced parallelizing compilers Latency Tolerant: important for Graph Algorithms

Cray MTA Processor Each thread can have 8 memory refs in flight Round trip to memory ~150 cycles No Processor Cache! Hashed Memory!

How Does the MTA Work? Latency tolerance via massive multi-threading –Context switch in a single tick –Global address space, hashed to reduce hot-spots –No cache or local memory. Context switch on memory request. –Multiple outstanding loads Remote memory request doesn’t stall processor –Other streams work while your request gets fulfilled Light-weight, word-level synchronization –Minimizes access conflicts Flexibly supports dynamic load balancing Notes: –MTA-2 is 7 years old –Largest machine is 40 processors

Case Study: MTA-2 vs. BlueGene/L With LLNL, implemented S-T shortest paths in MPI Ran on IBM/LLNL BlueGene/L, world’s fastest computer Finalist for 2005 Gordon Bell Prize –4B vertex, 20B edge, Erdös-Renyi random graph –Analysis: touches about 200K vertices –Time: 1.5 seconds on 32K processors Ran similar problem on MTA-2 –32 million vertices, 128 million edges –Measured: touches about 23K vertices –Time:.7 seconds on one processor,.09 seconds on 10 processors Conclusion: 4 MTA-2 processors = 32K BlueGene/L processors

But Speed Isn’t Everything Unlike MTA code, MPI code limited to Erdös-Renyi graphs –Can’t support power-law graphs; pervasive in informatics MPI code is 3 times larger than MTA-2 code –Took considerably longer to develop MPI code can only solve this very special problem –MTA code is part of general and flexible infrastructure MTA easily supports multiple, simultaneous users But … MPI code runs everywhere –MTA code runs only on MTA/Eldorado and on serial machines

Multithreaded Graph Software Design Build generic infrastructure for core operations including… –Breadth-first search (e.g. short paths) –Distributed local searches (e.g. subgraph isomorphism) –Rich filtering operations (numerous applications) Separate basic kernels from instance specifics Infrastructure is challenging to write –Parallelization & performance challenges reside in infrastructure –Must port to multiple architectures But with infrastructure in place, application development is highly productive and portable

Customizing Behavior: Visitors Idea from BOOST (Lumsdaine) Application programmer writes small visitor functions –Get invoked at key points by basic infrastructure E.g. when a new vertex is visited, etc. –Adjust behavior or copy data; build tailored knowledge products Example, with one breadth-first-search routine, you can… –Find short paths –Construct spanning trees –Find connected components, etc. Architectural dependence is hidden in infrastructure –Applications programming is highly productive Use just enough C++ for flexibility, but not too much Note: Code runs on serial Linux, Windows, Mac machines

Algorithm Class Eldorado Graph Infrastructure: C++ Design Levels Analyst Support “Visitor” class Algorithms Programmer Graph Class Data Str. Class Infrastructure Programmer Gives Parallelism, Hides Most Concurrency Gets parallelism for free Inspired by Boost GL, but not Boost GL

Kahan’s Algorithm for Connected Components

Infrastructure Implementation of Kahan’s Algorithm Search (tricky) Kahan’s Phase I visitor Kahan’s Phase III visitor (Trivial) Shiloach- Vishkin CRCW (tricky) Kahan’s Phase II visitor (Trivial)

Infrastructure Implementation of Kahan’s Algorithm Phase I: “component” values start “empty;” Make them “full.” Wait until both “full,” Add to hash table

Traceview Output for Infrastructure Impl. of Kahan’s CC algorithm

More General Filtering: The “Bully” Algorithm

“Bully” Algorithm Implementation Traverse “e” if we would anyway, or if this test returns true [or,and,replace] Lock dest while testing

Traceview Output for the Bully Algorithm

MTA-2 Scaling of Connected Components 5.41s 2.91s Power Law Graph (highly unstructured)

Computational Results: Subgraph Isomorphism

A Renaissance in Architecture Bad news –Power considerations limit the improvement in clock speed Good news –Moore’s Law marches on Real estate on a chip is no longer at a premium –On a processor, much is already memory control –Tiny bit is computing (e.g. floating point) The future is not like the past…

Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Lat. Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance) Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) Bus DDR HT Memory and I/O Interfaces Example: AMD Opteron

L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) FPU Execution Int Execution Bus DDR HT Memory and I/O Interfaces COMPUTER Example: AMD Opteron Thanks to Thomas Sterling

Consequences Current response, stamp out more processors. –Multicore processors. Not very imaginative. –Makes life worse for most of us Near future trends –Multithreading to tolerate latencies –MTA-like capability on commodity machines Potentially big impact on data-centric applications Further out –Application-specific circuitry E.g. hashing, feature detection, etc. –Reconfigurable hardware? Adapt circuits to the application at run time

Summary Massive Multithreading has great potential for data mining & learning Software development is challenging –correctness –performance Well designed infrastructure can hide many of these challenges –Once built, infrastructure enables high productivity Potential to become mainstream. Stay tuned…

Acknowledgements Jon Berry Simon Kahan, Petr Konecny (Cray) David Bader, Kamesh Madduri (Ga. Tech) (MTA s-t connectivity) Will McClendon (MPI s-t connectivity)