UC Berkeley Par Lab Overview 2 David Patterson Par Lab’s original research “bets” Software platform: data center + mobile client Let compelling applications.

Slides:



Advertisements
Similar presentations
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Advertisements

MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Using MapuSoft Instead of OS Vendor’s Simulators.
Emerging of Software Technologies Reporter: Jeremie Lucero.
1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.
Android Platform Overview (1)
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
The road to reliable, autonomous distributed systems
Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.
Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Laboratory Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt.
Parallel Applications Parallel Hardware Parallel Software IT industry Users 1 Par Lab Overview Dave Patterson Parallel Computing Laboratory (Par Lab) U.C.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 13 Embedded Systems
Tessellation OS Architecting Systems Software in a ManyCore World John Kubiatowicz UC Berkeley
CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,
1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.
Final Presentation Spring 2003 Project ID: D0822 Project Name: WinCE integrating BT media share application Supervisor: Evgeny Rivkin Performed by: Maya.
5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department.
Client/Server Architectures
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2, Kevin Klues 1, Sarah Bird 1, Steven Hofmeyr 3, Krste Asanovic 1, John Kubiatowicz.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Writing Systems Software in a Functional Language An Experience Report Iavor Diatchki, Thomas Hallgren, Mark Jones, Rebekah Leslie, Andrew Tolmach.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Application Software System Software.
Programmability Hiroshi Nakashima Thomas Sterling.
MPEG-4: Multimedia Coding Standard Supporting Mobile Multimedia System Lian Mo, Alan Jiang, Junhua Ding April, 2001.
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
+ Advances in the Parallelization of Music and Audio Applications Eric Battenberg, David Wessel & Juan Colmenares.
Background Computer System Architectures Computer System Software.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Computer System Structures
Introduction to threads
Chapter 1 Introduction.
Mobile App Development
Chapter 1 Introduction.
Texas Instruments TDA2x and Vision SDK
Chapter 1 The Nature of Software
Chapter 1 The Nature of Software
CMPE419 Mobile Application Development
CMPE419 Mobile Application Development
Presentation transcript:

UC Berkeley Par Lab Overview 2 David Patterson

Par Lab’s original research “bets” Software platform: data center + mobile client Let compelling applications drive research agenda Identify common programming patterns Productivity versus efficiency programmers Autotuning and software synthesis Build correctness + power/performance diagnostics into stack OS/Architecture support applications, provide primitives not pre-packaged solutions FPGA simulation of new parallel architectures: RAMP Above all, no preconceived big idea – see what works driven by application needs 3

Par Lab Research Overview 4 4 Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Design Patterns/Motifs Sketching Legacy Code Schedulers Communication and Synch. Primitives Efficiency Language Compilers Easy to write correct programs that run efficiently on manycore Legacy OS Multicore/GPGPU OS Libraries & Services ParLab Manycore/RAMP Hypervisor OS Arch. Productivity Layer Efficiency Layer Correctness Applications Composition & Coordination Language (C&CL) Parallel Libraries Parallel Frameworks Static Verification Dynamic Checking Debugging with Replay Directed Testing Autotuners C&CL Compiler/Interpreter Efficiency Languages Type Systems Diagnosing Power/Performance

Dominant Application Platforms Data Center or Cloud (“Server”) Laptop/Handheld (“Mobile Client”) Both together (“Server+Client”) New ParLab-RADLab collaborations Par Lab focuses on mobile clients But many technologies apply to data center 5 5

Music and Hearing Application (David Wessel) 6 Musicians have an insatiable appetite for computation + real-time demands More channels, instruments, more processing, more interaction! Latency must be low (5 ms) Must be reliable (No clicks!) 1. Music Enhancer Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays Laptop/Handheld recreate 3D sound over ear buds 2. Hearing Augmenter Handheld as accelerator for hearing aid 3. Novel Instrument User Interface New composition and performance systems beyond keyboards Input device for Laptop/Handheld 6 Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch- diameter icosahedron incorporating 120 tweeters.

Health Application: Stroke Treatment (Tony Keaveny) 7  Stroke treatment time-critical, need supercomputer performance in hospital  Goal: First true 3D Fluid-Solid Interaction analysis of Circle of Willis  Based on existing codes for distributed clusters

Content-Based Image Retrieval (Kurt Keutzer) 8 Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events, places, and objects 8 Relevance Feedback ImageDatabase Query by example Similarity Metric Candidate Results Final Result 1000’s of images

Robust Speech Recognition (Nelson Morgan) 9 Meeting Diarist Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting 9 Use cortically-inspired manystream spatio-temporal features to tolerate noise

Parallel Browser (Ras Bodik) Goal: Desktop quality browsing on handhelds Enabled by 4G networks, better output devices Bottlenecks to parallelize Parsing, Rendering, Scripting 84ms 2ms

11 Compelling Apps in a Few Years Name Whisperer Built from Content Based Image Retrieval Like Presidential Aid Handheld scans face of approaching person Matches image database Whispers name in ear, along with how you know him

Architecting Parallel Software with Patterns (Kurt Keutzer/Tim Mattson) 12 Our initial survey of many applications brought out common recurring patterns: “Dwarfs” -> Motifs Computational patterns Structural patterns Insight: Successful codes have a comprehensible software architecture: Patterns give human language in which to describe architecture

Motif (nee “Dwarf”) Popularity (Red Hot  Blue Cool) Motif (nee “Dwarf”) Popularity (Red Hot  Blue Cool) 13 How do compelling apps relate to 12 motifs?

Architecting Parallel Software 14 Pipe-and-Filter Agent-and-Repository Event-based Bulk Synchronous MapReduce Layered Systems Arbitrary Task Graphs Decompose Tasks/Data Order tasksIdentify Data Sharing and Access Graph Algorithms Dynamic programming Dense/Spare Linear Algebra (Un)Structured Grids Graphical Models Finite State Machines Backtrack Branch-and-Bound N-Body Methods Circuits Spectral Methods Identify the Software Structure Identify the Key Computations

People, Patterns, and Frameworks Design PatternsFrameworks Application DeveloperUses application design patterns (e.g. feature extraction) to architect the application Uses application frameworks (e.g. CBIR) to implement the application Application-Framework Developer Uses programming design patterns (e.g. Map/Reduce) to architect the application framework Uses programming frameworks (e.g MapReduce) to implement the application framework

Productivity/Efficiency and Patterns 16 Domain-literate programming gurus (1% of the population) Application frameworks Parallel patterns and programming frameworks + Parallel programming gurus (1-10% of programmers) Parallel programming frameworks 3 2 Domain Experts End-user, application programs Application patterns and frameworks + 1 The hope is for Domain Experts to create parallel code with little or no understanding of parallel programming Leave hardcore “bare metal” efficiency-layer programming to the parallel programming experts

Par Lab Research Overview 17 Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Design Patterns/Motifs Sketching Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Easy to write correct programs that run efficiently on manycore Legacy OS Multicore/GPGPU OS Libraries and Services ParLab Manycore/RAMP Hypervisor OS Arch. Productivity Layer Efficiency Layer Correctness Applications Composition & Coordination Language (C&CL) Parallel LibrariesParallel Frameworks Static Verification Dynamic Checking Debugging with Replay Directed Testing Autotuners C&CL Compiler/Interpreter Efficiency Languages Type Systems Diagnosing Power/Performance

Par Lab is Multi-Lingual Applications require ability to compose parallel code written in many languages and several different parallel programming models Let application writer choose language/model best suited to task High-level productivity code and low-level efficiency code Old legacy code plus shiny new code Correctness through all means possible Static verification, annotations, directed testing, dynamic checking Framework-specific constraints on non-determinism Programmer-specified semantic determinism Require common spec between languages for static checker Common linking format at low level (Lithe) not intermediate compiler form Support hand-tuned code and future languages & parallel models 18

Why Consider New Languages? Most of work is in runtime and libraries Do we need a language? And a compiler? If higher level syntax is needed for productivity We need a language If static analysis is needed to help with correctness We need a compiler (front-end) If static optimizations are needed to get performance We need a compiler (back-end) Will prototype frameworks in conventional languages, but investigate how new languages or pattern-specific compilers can improve productivity, efficiency, and/or correctness 19

Selective Embedded Just-In-Time Specialization (SEJITS) for Productivity Modern scripting languages (e.g., Python and Ruby) have powerful language features and are easy to use Idea: Dynamically generate source code in C within the context of a Python or Ruby interpreter, allowing app to be written using Python or Ruby abstractions but automatically generating, compiling C at runtime Like a JIT but Selective: Targets a particular method and a particular language/platform (C+OpenMP on multicore or CUDA on GPU) Embedded: Make specialization machinery productive by implementing in Python or Ruby itself by exploiting key features: introspection, runtime dynamic linking, and foreign function interfaces with language-neutral data representation 20

Selective Embedded Just-In-Time Specialization for Productivity Case Study: Stencil Kernels on AMD Barcelona, 8 threads Hand-coded in C/OpenMP: 2-4 days SEJITS in Ruby: 1-2 hours Time to run 3 stencil codes: 21 Hand-coded (seconds) SEJITS from cache (seconds) Extra JIT-time 1 st time executed (seconds)

Autotuning for Code Generation (Demmel, Yelick) 22 Search space for block sizes (dense matrix): Axes are block dimensions Temperature is speed Problem: generating optimal code like searching for needle in haystack Manycore  even more diverse New approach: “Auto-tuners” 1st generate program variations of combinations of optimizations (blocking, prefetching, …) and data structures Then compile and run to heuristically search for best code for that computer Examples: PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W (FFT)

Anatomy of a Par Lab Application 23 Tessellation OS Productivity Programmer Efficiency Programmer Machine Generated System Libraries Legacy Parallel Library Parallel Framework Library Legacy Serial Code Lithe Parallel Runtime Autotuner Tuned Code Interface Productivity Language High- Level Code

From OS to User-Level Scheduling Tessellation OS allocates hardware resources (e.g., cores) at coarse-grain, and user software shares hardware threads co-operatively using Lithe ABI Lithe provides performance composability for multiple concurrent and nested parallel libraries Already supports linking of parallel OpenMP code with parallel TBB code, without changing legacy OpenMP/TBB code and without measurable overhead 24

Tessellation: Space-Time Partitioning for Manycore Client OS 25 Wireless radio Memory Media PlayerNetwork Driver Filesystem Browser Video decoder GUI Windows VM De-scheduled Partitions QoS Allocations

Tessellation Kernel Structure 26 Tessellation Kernel Partition Management Layer Hardware Partitioning Mechanisms CPUs Physical Memory Interconnect Bandwidth Cache Performance Counters Partition Mechanism Layer (Trusted) Application Or OS Service Custom Scheduler Library OS Functionality Configure HW-supported Communication Message Passing Configure Partition Resources enforced by HW at runtime Partition Allocator Partition Scheduler Comm. Reqs Sched Reqs. Partition Resizing Callback API Res. Reqs. Kernel being written in Ivy, a safe C dialect

Par Lab Architecture Architect a long-lived horizontal software platform for independent software vendors (ISVs) ISVs won’t rewrite code for each chip or system Customer buys application from ISV 8 years from now, wants to run on machine bought 13 years from now (and see improvements) 27 …instead, one type of multi-paradigm core Fat Cores (I nst LP) Thin Cores (T hread LP) Weird Cores (D ata LP) Weirder Cores (G ate LP) Not multiple paradigms of core Core L2U$ / LLC slice Scalar (ILP) L1D$ L1I$ “Macro” TLP across cores Lan e Vector-Thread Unit (DLP+TLP) “Micro” TLP within core Specialized functional units exploiting GLP System Interconnect

RAMP Gold Rapid accurate simulation of manycore architectural ideas using FPGAs Initial version models 64 cores of SPARC v8 with shared memory system on $750 board 28 Cost Performance (MIPS) Simulations per day Software Simulator $2, RAMP Gold $2,000 + $

Par Lab’s original research “bets” Software platform: data center + mobile client Let compelling applications drive research agenda Identify common programming patterns Productivity versus efficiency programmers Autotuning and software synthesis Build correctness + power/perf. diagnostics into stack OS/Architecture support applications, provide primitives not pre-packaged solutions FPGA simulation of new parallel architectures: RAMP Above all, no preconceived big idea – see what works driven by application needs To learn more: 29

Par Lab Research Overview 30 Easy to write correct programs that run efficiently on manycore Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Design Patterns/Motifs Sketching Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Legacy OS Multicore/GPGPU OS Libraries & Services ParLab Manycore/RAMP Hypervisor OS Arch. Productivity Layer Efficiency Layer Correctness Applications Composition & Coordination Language (C&CL) Parallel LibrariesParallel Frameworks Static Verification Dynamic Checking Debugging with Replay Directed Testing Autotuners C&CL Compiler/Interpreter Efficiency Languages Type Systems Diagnosing Power/Performance