GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Distributed Processing, Client/Server and Clusters

Database Architectures and the Web

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing P. Balaji, Argonne National Laboratory W. Feng and J. Archuleta, Virginia Tech.

The Zebra Striped Network File System Presentation by Joseph Thompson.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Word Wide Cache Distributed Caching for the Distributed Enterprise.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

PMIT-6102 Advanced Database Systems

Computer System Architectures Computer System Software

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Serverless Network File Systems Overview by Joseph Thompson.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Outline Why this subject? What is High Performance Computing?

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Semantics-based Distributed I/O for mpiBLAST P. Balaji ά, W. Feng β, J. Archuleta β, H. Lin δ, R. Kettimuthu ά, R. Thakur ά and X. Ma δ ά Argonne National.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Parallel IO for Cluster Computing Tran, Van Hoai.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Addressing Data Compatibility on Programmable Network Platforms Ada Gavrilovska, Karsten Schwan College of Computing Georgia Tech.

Background Computer System Architectures Computer System Software.

ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Productive Performance Tools for Heterogeneous Parallel Computing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Distributed Platforms

Self Healing and Dynamic Construction Framework:

Distributed Shared Memory

Database System Architectures

Presentation transcript:

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science, Virginia Tech Math. and Computer Science, Argonne National Laboratory

Hardware Offload Engines Specialized Engines –Widely used in High-End Computing (HEC) systems for accelerating task processing –Built for specific purposes Not easily extendable or programmable Serve a small niche of applications Trends in HEC systems: Increasing size and complexity –Difficult for hardware to deal with complexity Fault tolerance (understanding application and system requirements is complex and too environment specific) Multi-path communication (optimal solution is NP-complete) Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

General Purpose Processors Multi- and Manycore Processors –Quad- and hex-core processors are commodity components today –Intel Larrabee will have 16-cores; Intel Terascale will have 80-cores –Simultaneous Multi-threading (SMT or Hyperthreading) is becoming common Expected future –Each physical node will have a massive number of processing elements (Terascale on the chip) Future multicore systems Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Benefits of Multicore Architectures Cost –Cheap: Moore’s law will continue to drive costs down –HEC is a small market; Multicores != HEC Flexibility –General purpose processing units –A huge number of tools already exist to program and utilize them (e.g., debuggers, performance measuring tools) –Extremely flexible and extendable Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Multicore vs. Hardware Accelarators Will multicore architectures eradicate hardware accelerators? –Unlikely: Hardware accelerators have their own benefits –Hardware accelerators provide two advantages: More processing power, better on-board memory bandwidth Dedicated processing capabilities –They run compute kernels in a dedicated manner –Do not deal with shared mode processing like CPUs –But more complex machines need dedicated processing for more things More powerful hardware offload techniques possible, but decreasing returns! Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

GePSeA: General Purpose Software Acceleration Hybrid hardware-software engines –Utilize hardware offload engines for low-hanging fruit Where return for investment is maximum –Utilize software “onload” engines for more complex tasks Can imitate some of the benefits of hardware offload engines, such as dedicated processing Basic Idea of GePSeA –Dedicate a small subset of processing elements on a multicore architecture for “protocol onloading” –Different capabilities added to GePSeA as plugins –Application interacts with GePSeA for dedicated processing Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

GePSeA: Basic Overview GePSeA does not try to take over tasks performed well by hardware offload engines It only supplements their capabilities Application or Software middleware Advanced hardware acceleration Basic hardware acceleration No intelligent hardware offload GePSeA Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Previous Work: ProOnE ProOnE was an initial design of GePSeA concentrating on communication onloading Studied issues related to onloading MPI internal protocols on a dedicated core –Showed significant performance benefits This paper extends this previous work for more general purpose onloading –Presents the design for such onloading and a case study application Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009 “ProOnE: A General Purpose Protocol Onload Engine for Multi- and Many-core Architectures”, P. Lai, P. Balaji, R. Thakur and D. K. Panda, ISC 2009.

Presentation Roadmap Introduction and Motivation GePSeA: General Purpose Software Acceleration Case Study: mpiBLAST Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

GePSeA Infrastructure P -> Application Process A -> Accelerator Application-Accelerator interaction for a three-node cluster Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

GePSeA: Intra-node Infrastructure Application Process 1 Application Process 2 Message Processing Block (Logical) Accelerator Intra-node Service Queue (Higher Priority) Inter-node Service Queue (Low Priority) Registration Request 1 Registration Request 2 Notify All GePSeA Communication Layer Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Memory Management Component Output Processing Component Application Reliable Advertising Service Distributed Lock Management Directory Services Distrib uted Data Cachin g Global Process-state Management Data Strea ming Servic e Global Memory Aggregator Distrib uted Data Sorting High Speed Network Data Management Component Distributed Data Caching Data Streaming Service Data Compression Engine Distributed Data Sorting Application Plug-ins Layer Overview of GePSeA Capabilities Core Components Dynamic Load Balancing Bulletin Board Service High-Speed Reliable UDP Data Transfer Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Data Management Core Components Responsible for all data movement and the data processing associated with the application GePSeA currently includes –Distributed Data Sorting Data sorting service that is distributed across multiple nodes (incremental sorting and merging) –Data Compression Engine Compress and Uncompress Data Convert output data to meta-data and vice-versa –Distributed Data Streaming Responsible for pre-fetching of data for the application from other nodes; Swap data Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Memory Management Core Components Allows processes to access entire system memory as a single aggregate memory space GePSeA currently includes: –Global Memory Aggregator Keeps track of cluster-wide memory Maintains local-global address mappings Store and retrieve data from global memory if local memory is committed –Database Directory Services Maintains up-to-date information about various database fragments across the cluster Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

I/O Processing Core Components Facilitates efficient output data processing Coordinating and synchronization of output data GePSeA currently includes: –Reliable Advertisement Service Messaging Service, used by Communication Layer –Dynamic Load Balancing Responsible for balancing the load among nodes –Global Process-state Management Maintains current status of nodes in cluster at all time –High-Speed Connectionless Data Transfer Provides high-speed UDP based data transfer Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Presentation Roadmap Introduction and Motivation GePSeA: General Purpose Software Acceleration Case Study: mpiBLAST Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

mpiBLAST- Introduction (1/2) Parallel Genome Sequence Search Application Identify regions of similarity between two sequences to discover functional, structural, or evolutionary relationships between the sequences. Image Source: mpiblast.org Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

mpiBLAST- Introduction (2/2) Database and Query Segmentation Scatter-Search-Gather Approach Input Queries Database Fragments Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

mpiBLAST over GePSeA Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

mpiBLAST Plug-ins Asynchronous Output Consolidation –Utilizes accelerator capability to merge/sort the data distributed across all multiple nodes in parallel –Remove bottlenecks Runtime Output Compression –Our results show that BLAST output can compressed be less than 10% –Significantly reduces output data transfer time over network Hot-Swap Database Fragments –Swaps its DB fragment with the DB fragment on demand –Useful for load balancing Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Presentation Roadmap Introduction and Motivation GePSeA: General Purpose Software Acceleration Case Study: mpiBLAST Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

P1 Core 1Core 2 Core 3 Core 4 P3 P4 A1 P2 mpiBLAST over GePSeA: Committed Core Running Accelerator on “Committed” Core Speed-up over mpiBLAST mpiBLAST with Accelerator mpiBLAST without Accelerator P1 Core 1Core 2 Core 3 Core 4 P3 P4 P2 Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Worker Search Time Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Output Consolidation Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Output Compression Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Presentation Roadmap Introduction and Motivation GePSeA: General Purpose Software Acceleration Case Study: mpiBLAST Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Concluding Remarks and Future Work Proposed, designed and evaluated a general purpose software accelerator (GePSeA) –Utilize a small subset of the cores to onload complex tasks Presented detailed design of the GePSeA infrastructure Onloaded mpiBLAST application as a case study –GePSeA-enabled mpiBLAST provides significant performance benefits Future Work: –Study performance and scalability on large-scale systems Pavan Balaji, Argonne National Laboratory ICPP 09/24/2009

Thank You! Contacts: Pavan Balaji: Wu-chun Feng: