Tohoku University, Japan

Slides:

Advertisements

Similar presentations

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Advertisements

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

CSCI 3 Introduction to Computer Science. CSCI 3 Course Description: –An overview of the fundamentals of computer science. Topics covered include number.

Ekrem Kocaguneli 11/29/2010. Introduction CLISSPE and its background Application to be Modeled Steps of the Model Assessment of Performance Interpretation.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/

7. Grid Computing Systems and Resource Management

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

What we DO need to make Desktop Grids a Success in Practice Michela Taufer UCSD - TSRI.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

CS220:INTRODUCTION TO SOFTWARE ENGINEERING CH1 : INTRODUCTION 1.

Alternative Parallel Processing Approaches

Sub-fields of computer science. Sub-fields of computer science.

Chapter 3 Data Representation

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Clouds , Grids and Clusters

Web: Parallel Computing Rabie A. Ramadan , PhD Web:

Introduction to Load Balancing:

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

COMPUTATIONAL MODELS.

基于多核加速计算平台的深度神经网络分割与重训练技术

Chilimbi, et al. (2014) Microsoft Research

Extreme Big Data Examples

Parallel Density-based Hybrid Clustering

Standards and Patterns for Dynamic Resource Management

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

Foundations of Computer Science

Sum of Absolute Differences Hardware Accelerator

Theme: “Learning with computer”

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Akshay Tomar Prateek Singh Lohchubh

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Computer Systems Performance Evaluation

CSE8380 Parallel and Distributed Processing Presentation

Packet Classification with Evolvable Hardware Hash Functions

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Fundamentals of Human Computer Interaction (HCI)

Results (Accuracy of Low Rank Approximation by iSVD)

COMP60621 Fundamentals of Parallel and Distributed Systems

Multithreaded Programming

By Brandon, Ben, and Lee Parallel Computing.

Introduction to Microprocessor Programming

Computer Systems Performance Evaluation

Presented By: Darlene Banta

Paper by D.L Parnas And D.P.Siewiorek Prepared by Xi Chen May 16,2003

4. Principles of Psychology Teaching

Chapter 5 Architectural Design.

Database System Architectures

Chapter 01: Introduction

Real time signal processing

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Tohoku University, Japan jh170049-ISJ Hiroaki Kobayashi (Graduate School of Information Sciences, Tohoku University) Theory and Practice of Vector Processing for Data and Memory Centric Applications Tohoku University, Japan Moscow State University, Russia Background Hiroaki Kobayashi Akihiro Musa Kazuhiko Komatsu Ryusuke Egawa Hiroyuki Takizawa Vladimir Voevodin Alexander Antonov Dmitry Nikitenko Alexey Teplov Ilya Afanasyev Design and development of architecture for building of exaflop-level systems is one of the most challenging tasks of contemporary high-performance computing. One of the most promising ways to find the solution of this task is using vectorization to achieve high CPU output. The origins of vectorization are found in linear memory structure of the computers and in wide usage of such data structures as arrays. Using vector operations allows much easily specifying compute blocks in terms of machine commands. The code becomes compact, and at the same time the efficiency of corresponding blocks is much higher than of those based on traditional approaches. Implementation and adapting of many of known algorithms for vector-pipeline supercomputers, that are originally designed for vector processing, can appear more effective by a huge ratio, than same algorithm implementations for other architectures. The whole spectrum of such kind of topics is up to be studied and researched as a part of the project. Vector data processing is widely used not only in supercomputers at present; it is widely implemented. The figure illustrates the example of Strongly Connected Components (SCC) algorithm implementation for CPU, GPU and KNL. The proposed by the joint team research is to be conducted at the frontier of state of the art. Japan team has been successfully undertaking designing and development of the world-best vector processing based microprocessor architectures. The team also posses all necessary experience, knowledge and technology that can grant hardware-related support for the project fulfillment. Russian team is known to possess deep and wide decades-long experience in analysis of program and algorithm structures that bases on fundamental and applied research results of a world-level, as well as rich many year experience supporting the largest in Russia HPC center, The Supercomputing Center of Moscow State University. Computer Science Research Computing Center Motivation and goals The goal of the project is the deep research on vector data processing potential that can be used both for development of new promising HPC systems and development of methods for solving of extremely large-scale numerical problems that operate huge amounts of data and intensively use memory access. As a part of the project theory and practice of vector data processing analysis will be performed together with peculiarities of hardware implementation and software stack support. The analysis results will serve the basis for development of solving large-scale tasks methods taking in account real state-of-the art of present supercomputing which is characterized by huge performance capabilities of HPC systems, large amounts of data and extremely high and growing level of parallelism. The developed methods will be approbated on a complex set of algorithms aimed at processing of graph structures of ultra high scale, containing over 109 vertices and edges (this scale can be met when analyzing social networks). These algorithms are characterized by highly intensive memory utilization, irregular memory access, absence of floating point operations, and at the same time the task scales demand precise data processing on all levels of memory hierarchy. Efficiency estimations for the methods of solving this kind of tasks on HPC systems based on vector data processing will be obtained as a part of the research results. Graph problems are known to be the basis for solution of the wide range of applied and scientific challenges. For example: communicational and transport networks optimization, cryptography, social nets analysis, bioengineering, web data analysis and so on. These problems are very specific in sensibility of organized memory stack and special instruction sets usage to achieve good performance on required task scales. The promising way to achieve high performance on such type of problems is to take advantage of the vector processors and machines. Reaching project goals will provide immediate results in form of described algorithm properties such as scalability and potential of vectorization disclosure. This rich experimental data is especially valuable becoming commonly and widely available, for example, as a part of the AlgoWiki project. The combination of theoretical definitions and estimations of the most known graph algorithms and their implementations together with experimentally obtained information on their dynamical properties will provide all-round descriptions which can be used for efficient development of scientific applications that require these approaches as well as better understanding the requirements for next generation vector processors and memory architecture. Both RU and JP teams will benefit greatly from theoretical and experimental study in the area: AlgoWiki project can be enriched with experimental data on solving graph based problems; Scalability and other algorithms properties update can contribute to the development of next generations of vector processing environment. Early performance evaluation To achieve all the joint project goals and results it is necessary to perform experimental runs on top level vector supercomputer. The tables illustrate the present scales of well-known graph problems that can be solved using compute nodes of traditional heterogeneous supercomputers, and reached performance by RU team. The diagram shows example of experimental results obtained for traditional heterogeneous nodes.