Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Chapter 9.1 = LU Decomposition MATH 264 Linear Algebra.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Reference: Message Passing Fundamentals.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
Memory Management 2010.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Dr. Gheith Abandah, Chair Computer Engineering Department The University of Jordan 20/4/20091.
 Optimal Packing of High- Precision Rectangles By Eric Huang & Richard E. Korf 25 th AAAI Conference, 2011 Florida Institute of Technology CSE 5694 Robotics.
Computer System Architectures Computer System Software
©2009 SQUARE ENIX CO., LTD. A Fault Tolerant Gaussian Elimination Solver for the Cell Broadband Engine James Geraci Lead Researcher Square Enix Co., Ltd.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Numerical Computation Lecture 7: Finding Inverses: Gauss-Jordan United International College.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Chapter 7 Memory Management Seventh Edition William Stallings Operating Systems: Internals and Design Principles.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Subject: Operating System.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.
JAVA AND MATRIX COMPUTATION
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.
The concept of RAID in Databases By Junaid Ali Siddiqui.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Workload Clustering for Increasing Energy Savings on Embedded MPSoCs S. H. K. Narayanan, O. Ozturk, M. Kandemir, M. Karakoy.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Concurrency and Performance Based on slides by Henri Casanova.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Munther Abualkibash University of Bridgeport, CT.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
PipeliningPipelining Computer Architecture (Fall 2006)
CE 454 Computer Architecture
Ioannis E. Venetis Department of Computer Engineering and Informatics
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
Operating System Introduction.
Dense Linear Algebra (Data Distributions)
Operating System Overview
COMP755 Advanced Operating Systems
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Parallelized Analytic Placer
Presentation transcript:

Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions: References Thermal Management: [1] G. Bosilca and R. Delmas and J. Dongarra and J. Langou, “Algorithmic Based Fault Tolerance Applied to High Performance Computing,” LAPACK Working Note #205, June [2] J. Geraci, A comparison of parallel Gaussian elimination solvers for the computation of electrochemical battery models on the cell processor, Thesis Ph.D., MIT, [3] E. Kursun and C. Chen-Yong, “Thermal variation characterization and thermal management of multicore architectures,” Micro, IEEE, Vol 29, No. 1, Jan Bosilca et al. [1] present an algorithm based fault tolerant scheme for multiprocessor systems that effectively deals with processor loss (aka erasure failures). Their scheme is diskless and relies on checksums. It utilizes (2p-1) checksum processors for every p 2 computing processors. This method's efficiency grows with the numbers of processors. However, for smaller numbers of cores such as the 8, 16, 32, 64 or 80 cores found on today's multicore processors, at best, 17% of the cores must be allocated to checksum usage. This can be prohibitive for embedded systems with tight power and space requirements. Our fault tolerant algorithm uses no extra cores for protection against erasure faults but instead uses the algorithm’s natural synchronization points to check for core faults and back up important information. Additionally, our system can be used to dynamically add cores to a workload. The in-core algorithm presented in [2, p.195] is implemented for this work. This algorithm operates on banded matrices like the one shown in Fig. 1. Everything outside the center band of width b is zero so no computation is performed on data in this area. Everything within the band is computed on. The algorithm initially distributes the rows among the cores in a block-round-robin manner so that each core receives a series of contiguous rows as seen in Fig. 1. Elimination starts from the upper left hand corner of the matrix and proceeds to the lower right hand corner of the matrix. Each elimination step starts with a base row which is used to eliminate the left most column in the b/2 by b/2 sized block directly below the base row. When moving from the first work block to the second work block, the first work block’s base row is stored in main memory. The new base row is broadcast to all the cores, and a new row of data is brought onto the chip. This process is seen in Fig. 1. This algorithm yields a compute to memory access ratio of: Fig. 1: Only 3 rows of data need to be moved for each before each block. Compute/memory access = (b/2) 2 /(3b) = b/12 Fault tolerance (FT) is added to the algorithm by having the cores back up their rows to memory periodically. This FT algorithm implementation, while based on the FT algorithm for Gaussian elimination for the Cell Broadband Engine presented in [2], varies significantly in implementation and has more capabilities. The FT algorithm in [2] could only redistribute rows upon an erasure fault. However, this implementation, in addition to being able to redistribute the workload in response to a failure, can add cores independent of a failure and as part of a response to a failure. The fault tolerance capability takes advantage of the natural serialization points of the algorithm to check for core failures. While checking for failures is free, fault tolerance requires that the data be backed up in main memory. If these accesses can be hidden with compute, then the fault tolerance incurs no performance costs. No attempt was made to hide these access for the present implementation. This optimization has been left for future work. while(present_base_row < last_row){ eliminate_on_my_rows(present_base_row); back_up_data_to_memory(); // introduced for fault tolerance synchronize_with_other_cores() // share the next base_row information here present_base_row++; } Algorithm 1: Rough pseudo-code for the algorithm In this situation N cores are computing and M < N cores fail. The rows associated with the failed cores are redistributed among the remaining cores. If necessary, the remaining cores perform any lost computation needed on the defunct cores’ rows, before continuing with the solve. Fig. 2: Eight cores are computing and then 3 cores fail. The failed cores’ rows are redistributed among the remaining cores. Fig. 3: Three cores are computing and then three more cores are added mid-compute. In this situation cores are added to a workload. The original cores give up enough rows to the new cores so that the work done by each core is about equal. Fig. 4: Three cores are computing, two of the cores fail, and the system responds by adding four new cores. When a failure occurs, the failed cores are first removed, then the new cores are started, and finally the workload is redistributed. In [3], it was found that there can be significant variation in thermal performance for cores on the same chip. Furthermore, there can be significant thermal variation for the same part on different cores when running the same tasks. For example, during their experiments with a two core test chip, when running the same tasks, a 7.5 o C difference between the load store units of the two cores was measured. The presented fault tolerant scheme can be used to improve energy efficiency when combined with core thermal information. By monitoring the thermal performance of the cores during computation, it can be determined if any of the computing cores has a temperature significantly greater than any of its partner cores. Using the “Simultaneous core failures and additions” capability of the algorithm, the hot core can be shut down and its workload moved to a new different cooler core or redistributed among the remaining cores, thereby improving energy efficiency. As proposed in [2], FT can be used as the basis for a preemptive/cooperative multitasking environment for multicore systems. This is because each application knows what to do when its number of available cores changes. Instead of having to remove a task from all the cores on which it is running, a few cores can be liberated and the liberated cores assigned to a new task. If the new task exits before the original task has completed, the cores used by the new task can be re-assigned to the original task. Performance © Square Enix Co., Ltd Forward elimination performance without fault tolerance achieves a peak of around 4 Gflops on an IBM QS22. With fault tolerance activated that performance drops to around 0.7 Gflops. Around 0.14 Gflops was achieved using ‘lu’ in Matlab on a Dell M6400 with a 2.67 Ghz Core 2 Duo T9550 with ‘b’ = 768. The drop in performance between the non-FT and FT implementations is due to the large number of new memory accesses introduced by the FT algorithm. This drop in performance can be mitigated in future implementations through the use of Cell coding techniques. Preemptive/Cooperative multitasking: Fig. 5: Performance peaks around 8 SPEs. Further Applications