Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Slides:



Advertisements
Similar presentations
Technology Analysis LINUX Alper Alansal Brian Blumberg Ramank Bharti Taihoon Lee.
Advertisements

Threads, SMP, and Microkernels
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Distributed Systems CS
SE-292 High Performance Computing
Lecture 6: Multicore Systems
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Computer Systems/Operating Systems - Class 8
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
3.5 Interprocess Communication
Figure 1.1 Interaction between applications and the operating system.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Computer System Architectures Computer System Software
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Operating System 4 THREADS, SMP AND MICROKERNELS
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Srihari Makineni & Ravi Iyer Communications Technology Lab
Operating System What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. An operating.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
Operating System 4 THREADS, SMP AND MICROKERNELS.
Full and Para Virtualization
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 1.
PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.
System Programming Basics Cha#2 H.M.Bilal. Operating Systems An operating system is the software on a computer that manages the way different programs.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Introduction to threads
NFV Compute Acceleration APIs and Evaluation
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Hot Processors Of Today
Constructing a system with multiple computers or processors
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Chapter 4: Threads.
Chapter 2: The Linux System Part 1
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Operating System 4 THREADS, SMP AND MICROKERNELS
Chapter 2: Operating-System Structures
MPJ: A Java-based Parallel Computing System
Chapter 4 Multiprocessors
Chapter 2: Operating-System Structures
Types of Parallel Computers
Cluster Computers.
Presentation transcript:

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29

Introduction The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-offs involved in multi- core computing platforms. analysis can help application and toolkit developers In designing better, topology aware, communication primitives intended to suit the needs of various high end computing applications. we take on the challenge of designing and implementing a portable intra-core communication framework for streaming computing and evaluate its performance on some popular multi-core architectures developed by Intel, AMD and Sun.

Analysis shows that a single communication method cannot achieve the best performance for all message Sizes The basic memory-based copy method gives good performance for small messages. vector loads and stores provide better performance for medium sized messages. Kernel based approach can achieve better performance for larger messages.

Multi-core Systems A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient A multi-core processor is a processing system Composed of two or more independent cores. As number cores increase the performance also increase.

Now days multi-core processors are used incase of high performance computing. When more and more users attempting to use their distributed workload within a single node or a small set of nodes and it causes a little communication delay,it decrease performances. The hardware support for accelerating it, have become critical to obtain optimal application performance. Because It cause a little communication delay.

The inter-node data exchange method such as Myrinet,or InfiniBand. The early communication method Myrinet : It has less overhead protocols than Ethernet. So it provide better performance,less interface and low latency. It uses two optical fiber. One used for Upstream and other used for Downstream and connected with node with switch It has latency of 5.5 microsecond and software and hardware cost is high

Infiband : It is a high speed interface for server and peripherals. It deal with connection between server to server and peripherals such as storage switches etc. The main disadvantage is software overhead.

Methodology Here we deals with the approach used to evaluate the communication mechanisms of various multicore architectures. Basic Memory Based Copy We use the basic memcpy function to copy data from the user buffer to the shared communication buffer and back to the user buffer on the receiving side. This form of data transfer would take paths 3 to 8

Vector Instructions The Intel SIMD Extensions 2 (SSE2) is the Intel SIMD instruction.(ie) Single instruction for multiple data. These instruction are used in case of multiprocessing system. Multiprocessing unit perform same operation on Set on different data.

SSE2 includes two Vector instruction set one is movdqua and other is movdqu. This instruction transfer 16Bytes data from source buffer to destination buffer. Movdqua : This instruction requires 16 bytes aligned memory location. Movdqu : This can be either aligned or unaligned. These are shown in 3 to 8 path.

Streaming Instructions SSE2 includes a set of streaming instructions that do not pollute the cache hierarchy. The streaming instructions would follow paths 1 and 2, There are two streaming instructions available in the SSE2 instruction set. Movntdt : This is a non-temporal store which copies the data from the source address to the destination address without polluting the cache lines. If data are already in cache, then the cache lines are updated. This instruction is capable of copying 16 bytes of aligned data at a time.

Movnti : This is a non-temporal instruction similar to movntdq except that we only copy 4 bytes Kernel-level Direct Copy The standard memory copy approaches involve two copies to transfer data from the source to the destination buffer. For large messages we opt for a single copy approach with kernel-level memory copies using the LiMIC2 library The library abstracts out the memory transfers into a few user level functions.

The kernel-level approach has some costs associated with it as well, like the context switches between the user and kernel space. So this method is not ideal to transfer small or medium sized messages as the overhead caused by the context switches will offset the benefits obtained from reducing one memory copy.

EXPERIMENTAL RESULTS The experimental setup consists of three different platforms. Intel-Nehalem : An Intel Nehalem system with two quad-core sockets operating at 2.93 GHz with 48 GB RAM and a PCIe 2.0 interface, running Red Hat Enterprise Linux Server release 5.2 with Linux AMD-Opteron: A four-socket, quad-core AMD Opteron board operating at 1.95 GHz with 16 GB RAM and a PCIe 1.0 interface, running Red Hat Enterprise Linux Server release 5.4 with Linux

Sun-Niagara: A Sun UltraSPARC-T2 processor with 8 cores on chip. Each core operates at 1.2 GHz, and supports 8 hardware threads to hide memory latency. In evaluation we use Memcpy, Stream, Vector and, LiMIC and compiler used GCC or ICC. GCC version and ICC Compiler version 11.0 were used in the experimental evaluation

The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Java,FortranJava GCC was originally written as the compiler for the GNU operating system.GNU operating system The GNU system was developed to be 100% free software, free in the sense that it respects the user's freedomrespects the user's freedom GCC compiler

ICC Compiler Intel C++ Compiler describes a group of C/C++ compilers from Intel. Compilers are available for Linux, Microsoft Windows, and Mac OS X.CC++compilersIntelLinuxMicrosoft WindowsMac OS X Intel supports compilation for its IA-32, Intel 64, Itanium 2, processors and it does not support non-Intel but compatible processors certain AMD processorsIA-32Intel 64Itanium 2 LicenseLicense : ProprietaryProprietary

memcpy based approaches give best performance for very small messages SSE2 based approaches give the best performance for small to medium sized messages the kernel based copy approach using LiMIC gives the best performance for large messages. observation

Memcpy-GCC to compare all the architectures as GCC was the only common compiler and memcpy the only instruction available on all the machines used for our experiments, with the exception of the Intel Nehalem where we use LiMIC2 with large messages We see that for intra-socket communication, the Intel Nehalem gives the best performance followed by AMD Opteron. Due to technical difficulties, we were not able to gather the inter-socket numbers on the Sun-Niagara. Observation

Conclusion We have also found out that the intra node communication performance is highly dependent on the memory and cache Architecture as we saw from the performance analysis of the basic memory based copy as well as SSE2 vector instructions. we can achieve an intra-socket small message latency of 120, 222 and 371 nanoseconds on Intel Nehalem, AMD Opteron and Sun Niagara 2, respectively. The inter-socket small message latency for Intel Nehalem and AMD Opteron are 320 and 218 nanoseconds

The maximum intra-socket communication bandwidth obtained were 6.5, 1.37 and Gbytes/second for Intel Nehalem, AMD Opteron and Sun Niagara 2. We were also able to obtain an inter-socket communication performance of 1.2 and 6.6 Gbytes/second for the AMD Opteron and Intel Nehalem.