Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Slides:



Advertisements
Similar presentations
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.
CS CS 5150 Software Engineering Lecture 19 Performance.
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Seminar on parallel computing Goal: provide environment for exploration of parallel computing Driven by participants Weekly hour for discussion, show &
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
CS CS 5150 Software Engineering Lecture 25 Performance.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Desktop with Direct3D 10 capable hardware Laptop with Direct3D 10 capable hardware Direct3D 9 capable hardware Older or no graphics hardware.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
KYLIN-I 麒麟一号 High-Performance Computing Cluster Institute for Fusion Theory and Simulation, Zhejiang University
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
The High Performance Cluster for QCD Calculations: System Monitoring and Benchmarking Lucas Fernandez Seivane Summer Student 2002.
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Enterprise Computing With Aspects of Computer Architecture Jordan Harstad Technology Support Analyst Arizona State University.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Sun Fire™ E25K Server Keith Schoby Midwestern State University June 13, 2005.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Performance measurement with ZeroMQ and FairMQ
Introduction: Lattice Boltzmann Method for Non-fluid Applications Ye Zhao.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.
High Performance Computing
European Laboratory for Particle Physics Window NT 4 Scaling/Performance Tests Alberto Di Meglio CERN IT/DIS/NCS.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computer Hardware – Part 2 Basic Components V.T. Raja, Ph.D., Information Management Oregon State University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Running Mantevo Benchmark on a Bare-metal Server Mohammad H. Mofrad January 28, 2016
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
GRAPE 현황. Hardware Configuration Master –2 x Dual Core Xeon 5130 (2 GHz) –2 GB memory –500 GB hard disk 4 Slaves –2 x Dual Core Xeon 5130 (2 GHz) –1 GB.
BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Itanium® 2 Processor Architecture
M. Bellato INFN Padova and U. Marconi INFN Bologna
QuickPath interconnect GB/s GB/s total To I/O
Parallel Processing and GPUs
المحور 3 : العمليات الأساسية والمفاهيم
Lucas Fernandez Seivane Summer Student 2002 IT Group, DESY Hamburg
TeraScale Supernova Initiative
Hybrid Programming with OpenMP and MPI
Memory System Performance Chapter 3
CSE 102 Introduction to Computer Engineering
Presentation transcript:

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals -Flexibility (control parameters must be configurable) -Persistence (pause and continue) -Visualization

Hardware Single Computer Configuration -1-4 CPUs -1-4 Cores -3-4 GHz CPUs bit FP IPC bit FP IPC -Windows Cluster Configurations -LION-XO (80x2xOpteron/8GB + 40x4xOpteron/16GB; 2.4 GHz) -1.6 TFlops (32-bit); 800 GFlops (64-bit); single-core assumed -Gigabit Ethernet -GNU/Linux -Single or dual core CPUs? CPU Model? 6 GFlops average desktop 256 GFlops top-line server

Algorithms Direct Methods: O(N 2 ) + very simple + scalable - inefficient (~30,000 particles 256 GFlops) Treecode / Mutipole: O(NlogN) - more difficult to implement - scalability harder to achieve + efficient ( particles) Field Methods: O(NlogN) or O(N) Involves solving Poisson’s equation Area of active research

Levels of Parallelization 1) SIMD: up to 4 threads -4x32-bit flops/cycle -2x64-bit flops/cycle 2) SMP/MPU: up to 4 threads -1-4 cores -1-4 CPUs 3) Cluster: up to N nodes

Memory Requirements 1)Position: x, y, z 2)Velocity: vx, vy, vz 6x4 = 24 bytes (32-bit fp) 6x8 = 48 bytes (64-bit fp) 2,500 points per KB (32-bit) 1,300 points per KB (64-bit)

Levels of Memory 1) L1 cache: 64 KB -CPU clock-speed -no latency 2) L2 cache: 1 MB -CPU clock-speed -low latency 3) RAM: GBs -reduced speed (up to 12-24GB/s) -huge latency 4) Network (weakest link) -1 Gbit/sec

10 9 Particles Require… Memory: 24 GB (32-bit) Instructions per iteration: Log 2 (10 9 )x10 9 xconst~3x10 12 ops=3T Flops Time: ~ GFlops