841.11 / ZZ88 Performance of Parallel Neuronal Models on Triton Cluster Anita Bandrowski, Prithvi Sundararaman, Subhashini Sivagnanam, Kenneth Yoshimoto,

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Characteristics of a Microprocessor. The microprocessor is the defining trait of a computer, so it is important to understand the characteristics used.
INTEL COREI3 INTEL COREI5 INTEL COREI7 Maryam Zeb Roll#52 GFCW Peshawar.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.
The i9 Processor From INTEL By: Chad Sheppard. Little info about the new chip Coming from a great line of processors Intel Pentium 1, 2, 3, M, 4, 4HT.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
MPI and OFA Divergent interests? Dan Caldwell, VP WW Channel Sales Scali, Inc.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Computational Models at the Touch of a Button Daniel Guan, San Diego Supercomputer Center, UCSD Introduction Since supercomputers are not widely available.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Computer Performance Computer Engineering Department.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
NEUROSCIENCE GATEWAY PORTAL Introduction Currently, Neuroscientists wishing to view data of computational model may not be able to do so. This may be due.
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Background Computer System Architectures Computer System Software.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Hardware Architecture
PipeliningPipelining Computer Architecture (Fall 2006)
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Using the VTune Analyzer on Multithreaded Applications
These slides are based on the book:
CS203 – Advanced Computer Architecture
Assembly Language for Intel-Based Computers, 5th Edition
Super Computing By RIsaj t r S3 ece, roll 50.
What happens inside a CPU?
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Computer Architecture
Support for ”interactive batch”
CLUSTER COMPUTING.
Hybrid Programming with OpenMP and MPI
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Learning Objectives To be able to describe the purpose of the CPU
Database System Architectures
Presentation transcript:

/ ZZ88 Performance of Parallel Neuronal Models on Triton Cluster Anita Bandrowski, Prithvi Sundararaman, Subhashini Sivagnanam, Kenneth Yoshimoto, Amitava Majumdar and Maryann Martone University of California, San Diego, CA Results Large computational clusters can be very useful in quickly simulating large neuronal networks, however building a cluster can be a cost prohibitive and daunting task for an individual laboratory. Research staff at UCSD’s Neuroscience department along with members of UCSD’s San Diego Super Computer Center (SDSC) are exploring how models built for clusters can run on large machines and what scales are appropriate, with two main goals: to check the validity of data coming from large machines compared to single core machines and to determine performance metrics. Neuronal models obtained from ModelDB ( or neuroConstruct ( were run on the Triton compute cluster (256 nodes, 24 gigabytes of memory, Xeon E5530 processors with 8 megabyte cache) at SDSC. The Triton compute cluster is composed of Appro gB222X Blade Server nodes with dual quad-core Intel architecture Nehalem, running at 2.40 GHz. Models were run on varying numbers of processors on the Triton cluster to simulate smaller clusters, including desktop machines. Introduction and Methodology Comparing Walltime for Different Nodes - Processors/node combination: For model from modelDB, wallclock time was compared for different node-processors combination while keeping the overall processors per run a constant (8 or 16) Comparing Walltime for Different Nodes - Processors/node combination: For model from modelDB, wallclock time was compared for different node-processors combination while keeping the overall processors per run a constant (8 or 16) Consistency Results: Output results are consistent when the model was run with different number of processors. The model used in this set of runs is the Traub (2005) model entitled “A single column thalamocortical network model” and it was obtained from modelDB (#45539). This model has 3,560 multi- compartment neurons, four output files are overlapped showing perfect alignment from one neuron. Consistency Results: Output results are consistent when the model was run with different number of processors. The model used in this set of runs is the Traub (2005) model entitled “A single column thalamocortical network model” and it was obtained from modelDB (#45539). This model has 3,560 multi- compartment neurons, four output files are overlapped showing perfect alignment from one neuron. Conclusion: Parallel implementations of models with greater numbers of neurons performed well in terms of strong scaling where the simulation time decreased as the number of cores increased in a mode that provided sufficient memory per CPU. Models with less than 20 neurons didn’t show a real need for using bigger computational clusters since serial runs performed better than parallel runs in terms of simulation time. The study identified various potential bottlenecks. Changing the number of processors per node, while maintaining the overall number of processors per run, can lead to different runtimes. Parallel performance was affected by caching. Interconnect speed and number of sockets per node could also affect the results on different clusters. The study proved that consistency is not affected by running a model with different processor counts. The study clearly showed that if parallelization of models was not done properly, the consistency of the results varied greatly. We anticipate that dividing tasks of the neuronal model across the parallel cores may provide improved parallel performance which will be investigated further Scaling Results: Models of different size (number of neurons) were run on varying number of processors. Large neuron-count models demonstrated reduced run time with higher processer counts. This was not true for low neuron count (20 neuron) models. Scaling Results: Models of different size (number of neurons) were run on varying number of processors. Large neuron-count models demonstrated reduced run time with higher processer counts. This was not true for low neuron count (20 neuron) models. Scalability Results: We observe a decrease in the wall clock time (execution time) as the number of processors increase. Scalability Results: We observe a decrease in the wall clock time (execution time) as the number of processors increase. Funded in part by the NIH Neuroscience Blueprint HHSN C via NIDA Funded in part by the Triton Research Grant