UPC Research Activities at UF Presentation for UPC Workshop ’04 Alan D. George Hung-Hsun Su Burton C. Gordon Bryan Golden Adam Leko HCS Research Laboratory.

Slides:



Advertisements
Similar presentations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Advertisements

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Types of Parallel Computers
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
THE AFFORDABLE SUPERCOMPUTER HARRISON CARRANZA APARICIO CARRANZA JOSE REYES ALAMO CUNY – NEW YORK CITY COLLEGE OF TECHNOLOGY ECC Conference 2015 – June.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
PMIT-6102 Advanced Database Systems
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
7/30/031 System-Area Networks (SAN) Group Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters Dr. Alan D. George, Director HCS.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
5/12/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters Dr. Alan D. George, Principal Investigator Mr. Burton C. Gordon, Sr.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Full and Para Virtualization
Parallel Performance Wizard: a Performance Analysis Tool for UPC (and other PGAS Models) Max Billingsley III 1, Adam Leko 1, Hung-Hsun Su 1, Dan Bonachea.
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,
Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.
11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun.
IEEE Workshop on HSLN 16 Nov 2004 SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit Hung-Hsun Su, Burton C. Gordon,
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.
FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

UPC Research Activities at UF Presentation for UPC Workshop ’04 Alan D. George Hung-Hsun Su Burton C. Gordon Bryan Golden Adam Leko HCS Research Laboratory University of Florida

2 Outline FY04 research activities  Objectives  Overview  Results  Conclusions New activities for FY05  Introduction  Approach  Conclusions

3 Research Objective (FY04) Goal  Extend Berkeley UPC support to SCI – a new SCI Conduit.  Compare UPC performance on platforms with various interconnects using existing and new benchmarks.

4 GASNet SCI Conduit - Design

5 Experimental Testbed  Elan, VAPI (Xeon), MPI, and SCI conduits Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset. SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus. Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch. InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO port switch from Infinicon. RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1.  VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S server motherboard. InfiniBand: Same as in VAPI (Xeon).  GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM. Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F- SW16 switch. RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1.  ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections. Tru64 5.1B Unix, HP UPC V2.1 compiler. * via testbed made available courtesy of Michigan Tech

6 IS (Class A) from NAS Benchmark IS (Integer Sort), lots of fine-grain communication, low computation. Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application.

7 DES Differential Attack Simulator S-DES (8-bit key) cipher (integer-based).  Creates basic components used in differential cryptanalysis. S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT).  Bandwidth-intensive application.  Designed for high cache miss rate, so very costly in terms of memory access.

8 DES Analysis With increasing number of nodes, bandwidth and NIC response time become more important. Interconnects with high bandwidth and fast response times perform best.  Marvel shows near-perfect linear speedup, but processing time of integers an issue.  VAPI shows constant speedup.  Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed for better analysis.  GM does not begin to show any speedup until 4 nodes, then minimal.  MPI conduit clearly inadequate for high-bandwidth programs.  SCI conduit performs well for high-bandwidth programs but with the same speedup problem as GM.

9 Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes. Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack. Has three main phases:  Compute optimal difference pair based on S-Box (not very CPU-intensive).  Performs main differential attack (extremely CPU-intensive). Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier.  Analyze data from differential attack (not very CPU-intensive). Computationally (independent processes) intensive + several synchronization points. Parameters MAINKEYLOOP = 256 NUMPAIRS = 400,000 Initial Key: 12345

10 CAMEL Analysis Marvel  Attained almost perfect speedup.  Synchronization cost very low. Berkeley UPC  Speedup decreases with increasing number of threads. Cost of synchronization increases with number of threads.  Run time varied greatly as number of threads increased. Hard to get consistent timing readings.  Still decent performance for 32 threads (76.25% efficiency, VAPI).  Performance is more sensitive to data affinity.

11 Conclusions (FY04) SCI conduit  Functional, optimized version is available.  Although limited by current driver from vendor, it is able to achieve performance comparable to other conduits.  Enhancements to resolve driver limitation are being investigated in close collaboration with Dolphin. Support access of all virtual memory on remote node. Minimize transfer setup overhead. Performance comparison  Marvel Provides better compiler warnings. Has better speedup.  Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC. VAPI and Elan are initially found to be strongest.  Surprisingly bad performance is possible in UPC!

12 Introduction to New Activity (FY05) UPC Performance Analysis Tool (PAT) Motivations  UPC program does not yield the expected performance. Why?  Due to the complexity of parallel computing, difficult to determine without tools for performance analysis.  Discouraging for users, new & old; few options for shared- memory computing in UPC and SHMEM communities. Goals  Identify important performance “factors” in UPC computing.  Develop framework for a performance analysis tool. As new tool or as extension/redesign of existing non-UPC tools.  Design with both performance and user productivity in mind.  Attract new UPC users and support improved performance.

13 Approach Define layers to divide the workload. Conduct existing-tool study and performance layers study in parallel to:  Minimize development time  Maximize usefulness of PAT

14 Conclusions (FY05) PAT development cannot be successful without UPC developer and user input.  Develop a UPC user pool to obtain user input. What kind of information is important? Familiarity with any existing PAT? Preference if any? Why? Past experience on program optimization.  Require extensive knowledge on how each UPC compiler works to support each of them successfully. Compilation strategies. Optimization techniques. List of current and future platforms.  Propose the idea of a standard set of performance measurements for all UPC platforms and implementations. Computation (local, remote). Communication.  Develop a repository of known performance bottleneck issues.

15 Suggestions and Questions?

16 Appendix – Sample User Survey Are you currently using any performance analysis tools? If so, which ones? Why? What features do you think are most important in deciding which tool to use? What kind of information is most important for you when determining performance bottlenecks? Which platforms do you target most? Which compiler(s)? From past experience, what coding practices typically lead to most of the performance bottlenecks (for example: bad data affinity to node) ?

17 Appendix - GASNet Latency on Conduits

18 Appendix - GASNet Throughput on Conduits