UPC Research Activities at UF Presentation for UPC Workshop ’04 Alan D. George Hung-Hsun Su Burton C. Gordon Bryan Golden Adam Leko HCS Research Laboratory University of Florida
2 Outline FY04 research activities Objectives Overview Results Conclusions New activities for FY05 Introduction Approach Conclusions
3 Research Objective (FY04) Goal Extend Berkeley UPC support to SCI – a new SCI Conduit. Compare UPC performance on platforms with various interconnects using existing and new benchmarks.
4 GASNet SCI Conduit - Design
5 Experimental Testbed Elan, VAPI (Xeon), MPI, and SCI conduits Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset. SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus. Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch. InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO port switch from Infinicon. RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1. VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S server motherboard. InfiniBand: Same as in VAPI (Xeon). GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM. Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F- SW16 switch. RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1. ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections. Tru64 5.1B Unix, HP UPC V2.1 compiler. * via testbed made available courtesy of Michigan Tech
6 IS (Class A) from NAS Benchmark IS (Integer Sort), lots of fine-grain communication, low computation. Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application.
7 DES Differential Attack Simulator S-DES (8-bit key) cipher (integer-based). Creates basic components used in differential cryptanalysis. S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT). Bandwidth-intensive application. Designed for high cache miss rate, so very costly in terms of memory access.
8 DES Analysis With increasing number of nodes, bandwidth and NIC response time become more important. Interconnects with high bandwidth and fast response times perform best. Marvel shows near-perfect linear speedup, but processing time of integers an issue. VAPI shows constant speedup. Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed for better analysis. GM does not begin to show any speedup until 4 nodes, then minimal. MPI conduit clearly inadequate for high-bandwidth programs. SCI conduit performs well for high-bandwidth programs but with the same speedup problem as GM.
9 Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes. Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack. Has three main phases: Compute optimal difference pair based on S-Box (not very CPU-intensive). Performs main differential attack (extremely CPU-intensive). Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier. Analyze data from differential attack (not very CPU-intensive). Computationally (independent processes) intensive + several synchronization points. Parameters MAINKEYLOOP = 256 NUMPAIRS = 400,000 Initial Key: 12345
10 CAMEL Analysis Marvel Attained almost perfect speedup. Synchronization cost very low. Berkeley UPC Speedup decreases with increasing number of threads. Cost of synchronization increases with number of threads. Run time varied greatly as number of threads increased. Hard to get consistent timing readings. Still decent performance for 32 threads (76.25% efficiency, VAPI). Performance is more sensitive to data affinity.
11 Conclusions (FY04) SCI conduit Functional, optimized version is available. Although limited by current driver from vendor, it is able to achieve performance comparable to other conduits. Enhancements to resolve driver limitation are being investigated in close collaboration with Dolphin. Support access of all virtual memory on remote node. Minimize transfer setup overhead. Performance comparison Marvel Provides better compiler warnings. Has better speedup. Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC. VAPI and Elan are initially found to be strongest. Surprisingly bad performance is possible in UPC!
12 Introduction to New Activity (FY05) UPC Performance Analysis Tool (PAT) Motivations UPC program does not yield the expected performance. Why? Due to the complexity of parallel computing, difficult to determine without tools for performance analysis. Discouraging for users, new & old; few options for shared- memory computing in UPC and SHMEM communities. Goals Identify important performance “factors” in UPC computing. Develop framework for a performance analysis tool. As new tool or as extension/redesign of existing non-UPC tools. Design with both performance and user productivity in mind. Attract new UPC users and support improved performance.
13 Approach Define layers to divide the workload. Conduct existing-tool study and performance layers study in parallel to: Minimize development time Maximize usefulness of PAT
14 Conclusions (FY05) PAT development cannot be successful without UPC developer and user input. Develop a UPC user pool to obtain user input. What kind of information is important? Familiarity with any existing PAT? Preference if any? Why? Past experience on program optimization. Require extensive knowledge on how each UPC compiler works to support each of them successfully. Compilation strategies. Optimization techniques. List of current and future platforms. Propose the idea of a standard set of performance measurements for all UPC platforms and implementations. Computation (local, remote). Communication. Develop a repository of known performance bottleneck issues.
16 Appendix – Sample User Survey Are you currently using any performance analysis tools? If so, which ones? Why? What features do you think are most important in deciding which tool to use? What kind of information is most important for you when determining performance bottlenecks? Which platforms do you target most? Which compiler(s)? From past experience, what coding practices typically lead to most of the performance bottlenecks (for example: bad data affinity to node) ?
17 Appendix - GASNet Latency on Conduits
18 Appendix - GASNet Throughput on Conduits