IceCube simulation with PPC on GPUs Dmitry Chirkin, UW Madison photon propagation code graphics processing unit.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

SPICE Mie [mi:] Dmitry Chirkin, UW Madison. Updates to ppc and spice PPC: Randomized the simulation based on system time (with us resolution) Added the.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

OpenFOAM on a GPU-based Heterogeneous Cluster

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

NVIDIA Confidential. Product Description World’s most popular 3D content creation tool Used across Design, Games and VFX markets Over +300k 3ds Max licenses,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Computer System Architectures Computer System Software

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

Extracted directly from:

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

IceCube: String 21 reconstruction Dmitry Chirkin, LBNL Presented by Spencer Klein LLH reconstruction algorithm Reconstruction of digital waveforms Muon.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Optimising Cuts for HLT George Talbot Supervisor: Stewart Martin-Haugh.

Photon propagation and ice properties Bootcamp UW Madison Dmitry Chirkin, UW Madison r air bubble photon.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

IceCube simulation with PPC Dmitry Chirkin, UW Madison photon propagation code.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Computer Systems Week 14: Memory Management Amanda Oddie.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Photon propagation and ice properties Bootcamp UW Madison Dmitry Chirkin, UW Madison r air bubble photon.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Ice Investigation with PPC Dmitry Chirkin, UW (photon propagation code)

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

IceCube simulation with PPC Dmitry Chirkin, UW Madison, 2010.

Ice model update Dmitry Chirkin, UW Madison IceCube Collaboration meeting, Calibration session, March 2014.

IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

Sunpyo Hong, Hyesoon Kim

Review of Ice Models What is an “ice model”? PTD vs. photonics What models are out there? Which one(s) should/n’t we use? Kurt Woschnagg, UCB AMANDA Collaboration.

IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now.

GPU Photon Transport Simulation Studies Mary Murphy Undergraduate, UW-Madison Dmitry Chirkin IceCube at UW-Madison Tareq AbuZayyad IceCube at UW-River.

IceCube simulation with PPC Dmitry Chirkin, UW Madison, 2010 effective scattering coefficient (from Ryan Bay)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Muon Energy reconstruction in IceCube and neutrino flux measurement Dmitry Chirkin, University of Wisconsin at Madison, U.S.A., MANTS meeting, fall 2009.

Photon propagation and ice properties Bootcamp UW Madison Dmitry Chirkin, UW Madison r air bubble photon.

Light Propagation in the South Pole Ice

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

South Pole Ice model Dmitry Chirkin, UW, Madison.

South Pole Ice (SPICE) model

Ice Investigation with PPC

System calls….. C-program->POSIX call

6- General Purpose GPU Programming

Presentation transcript:

IceCube simulation with PPC on GPUs Dmitry Chirkin, UW Madison photon propagation code graphics processing unit

IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now

Photonics: conventional on CPU First, run photonics to fill space with photons, tabulate the result Create such tables for nominal light sources: cascade and uniform half-muon Simulate photon propagation by looking up photon density in tabulated distributions  Table generation is slow  Simulation suffers from a wide range of binning artifacts  Simulation is also slow! (most time is spent loading the tables)

Direct photon tracking with PPC simulating flasher/standard candle photons same code for muon/cascade simulation using precise scattering function: linear combination of HG+SAM using tabulated (in 10 m depth slices) layered ice structure employing 6-parameter ice model to extrapolate in wavelength tilt in the ice layer structure is properly taken into account transparent folding of acceptance and efficiencies precise tracking through layers of ice, no interpolation needed precise simulation of the longitudinal development of cascades and angular distribution of particles emitting Cherenkov photons photon propagation code

Approximation to Mie scattering f SL Simplified Liu: Henyey-Greenstein: Mie: Describes scattering on acid, mineral, salt, and soot with concentrations and radii at SP

Dependence on g= and f SL g= f SL flashing  64-50

Ice tilt in ppc Measured with dust loggers (Ryan Bay)

Photon angular profile from thesis of Christopher Wiebusch

PPC simulation on GPU graphics processing unit execution threads propagation steps (between scatterings) photon absorbed new photon created (taken from the pool) threads complete their execution (no more photons) Running on an NVidia GTX 295 CUDA-capable card, ppc is configured with: 448 threads in 30 blocks (total of threads) average of ~ 1024 photons per thread (total of photons per call)

Photon Propagation Code: PPC There are 5 versions of the ppc: original c++ "fast" c++ in Assembly for CUDA GPU icetray module All versions verified to produce identical results comparison with i3mcml

ppc icetray module at uses a wrapper: private/ppc/i3ppc.cxx, compiled by cmake system into the libppc.so additional library libxppc.so is compiled by cmake  Set GPU_XPPC:BOOL=ON or OFF or can also be compiled by running make in private/ppc/gpu:  “make glib” compiles gpu-accelerated version (needs cuda tools)  “make clib” compiles cpu version (from the same sources!)

ppc example script run.py if(len(sys.argv)!=6): print "Use: run.py [corsika/nugen/flasher] [gpu] [seed] [infile/num of flasher events] [outfile]" sys.exit() … det = "ic86" detector = False … os.putenv("PPCTABLESDIR", expandvars("$I3_BUILD/ppc/resources/ice/mie")) … if(mode == "flasher"): … str=63 dom=20 nph=8.e9 tray.AddModule("I3PhotoFlash", "photoflash")(…) os.putenv("WFLA", "405") # flasher wavelength; set to 337 for standard candles os.putenv("FLDR", "-1") # direction of the first flasher LED … # Set FLDR=x+(n-1)*360, where 0 0 to simulate n LEDs in a # symmetrical n-fold pattern, with first LED centered in the direction x. # Negative or unset FLDR simulates a symmetric in azimuth pattern of light. tray.AddModule("i3ppc", "ppc")( ("gpu", gpu), ("bad", bad), ("nph", nph*0.1315/25), # corrected for efficiency and DOM oversize factor; eff(337)= ("fla", OMKey(str, dom)), # set str=-str for tilted flashers, str=0 and dom=1,2 for SC1 and 2 ) else:

ppc-pick and ppc-eff ppc-pick: restrict to primaries below MaxEpri load("libppc-pick") tray.AddModule("I3IcePickModule ","emax")( ("DiscardEvents", True), ("MaxEpri", 1.e9*I3Units.GeV) ) ppc-eff: reduce efficiency from 1.0 to eff load("libppc-eff") tray.AddModule("AdjEff", "eff")( ("eff", eff) )

ppc homepage

GPU scaling Original:1/2.081/2.70 CPU c++: Assembly: GTX 295: GTX/Ori: C1060: C2050: GTX 480: On GTX 295: GHz Running on 30 MPs x 448 threads Kernel uses: l=0 r=35 s=8176 c=62400 On GTX 480: GHz Running on 15 MPs x 768 threads Kernel uses: l=0 r=40 s=3960 c=62400 On C1060: GHz Running on 30 MPs x 448 threads Kernel uses: l=0 r=35 s=3992 c=62400 On C2050: GHz Running on 14 MPs x 768 threads Kernel uses: l=0 r=41 s=3960 c=62400 Uses cudaGetDeviceProperties() to get the number of multiprocessors, Uses cudaFuncGetAttributes() to get the maximum number of threads

GTX 480 vs. GTX 295 GTX 295 has 2 GPUs 240 MPs in 30 cores 8 MPs per core 2 single-precision sFPUs  60 sFPUs per GPU  480 cores per card  120 sFPUs per card Shared memory: 16Kb per core 960 Kb per card GTX 480 has 1 GPU 480 MPs in 15 cores 32 MPs per core 4 single-precision sFPUs  60 sFPUs per GPU  480 cores per card  60 sFPUs per card (!) shared memory up to 48Kb of per core up to 720 Kb per card Why is ppc not a factor 2 faster on GTX 480 GPU than on GTX 295 GPU?

Kernel time calculation Run 3232 (corsika) IC86 processing on cuda002 (per file): GTX 295: Device time: (in-kernel: ) [ms] GTX 480: Device time: (in-kernel: ) [ms] If more than 1 thread is running using same GPU: Device time: (in-kernel: ) [ms] 3 counters:1. time difference before/after kernel launch in host code 2. in-kernel, using cycle counter:min thread time 3.max thread time Also, real/user/sys times of top: gpus6 cpus1 cores8 files693 Real749m4.693s User3456m10.888s sys39m50.369s Device time: [ms] files: 693 real: user: gpu: kernel: [seconds] 81%-91% GPU utilization

Concurrent execution time CPUGPUCPUGPU Thread 1: CPUGPUCPUGPU Thread 2: CPU GPU CPU GPU CPU GPU CPU GPU One thread: Create track segments Copy track segments to GPU Process photon hits Copy photon hits from GPU Need 2 buffers for track segments and photon hits However: have 2 buffers: 1 on host and 1 on GPU! Just need to synchronize before the buffers are re-used

Typical run times corsika: run 3232: sec files ic86/spx/3232 on cuda00[123] (53.4 seconds per job) 1.2 days of real detector time in 6.5 days nugen: run 2972: event files; E^-2 weighted ic86/spx/2972 on cudatest (25.1 seconds per job) entire 10k set of files in 2.9 days  this is enough for an atmnu/diffuse analysis! Considerations: Maximize GPU utilization by running only mmc+ppc parts on the GPU nodes still, IC40 mmc+ppc+detector was run with ~80% GPU utilization run with 100% DOM efficiency, save all ppc events with at least 1 MC hit apply a range of allowed efficiencies (70-100%) later with ppc-eff module

Use in analysis PPC run on GPUs was already used in several analyses already published or in progress. The ease of changing the ice parameters facilitated propagation of ice uncertainties through the analysis, as all “systematics” simulation sets are simulated in roughly the same amount of time, with no extra overhead. A similar quality uncertainty analysis based on photonics simulation would have taken much longer because of the large CPU cost of the initial table generation.

OSG Summer School ‘10 DAG (Directed Acyclical Graph) -based simulation Separate simulation segments into tasks Assign task to a node in DAG

Dedicated GPU cluster GPU-based simulation We have recently began experimenting with GPU-based implementation of portions of IceCube simulation. DAG assigns separate tasks to different compute nodes Execution of photon propagation simulation on dedicated GPU nodes. For many simulations GPU segment of chain is much faster than the rest of the simulation. Small number GPU-enabled machines can consume the data from large pool lf CPU cores. PPC generatorgeneratorgeneratorgenerator DetectorDetector

GPU-based simulation Optimal DAG differ depending on the specific simulation

Current Status of GPU-based Simulation Production NuGen and CORSIKA simulation currently running on Madison cluster: NPX3+CUDA Testing optimal DAG configurations to take advantage of GPUs Current Condor queue has option for selecting machines with GPUs There are multiple cores and multiple GPUs on each machine Condor assigns environment variable ${_CONDOR_SLOT} which is used as parameter to select a GPU on PPC. SGE, PBS IceProd plugins being writtend to support DAGs in order to incorporate other non-Condor sites: NERSC Dirac and Tesla and clusters in Dortmund, DESY, Alberta Other IceCube sites report to have GPUs available and will be incorporated into production.

GPU/PPC production: coincident neutrino-CR muons

Our initial GPU cluster 4 computers: 1 cudatest 3 cuda nodes (cuda001-3) Each has 4-core CPU 3 GPU cards, each with 2 GPUs (i.e. 6 GPUs per computer) Each computer is ~ $3500

Our initial cluster cudatest: Found 6 devices, driver 2030, runtime (1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 1(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 2(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 3(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 4(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 5(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 3 GTX 295 cards, each with 2 GPUs PSU 0 and 1 4 and 5 2 and 3 nvidia-smi -lsa GPU 0: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 87 C GPU 1: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 90 C GPU 2: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 100 C GPU 3: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 105 C GPU 4: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 100 C GPU 5: Product Name: GeForce GTX 295 Serial: PCI ID: 5eb10de Temperature: 103 C

BAD multiprocessors (MPs) clist cudatest cuda cuda cuda #badmps cuda cuda cuda Disable 3 bad GPUs out of 24: 12.5% Disable 3 bad MPs out of 720: 0.4%! Configured: xR=5 eff=0.95 sf=0.2 g=0.943 Loaded 12 angsens coefficients Loaded 6x170 dust layer points Loaded random multipliers Loaded 42 wavelenth points Loaded 171 ice layers Loaded 3540 DOMs (19x19) Processing f2k muons from stdin on device 2 Total GPU memory usage: photons: hits: 991 Error: TOT was a nan or an inf 1 times! Bad MP #20 photons: hits: 393 photons: hits: 570 photons: hits: 501 photons: hits: 832 photons: hits: 717 CUDA Error: unspecified launch failure Total GPU memory usage: photons: hits: 938 Error: TOT was a nan or an inf 9 times! Bad MP #20 #20 #20 #20 photons: hits: 442 photons: hits: 627 CUDA Error: unspecified launch failure gpu]$ cat mmc.1.f2k | BADMP=20./ppc 2 > /dev/null Configured: xR=5 eff=0.95 sf=0.2 g=0.943 Loaded 12 angsens coefficients Loaded 6x170 dust layer points Loaded random multipliers Loaded 42 wavelenth points Loaded 171 ice layers Loaded 3540 DOMs (19x19) Processing f2k muons from stdin on device 2 Not using MP #20 Total GPU memory usage: photons: hits: 871 … photons: hits: 114 Device time: (in-kernel: ) [ms] Failure rates:

3 x DELL PowerEdge C410x –48 Tesla M2070 GPGPU –21504 GPU cores –48 TFlops single precision –24 TFlops double precision 6 x DELL PowerEdge C6145 –24 AMD Opteron™ 6100 Processors –288 CPU cores –7 TFlops double precision QDR Infiniband Interconnect –Allows high speed MPI applications ~ $ 200,000 The IceWave Cluster

DELL PowerEdge C410x –16 PCIe Expansion chassis –Use of C2050 TESLA GPUs () –Flexible assignment of GPUs toServers Allows 1-4 GPUs per server Basic Elements

DELL PowerEdge C6145 –2 x 4 CPU Servers –AMD Opteron™ 6100 (Magny-Cours) –12 cores per processor –48-96 cores per 2U server –192 GB per 2U server Basic Elements

Concluding remarks PPC (photon propagation code) is used by IceCube for photon tracking Precise treatment of the photon propagation, includes all known effects (longitudinal development of particle cascades, ice tilt, etc.) PPC can be run on CPUs or GPUs; running on a GPU is 100s of times faster than on a CPU core We use DAG to run PPC routinely on GPUs for mass production of simulated data GPU computers can be assembled with NVidia or AMD video cards  however, some problems exist in consumer video cards  bad MPs can be worked around in PPC  computing-grade hardware can be used instead

Extra slides

Oversize DOM treatment Oversized DOM treatment (designed for minimum bias compared to oversize=1):  oversize only in direction perpendicular to the photon  time needed to reach the nominal (non-oversized) DOM surface is added  re-use the photon after it hits a DOM and ensure the causality in the flasher simulation The oversize model was chosen carefully to produce the best possible agreement with the nominal x1 case (see next slide). nominal DOM oversized DOM oversized ~ 5 times photon This is a crucial optimization, however: Some bias is unavoidable since DOMs occupy larger space: x1: diameter of 33 cm x5: 1.65 m x16: 5.3 m This could lead to ~5-10% variation of the individual DOM simulated rates.

Timing of oversized DOM MC xR=1 default do not track back to detected DOM do not track after detection no ovesize delta correction! do not check causality del=(sqrtf(b*b+(1/(e.zR*e.zR-1)*c)-D)*e.zR-h del=e.R-OMR Flashing    xR=1 default

ice density: mwe handbook of chemistry and physics T.Gow's data of density near the surface T= *d+5.822e-6*d (fit to AMANDA data) Fit to (1-p 1 *exp(-p 2 *d))*f(T(d))*(1+0.94e-12*9.8*917*d)

Device enumeration cuda002: Found 5 devices, driver 3010, runtime (2.0): GeForce GTX GHz G( ) S(49152) C(65536) R(32768) W(32) l0 o1 c0 h1 i0 m15 a512 M( ) T(1024: 1024,1024,64) G(65535,65535,1) 1(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M( ) T(512: 512,512,64) G(65535,65535,1) 2(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M( ) T(512: 512,512,64) G(65535,65535,1) 3(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M( ) T(512: 512,512,64) G(65535,65535,1) 4(1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M( ) T(512: 512,512,64) G(65535,65535,1) 2 GTX 295 cards, 1 GTX 480 card PSU 1 and 2 3 and 4 0 nvidia-smi -a GPU 0: Product Name: GeForce GTX 295 PCI ID: 5eb10de Temperature: 68 C GPU 1: Product Name: GeForce GTX 295 PCI ID: 5eb10de Temperature: 73 C GPU 2: Product Name: GeForce GTX 480 PCI ID: 6c010de Temperature: 106 C GPU 3: Product Name: GeForce GTX 295 PCI ID: 5eb10de Temperature: 90 C GPU 4: Product Name: GeForce GTX 295 PCI ID: 5eb10de Temperature: 91 C 0 and 1 3 and 4 2

Fermi vs. Tesla cudatest: Found 6 devices, driver 2030, runtime (1.3): GeForce GTX GHz G( ) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) tesla: Found 1 devices, driver 3000, runtime (1.3): Tesla C GHz G( ) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M( ) T(512: 512,512,64) G(65535,65535,1) fermi: Found 1 devices, driver 3000, runtime (2.0): Tesla C GHz G( ) S(49152) C(65536) R(32768) W(32) l1 o1 c0 h1 i0 m14 a512 M( ) T(1024: 1024,1024,64) G(65535,65535,1) beta: Found 1 devices, driver 3010, runtime (2.0): Tesla C GHz G( ) S(49152) C(65536) R(32768) W(32) l1 o1 c0 h1 i0 m14 a512 M( ) T(1024: 1024,1024,64) G(65535,65535,1) 11:arch=11 make gpu 12:arch=12 make gpu (default/best) 1x:arch=12 make gpu with -ftz=true -prec-div=false -prec-sqrt=false 20:arch=20 make gpu 2x:arch=20 make gpu with -ftz=true -prec-div=false -prec-sqrt=false Flasherf2kmuon

Other Consider: building production computers with only 2 cards, leaving a space in between using 6-core CPUs if paired with 3 GPU cards 4-way Tesla GPU-only servers a possible solution Consumer GTX card much faster than Tesla/Fermi cards GTX 295 was so far found to be a better choice than GTX 480 but: no longer available! Reliability: 0.4% loss of advertised capacity in GTX 295 cards however: 2 of 3 affected cards were “refurbished” do cards deteriorate over time? The failed MPs did not change in ~3 months