Download presentation
Presentation is loading. Please wait.
1
1 RAMP 100K Core Breakout Assorted RAMPants RAMP Retreat, UC San Diego June 14, 2007 1M
2
2 Two Kinds of 1M Core Machine Scientific supercomputer (e.g., BlueGene) Data Center (e.g., Google) Some commonalities Both will require heavy virtualization of the physical FPGA resources Both will need host disk to hold target RAM state Goal is 1M cores in a “few” racks Roughly 100-1,000 TB of target RAM (100-1000 disks) Some differences: Latency of core interaction Supercomputer: microseconds Datacenter: milliseconds
3
3 Virtualization Techniques (virtualizing the old PMS model) Cores: Virtualize both functional and timing model on single physical pipeline on FPGA Simple barrel pipeline should suffice, since always update timing model even if target thread is stalled Routers: Virtualize crossbar using single physical switch/RAM block Probably need to buffer one target cycle worth of inputs, to allow arbitrary arbitration scheme in model Memory: Virtualize memory ports using single physical RAM Each PMS arc is a virtualized channel, maybe use simple striping everywhere to make things composable
4
4 Virtualized Memory Hierarchy 16 physical cores/FPGA only generate <1 request/cycle off chip (~few % miss rate, independent of # virtualized threads+timing models) 8B/cycle @ 100MHz, 800MB/s DRAM provides next level of cache What miss rate from 4GB DRAM cache to run at full speed? Assume roughly one disk/FPGA 1TB is enough state for 1000 cores with 1GB each Provides <100MB/s bandwidth best case Only need <10% miss rate ???? BUT! Disk has huge latency. Even with large block transfers (Pages? Tracks?) want to predetermine memory requests, schedule disk accesses, to get reasonable performance Use runahead technique to guess what each thread will want in next few simulation cycles (checkpoint registers, run 1000 cycles ahead, don’t write memory but record misses, restore to checkpoint, then run 1000 cycles in demand mode)
5
5 Interaction Latency/Bandwidth Supercomputer model design more difficult due to need to model low latency interactions Only a few target clock cycles between core interactions, some interactions synchronous (e.g., barrier sync logic) Datacenter cores only interact through Ethernet OK to run each core for longer before checking for interaction event and interactions asynchronous (e.g., can schedule NIC interrupts when convenient for model)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.