Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx.

Similar presentations


Presentation on theme: "© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx."— Presentation transcript:

1 © Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx

2 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 2 RAMP-White Requirements Coherent shared memory experimental platform Configurable coherence protocol, engine Scalable to the same level as other RAMP machines 1K eventual target Down to 2 Full system (OS, I/O, etc.) Intentions ISA/Architecture independent (like all RAMP efforts) Use different cores Integrate components from other RAMP participants A test-bed for sharing IP

3 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 3 Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost for academics (www.bluespec.com) Start with XUP board We had XUP before BEE2 Embedded PowerPC is starting core It’s a free, fast core with real (incoherent) 16KB caches No space issues on XUP 2 Leons + MMU + memory controller barely fits (no space for our stuff) RAMP is core independent My research needs fast cores Can then use synthesizable 405s Multi-OS shared space Processors map to shared global space May try SMP OS, but unlikely to scale well to 1K processors

4 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 4 High-Level Architecture Philosophy Flexibility Avoid wasted work Easy changes Module-agnostic Processors, network, I/O, etc. Interfaces Complete set of necessary interfaces All communication via messages Fixed fields, but fields are configurable “shims” connect components to White infrastructure Use existing IP Building one instance to confirm interface completeness

5 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 5 32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA Need more than 32b Eventually, hope for 64b soft-core processors For now two options: live with 4GB space Or, provide one more layer of translation Physical address in certain region is global virtual address Translated by hardware to node + physical address Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0

6 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 6 RAMP-White Block Diagram Network Router Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $ Proc dependent

7 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 7 Three Phase Approach to Hardware Phase 1: Incoherent shared memory No hardware global cache, just global shared memory support Optional cache for local memory However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor Ring network Ring-based coherence (scalable bus) Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol True coherence engine not required But, very restricted communication Sufficient for testing, modeling many targets General network-based coherence Requires general coherence engine, general network IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

8 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 8 Intersection Unit Processor interface Slave Snoop Network interface Master (send) Slave (receive) Memory interface Master (issue memory requests) Hooks for coherency engine Bluespec nice to specify coherence engine Incoherent version is a special case Programmable memory regions Global (local and remote) Local translation Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

9 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 9 Intersection Unit Internals Intersection Unit Controller Memory Controller & DRAM Controller BRAMs ProcIONetProcIONet Global Address Translation hardware

10 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 10 Network Interface Unit Currently two virtual channels Split into two components Msg composition/Queuing Net transmit/receive Insert/extract for ring Intended to permit other net- specific transmit/receive One input/one output Creates a simple unidirectional ring Can interface to more advanced fabrics Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

11 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 11 IU Internal Message Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size Bluespec permits easy modification for your protocol PRICMDPERMSIZETAG GADDR DATA

12 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 12 Network Message PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data PRIDESTSRCNETTAGCMD MESSAGE SIZE

13 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 13 Programmer View Sequential consistency PowerPC Global addresses labeled as uncached  Ordered accesses from PowerPC 405 Coherent global cache still uncached Soft cores can be weaker User interface Terminal per core/OS if desired Mmap to map shared memory

14 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 14 Operating System Issues with SMP OS on embedded PowerPC Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts, bring-up) How scalable anyways? (1K processors) Therefore, separate OS per core Region of memory is global Mmap Locks implemented using regular loads/stores + sequential consistency

15 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 15 Status: Phase 1 RAMP-White Hari Angepat did the work Components Written in Bluespec NIU code complete and tested 2 processor ring IU code complete and tested Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface Hardware intended to target different ISAs PLB master and slave shims written Some preliminary OS work Multi-image mmap interface running

16 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 16 Current RAMP-White Phase 1 Intersection Unit (IU) IO & Platform Devices PPC 405 Network Interface (NIU) Memory Controller (MC) PLB shim Intersection Unit (IU) PPC 405 Network Interface (NIU) Linux

17 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 17 Phase 1 Demo on XUP Configuration See both processors boot and run (top, cpu_info) Run a simple “take-lock, increment counter, release lock”

18 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 18 Our Long Term Plans Phase 1, XUP just started to work With multi-OS, limited device support Limited alpha release end of the 3Q07 Phase 2 Coherent cache, IU forwarding modifications Better OS support (ProtoFlex?) Limited alpha release 1Q08 Phase 3 Arbitrary network, cache coherency engine Getting network from Washington, Berkeley RDL? Leon? Release depends on ease of integration

19 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 19 Conclusions RAMP-White architecture Phased approach minimizes wasted work Designed to be easy to modify for your purpose Many architectures only require modified coherence engine, maybe cache ISA/implementation agnostic Care taken to not be specific RAMP White Phase 1 works Running on XUP We will be our own customer Building cycle-accurate x86 CMP simulator on top

20 © Derek Chiou 20 Extra slides

21 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 21 Node Architecture IU P P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

22 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 22 Generalized Architecture Proc IUNIUMC $ Mem OPB bridge Intersection Unit Network Interface Unit PLB Proc dependent Proc independent

23 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 23 Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP Used some code (PLB master) Red-BEE is not ready to distribute Looking for switch code Berkeley’s code on CVS repository But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now Would like to steal software OS (kernel proxy) SMP OS port Naming MPI reference design in BEE2 repository Is that RAMP-Blue? A central CVS repository for RAMP code?

24 Test of size 6/11/2015Derek Chiou, RAMP-White Tutorial, FCRC 2007 24 Sharing Over the Long Term Processor is shared Leon PowerPC MicroBlaze Everything else MC is shared Xilinx or Berkeley Coherent cache can be shared Transactional/traditional Borrow Stanford’s? Coherency engine can be shared CMU/Stanford IU functionality can be shared Trying to make ours general NIU can be shared Borrow half from Berkeley? Network can be shared Borrow Berkeley’s? Proc IUNIUMC $ Mem Peripherals CCE


Download ppt "© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx."

Similar presentations


Ads by Google