ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer System Lab Stanford University
2 Why we built ATLAS Multicore processors exposes challenges of multithreaded programming Transactional Memory simplifies parallel programming As simple as coarse-grain locks As fast as fine-grain locks Currently missing for evaluating TM Fast TM prototypes to develop software on FPGAs improving capabilities attractive for CMP prototyping Fast Can operate > 100 MHz More logic, memory and I/O’s Larger libraries of pre-designed IP cores ATLAS: 8-processor Transactional Memory System 1 st FPGA-based Hardware TM system Member of RAMP initiative RAMP Red
3 ATLAS provides … Speed > 100x speed-up over SW simulator [FPGA 2007] Rich software environment Linux OS Full GNU environment (gcc, glibc, etc.) Productivity Guided performance tuning Standard GDB environment + deterministic replay
4 Transaction Building block of a program Critical region Executed atomically & isolated from others TCC’s Execution Model
5 CPU 0CPU 1CPU 2 Commit Arbitrate Execute Code Commit Arbitrate Execute Code Undo Execute Code ld 0xbeef Re- Execute Code... ld 0xaaaa ld 0xbbbb... ld 0xbeef... 0xbeef st 0xbeef... ld 0xdddd ld 0xeeee... In TCC, All Transactions All The Time [PACT 2004]
6 CMP Architecture for TCC Speculatively Read Bits: ld 0xdeadbeef Speculatively Written Bits: st 0xcafebabe Violation Detection: Compare incoming address to R bits Commit : Read pointers from Store Address FIFO, flush addresses W bits set
7 ATLAS 8-way CMP on BEE2 Board User FPGAs 4 FPGAs for a total of 8 TCC CPUs PPC, TCC caches, BRAMs and busses 100 MHz Control FPGA Linux 300 MHz Launch TCC apps here Handle system services for TCC PowerPCs Fabric 100 MHz
8 ATLAS Software Overview TM Application TM APIATLAS Profiler ATLAS Subsystem Linux OS ATLAS HW on BEE2 TM application can be easily written with TM API ATLAS profiler provides a runtime profiling and guided performance tuning ATLAS subsystem provides Linux OS support for the TM application
9 ATLAS subsystem Commit Linux PPC TCC PPC0 Transfers initial context TCC PPC1 TCC PPC2 … TCC PPC7 Invokes parallel work Joins parallel work Exit with app. stats Violation
10 ATLAS System Support TCC PPC requests OS support. (TLB miss, system call) Linux PPC replies back to the requestor. Linux PPC regenerates and services the request. Serialize, if request is irrevocable System Call Page-out Linux PPC TCC PPC
11 Coding with TM API: histogram main (int argc, void* argv) { … sequential code … TM_PARALLEL(run, NULL, numCpus); … sequential code … } // static scheduling with interleaved access to A[] void* run(void* args) { int i = TM_GET_THREAD_ID(); for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) { TM_BEGIN(); bucket[A[i]]++; TM_END(); } OpenTM will provide high-level (OpenMP style) pragmas
12 Guided Performance Tuning TAPE: Light-weight runtime profiler [ICS 2005] Tracking most significant violations (longest loss time) Violated object address PC where object was read Loss time & # of occurrence Committing thread’s ID and transaction PC Tracking most significant overflows (longest duration) Overflows: when speculative state can no longer stay in TCC$ PC where overflows Overflow duration & number of occurrence Type of overflow (LRU or Write Buffer)
13 Deterministic Replay All Transactions All The Time TM 101: Transaction is executed atomically and in isolation TM’s illusion: transaction starts after older transactions finish Only need to record “the order of commit” Minimal runtime overhead & footprint size = 1B / transaction Logging executionReplay execution write-set time LOG: T0 T1 T2 write-set T0 T1 T2 Token arbiter enforces commit order specified in LOG T0 T1T2
14 Useful Features of Replay Monitoring code in the transaction Remember we only record the transaction order Verification Log is not written in stone Complete runtime scenario coverage is possible Choice of running Replay on ATLAS itself HW support for other debugging tools (see next slide) Local machine (your favorite desktop or workstation) Runs natively on faster local machine, sequentially Seamless access to existing debugging tools
15 GDB support Current status GDB integrated with local machine replay GDB provides debugability while guaranteeing deterministic replay Below are work-in-progress Breakpoint Thread local BP vs. global BP Stop the world by controlling commit token Stepping Backward stepping: Transaction is ready to roll back Transaction stepping Unlimited data-watch (ATLAS only) Separate monitor TCC cache to register data-watches
16 Conclusion: ATLAS provides Speed > 100x speed-up over SW simulator [FPGA 2007] Software environment Linux OS Full GNU environment (gcc, glibc, etc.) Productivity TAPE: Guided performance tuning Deterministic replay Standard GDB environment Future Work High-level language support (Java, Python, …)
17 Questions and Answers ATLAS Team Members System Hardware – Njuguna Njoroge, PhD Candidate System Software – Sewook Wee, PhD Candidate High level languages – Jiwon Seo, PhD Candidate HW Performance – Lewis Mbae, BS Candidate Past contributors Interconnection Fabric – Jared Casper, PhD Candidate Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar