Download presentation
Presentation is loading. Please wait.
Published byCandace Bradley Modified over 9 years ago
1
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June 2000 Presented by Garver Moore ECE259 Spring 2006 Professor Daniel Sorin
2
Motivation Economic: High demand for OLTP machines Disconnect between ILP-focus and this demand OLTP --High memory latency -- Little ILP (Get, process, store) --Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market Amdahl’s Law
3
The Piranha Processing Node* *Directly from Barroso et. al Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology CPU: Alpha ECE152 work Single in-order 8-stage pipeline
4
Communication Assist + Home Engine and Remote Engine support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. + This allows for rapid parallel development and an semi-custom design methodology + Also facilitates multiple clock domains THERE IS NO INHERENT I/O CAPABILITY.
5
I/O Organization + Smaller than processing node + Router 2 links, alleviates need for routing table + Memory is globally visible and part of coherency scheme + CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O. + Re-used dL1 design provides interface to PCI/X interface + Supports arbitrary I/O:P ratio, network topology + Glueless scaling up to 1024 nodes of any type supports application specific customization
6
Coherence: Local + L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory + Chip ICS responsible for all on-chip communication + L2 is “non-inclusive”. + “Large victim buffer” for L1s. Keeps tags and state copies of L1 data + The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist. + L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory. +L2 on forwards blocks conflicting requests
7
Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity leaves 44 bits per 64 bit line) Invalidation-based distributed directory protocol Some optimizations No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token). Routing toward empty buffers for old messages linear buffer dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.
8
Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware “Fudged” for full-custom effect Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
9
Results
10
Questions/Discussion Deadlock avoidance w/o NACK CMP vs SMP “Fishy” evaluation methodology? Specialized computing Buildability?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.