Download presentation
Presentation is loading. Please wait.
Published byJulian Nelson Modified over 9 years ago
1
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication and synchronization models Inflexible Inefficient Expensive All-purpose programmable core Ronny Krashinsky Christopher Batten Krste Asanovic CPU Chip Set DSP1DSP2 FPGA ASIC DRAM SRAM DRAM Handles all information processing Unified software programming model Competitive in performance and energy Scale by tiling an efficient core Sensor Nets Servers Robots Laptops Embedded computing today… Routers Set-top Boxes Games TVs Smart phones
2
Control Processor VP0 Memory VP1VP2VP3VPN thread- fetch vector-fetch VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches instruction blocks for all VPs in parallel Thread-fetch: a VP fetches its own instruction blocks VT allows a seamless intermixing of vector and thread control vector-fetch vector-load vector-store vector-fetch vector-load vector-store vector-fetch vector-load Vector Execution Vector-Thread Architecture vector-store VP0 VP1 VP2 VP3 VPNControl Proc. Threaded Execution VP0 VP1 VP2 VP3 VPNControl Proc.
3
C0 C1 C2 C3 CMU SD Vector-Mem Unit 32 KB SRAM C0 C1 C2 C3 CMU SD C0 C1 C2 C3 CMU SD C0 C1 C2 C3 CMU SD Cache Tags CP Cache Control Control Processor (CP) – scalar RISC core Vector-Thread Unit – 4 lanes, 16 decoupled clusters, instruction fetch, load/store, and command management units, up to 128 VP threads Vector-Memory Unit – unit-stride, strided, and segment loads and stores, refill/access decoupling Cache – 4-port, non-blocking, 32-way set- associative, 32 KB Register File 32x32-bit Register File 32x32-bit Instr. cache 32x46-bit Instr. cache 32x46-bit Datapath 32-bit Datapath 32-bit Control Logic Automatic synthesis, place & route Preplaced standard cells, RAM blocks Aggressive clock-gating Iterative design flow Verification: formal equiv. check + sim. Vectorizable data processing applications, e.g. 802.11a wireless transmitter: 9.7 ops per cycle Non-vectorizable encoder/decoder algorithms, e.g. ADPCM speech decompression: 6.5 ops per cycle Threaded IP routing table lookups: 6.1 ops per cycle 3mm Read/Write Crossbars Lane 0 Lane 1 Lane 2 Lane 3 TSMC 180 nm, 6 layers Al 7.1 M trans., 1.4 M gates, 397 K cells, 300 k RAM bits 16.6 mm 2 core area, 23.1 mm 2 chip area 260 MHz at 1.8 V, 600 mW typical 24 person-months design effort
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.