Frank Vahid, UCR 1 Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Support provided by NSF, SRC, Dept. of Educ. Also CareFusion, Xilinx, METI Chen Huang (UC Riverside, now Amazon) Bailey Miller (UC Riverside, intern at SpaceX) Prof. Tony Givargis (UC Irvine) Ting-Shuo Chou (UC Irvine) Others...
Frank Vahid, UCR 2Bailey Miller, UCR 2
Frank Vahid, UCR 3 Models of physical world that run in real-time Test cyber-physical systems Physical mockup Transducer models Environment model Digital mockup
Frank Vahid, UCR 4 Issue: Real-time achieved via inaccuracy Frank Vahid, UCR 4 Weibel lung complexity 4 gen: 32 ODEs 6 gen: 128 ODEs 8 gen: 512 ODEs 10 gen: 2048 ODEs “2-3 minutes to simulate one breath accurately” V[1],R[1] V[2],R[2] V[7],R[7]
Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic Weibel + hemo Performance (ms) PC(1) PC(4) GPU Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x PC & GPU Parallel computations + Neighbor communication Seem like great match for FPGAs
Frank Vahid, UCR 6 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGAs: Sw circuits (parallel) for (i=0; i < 128; i++) y += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions –Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles (though slower clock) Speedup > 10x-100x
Frank Vahid, UCR 7 2x2 switch matrix y z w x FPGAs “101” (A Quick Intro) ab a1a0a1a0 4x2 Memory abab d 1 d 0 F G LUT FG ab SM ab c D E FPGA abc D E
Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic Weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS / FPGAs Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS/FPGA: 3.2x HLS High-level synthesis: Compiler that converts program to circuits
Frank Vahid, UCR 9 Network of synchronized PEs on FPGAs FPGA Digital mockup General Processing Element Iterative ODE solver (Euler/RK4) 0.1 ms / 0.01 ms timestep PE 1 PE: 300 MHz
Frank Vahid, UCR 10 Synthesis tool 10K iterations 150K iterations Convert virtual PEs to physical circuits using FPGA place-route 1 2 Phase Maps ODEs to virtual PEs using simulated annealing
Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS General PEs Speedup vs real-time PC(1):0.8x PC(4):3.1x GPU:1.6x HLS:3.2x General PEs:4.9x General PEs
Frank Vahid, UCR 12 Problem: More PEs Lower frequency FPGA Inter-PE critical path FPGA DSP INST MEM DATA MEM Internal PE critical path 11-gen Weibel model, Virtex6 240T FPGA, general PEs Real ODEs/sec Lost ODEs/sec due to freq drop
Frank Vahid, UCR 13 FPGA Use model structure to improve Graph embedding: Map guest graph to host graph, minim. max wire length Guest Host Virtual PEs Physical PEs Avoid using FPGA placement (Phase 2)
Frank Vahid, UCR 14 FPGA 12 3 … Phase 2 – Map virtual PEs to physical PEs Embedding algorithm H-tree embedding Linear embedding Direct map embedding Guest Host [1] Zienicke, P Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90). [2] Aleliunas, R., and Rosenberg, A.L On Embedding Rectangular Grids in Square Grids. (Computers ‘82). [3] Berman, F., and Snyder, L On mapping parallel algorithms into parallel architectures, (PDC, ‘87).
Frank Vahid, UCR 15 2D grid of physical PEs EqP1 EqV1 EqP2 EqV2 EqP3 EqV3 EqP4 EqV4 EqP7 EqV7 EqP5 EqV5 EqP6 EqV6 FPGA Bypass FPGA placement EqP1 EqV1 EqP4 EqV4 EqP2 EqV2 EqP5 EqV5 EqP2 EqV2 EqP6 EqV6 EqP7 EqV7 (Phase 1: May require "graph folding" first to reduce #PEs)
Frank Vahid, UCR 16 FPGA Compare/backup: Simulated annealing Cost function: C = w1*sum + w2*max + w3*gaps Sum = sum of wire distances Max = max wire length (Euclidean dist.) Gaps = wires across architectural features FPGA P1 P2 Neighbor function: Swap PEs based on distance to neighbors
Frank Vahid, UCR 17 Results No placement strategy Simulated annealing placement Embedding placement 4 generations shown5 generations shown
Frank Vahid, UCR 18 Results Not routable Strategy# LUTS# BRAM#DSPEquivalent LUTs None SA Embed No impact on size 2D Neuron model - 256PE – Xilinx Virtex6 StrategyTotal power (mW) Dynamic power (mW) Static power (mW) None SA Embed % more power
Frank Vahid, UCR 19 Miller, B., F. Vahid, and T. Givargis. Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation. ACM Int. Symp. on FPGAs, Graph emb (Gen PEs) Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x
Frank Vahid, UCR 20 Custom Processing Element Custom datapath to solve specific type of equation MUL Const ROM Address Input_sel Address Inputs Output SUB Controller We Data RAM Controller PE SUBMUL FPGA Digital mockup Interface V’ = F 1 – F 2 F’ = P 1 -P 2 -(F*C R )*C L Custom PE for each ODE type Modified synthesis tool to create custom PEs for given ODEs first, then synthesis ODEs to PEs
Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS General PEs Custom PEs Huang, Vahid, Givargis. Synthesis of networks of custom processing elements for real-time physical system emulation. Transactions on Design Automation of Electronic Systems (TODAES), 2013 (to appear). Custom PEs Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x
Frank Vahid, UCR 22 Networks of Heterogeneous PEs Huang, Miller, Vahid, Givargis. Synthesis of Heterogeneous Processing Elements for Physical System Emulation. CODES+ISSS 2012, Oct, General PE: –Slow, flexible (can solve any types of ODEs) Custom PE: –Fast, inflexible (only solves one type of ODEs) Multi-Type PE –Combined multiple types of ODEs into single custom PE FPGA Digital mockup Interface Huge solution space: How to choose types of PEs? How many PEs to allocate? How to bind ODEs to PEs?
Frank Vahid, UCR 23 Automatic allocation and binding Initial random allocation PE allocator ODE-to-PE mapper New PE allocation Cycles of each PE Better solution Best solution N Y Simulated annealing
Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS General PEs Custom PEs Heterogeneous PEs C. Huang, B. Miller, F. Vahid, T. Givargis. Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation. IEEE/ACM Conf on Hardware/Software Codesign and System Synthesis (CODES/ISSS, part of ESWEEK), Finland, Oct Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x Heterog PE: 34.5x
Frank Vahid, UCR 25 Network of general/custom/heterogeneous PEs VS HLS (regularity extraction) Heterogeneous PE: (10x, 1.1x) HLS (7x, 0.85x) general PE (6x, 1.35x) custom PE (Speed, Size) Performance (ms): time to emulate 1000 ms, using Euler with 0.01 ms step. Size (equivalent LUTs)
Frank Vahid, UCR 26 Speedup / dollar CPU (I Intel X58 board): $480 GPU(GTX460 + I H55 board): $380 FPGA (Xilinx Virtex6 240T-2 board): $1800 Heterogeneous PEs: 3X better than PC(4) 4.5x better than GPU FPGA: Easier to build custom interfaces
Frank Vahid, UCR 27 Other projects Assistive monitoring –..\Desktop\Fall montage.mp4..\Desktop\Frank_pullChair_013113_cam3.video.wmv..\Desktop\Fall montage.mp4..\Desktop\Frank_pullChair_013113_cam3.video.wmv Web-based learning –"Textbook is dead" –Multi-univ synergy –pcpp.zyante.com (C++)pcpp.zyante.com Embedded systems educ. –New prog. model, virtual lab, programmingembeddedsystems.com programmingembeddedsystems.com –Also riosscheduler.orgriosscheduler.org Drunk driving (DUI) –..\Desktop\dui.MOV..\Desktop\dui.MOV –duicam.orgduicam.org drunken-driving-app/ drunken-driving-app/
Frank Vahid, UCR 28 Summary Frank Vahid, UCR 28 Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x Heterog PE: 34.5x (Grph emb+HPE: 48.5x) FPGAs: Fastest cost- effective execution of physical models – Future –Manycore device –Beyond testing CPS Implement end-products
Frank Vahid, UCR 29 Questions? Frank Vahid, UCR 29