May. 2009 Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab) Huang, Yifei (Illinois Math & Science Academy) May, 2009

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 2 About Illinois Math & Science Academy One coauthor (Huang Yifei) is with Illinois Math & Science Academy (IMSA). IMSA enrolls grade 10-12 academically talented Illinois students. Nobel Laureate Dr. Leon Lederman is an IMSA Founder and Resident Scholar at IMSA. The work is done through the Student Inquiry and Research program. (The SIR consists 22 Wednesdays in 08-09 academic year. The work is done in Fermilab. Transportation to Fermilab is provided by IMSA).

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 3 What? In space charge simulation computing task, in an low cost FPGA, a 16-bit demo core is developed: One FPGA = 5 Intel Core 2 Duo 2.2GHz CPU (0.5 W) vs. (5 x 75 W)

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 4 The Space Charge Computing Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N 2 and each calculation of the Coulomb force requires a square root, a division and several multiplications. Regular sequential computers are not fast enough. Number of Electrons Number of Calculations/Iteration Computing Time/1000 Iterations @ 10 7 Calculations/s 10 3 ~10 6 100 s 10 4 ~10 8 2.7 hours 10 5 ~10 10 11.6 days 10 6 ~10 12 3.2 years

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 5 The FPGA Board Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core.

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 6 The 16-bit Demo Core

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 7 A Double-Layer + Single-Layer Sequencer BAAA 001234255 101234 201234 301234 401234 01234 00 A double-layer loop is followed by a single-layer loop. 10 20 31 42 255253 0254 0255 00 Inner Loop Outer Loop State Control

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 8 LUT 10b in 16b out x2x2 x2x2 x2x2 + The Lookup Table The LUT replaces: A Square Rooting Two Multiplications A Reciprocal Operations

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 9 xixi - X X X LUT 10b in 16b out yiyi zizi 16-bit Coordinates 32-bit Forces x2x2 x2x2 x2x2 + - - Number of Bits for Input to LUT 32-bit Sum of Squares A 32-bit input LUT is too big. 2 32 =4G words. Shifters are used before and after the LUT. Leading zeros are eliminated:  00000001010110  0101011000

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 10 x1x2 (x1-x2) (x1-x2)^2 Sum of 3 Squares LUT Bit Evolution Before LUT If ((High Bits) != 0) Choose (High Bits) Else Choose (Low Bits)

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 11 (x1-x2) LUT Bit Evolution After LUT Shift 2n before LUTShift 3n after LUT

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 12 Two Electrons with Natural Scales 256 nm 28ps e e

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 13 256 Charged Particles, Iteration 0

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 22 Speed Comparison with Regular CPU The FPGA core is x10 faster than a typical 2.2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs 80-100 clock cycles for each Coulomb force calculation.

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 23 One Board: 8 FPGA Cores One board has a calculation capacity as 40 dual core CPUs. The power consumption of one board is < 4.5 W. Newer FPGAs capable of hosting 4 cores/FPGA are available. One Core/FPGA = 5 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs 8 Cores/Board = 40 Dual Core CPUs

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 24 The Execution & Non-Execution Cycles In current micro-processors:  Each instruction takes one clock cycle to execute.  It takes many clock cycles to prepare for executing an instruction.  Pipelined? Yes. But the non-execution pipeline stages consume silicon area, power etc.  To execute an instruction != to do useful calculation. Can we do something different?  Arithmetic, Algorithm, Architecture. From MIT 6.823 Open Course Site

May. 2009 Wu Jinyuan, (Fermilab jywu168@fnal.gov), Huang Yifei (IMSA) 25 The End Thanks

May. 2009 Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

Similar presentations

Presentation on theme: "May. 2009 Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May. 2009 Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

Similar presentations

Presentation on theme: "May. 2009 Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)"— Presentation transcript:

Similar presentations

About project

Feedback