Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

Similar presentations


Presentation on theme: "Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1."— Presentation transcript:

1 Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1

2  Adapteva Company ◦ Small fabless semiconductor company ◦ Founded in 2008 ◦ Main objective is to design massively parallel chips with emphasis on power efficiency  First company that designed chip that expects to scale over 1000 cores ◦ Current products  Epiphany processor (16 core and 64 core versions)  Parallela board ◦ Parallela University Program started this year 6. 1. 2015 by Martin Kruliš (v1.0)2

3 6. 1. 2015 by Martin Kruliš (v1.0)3 16-core Epiphany Coprocessor 1GB SDRAM μUSB 1Gb Ethernet μSD μHDMI μUSB Zyng dual-core ARM-A9 (with integrated FPGA) Expansion Slots

4 6. 1. 2015 by Martin Kruliš (v1.0)4

5 6. 1. 2015 by Martin Kruliš (v1.0)5

6  Coprocessor ◦ 32-bit RISC cores with superscalar architecture ◦ 32KB local memory per core (1 cycle latency)  Divided into four independent banks ◦ IEEE754 compliant floating point instruction set ◦ Two DMA channels  eMesh (Network-on-Chip) ◦ Both on chip and off chip communication ◦ No specific API, works with memory transactions  eLink (Chip-to-Chip Links) ◦ 4 I/O ports for external communication 6. 1. 2015 by Martin Kruliš (v1.0)6

7  Coprocessor Cores ◦ Simple in-order RISC architecture  Most instructions take 1 cycle  8-stage dual-issue pipeline  Instruction set optimized for signal processing ◦ Separate integer and floating point ALU ◦ 64x 32-bit registers (for both IALU and FPU)  Load store architecture  Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store ◦ Performance  16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each 6. 1. 2015 by Martin Kruliš (v1.0)7

8  Memory Model ◦ Internal memory of each node is mapped into global memory 6. 1. 2015 by Martin Kruliš (v1.0)8

9  Local Memory ◦ Divided into four banks with independent controllers ◦ Each clock cycle each bank may perform:  Send 64bit word to program sequencer  Transfer 64bit word between memory and registers  Receive 64bit word from eMesh interface  Local DMA sends 64bit word to eMesh interface ◦ Memory order model  Local reads and writes follow strong memory model  Non-local transactions follow weak memory model  Operations may not propagate in the same order 6. 1. 2015 by Martin Kruliš (v1.0)9

10  eMesh ◦ 2D topology with nearest-neighbor connections ◦ 3 orthogonal (independent) meshes  cMesh – on-chip write transactions (8B/cycle)  xMesh – off-chip write transactions (1B/cycle)  rMesh – read requests (1req/8cycles) ◦ Edge connections may be interfaced with other epiphany chips  Or other type of busses (off-core memory, IO ports, …) ◦ Significantly favorizes writing operations to reading  Writing transactions are 16x faster 6. 1. 2015 by Martin Kruliš (v1.0)10

11  eMesh 6. 1. 2015 by Martin Kruliš (v1.0)11

12  eMesh Routing ◦ Upper 12bits of the address is address of the core  6 bits – row index, 6 bits – col index ◦ Each node uses simple routing algorithm ◦ Nodes use round-robin arbitration to avoid deadlock 6. 1. 2015 by Martin Kruliš (v1.0)12

13  DMA ◦ Two DMA channels per node ◦ 2D addressing awareness, flexible strides ◦ Local-external memory and external-external memory transfers ◦ Completion signaling by HW interrupt ◦ Master and slave modes  Slave DMA is controlled by external IO or another DMA 6. 1. 2015 by Martin Kruliš (v1.0)13

14  Epiphany SDK ◦ Separate compilation for host and coprocessor code  Epiphany uses e-gcc and e-objcopy ◦ The host runtime provide way to  Detect the coprocessor  Allocate memory, transfer data  Execute precompiled binaries on the coprocessor  OpenCL ◦ The coprocessor is perceived as OpenCL accelerator ◦ Each core is computing unit, on-chip memory is local memory, … 6. 1. 2015 by Martin Kruliš (v1.0)14

15  Host Code Example e_platform_t platform; e_epiphany_t dev; e_init(NULL); e_reset_system(); e_get_platform_info(&platform); e_open(&dev, 0, 0, platform.rows, platform.cols); e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols); for (i = 0; i < platform.rows ; ++i) for (j = 0; j < platform.cols; ++j) { coreid = (i + platform.row) * 64 + j + platform.col; usleep(100000); e_read(&emem, 0, 0, 0x0, emsg, _BufSize); e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));... } e_close(&dev); e_finalize(); 6. 1. 2015 by Martin Kruliš (v1.0)15

16 6. 1. 2015 by Martin Kruliš (v1.0)16

17  Matrix Multiplication 6. 1. 2015 by Martin Kruliš (v1.0)17 A tiles are rotated vertically in each column B tiles are rotated horizontally in each row

18 6. 1. 2015 by Martin Kruliš (v1.0)18


Download ppt "Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1."

Similar presentations


Ads by Google