Multi-core SOC for Future Media Processing Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University
Outline Opportunities & challenges from media processing Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Opportunities Video conference IP-phone Smart terminal PDA Video camera HDTV Set-top box … The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Challenges—multiple standards 1st MPEG-2 Encoder 6 MPEG-2 2nd Generation Encoder MPEG-4 5 H.26L H.263 H.264 3rd Generation Encoder WMV 4 VP3 AVS 4th Generation Encoder Mbit/s 3 5th Generation Encoder WMV 2 VP3 AVS 1 H.264 / MPEG-4 part 10 The Institute of VLSI Design, Zhejiang Univ. 1994 1995 1996 1997 1998 1999 2000 2018/9/17 2001 2002 2003 2004 2005
Challenges — excellent hardware Very high computation complexity H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS Multiple standards co-exist Demands of flexibility & programmability Low power Low cost Best choice : Application Specific Instruction Processor The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Multimedia algorithm characteristics Outer-loop and inner loop Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring Inner loop: Regular algorithms (Prediction, FIR, DCT, motion estimation) The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Multimedia algorithm mapping Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer loop Vector processor(VP, VLIW+SIMD) — inner loop The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Multi-core SOC architecture Top level Media processing kernel The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Inside the media processing kernel GAG1 GAG2 GAG3 GAG4 GDM GTM V-DM1 V-DM2 V-DM3 V-DM4 EDSP-control path Vector control path DMA and off chip memories 2D crossbar connection network E-DP V-DP1 V-DP2 V-DP3 V-DP4 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Technologies— specified instruction set __asm{ mov edx, mptr movdqu xmm1, [edx] packssdw xmm1,xmm1// read m50] from memory to xmm1} __asm{ movdqu xmm4, [edx +48] packssdw xmm4,xmm4// read m5[3] from memory} __asm{ movq xmm5,xmm1 psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]); paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]); movq xmm5, xmm2 psraw xmm2,1 psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3] psraw xmm4,1 paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)} for (j=0;j<BLOCK_SIZE;j++){ for (i=0;i<BLOCK_SIZE;i++){ m5[i]=img->cof[i0][j0][i][j]; } m6[0]=(m5[0]+m5[2]); m6[1]=(m5[0]-m5[2]); m6[2]=(m5[1]>>1)-m5[3]; m6[3]=m5[1]+(m5[3]>>1); Our IS 6 cycles adapt programmable processors to specific algorithms by introducing specialized instructions for frequently occurring operations of higher complexity. Integer IDCT in H.264 Intel MMX:13 cycles The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Technologies—instruction mergence Load/Store 30% result = 0; pres_y = dy == 1 ? y_pos : y_pos+1; pres_y = max(0,min(maxold_y,pres_y));//load for(x=-2;x<4;x++) //control { pres_x = max(0,min(maxold_x,x_pos+x));//load result += imY[pres_y][pres_x]*COEF[x+2]; // computation, permutation and load } result1 = max(0, min(255, (result+16)/32));//computation Permutation 25% Computation 35% Control 10% Ld/St and Perm. Merged Computation 6 – tap sub- pixels interpolation Control The Institute of VLSI Design, Zhejiang Univ. 2018/9/17 Reduce a half of time
Benchmarking results for CPU core CK520 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Simulation results for DSP performance Enhanced DSP CAVLC(context adaptive variable length coding) OGG(new audio standard) Sequence (CIF) MIPS/frame Max Average Foreman 0.147,832 0.029,898 Mobile 0.541,943 0.134,240 Function MIPS/frame MDCT 6 De_VQ 2.5 Floor/Coupling 3.5 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Simulation results for DSP performance Vector processor H.264 baseline decoder Sequence (298 frames) MIPS@30 frames Max Average QCIF Foreman 28.1 12.7 Aikyo 19.8 5.3 CIF 116.3 52.3 92.9 22.8 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Project status Finished 2 versions of CPU Core Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Future work Scheduling for task level parallelism(TLP) between heterogeneous processors Simulation/debugging tools for heterogeneous processors Methodologies for design space exploration The Institute of VLSI Design, Zhejiang Univ. 2018/9/17
Thank you! The Institute of VLSI Design, Zhejiang Univ. 2018/9/17