Performance Tuning Team Chia-heng Tu June 30, 2009 summer projects Performance Tuning Team Chia-heng Tu June 30, 2009
Optimization levels in General System architecture Design & Source code level Compile level Compiler Library level http://i.zdnet.com/blogs/android-architecture-485b.jpg OS level Bus Architecture level Processing Elements (ARM, PPC, etc) Accelerators (DSPs, FPGA, ASICs, etc) I/O Devices (UART, USB, LCD, etc) Processing Elements (ARM, PPC, etc) Accelerators (DSPs, FPGA, ASICs, etc)
List of summer projects Performance Evaluation of the CUDA programs on Muticore platforms (Compiler, Architecture, Parallel Computing, Performance Tools) Establishing Heterogeneous Multicore Environment (QEMU) (System software) Integrate an existing DSP simulator (System software) Communication facility (MSG library) on the environment (Compiler) Write a DSP emulator Performance Analysis Infrastructure (QEMU) (System software) Port PAPI onto QEMU (arm processor) (Architecture) Add Hardware Performance Monitoring Events (Performance tool) Tracing tool library porting on QEMU Embedded Development Platform (TI Davinci) Port Tracing tool onto TI Davinci platform Port MSG Library onto TI Davinci platform Integrate PAPI onto TI Davinci platform Study of the Impact of CPU Architecture on Program Performance (Architecture, performance tools) Memory opportunity MOEA Project
Performance Evaluation of the CDUA programs on Multicore platforms Programming model vs. CPU architectures Binaries (PPE+SPE) Real Apps. (written in CUDA program model) Real Apps. (Parallel C program) Application Layer Code translator Cell compiler OS Layer Red Hat or Fedora 9 Linux Platform Layer http://images.google.com/imgres?imgurl=http://www.hec.nasa.gov/news/gallery_images/cell.chip_diagram.jpg&imgrefurl=http://www.hec.nasa.gov/news/features/2008/cell.074208.html&usg=__l70_zIt-_yhYeYWFoYHwMepKDmg=&h=297&w=490&sz=29&hl=en&start=16&sig2=cZATgCqvqi631FdAOa2Zig&um=1&tbnid=li1pDgl39bwC6M:&tbnh=79&tbnw=130&prev=/images%3Fq%3DIBM%2BCell%2Bprocessor%26hl%3Den%26rlz%3D1B3GGGL_enTW176TW243%26sa%3DN%26um%3D1&ei=YYRISs_JMY2CkQXczejvCQ Nehalem image is from: http://news.cnet.com/8301-13924_3-10008472-64.html vs. Intel Nehalem Architecture IBM Cell Broadband Engine Architecture
Establishing Heterogeneous Multicore Environment Integrate an existing DSP simulator Build communication facility (MSG library) on the environment Write a DSP emulator ARM Binary DSP Binary Application Layer Real Apps. (Crypto, Multimedia, etc) Library Layer High-level Communication Interface Communication Library OS Layer OS (Linux) Bus I2C Bus Accelerator (PAC DSP) Platform Layer (Virtual Platform) ARM Memory QEMU
Performance Analysis Infrastructure Port Tracing tool library on QEMU (performance tools) Freq and time of function calls, and call graph Integrate Performance Application Programming Interface (PAPI) Add Hardware Performance Events 1 Real Apps. (Crypto, Multimedia, etc) Application Layer Source code instrumentor int main(int argc,char **argv) { // The data structure recording the performance data struct perfctr_sum_ctrs before, after; // … prolog: setup the environment. read_PMU(&before); dijkstra.c(); read_PMU(&after); //… Epilog: dump the performance data (Instruction Counts) return 0; } PMU_dijkstra.c 2 High-level Performance Analysis Interface Library Layer Tracing lib. Performance Application Programming Interface Library Perfctr (PMU Driver) OS Layer OS (Linux) Bus 3 I2C Bus ARM cache miss rate, etc Accelerator (PAC DSP ISS) Logical Time Stamp Counter (TSC) Performance Monitoring Unit (PMU) Platform Layer (Virtual Platform) Timing Facility QEMU
Embedded Development Platform (TI Davinci) Port MSG Library onto TI Davinci platform Port Tracing tool onto TI Davinci platform Integrate PAPI onto TI Davinci platform Real Apps. (Crypto, Multimedia, etc) Application Layer 2 3 1 Call graph High-level Performance Analysis Interface High-level Communication Interface A Library Layer Communication Library Tracing lib. Performance Application Programming Interface Library OS (Micro-kernel) OS (Micro-kernel) B C OS Layer Bus I2C Bus D E F ARM C64x DSP Performance Monitoring Unit (PMU) Platform Layer (Virtual Platform) TI-Davinci
Study of the impact of CPU Architecture on Program Performance Performance comparison of parallel programs on different multicore architectures Impact factors: cache size, cache hierarchy, interconnection among cores, etc Real Apps. (Parallel C Programs) Application Layer OS Layer Linux Platform Layer http://www.digital-daily.com/cpu/quad_core_opteron/ IBM Cell Broadband Engine Architecture AMD Quad Core Architecture Intel Nehalem Architecture
Everyone is Welcome to join us!!! Practical, system wide, and up to date research projects Everyone is Welcome to join us!!!