High-performance tracing of many-core systems with LTTng Simon Marchi Laboratoire DORSAL Département de génie informatique Noyau d'un système d'exploitation
Outline Intro to tracing Problem description Studied platforms Characteristics of many-core processors Work done and planned work Noyau d'un système d'exploitation
Intro to tracing (1/2) Very high performance logging How ? Very compact output format Lockless synchronization Low-level optimization (architecture dependent) Small footprint Won't block the application Noyau d'un système d'exploitation
Intro to tracing (2/2) Used by Kernel and application developers Sysadmins Security analysts Education Noyau d'un système d'exploitation
LTTng ! Open source project started at Polytechnique Linux kernel and userspace application tracer Very active development – many industrial partners http://www.lttng.org Noyau d'un système d'exploitation
LTTng ! Noyau d'un système d'exploitation
Problem description Latest generation of many-core processors Tilera, Intel Xeon Phi, Freescale, Adapteva Expected to become more popular Energy-efficient Best way to use ever-increasing number of transistors on chips Developers need good tools LTTng helps developers with performance problems or bugs related to parallel programming. There is no doubt a tracer will be a good friend on a 50 core machine. Noyau d'un système d'exploitation
Problem description Port and optimize LTTng for many-core architectures Expected challenges Limited storage High volume of data generated Highly parallel architectures, performance scaling We expect to do more at the same time with these processors, so necessarily there will be more to trace. Noyau d'un système d'exploitation
Studied platforms Tilera TILE-Gx8036 36 cores (versions up to 100 cores to come) Target market: cloud computing, packet processing, data mining, multimedia, etc. Already available at the lab Intel Xeon Phi 60 x86-compatible cores Target market: coprocessor in servers, high performance computing Launched November 2012, general availability January 2013... on its way ! Noyau d'un système d'exploitation
Tilera TILE-Gx architecture Source: TILE-Gx8036 product brief, http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8036_PB033-02_0.pdf Noyau d'un système d'exploitation
Intel Xeon Phi architecture Source: Intel, http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Noyau d'un système d'exploitation
Common characteristics Interconnection network between cores Shared memory becomes a bottleneck TILE-Gx: mesh-like network Xeon Phi: ring interconnect Very fast Distributed cache architecture Each core has its own L1/L2 cache On L2 cache miss, the core looks up in the other L2 (virtual L3) Uses the interconnection network Direct L2 to I/O transfer to avoid main memory Tilera: 64k L1 + 256k L2 Xeon Phi: ? Very fast network, delay for interconnection: ~1-5 cycles per hop Noyau d'un système d'exploitation
Common characteristics In-memory filesystem 8 GB of memory no permanent storage The trace has to be stored somewhere else. High bandwidth I/O PCI Express link to host Tilera: 4 x 10GbE network controller Runs a full Linux OS Most standard tools (e.g. gdb, oprofile) are already compatible Noyau d'un système d'exploitation
Tilera TILE-Gx characteristrics Mesh network Developers can use it as a “software” ASIC Many hardware accelerators for Packet processor/router Cryptography (SSL, DSA, RSA, IPSec, etc...) Compression (gzip) Runs a hypervisor Possibility to dedicate cores to different simultaneously running OSes Possibility to run Zero Overhead Linux and bare metal applications Software asic: lots of small processors connected together, short wires Noyau d'un système d'exploitation
Work done Basic port of LTTng (UST and kernel) to the Tilera Only one small fix was necessary on the LTTng side A few issues reported to Tilera were fixed on their side Noyau d'un système d'exploitation
Planned work Direct port of LTTng to the platforms - Just get it to work Develop a benchmark suite - Various real-life, heavily parallel applications Find bottlenecks, optimize - Make use of the special communication hardware - Adapt to the architectural features Integrate the work - Find ways to abstract for other many-core platforms Noyau d'un système d'exploitation
Conclusion Problem: many-core = a lot of data to trace Require different approaches than classic processors Different bottlenecks / constraints New hardware features / accelerators Noyau d'un système d'exploitation
Question ? Noyau d'un système d'exploitation