OCR on Knights Landing (Xeon-Phi) 31st Mar 2016 Acknowledgment: This material is based upon work supported by the Department of Energy Office of Science under cooperative agreement DE-SC0008717 and DE-SC0014355, and Lawrence Livermore National Labs subcontract B608115.
Knights Landing Overview Three modes Self-boot processor Self-boot w/ integrated fabric Co-processor (PCIe addon card) MCDRAM: three memory modes Flat – entirely addressable Cache – on DDR, direct-mapped Hybrid – part cache, part memory Cluster modes (cc mesh interconnect) All-to-all: address uniformly hashed Quadrant: software-transparent, address hashed to dir same quadrant as memory Sub-NUMA: exposed as 4 NUMA nodes KNL presentation at Hotchips ‘15
OCR on KNL 1 policy domain with up to 288 workers MCDRAM in flat mode, with two allocators $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 255 node 0 size: 98200 MB node 0 free: 90312 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15519 MB node distances: node 0 1 0: 10 31 1: 31 10 Memory hints to choose allocator on MCDRAM (OCR_HINT_DB_HIGHBW)
Results – Stencil 2D weak scaling Xeon KNL Preliminary results! Software under optimization
Results – MCDRAM vs DDR Stencil 2D with 256 threads Preliminary results! Software under optimization Stencil 2D with 256 threads
Results – Stream Runtime bottlenecks? Profiling underway Limited vectorization opportunities? Preliminary results! Software under optimization
Next Steps Rootcause & fix MCDRAM performance Study all-to-all vs. sub-NUMA modes Single vs multiple policy domains Performance counters & introspection