软件调优基础 2004 年 2 月 23 日
为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL s5.445s5.457s10.996s3.328s0.762s0.848s0.738s for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; }
目标 明确性能调优的主要任务 定义一些重要的性能调优术语 利用 Intel 工具提供帮助
Agenda
调优循环 分析数据并得出结 论 测试结果 修改代码实现优 化 确定修改方法来 解决问题 从这里开始 收集性能数据
When (why) to Start User Requirement? Software Vendor Requirement? Put Performance Requirement into the Requirements Document Performance should be considered at every stage of the product life cycle (Requirements Gathering, Design, and Testing) Exception: Do “code tuning” after the simple/readable non-optimized version of the application exists.
工作 vs. 效果
When to Stop Architecture is at Maximum Efficiency? Be sure you know what this is: Calculate Theoretical Maximum Be sure you know what this is: Calculate Theoretical Maximum Performance Requirement is satisfied Incrementally do Wide Mesh Optimizations 2 until done Incrementally do Wide Mesh Optimizations 2 until done
调优原则 We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Donald Knuth Quality Code is: – Portable – Readable – Maintainable – Reliable Intelligently Sacrifice Quality for Performance
Agenda
收集性能数据 Timer Use to get wall clock time Accuracy, Low Overhead Use Intel ® VTune™ Performance Analyzer Profiler: Gather Information about Code Usage Performance Monitor: Gather Information about System Resource Usage
工作量 A good workload should have these characteristics: measurable reproducible static representative
分析数据得出结论 Baseline Current Performance Examine Hot Spots Identify Bottlenecks Calculate Potential Maximum Performance
Examine Hot Spots Examine Hot Spots The Pareto Principle, a.k.a. the 80/20 Rule Concentrate on the vital few vs. the trivial many Hot Spot: 应用或系统中占主要运算量的部分 Generally consists of a Loop For Applications that don’t have hot spots, examine: Memory Layout Exceptions Effective Compiler Usage
额外内容 Big O Utilization, Efficiency, Throughput, Latency Bottlenecks I/O, Memory, CPU MIPS/FLOPS/CPI Concurrency, Parallelism Scalability Loads/Stores per Calculation
Agenda
优化设计层次 问题定义 系统结构 算法和数据结构 代码调优 系统软件 系统硬件
代码调优 汇编指令级 内部函数 C++ 向量类库 多线程 循环转化 编译器及参数 性能库 Hardest to develop and maintain Easiest to develop, port and maintain Hardest to develop and maintain Easiest to develop, port and maintain
Code Tuning If Parallel Processing Break Algorithm up across Clusters (Distributed Memory) Single Node Optimization Break Algorithm up across Processors (SMP)
修改代码实现优化 Use Intel® Libraries Use Various Compiler Switches Find out if the compiler or hardware does the enhancements automatically - before implementing yourself Modify Source (i.e. Loop Transformations, SWP, SIMD, OpenMP, Intrinsics, Assembly)
Test! Make sure Applications still runs correctly (Regression Testing) Make sure enhancement actually increases performance Calculate Speed-up Decide if you’re done optimizing
Speed-Up Speed-Up = Optimized Time Baseline Time Speed-Up =Optimized Throughput Baseline Throughput The Two Basic Formulas
Summary Optimization Tasks Gather Performance Data Analyze Data & Identify Issues Generate Alternatives to Resolve Issue Implement Enhancements Test Results Use Intel® Software Development Tools for every step in the process