Memory and network stack tuning in Linux:

Slides:

Advertisements

Similar presentations

Unix Systems Performance Tuning Project of COSC 513 Name: Qinghui Mu Instructor: Prof. Anvari.

Advertisements

RHEL6 tuning guide for mellanox ethernet card.

Device Layer and Device Drivers

Exadata Distinctives Brown Bag New features for tuning Oracle database applications.

© 2010 VMware Inc. All rights reserved Confidential Performance Tuning for Windows Guest OS IT Pro Camp Presented by: Matthew Mitchell.

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 1 on behalf of M. Bencivenni, T.Ferrari, D. De Girolamo, Stefano.

VSphere vs. Hyper-V Metron Performance Showdown. Objectives Architecture Available metrics Challenges in virtual environments Test environment and methods.

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

What’s needed to transmit? A look at the minimum steps required for programming our anchor nic’s to send packets.

FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.

Implementing Efficient RSS Capable Hardware and Drivers for Windows 7

Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.

Availability Configuration PerformanceCapacity.

1 Some Context for This Session…  Performance historically a concern for virtualized applications  By 2009, VMware (through vSphere) and hardware vendors.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Windows RDMA File Storage

Configuration of Linux Terminal Server Group: LNS10A6 Thebe Laxmi, Sharma Prabhakar, Patrick Appiah.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

CSC 360- Instructor: K. Wu CPU Scheduling. CSC 360- Instructor: K. Wu Agenda 1.What is CPU scheduling? 2.CPU burst distribution 3.CPU scheduler and dispatcher.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.

Jamel Callands Austin Chaet Carson Gallimore.  Downloading  Recommended Specifications  Features  Reporting and Monitoring  Questions.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 9 Basic Router Troubleshooting.

Processor or Socket NUMA Node Core LP Processor or Socket NUMA Node Core LP Processor or Socket NUMA Node Core LP Processor or Socket NUMA Node Core.

Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.

Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.

Recent advances in the Linux kernel resource management Kir Kolyshkin, OpenVZ

System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

10GE network tests with UDP

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Srihari Makineni & Ravi Iyer Communications Technology Lab

S4-Chapter 3 WAN Design Requirements. WAN Technologies Leased Line –PPP networks –Hub and Spoke Topologies –Backup for other links ISDN –Cost-effective.

Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.

1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.

Boot Sequence, Internal Component & Cisco 3 Layer Model 1.

TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.

Hyper-V Performance, Scale & Architecture Changes Benjamin Armstrong Senior Program Manager Lead Microsoft Corporation VIR413.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Commodity Flash-Based Systems at 40GbE - FIONA Philip Papadopoulos* Tom Defanti Larry Smarr John Graham Qualcomm Institute, UCSD *Also San Diego Supercomputer.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

DAQ & ConfDB Configuration DB workshop CERN September 21 st, 2005 Artur Barczyk & Niko Neufeld.

REMINDER Check in on the COLLABORATE mobile app Best Practices for Oracle on VMware - Deep Dive Darryl Smith Chief Database Architect Distinguished Engineer.

R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.

16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

© 2013 MontaVista Software, LLC. MontaVista Confidential and Proprietary. CGE7 Core Isolation Nawneet Anand.

LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.

Shaopeng, Ho Architect of Chinac Group

NFV Compute Acceleration APIs and Evaluation

Kernel/Hardware for bifrost

Software Defined Storage

Heterogeneous Computation Team HybriLIT

Data Transfer Node Performance GÉANT & AENEAS

Multi-PCIe socket network device

Introduction of Week 3 Assignment Discussion

Boost Linux Performance with Enhancements from Oracle

Windows Server 2016 Software Defined Storage

Xen Network I/O Performance Analysis and Opportunities for Improvement

DPDK Accelerated Load Balancer

Process scheduling Chapter 5.

ECE 671 – Lecture 8 Network Adapters.

Cluster Computers.

Presentation transcript:

Memory and network stack tuning in Linux: the story of highly loaded servers migration to fresh Linux distribution Dmitry Samsonov Всем привет! Привет, админы! надеюсь все передохнули, обсудили доклады и готовы к новой информации

Dmitry Samsonov Expertise: Zabbix CFEngine Linux tuning Lead System Administrator at Odnoklassniki Expertise: Zabbix CFEngine Linux tuning dmitry.samsonov@odnoklassniki.ru https://www.linkedin.com/in/dmitrysamsonov

OpenSuSE 10.2 Release: 07.12.2006 End of life: 30.11.2008 CentOS 7 В конце прошлого года мы начали переводить наши сервера со старого Opensuse на новый Centos, потому что обновлять пакеты своими силами становилось всё труднее, разработчики требовали новые версии приложений и библиотек, а безопасники пугали уязвимостями.

Video distribution 4 x 10Gbit/s to users 2 x 10Gbit/s to storage 256GB RAM — in-memory cache 22 х 480GB SSD — SSD cache 2 х E5-2690 v2 Благодаря качественному подготовительному процессу миграция пошла гладко и центос пополз по проду, пришла очередь серверов видео раздачи, самых мощных и нагруженным в проде таких серверов у нас больше 40 после раскатки мы столкнулись с чередой проблем, так же заодно мы решили разобраться и с некоторыми старыми проблемами

TOC Memory OOM killer Swap Network Broken pipe Network load distribution between CPU cores SoftIRQ о том как мы их решали и заново открывали для себя линукс и будет мой доклад Всего будет 5 частей - 2 про память и 3 про сеть. Итак, начнём.

Memory OOM killer

1. All the physical memory 3. Each zone NODE 0 (CPU N0) NODE 1 (CPU N1) 20*PAGE_SIZE 21*PAGE_SIZE 22*PAGE_SIZE 23*PAGE_SIZE 24*PAGE_SIZE 25*PAGE_SIZE 26*PAGE_SIZE 27*PAGE_SIZE ... 28*PAGE_SIZE ... 29*PAGE_SIZE ... 210*PAGE_SIZE ... 2. NODE 0 (only) ZONE_NORMAL (4+GB) Ноды Зоны Чанки память доступна приложениям не как 1 кусок, а как много кусков разного размера сначала они большие, по мере выделения приложениям они режутся на более мелкие ZONE_DMA32 (0-4GB) ZONE_DMA (0-16MB)

OOM killer, system CPU spikes! What is going on? OOM killer, system CPU spikes! 18:00 до 12:00 час пик, ночь

Memory fragmentation Memory after server has booted up After some time After some more time Фрагмантация есть всегда, но не всегда она мешает.

Why this is happening? Lack of free memory Memory pressure

What to do with fragmentation? Increase vm.min_free_kbytes! High/low/min watermark. /proc/zoneinfo Node 0, zone Normal pages free 2020543 min 1297238 low 1621547 high 1945857 увеличивая шанс, что в ней будут свободными страницы нужного размера уровень (watermark) при котором начинается reclaim и в случае неудачи compaction (дефрагментация памяти) увеличивать пока проблемы не прекратятся, у нас 1Г, а по умолчанию высчитывает ядро по объёму памяти

Current fragmentation status /proc/buddyinfo Node 0, zone DMA 0 0 1 0 ... Node 0, zone DMA32 1147 980 813 450 ... Node 0, zone Normal 55014 15311 1173 120 ... Node 1, zone Normal 70581 15309 2604 200 ... ... 2 1 1 0 1 1 3 ... 386 115 32 14 2 3 5 ... 5 0 0 0 0 0 0 ... 32 0 0 0 0 0 0

Why is it bad to increase min_free_kbytes? Part of the memory min_free_kbytes-sized will not be available. ни приложение, ни ядром даже под page cache Был случай, когда спасло только min_free_kbytes =10Г

Memory Swap

40GB of free memory and vm.swappiness=0, but server is still swapping! What is going on? 40GB of free memory and vm.swappiness=0, but server is still swapping! Кроме фрагментации. Может просто тормозить, или опять же съедать system time, т.к. свободные страницы будут на другой ноде. NUMA

1. All the physical memory 3. Each zone NODE 0 (CPU N0) NODE 1 (CPU N1) 20*PAGE_SIZE 21*PAGE_SIZE 22*PAGE_SIZE 23*PAGE_SIZE 24*PAGE_SIZE 25*PAGE_SIZE 26*PAGE_SIZE 27*PAGE_SIZE ... 28*PAGE_SIZE ... 29*PAGE_SIZE ... 210*PAGE_SIZE ... 2. NODE 0 (only) ZONE_NORMAL (4+GB) если хв и ос Numa-aware, то приложения будут получать память на той ноде на которой работают, будет выше cache hit и выше скорость ZONE_DMA32 (0-4GB) ZONE_DMA (0-16MB)

Uneven memory usage between nodes Free Used Free Used NODE 0 (CPU N0) NODE 1 (CPU N1) у нас на серверах есть 1 большой tmpfs, который создаётс 1 тридом, поэтому в основном использовалась память одной ноды при высоком memory pressure ядро может решить, что засвопить часть страниц лучше, чем мигрировать их между нодами

Current usage by nodes numastat -m <PID> numastat -m Node 0 Node 1 Total --------------- --------------- --------------- MemFree 51707.00 23323.77 75030.77 ... у нас на серверах есть 1 большой tmpfs, который создаётся 1 тридом, поэтому в основном использовалась память одной ноды

What to do with NUMA? Prepare application Multithreading in all parts Node affinity Turn off NUMA For the whole system (kernel parameter): numa=off Per process: numactl — interleave=all <cmd> Многопоточность, что бы не было какого-то элемента приложения, который будет висеть на одной ноде и пороть работу всего приложегния.

Network Про сеть с высокой пропускной способностью!

What already had to be done Ring buffer: ethtool -g/-G Transmit queue length: ip link/ip link set <DEV> txqueuelen <PACKETS> Receive queue length: net.core.netdev_max_backlog Socket buffer: net.core.<rmem_default|rmem_max> net.core.<wmem_default|wmem_max> net.ipv4.<tcp_rmem|udp_rmem> net.ipv4.<tcp_wmem|udp_wmem> net.ipv4.udp_mem Offload: ethtool -k/-K

Network Broken pipe

What is going on? Broken pipe errors background In tcpdump - half-duplex close sequence. Проблема была ДО centos Первая сетевая проблема заключалась в фоне ошибок broken pipe. По дампу трафика было видно, что клиенты, связь с которыми завершается ошибкой Broken pipe, пред этим совершают быстрое закрытие TCP соединения - half-duplex close sequence. Почему это происходит?

OOO Out-of-order packet, i.e. packet with incorrect SEQuence number. Приводит к некорректному (для отправителя данных) закрытию соединения - half-duplex tcp close sequence, т.е. отправитель просто получит RST. OOO, т.е. не тот SEQ. это плохо, потому что retransmit

What to do with OOO? One connection packets by one route: Same CPU core Same network interface Same NIC queue Configuration: Bind threads/processes to CPU cores Bind NIC queues to CPU cores Use RFS выбор очередей в карте выбор интерфейсов в бонде (в обоих направлениях) выбор ядра для очереди выбор ядра для процесса/триды про RFS будет дальше

Before/after Broken pipes per second per server результат

Why is static binding bad? Load distribution between CPU cores might be uneven

Network load distribution between CPU cores Поиск в гугле по повышенной нагрузке на CPU0 возвращает от 50 тысяч до 150 тысяч страниц. Как вы думаете сколько по CPU1? До миллиона, но ни одного релевантного.

CPU0 utilization at 100%

Why this is happening? Single queue - turn on more: ethtool -l/-L Interrupts are not distributed: dynamic distribution - launch irqbalance/irqd/birq static distribution - configure RSS динамика распределяет неравномерно строго говоря, динамическое распределение может быть полезным, когда важна latency, т.к. демоны учитывают NUMA и соблюдают локальность данных

RSS CPU RSS eth0 1 2 3 4 5 6 7 8 9 10 11 12 Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 1 2 3 4 5 6 7 8 9 10 11 12 eth0 Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Допустим мы это сделали CPU0 отпустило Пропускная способность выросла, но вскоре опять упёрлись в нагрузку по другим ядрам Network

CPU0-CPU7 utilization at 100% статическое распределение через RSS

We need more queues!

16 core utilization at 100% RSS может распределить нагрузку только на 16 очередей

scaling.txt RPS = Software RSS XPS = RPS for outgoing packets RFS? Use packet consumer core number https://www.kernel.org/doc/Documen tation/networking/scaling.txt мы это используем!

Why is RPS/RFS/XPS bad? Load distribution between CPU cores might be uneven. CPU overhead т.к. отдельные соединения могут быть быстрее если тридов приложеня меньше, чем ядер

Accelerated RFS Mellanox supports it, but after switching it on maximal throughput on 10G NICs were only 5Gbit/s.

Signature Filter (also known as ATR - Application Targeted Receive) Intel Signature Filter (also known as ATR - Application Targeted Receive) RPS+RFS counterpart Хуже распределяет нагрузку и допускает packet reordering

Network SoftIRQ мутная тема Не смотря на то что это раздача и исходящих пакетов больше, softirq net_rx прерываний значительно больше, об этом поговорим чуть позже.

How SoftIRQs are born Network eth0 Q0 Q... Что происходит при получении пакета: … eth0 Q0 Q...

How SoftIRQs are born Network CPU C0 C... HW IRQ 42 eth0 Q0 Q... RSS генерируются на первый пакет, CPU C0 C... HW IRQ 42 eth0 Q0 Q... RSS

How SoftIRQs are born Network SoftIRQ NET_RX CPU0 HW interrupt processing is finished затем отключаются, выгребается N пакетов, затем включаются CPU C0 C... HW IRQ 42 eth0 Q0 Q... RSS

How SoftIRQs are born Network SoftIRQ NET_RX CPU0 HW interrupt processing is finished NAPI poll HW отключаются, выгребается N пакетов, затем включаются CPU C0 C... HW IRQ 42 eth0 Q0 Q... RSS

What to do with high SoftIRQ?

Interrupt moderation ethtool -c/-C

Why is interrupt moderation bad? You have to balance between throughput and latency

What is going on? Too rapid growth

Health ministry is warning! CHANGES REVERT! обязательно откатывайте те изменения, который вам не дали однозначного выигрыша! в противном случае рано или поздно они ухудшат производительность системы! KEEP IT TESTS

Thank you! Dmitry Samsonov dmitry.samsonov@odnoklassniki.ru Odnoklassniki technical blog on habrahabr.ru http://habrahabr.ru/company/odnoklassniki/ More about us http://v.ok.ru/ Dmitry Samsonov dmitry.samsonov@odnoklassniki.ru На выполнение этой задачи было потрачено много времени, но мы решили и старые, и новые проблемы освоили новые технологии и углубили знания о линуксе и самое главное - получили бесценный опыт Подходите ко мне после доклада, задавайте вопросы, рассказывайте про ваши истории успеха и проблемы - мне это интересно!