Learn the Linux Kernel with Ftrace Steven Rostedt 4096R/5A56DE73 5ED9 A48F C54C 0A22 D1D0 804C EBC2 6CDB 5A56 DE73.

Slides:



Advertisements
Similar presentations
RT_FIFO, Device driver.
Advertisements

Hacking: The Art of Exploitation
CS 4284 Systems Capstone Godmar Back Processes and Threads.
Measuring Function Duration with Ftrace By Tim Bird Sony Corporation of America On ARM.
1 Homework Reading –PAL, pp , Machine Projects –Finish mp2warmup Questions? –Start mp2 as soon as possible Labs –Continue labs with your.
Y86 Processor State Program Registers
University of Washington Today Memory layout Buffer overflow, worms, and viruses 1.
C questions A great programmer codes excellent code in C and Java. The code does video decoding. Java code works faster then C on my computer. how come?
Multiprogramming CSE451 Andrew Whitaker. Overview Multiprogramming: Running multiple programs “at the same time”  Requires multiplexing (sharing) the.
University of Washington x86 Programming III The Hardware/Software Interface CSE351 Winter 2013.
1 #include void silly(){ char s[30]; gets(s); printf("%s\n",s); } main(){ silly(); return 0; }
Recitation 6 – 2/26/01 Outline Linking Exam Review –Topics Covered –Your Questions Shaheen Gandhi Office Hours: Wednesday.
Kernel Modules. Kernel Module Pieces of code that can be loaded and unloaded into the kernel upon demand. Compiled as an independent program With appropriate.
University of Washington Today Happy Monday! HW2 due, how is Lab 3 going? Today we’ll go over:  Address space layout  Input buffers on the stack  Overflowing.
Recitation 2 – 2/11/02 Outline Stacks & Procedures Homogenous Data –Arrays –Nested Arrays Mengzhi Wang Office Hours: Thursday.
Fall 2013 SILICON VALLEY UNIVERSITY CONFIDENTIAL 1 Introduction to Embedded Systems Dr. Jerry Shiao, Silicon Valley University.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
Implementing System Calls CS552 Kartik Gopalan. CS552/BU/Spring2008 Steps in writing a system call 1.Create an entry for the system call in the kernel’s.
ELF binary # readelf -a foo.out ELF Header:
Interfacing Device Drivers with the Kernel
IA32 Assembly Programming in Linux
1 Understanding Pointers Buffer Overflow. 2 Outline Understanding Pointers Buffer Overflow Suggested reading –Chap 3.10, 3.12.
Bits and Bytes September 1, F’05 class02.ppt “The Class That Gives CMU Its Zip!”
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 unstrip: Restoring Function Information to Stripped Binaries Using Dyninst Emily.
OUTLINE 2 Pre-requisite Bomb! Pre-requisite Bomb! 3.
Discussions on hw3 Objective –To implement multitasking of 3 Tiny-UNIX processes on the SAPC –Each process tries to output data to the COM2 output port.
Kernel Tracing David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO
Recitation 2 – 2/11/02 Outline Stacks & Procedures Homogenous Data –Arrays –Nested Arrays Structured Data –struct s / union s –Arrays of structs.
Lecture 7 Interrupt ,Trap and System Call
Ftrace – Embedded Edition Steven Rostedt
Presented by Steven Rostedt
KernelShark (quick tutorial)
Ftrace Tutorial Steven Rostedt
Real Time Linux Who Needs It? Presented by: Steven Rostedt
Virtualizing the CPU: Processes 1. How to provide the illusion of many CPUs? CPU virtualizing – The OS can promote the illusion that many virtual CPUs.
Debugging the Linux Kernel
Practical Session 3.
Buffer Overflow Buffer overflows are possible because C doesn’t check array boundaries Buffer overflows are dangerous because buffers for user input are.
Practical Session 5.
Big Picture of Linux Probe and Trace Tools
Homework / Exam Return and Review Exam #1 Reading Machine Projects
The Hardware/Software Interface CSE351 Winter 2013
Processes.
Homework Reading Machine Projects Labs PAL, pp ,
Recitation 2 – 2/11/02 Outline Stacks & Procedures
Kernel Tracing David Ferry, Chris Gill
Debugging Linux Kernel by Ftrace
143A: Principles of Operating Systems Lecture 8: Basic Architecture of a Program Anton Burtsev October, 2017.
Homework In-line Assembly Code Machine Language
Linux Kernel Driver.
Debugging Linux Kernel by Ftrace
Debugging Linux Kernel by Ftrace
Recitation 2 – 2/4/01 Outline Machine Model
Assembly Language Programming V: In-line Assembly Code
Emily Jacobson and Nathan Rosenblum
Computer Architecture adapted by Jason Fritts then by David Ferry
Y86 Processor State Program Registers
Performance Evaluation of InfiniBand NFS/RDMA for Linux
Assembly Language Programming II: C Compiler Calling Sequences
Discussions on HW2 Objectives
Machine Level Representation of Programs (IV)
Machine-Level Programming: Introduction
CAP6135: Malware and Software Vulnerability Analysis Buffer Overflow : Example of Using GDB to Check Stack Memory Cliff Zou Spring 2015.
Discussions on HW2 Objectives
Kernel Tracing David Ferry, Chris Gill, Brian Kocoloski
Instructors: Majd Sakr and Khaled Harras
Implementing System Calls
CAP6135: Malware and Software Vulnerability Analysis Buffer Overflow : Example of Using GDB to Check Stack Memory Cliff Zou Spring 2016.
CAP6135: Malware and Software Vulnerability Analysis Buffer Overflow : Example of Using GDB to Check Stack Memory Cliff Zou Spring 2013.
September 5th, 2018 Prof. Ion Stoica
Presentation transcript:

Learn the Linux Kernel with Ftrace Steven Rostedt 4096R/5A56DE73 5ED9 A48F C54C 0A22 D1D0 804C EBC2 6CDB 5A56 DE73

What is Ftrace? ● Two meanings – Linux function tracer infrastructure – Generic kernel tracing infrastructure ● Function tracing ● Tracepoints ● latency Tracing ● Kernel debuging

Building with Ftrace ● Kernel Hacking --> Tracers ● Kernel Function Tracer – CONFIG_FUNCTION_TRACER ● Kernel Function Graph Tracer – CONFIG_FUNCTION_GRAPH_TRACER ● git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git – cd linux

Overview of ftrace ● Mounting ● CLI to administer ftrace ● events ● function tracing ● trace-cmd

The Debugfs ● Officially mounted at – /sys/kernel/debug ● I prefer – mkdir /debug – mount -t debugfs nodev /debug – This presentation will use /debug ● Do what you want

The Tracing Directory # ls /debug/tracing available_events dyn_ftrace_total_info kprobe_profile printk_formats set_ftrace_notrace sysprof_sample_period trace_pipe tracing_on available_filter_functions events ksym_profile README set_ftrace_pid trace trace_stat tracing_thresh available_tracers failures ksym_trace_filter saved_cmdlines set_graph_function trace_clock tracing_cpumask buffer_size_kb function_profile_enabled options set_event stack_max_size trace_marker tracing_enabled current_tracer kprobe_events per_cpu set_ftrace_filter stack_trace trace_options tracing_max_latency

Tracers ● Found in available_tracers – function – function_graph – wakeup and wakeup_rt – irqsoff, preemptoff, preemtirqsoff – mmiotrace – nop

The Function Tracer tracing]# echo function > current_tracer tracing]# cat trace | head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | simpress.bin-2792 [000] : unix_poll <-sock_poll simpress.bin-2792 [000] : sock_poll_wait <-unix_poll simpress.bin-2792 [000] : fput <-do_sys_poll simpress.bin-2792 [000] : fget_light <-do_sys_poll simpress.bin-2792 [000] : sock_poll <-do_sys_poll simpress.bin-2792 [000] : unix_poll <-sock_poll simpress.bin-2792 [000] : sock_poll_wait <-unix_poll simpress.bin-2792 [000] : fput <-do_sys_poll simpress.bin-2792 [000] : fget_light <-do_sys_poll simpress.bin-2792 [000] : sock_poll <-do_sys_poll simpress.bin-2792 [000] : unix_poll <-sock_poll

How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: e8 5b f9 d7 ff call c c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax

How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: 0f 1f nop c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax

How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: e8 61 f9 d7 ff call c c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax

How the Function Tracer Works ● ftrace_caller ENTRY(mcount) ret END(mcount) ENTRY(ftrace_caller) cmpl $0, function_trace_stop jne ftrace_stub pushl %eax pushl %ecx pushl %edx movl 0xc(%esp), %eax movl 0x4(%ebp), %edx subl $MCOUNT_INSN_SIZE, %eax.globl ftrace_call ftrace_call: call ftrace_stub popl %edx popl %ecx popl %eax.globl ftrace_stub ftrace_stub: ret END(ftrace_caller)

set_ftrace_filter tracing]# echo schedule > set_ftrace_filter tracing]# cat set_ftrace_filter schedule tracing]# echo function > current_tracer tracing]# cat trace | head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | Xorg-1849 [001] : schedule <-schedule_hrtimeout_range -0 [001] : schedule <-cpu_idle Xorg-1849 [001] : schedule <-__cond_resched kondemand/ [001] : schedule <-worker_thread Xorg-1849 [001] : schedule <-sysret_careful Xorg-1849 [001] : schedule <-schedule_hrtimeout_range gnome-terminal-2112 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-schedule_hrtimeout_range gnome-terminal-2112 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-sysret_careful

set_ftrace_filter (Continued) tracing]# echo schedule_tail >> set_ftrace_filter tracing]# cat set_ftrace_filter schedule_tail schedule tracing]# echo 'sched*' > set_ftrace_filter tracing]# cat set_ftrace_filter | head -10 sched_avg_update sched_group_shares sched_group_rt_runtime sched_group_rt_period sched_slice sched_rt_can_attach sched_feat_open sched_debug_open sched_feat_show sched_feat_write

Acceptable Globs ● match* – Selects all functions starting with “match” ● *match – Selects all functions ending with “match” ● *match* – Selects all functions with “match” in its name

set_ftrace_notrace tracing]# echo > set_ftrace_filter tracing]# echo '*lock*' > set_ftrace_notrace tracing]# cat set_ftrace_notrace | head -10 xen_pte_unlock alternatives_smp_unlock user_enable_block_step __acpi_release_global_lock __acpi_acquire_global_lock unlock_vector_lock lock_vector_lock parse_no_kvmclock kvm_set_wallclock kvm_register_clock

:mod: command tracing]# echo :mod:e1000e > set_ftrace_filter tracing]# cat set_ftrace_filter | head -10 e1e_rphy e1e_wphy e1000_put_hw_semaphore_82571 e1000_set_d0_lplu_state_82571 e1000e_clear_vfta e1000_check_mng_mode_82574 e1000_led_on_82574 e1000_valid_led_default_82571 e1000e_get_laa_state_82571 e1000_write_nvm_82571 ● Works for both set_ftrace_filter and set_ftrace_notrace

The Function Graph Tracer tracing]# echo function_graph > current_tracer tracing]# cat trace | head -20 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | down_read_trylock() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | __might_sleep(); 1) us | _cond_resched(); 1) us | find_vma(); 1) | handle_mm_fault() { 1) us | pud_alloc(); 1) us | pmd_alloc(); 1) | __do_fault() { 1) | filemap_fault() { 1) | find_get_page() { 1) us | page_cache_get_speculative(); 1) us | } 1) | lock_page() {

What Does That Function Call? tracing]# echo sys_read > set_graph_function tracing]# cat trace | head -20 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) us | fsnotify(); 1) us | } 1) ! us | } 1) us | fput_light(); 1) ! us | } 1) | sys_read() { 1) us | fget_light(); 1) | vfs_read() { 1) | rw_verify_area() { 1) | security_file_permission() { 1) | selinux_file_permission() { 1) us | avc_policy_seqno(); 1) us | } 1) us | } 1) us | } 1) | tty_read() {

Interrupts in Function Graph 1) | __sock_recvmsg() { 1) | security_socket_recvmsg() { 1) | selinux_socket_recvmsg() { 1) ==========> | 1) | smp_apic_timer_interrupt() { 1) | apic_write() { 1) us | native_apic_mem_write(); 1) us | } 1) us | exit_idle(); 1) | irq_enter() { 1) us | rcu_irq_enter(); 1) us | idle_cpu(); 1) us | } 1) | hrtimer_interrupt() { 1) | ktime_get() { 1) | timekeeping_get_ns() { 1) us | read_hpet(); [...] 1) us | } 1) us | } 1) us | rcu_irq_exit(); 1) us | idle_cpu(); 1) us | } 1) ! us | } 1) <========== | 1) | socket_has_perm() { 1) | avc_has_perm() {

trace_pipe tracing]# cat trace_pipe 1) us | flush_tlb_page(); 1) us | } 1) us | } 1) us | } 1) | up_read() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | notify_page_fault(); 1) | down_read_trylock() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | __might_sleep(); 1) us | _cond_resched(); 1) us | find_vma(); 1) | handle_mm_fault() { 1) us | pud_alloc(); 1) us | pmd_alloc(); 1) | __do_fault() { 1) | filemap_fault() {

Latency Tracers ● wakeup – trace wake up time high highest prio task ● wakeup_rt – trace wake up time of highest prio RT task ● irqsoff – trace time interrupts is disabled ● preemptoff – trace time preemption is disabled ● preemptirqsoff – trace time preemption or interrupts disabled

Trace Events tracing]# ls events block ext4 header_event irq kmem kvmmmu sched syscalls enable ftrace header_page jbd2 kvm module skb workqueue tracing]# ls events/sched/ enable sched_process_exit sched_stat_iowait sched_wakeup filter sched_process_fork sched_stat_sleep sched_wakeup_new sched_kthread_stop sched_process_free sched_stat_wait sched_kthread_stop_ret sched_process_wait sched_switch sched_migrate_task sched_signal_send sched_wait_task tracing]# ls events/sched/sched_wakeup enable filter format id

Enable a Single Event tracing]# echo 1 > events/sched/sched_wakeup/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] -0 [001] : sched_wakeup: task events/1:10 [120] success=1 [001] events/1-10 [001] : sched_wakeup: task gnome-terminal:2162 [120] success=1 [001]

Enable All Subsystem Events tracing]# echo 1 > events/sched/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | events/0-9 [000] : sched_switch: task events/0:9 [120] (S) ==> kondemand/0:1305 [120] kondemand/ [000] : sched_stat_wait: task: restorecond:1395 wait: [ns] kondemand/ [000] : sched_switch: task kondemand/0:1305 [120] (S) ==> restorecond:1395 [120] restorecond-1395 [000] : sched_stat_wait: task: restorecond:1395 wait: 0 [ns] restorecond-1395 [000] : sched_stat_sleep: task: kondemand/0:1305 sleep: [ns] restorecond-1395 [000] : sched_wakeup: task kondemand/0:1305 [120] success=1 [000]

Enable All Events tracing]# echo 1 > events/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | acpid-1470 [001] : kfree: call_site=ffffffff810c996d ptr=(null) acpid-1470 [001] : sys_read -> 0x1 acpid-1470 [001] : sys_exit: NR 0 = 1 acpid-1470 [001] : sys_read(fd: 3, buf: 7f4ebb32ac50, count: 1) acpid-1470 [001] : sys_enter: NR 0 (3, 7f4ebb32ac50, 1, 8, 40, ) acpid-1470 [001] : kfree: call_site=ffffffff810c996d ptr=(null)

Enable Multiple Events tracing]# echo 1 > events/sched/sched_wakeup/enable tracing]# echo 1 > events/sched/sched_wakeup_new/enable tracing]# echo 1 > events/sched/sched_switch/enable tracing]# cat trace | head -15 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_switch: task bash:2913 [120] (S) ==> swapper:0 [140] -0 [001] : sched_wakeup: task events/1:10 [120] success=1 [001] -0 [001] : sched_switch: task swapper:0 [140] (R) ==> events/1:10 [120] events/1-10 [001] : sched_wakeup: task gnome-terminal:2158 [120] success=1 [001] events/1-10 [001] : sched_switch: task events/1:10 [120] (S) ==> gnome-terminal:2158 [120] gnome-terminal-2158 [001] : sched_switch: task gnome-terminal:2158 [120] (S) ==> swapper:0 [140] -0 [000] : sched_wakeup: task phy0:1041 [120] success=1 [000] -0 [000] : sched_switch: task swapper:0 [140] (R) ==> phy0:1041 [120]

Event Directory or File ● set_event shows all events enabled ● available_events shows what events are available ● echo 1 > events/sched/enable – same as “echo sched > set_event” tracing]# echo 1 > events/irq/enable tracing]# cat set_event irq:irq_handler_entry irq:irq_handler_exit irq:softirq_entry irq:softirq_exit

Trace Events in the kernel #undef TRACE_SYSTEM #define TRACE_SYSTEM sample #if !defined(_TRACE_EVENT_SAMPLE_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_EVENT_SAMPLE_H #include TRACE_EVENT(foo_bar, TP_PROTO(char *foo, int bar), TP_ARGS(foo, bar), TP_STRUCT__entry( __array( char, foo, 10 ) __field( int, bar ) ), TP_fast_assign( strncpy(__entry->foo, foo, 10); __entry->bar = bar; ), TP_printk("foo %s %d", __entry->foo, __entry->bar) ); #endif #undef TRACE_INCLUDE_PATH #undef TRACE_INCLUDE_FILE #define TRACE_INCLUDE_PATH. #define TRACE_INCLUDE_FILE trace-events-sample #include

Trace Events in the kernel #define CREATE_TRACE_POINTS #include "trace-events-sample.h" static void simple_thread_func(int cnt) { set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(HZ); trace_foo_bar("hello", cnt); } ● C file

Creating a Trace Event (modules) ● Makefile CFLAGS_trace-events-sample.o := -I$(src) obj-$(CONFIG_SAMPLE_TRACE_EVENTS) += trace-events-sample.o

Plugins vs Events ● Plugins are set via current_tracer – Events are enabled via the event directory or the set_event file ● Plugins are listed via the available_tracers file – Events are listed by the event directory or the available_events file ● Only one plugin at a time – Any number of events can be enabled – They show up in any trace

Mixing Events With Plugins ● Latency tracers work best with seeing what is happening ● Function tracer is too verbose. Although filtering may help ● Just start and stop is not enough

Filters ● Filter any trace event or ftrace entry ● Use equal '==' and logic descriptors '||' and '&&' ● Filter on any field in the format file – i.e. sched_switch's prev_state

Filter on sched_switch tracing]# echo "prev_state == 0" > events/sched/sched_switch/filter tracing]# cat trace | head -15 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | -0 [001] : sched_switch: task swapper:0 [140] (R) ==> events/1:10 [120] -0 [001] : sched_switch: task swapper:0 [140] (R) ==> Xorg:1840 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> gnome-settings-:2133 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120]

tracing_on tracing]# echo 0 > tracing_on tracing]# echo 1 > tracing_on tracing]# echo 0> tracing_on tracing]# echo 0 > tracing_on; run_test; echo 0 > tracing_off

trace_marker tracing]# echo 'Hello Chemnitz!' > trace_marker tracing]# cat trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-3798 [001] : 0: Hello Chemnitz!

tracing_on int trace_on_fd; int main(int argc, char *argv[]) { char buf[BUFSIZ]; [...] find_debugfs(buf); strcat(buf, "/tracing/tracing_on"); trace_on_fd = open(buf, O_WRONLY); [...] if (error_detected()) { /* hit bug */ write(trace_on_fd, "0", 1); }

tracing_on and trace_marker int trace_on_fd; int trace_mark_fd; int main(int argc, char *argv[]) { char buf[BUFSIZ]; [...] find_debugfs(buf); strcat(buf, "/tracing/tracing_on"); trace_on_fd = open(buf, O_WRONLY); find_debugfs(buf); strcat(buf, "/tracing/trace_marker"); trace_mark_fd = open(buf, O_WRONLY); [...] write(trace_mark_fd, "Testing for error\n", 18); if (error_detected()) { /* hit bug */ write(trace_on_fd, "0", 1); }

trace-cmd ● Version 1.1-rc1 ● git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

trace-cmd ● binary tool to read Ftrace's buffers – Records into a trace.dat file for later reads – Reads the trace.dat file ● Can record on big endian, read in little, and vice versa – Reads the raw buffers using splice – Will automatically mount debugfs if it is not mounted ● Must have root access (sudo)

trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ● Default, writes to “trace.dat”

trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ● Default, writes to “trace.dat”

trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph.dat -p function_graph ls -ltr /usr \ > /dev/null plugin function_graph disable all offset=2f2000 offset= ● Default, writes to “trace.dat”

trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph.dat -p function_graph ls -ltr /usr \ > /dev/null plugin function_graph disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph-events.dat -e sched -p function_graph \ ls -ltr /usr > /dev/null plugin function_graph disable all enable sched offset=2f2000 offset= ● Default, writes to “trace.dat”

Filters, and Options ~]# trace-cmd record -e sched_switch -f 'prev_prio < 100' ~]# trace-cmd record -p function_graph -O nograph-time ~]# trace-cmd record -p function_graph -g sys_read ~]# trace-cmd record -p function_graph -l do_IRQ -l timer_interrupt ~]# trace-cmd record -p function_graph -n '*lock*' ● -f : filter ● -O : option ● -g : same as echoing into set_graph_function ● -l : same as echoing into set_ftrace_filter ● -n : same as echoing into set_ftrace_notrace

trace-cmd report ● Default, reads from “trace.dat” ~]# trace-cmd report | head -15 version = 6 cpus=2 trace-cmd-6157 [000] : sched_stat_runtime: task: trace-cmd:61 trace-cmd-6157 [000] : sched_switch: 6157:120:S ==> 0:1 -0 [000] : sched_stat_wait: task: trace-cmd:61 -0 [000] : sched_switch: 0:120:R ==> 6158:1 ls-6158 [001] : sched_wakeup: 6158:?:? : ls-6158 [001] : sched_stat_runtime: task: trace-cmd:61 ls-6158 [001] : sched_stat_runtime: task: trace-cmd:61 ls-6158 [001] : sched_switch: 6158:120:R ==> 590 migration/ [001] : sched_stat_wait: task: trace-cmd:61 migration/ [001] : sched_migrate_task: task trace-cmd:615 migration/ [001] : sched_switch: 5900:0:S ==> 0:120 ls-6158 [000] : sched_stat_runtime: task: ls:6158 runt ls-6158 [000] : sched_stat_runtime: task: ls:6158 runt

trace-cmd report (continued) ~]# trace-cmd report -i func.dat | head -15 version = 6 cpus=2 ls-6178 [000] : function: fsnotify_modify <-- vfs_write ls-6178 [000] : function: inotify_inode_queue_event <-- fsn ls-6178 [000] : function: fsnotify_parent <-- fsnotify_modi ls-6178 [000] : function: __fsnotify_parent <-- fsnotify_pa ls-6178 [000] : function: inotify_dentry_parent_queue_event ls-6178 [000] : function: fsnotify <-- fsnotify_modify ls-6178 [000] : function: fput_light <-- sys_write ls-6178 [000] : function: audit_syscall_exit <-- sysret_aud ls-6178 [000] : function: audit_get_context <-- audit_sysca ls-6178 [000] : function: audit_free_names <-- audit_syscal ls-6178 [000] : function: path_put <-- audit_free_names ls-6178 [000] : function: dput <-- path_put ls-6178 [000] : function: mntput <-- path_put

trace-cmd report (continued) ~]# trace-cmd report -i fgraph.dat | head -15 | cut -c complement version = 6 cpus=2 ls-6186 [000] funcgraph_entry: | fsnotify_modify() { ls-6186 [000] funcgraph_entry: us | inotify_inode_queue_event(); ls-6186 [000] funcgraph_entry: | fsnotify_parent() { ls-6186 [000] funcgraph_entry: us | __fsnotify_parent(); ls-6186 [000] funcgraph_entry: us | inotify_dentry_parent_queu ls-6186 [000] funcgraph_exit: us | } ls-6186 [000] funcgraph_entry: us | fsnotify(); ls-6186 [000] funcgraph_exit: us | } ls-6186 [000] funcgraph_entry: us | fput_light(); ls-6186 [000] funcgraph_entry: | audit_syscall_exit() { ls-6186 [000] funcgraph_entry: us | audit_get_context(); ls-6186 [000] funcgraph_entry: | audit_free_names() { ls-6186 [000] funcgraph_entry: | path_put() {

trace-cmd report (continued) ~]# trace-cmd report -i fgraph-events.dat | head -15 | \ cut -c complement version = 6 cpus=2 ls-6209 [001] funcgraph_entry:0.385 us | task_of(); ls-6209 [001] funcgraph_entry: | ftrace_raw ls-6209 [001] sched_stat_wait: task: phy0:861 wait: [ns] ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | __dequeue_entity( ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | task_of(); ls-6209 [001] funcgraph_entry: us | hrtick_start_fair ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | perf_event_task_sch ls-6209 [001] funcgraph_entry: | ftrace_raw_event_sc ls-6209 [001] sched_switch: 6209:120:R ==> 861:120: phy0 ls-6209 [001] funcgraph_exit: us | }

trace-cmd start ● Using start is like echoing into debugfs – trace-cmd start -e all ● same as “echo 1 > events/enable” ● Uses the same options as trace-cmd record – trace-cmd start -p function_graph – trace-cmd start -p function -e sched_switch ● Reset for manual use – trace-cmd start -p nop

trace-cmd stop / extract ● trace-cmd stop – stops the tracer from writing: ● same as “echo 0 > tracing_on” ● trace-cmd extract -o output.dat – Makes a “dat” file that trace-cmd report can use – Without “-o...” will create “trace.dat”

trace-cmd reset ● trace-cmd stop does not stop the overhead of tracing ● trace-cmd reset disables all tracing – trace-cmd reset ● Removes trace data from kernel – Do the extract before doing the reset

trace-cmd list ● See the trace options, events or plugins – trace-cmd list -o ● shows list of trace options ● these options are used by trace-cmd record -O option – trace-cmd list -p ● available tracers (use to be called plugins) – trace-cmd list -e ● available events

trace-cmd split ● Split by time, events, CPU – trace-cmd split ● splits from timestamp to end of file – trace-cmd split -e 1000 ● splits out the first 1000 events – trace-cmd split -m 1 -r ● split 1 millisecond starting at first timestamp to second timestamp repeatedly – trace.dat.1, trace.dat.2,...

trace-cmd listen ● listen for connections from other boxes – trace-cmd listen -p d ● Record can now send to that box – trace-cmd record -N host:5678 -e all – use “-t” to force TCP otherwise trace data is sent via UDP

What does the kernel do? ● Ever wonder what happens beyond the syscall? ● Use the source Luke! – But use ftrace to see what gets called ● Networking (follow that packet) ● File Systems (where does the info go?)

System Call Tracing ● strace like calls ● Trace system calls of all applications – trace-cmd record -e syscalls sleep 10

Follow that code SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret; } ● fs/read_write.c

Follow that code ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { ssize_t ret; if (!(file->f_mode & FMODE_READ)) return -EBADF; if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read)) return -EINVAL; if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) return -EFAULT; ret = rw_verify_area(READ, file, pos, count); if (ret >= 0) { count = ret; if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else ret = do_sync_read(file, buf, count, pos); if (ret > 0) { fsnotify_access(file); add_rchar(current, ret); } inc_syscr(current); } return ret; }

Follow that code ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { struct iovec iov = {.iov_base = buf,.iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; kiocb.ki_left = len; kiocb.ki_nbytes = len; for (;;) { ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); *ppos = kiocb.ki_pos; return ret; }

Follow that code ● trace-cmd record -F -p function_graph \ ● -g sys_read cat /etc/password ● trace-cmd report | less

Who called that Function? ● trace-cmd record -p function –func-stack -l

stacktrace -0 [001] : kmem_cache_free: call_site=ffffffff81103e40 ptr=ffff88002f4aa [001] : => kmem_cache_free => mempool_free_slab => mempool_free => bio_free => dm_bio_destructor => bio_put => clone_endio => bio_endio

Function Stack Tracer ● Very dangerous! tracing]# echo nop > current_tracer tracing]# echo mod_timer > set_ftrace_filter tracing]# echo function > current_tracer tracing]# echo 1 > options/func_stack_trace tracing]# cat trace |head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2758 [001] : mod_timer+0x8/0x49 <-add_timer+0x2f/0x45 bash-2758 [001] : => add_timer+0x2f/0x45 => queue_delayed_work_on+0xe8/0x110 => queue_delayed_work+0x39/0x4f => schedule_delayed_work+0x2e/0x44 => tty_flip_buffer_push+0x6f/0x8a => pty_write+0x56/0x7e => n_tty_write+0x235/0x365 => tty_write+0x1ab/0x267 phy [000] : mod_timer+0x8/0x49 <- mod_beacon_timer+0x4f/0x6a [mac80211]

Event Format Files tracing]# cat events/sched/sched_switch/format name: sched_switch ID: 57 format: field:unsigned short common_type;offset:0;size:2; field:unsigned char common_flags;offset:2;size:1; field:unsigned char common_preempt_count;offset:3;size:1; field:int common_pid;offset:4;size:4; field:int common_lock_depth;offset:8;size:4; field:char prev_comm[TASK_COMM_LEN];offset:12;size:16; field:pid_t prev_pid;offset:28;size:4; field:int prev_prio;offset:32;size:4; field:long prev_state;offset:40;size:8; field:char next_comm[TASK_COMM_LEN];offset:48;size:16; field:pid_t next_pid;offset:64;size:4; field:int next_prio;offset:68;size:4; print fmt: "task %s:%d [%d] (%s) ==> %s:%d [%d]", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"}, { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "W" }) : "R", REC->next_comm, REC->next_pid, REC->next_prio

stack_trace ● echo 1 > /proc/sys/kernel/stack_tracer_enabled ● kernel command line “stacktrace”

stack_trace tracing]# cat stack_trace Depth Size Location (45 entries) ) ftrace_call+0x5/0x2b 1) update_curr+0x10a/0x12b 2) enqueue_entity+0x31/0x20f 3) enqueue_task_fair+0x3d/0x98 4) enqueue_task+0x6b/0x8d [...] 28) sr_test_unit_ready+0x72/0xec 29) sr_media_change+0x57/0x264 30) media_changed+0x63/0xb2 31) cdrom_media_changed+0x44/0x5e 32) sr_block_media_changed+0x2c/0x42 33) check_disk_change+0x3c/0x85 34) cdrom_open+0x8d9/0x96b 35) sr_block_open+0x9f/0xd2 36) __blkdev_get+0xde/0x37c 37) blkdev_get+0x23/0x39 38) blkdev_open+0x85/0xd1 39) __dentry_open+0x14b/0x28f 40) nameidata_to_filp+0x51/0x76 41) do_filp_open+0x514/0x9bc 42) do_sys_open+0x71/0x131 43) sys_open+0x33/0x49 44) system_call_fastpath+0x16/0x1b

ftrace_dump_on_oops ● echo 1 > /proc/sys/kernel/ftrace_dump_on_oops ● Kernel command line “ftrace_dump_on_oops” ● Dumps the trace to the console on oops or panic or NMI lockup detection ● Dump to console via “sysrq-z”

trace_printk ● Acts just like printk in the kernel ● Needs recompile when adding new printks ● Records to ring buffer, so it is fast – use in the scheduler – use in interrupts – save to use even in NMIs

Questions?