Learn the Linux Kernel with Ftrace Steven Rostedt 4096R/5A56DE73 5ED9 A48F C54C 0A22 D1D0 804C EBC2 6CDB 5A56 DE73
What is Ftrace? ● Two meanings – Linux function tracer infrastructure – Generic kernel tracing infrastructure ● Function tracing ● Tracepoints ● latency Tracing ● Kernel debuging
Building with Ftrace ● Kernel Hacking --> Tracers ● Kernel Function Tracer – CONFIG_FUNCTION_TRACER ● Kernel Function Graph Tracer – CONFIG_FUNCTION_GRAPH_TRACER ● git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git – cd linux
Overview of ftrace ● Mounting ● CLI to administer ftrace ● events ● function tracing ● trace-cmd
The Debugfs ● Officially mounted at – /sys/kernel/debug ● I prefer – mkdir /debug – mount -t debugfs nodev /debug – This presentation will use /debug ● Do what you want
The Tracing Directory # ls /debug/tracing available_events dyn_ftrace_total_info kprobe_profile printk_formats set_ftrace_notrace sysprof_sample_period trace_pipe tracing_on available_filter_functions events ksym_profile README set_ftrace_pid trace trace_stat tracing_thresh available_tracers failures ksym_trace_filter saved_cmdlines set_graph_function trace_clock tracing_cpumask buffer_size_kb function_profile_enabled options set_event stack_max_size trace_marker tracing_enabled current_tracer kprobe_events per_cpu set_ftrace_filter stack_trace trace_options tracing_max_latency
Tracers ● Found in available_tracers – function – function_graph – wakeup and wakeup_rt – irqsoff, preemptoff, preemtirqsoff – mmiotrace – nop
The Function Tracer tracing]# echo function > current_tracer tracing]# cat trace | head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | simpress.bin-2792 [000] : unix_poll <-sock_poll simpress.bin-2792 [000] : sock_poll_wait <-unix_poll simpress.bin-2792 [000] : fput <-do_sys_poll simpress.bin-2792 [000] : fget_light <-do_sys_poll simpress.bin-2792 [000] : sock_poll <-do_sys_poll simpress.bin-2792 [000] : unix_poll <-sock_poll simpress.bin-2792 [000] : sock_poll_wait <-unix_poll simpress.bin-2792 [000] : fput <-do_sys_poll simpress.bin-2792 [000] : fget_light <-do_sys_poll simpress.bin-2792 [000] : sock_poll <-do_sys_poll simpress.bin-2792 [000] : unix_poll <-sock_poll
How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: e8 5b f9 d7 ff call c c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax
How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: 0f 1f nop c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax
How the Function Tracer Works ● mcount c1283ccb : c1283ccb: 55 push %ebp c1283ccc: 89 e5 mov %esp,%ebp c1283cce: 57 push %edi c1283ccf: 56 push %esi c1283cd0: 53 push %ebx c1283cd1: 83 ec 78 sub $0x78,%esp c1283cd4: e8 61 f9 d7 ff call c c1283cd9: b c1 mov $0xc ,%eax c1283cde: c mov %eax,-0x74(%ebp) c1283ce1: mov %eax,-0x78(%ebp) c1283ce4: 64 8b 1d 74 f9 46 c1 mov %fs:0xc146f974,%ebx c1283ceb: 8b 55 8c mov -0x74(%ebp),%edx c1283cee: d 6c ec 41 c1 add -0x3ebe1394(,%ebx,4),%edx c1283cf5: 89 d8 mov %ebx,%eax
How the Function Tracer Works ● ftrace_caller ENTRY(mcount) ret END(mcount) ENTRY(ftrace_caller) cmpl $0, function_trace_stop jne ftrace_stub pushl %eax pushl %ecx pushl %edx movl 0xc(%esp), %eax movl 0x4(%ebp), %edx subl $MCOUNT_INSN_SIZE, %eax.globl ftrace_call ftrace_call: call ftrace_stub popl %edx popl %ecx popl %eax.globl ftrace_stub ftrace_stub: ret END(ftrace_caller)
set_ftrace_filter tracing]# echo schedule > set_ftrace_filter tracing]# cat set_ftrace_filter schedule tracing]# echo function > current_tracer tracing]# cat trace | head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | Xorg-1849 [001] : schedule <-schedule_hrtimeout_range -0 [001] : schedule <-cpu_idle Xorg-1849 [001] : schedule <-__cond_resched kondemand/ [001] : schedule <-worker_thread Xorg-1849 [001] : schedule <-sysret_careful Xorg-1849 [001] : schedule <-schedule_hrtimeout_range gnome-terminal-2112 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-schedule_hrtimeout_range gnome-terminal-2112 [001] : schedule <-schedule_hrtimeout_range Xorg-1849 [001] : schedule <-sysret_careful
set_ftrace_filter (Continued) tracing]# echo schedule_tail >> set_ftrace_filter tracing]# cat set_ftrace_filter schedule_tail schedule tracing]# echo 'sched*' > set_ftrace_filter tracing]# cat set_ftrace_filter | head -10 sched_avg_update sched_group_shares sched_group_rt_runtime sched_group_rt_period sched_slice sched_rt_can_attach sched_feat_open sched_debug_open sched_feat_show sched_feat_write
Acceptable Globs ● match* – Selects all functions starting with “match” ● *match – Selects all functions ending with “match” ● *match* – Selects all functions with “match” in its name
set_ftrace_notrace tracing]# echo > set_ftrace_filter tracing]# echo '*lock*' > set_ftrace_notrace tracing]# cat set_ftrace_notrace | head -10 xen_pte_unlock alternatives_smp_unlock user_enable_block_step __acpi_release_global_lock __acpi_acquire_global_lock unlock_vector_lock lock_vector_lock parse_no_kvmclock kvm_set_wallclock kvm_register_clock
:mod: command tracing]# echo :mod:e1000e > set_ftrace_filter tracing]# cat set_ftrace_filter | head -10 e1e_rphy e1e_wphy e1000_put_hw_semaphore_82571 e1000_set_d0_lplu_state_82571 e1000e_clear_vfta e1000_check_mng_mode_82574 e1000_led_on_82574 e1000_valid_led_default_82571 e1000e_get_laa_state_82571 e1000_write_nvm_82571 ● Works for both set_ftrace_filter and set_ftrace_notrace
The Function Graph Tracer tracing]# echo function_graph > current_tracer tracing]# cat trace | head -20 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | down_read_trylock() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | __might_sleep(); 1) us | _cond_resched(); 1) us | find_vma(); 1) | handle_mm_fault() { 1) us | pud_alloc(); 1) us | pmd_alloc(); 1) | __do_fault() { 1) | filemap_fault() { 1) | find_get_page() { 1) us | page_cache_get_speculative(); 1) us | } 1) | lock_page() {
What Does That Function Call? tracing]# echo sys_read > set_graph_function tracing]# cat trace | head -20 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) us | fsnotify(); 1) us | } 1) ! us | } 1) us | fput_light(); 1) ! us | } 1) | sys_read() { 1) us | fget_light(); 1) | vfs_read() { 1) | rw_verify_area() { 1) | security_file_permission() { 1) | selinux_file_permission() { 1) us | avc_policy_seqno(); 1) us | } 1) us | } 1) us | } 1) | tty_read() {
Interrupts in Function Graph 1) | __sock_recvmsg() { 1) | security_socket_recvmsg() { 1) | selinux_socket_recvmsg() { 1) ==========> | 1) | smp_apic_timer_interrupt() { 1) | apic_write() { 1) us | native_apic_mem_write(); 1) us | } 1) us | exit_idle(); 1) | irq_enter() { 1) us | rcu_irq_enter(); 1) us | idle_cpu(); 1) us | } 1) | hrtimer_interrupt() { 1) | ktime_get() { 1) | timekeeping_get_ns() { 1) us | read_hpet(); [...] 1) us | } 1) us | } 1) us | rcu_irq_exit(); 1) us | idle_cpu(); 1) us | } 1) ! us | } 1) <========== | 1) | socket_has_perm() { 1) | avc_has_perm() {
trace_pipe tracing]# cat trace_pipe 1) us | flush_tlb_page(); 1) us | } 1) us | } 1) us | } 1) | up_read() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | notify_page_fault(); 1) | down_read_trylock() { 1) us | _spin_lock_irqsave(); 1) us | _spin_unlock_irqrestore(); 1) us | } 1) us | __might_sleep(); 1) us | _cond_resched(); 1) us | find_vma(); 1) | handle_mm_fault() { 1) us | pud_alloc(); 1) us | pmd_alloc(); 1) | __do_fault() { 1) | filemap_fault() {
Latency Tracers ● wakeup – trace wake up time high highest prio task ● wakeup_rt – trace wake up time of highest prio RT task ● irqsoff – trace time interrupts is disabled ● preemptoff – trace time preemption is disabled ● preemptirqsoff – trace time preemption or interrupts disabled
Trace Events tracing]# ls events block ext4 header_event irq kmem kvmmmu sched syscalls enable ftrace header_page jbd2 kvm module skb workqueue tracing]# ls events/sched/ enable sched_process_exit sched_stat_iowait sched_wakeup filter sched_process_fork sched_stat_sleep sched_wakeup_new sched_kthread_stop sched_process_free sched_stat_wait sched_kthread_stop_ret sched_process_wait sched_switch sched_migrate_task sched_signal_send sched_wait_task tracing]# ls events/sched/sched_wakeup enable filter format id
Enable a Single Event tracing]# echo 1 > events/sched/sched_wakeup/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] bash-2613 [001] : sched_wakeup: task bash:2613 [120] success=0 [001] -0 [001] : sched_wakeup: task events/1:10 [120] success=1 [001] events/1-10 [001] : sched_wakeup: task gnome-terminal:2162 [120] success=1 [001]
Enable All Subsystem Events tracing]# echo 1 > events/sched/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | events/0-9 [000] : sched_switch: task events/0:9 [120] (S) ==> kondemand/0:1305 [120] kondemand/ [000] : sched_stat_wait: task: restorecond:1395 wait: [ns] kondemand/ [000] : sched_switch: task kondemand/0:1305 [120] (S) ==> restorecond:1395 [120] restorecond-1395 [000] : sched_stat_wait: task: restorecond:1395 wait: 0 [ns] restorecond-1395 [000] : sched_stat_sleep: task: kondemand/0:1305 sleep: [ns] restorecond-1395 [000] : sched_wakeup: task kondemand/0:1305 [120] success=1 [000]
Enable All Events tracing]# echo 1 > events/enable tracing]# cat trace | head -10 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | acpid-1470 [001] : kfree: call_site=ffffffff810c996d ptr=(null) acpid-1470 [001] : sys_read -> 0x1 acpid-1470 [001] : sys_exit: NR 0 = 1 acpid-1470 [001] : sys_read(fd: 3, buf: 7f4ebb32ac50, count: 1) acpid-1470 [001] : sys_enter: NR 0 (3, 7f4ebb32ac50, 1, 8, 40, ) acpid-1470 [001] : kfree: call_site=ffffffff810c996d ptr=(null)
Enable Multiple Events tracing]# echo 1 > events/sched/sched_wakeup/enable tracing]# echo 1 > events/sched/sched_wakeup_new/enable tracing]# echo 1 > events/sched/sched_switch/enable tracing]# cat trace | head -15 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_wakeup: task bash:2913 [120] success=0 [001] bash-2913 [001] : sched_switch: task bash:2913 [120] (S) ==> swapper:0 [140] -0 [001] : sched_wakeup: task events/1:10 [120] success=1 [001] -0 [001] : sched_switch: task swapper:0 [140] (R) ==> events/1:10 [120] events/1-10 [001] : sched_wakeup: task gnome-terminal:2158 [120] success=1 [001] events/1-10 [001] : sched_switch: task events/1:10 [120] (S) ==> gnome-terminal:2158 [120] gnome-terminal-2158 [001] : sched_switch: task gnome-terminal:2158 [120] (S) ==> swapper:0 [140] -0 [000] : sched_wakeup: task phy0:1041 [120] success=1 [000] -0 [000] : sched_switch: task swapper:0 [140] (R) ==> phy0:1041 [120]
Event Directory or File ● set_event shows all events enabled ● available_events shows what events are available ● echo 1 > events/sched/enable – same as “echo sched > set_event” tracing]# echo 1 > events/irq/enable tracing]# cat set_event irq:irq_handler_entry irq:irq_handler_exit irq:softirq_entry irq:softirq_exit
Trace Events in the kernel #undef TRACE_SYSTEM #define TRACE_SYSTEM sample #if !defined(_TRACE_EVENT_SAMPLE_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_EVENT_SAMPLE_H #include TRACE_EVENT(foo_bar, TP_PROTO(char *foo, int bar), TP_ARGS(foo, bar), TP_STRUCT__entry( __array( char, foo, 10 ) __field( int, bar ) ), TP_fast_assign( strncpy(__entry->foo, foo, 10); __entry->bar = bar; ), TP_printk("foo %s %d", __entry->foo, __entry->bar) ); #endif #undef TRACE_INCLUDE_PATH #undef TRACE_INCLUDE_FILE #define TRACE_INCLUDE_PATH. #define TRACE_INCLUDE_FILE trace-events-sample #include
Trace Events in the kernel #define CREATE_TRACE_POINTS #include "trace-events-sample.h" static void simple_thread_func(int cnt) { set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(HZ); trace_foo_bar("hello", cnt); } ● C file
Creating a Trace Event (modules) ● Makefile CFLAGS_trace-events-sample.o := -I$(src) obj-$(CONFIG_SAMPLE_TRACE_EVENTS) += trace-events-sample.o
Plugins vs Events ● Plugins are set via current_tracer – Events are enabled via the event directory or the set_event file ● Plugins are listed via the available_tracers file – Events are listed by the event directory or the available_events file ● Only one plugin at a time – Any number of events can be enabled – They show up in any trace
Mixing Events With Plugins ● Latency tracers work best with seeing what is happening ● Function tracer is too verbose. Although filtering may help ● Just start and stop is not enough
Filters ● Filter any trace event or ftrace entry ● Use equal '==' and logic descriptors '||' and '&&' ● Filter on any field in the format file – i.e. sched_switch's prev_state
Filter on sched_switch tracing]# echo "prev_state == 0" > events/sched/sched_switch/filter tracing]# cat trace | head -15 # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | -0 [001] : sched_switch: task swapper:0 [140] (R) ==> events/1:10 [120] -0 [001] : sched_switch: task swapper:0 [140] (R) ==> Xorg:1840 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> gnome-settings-:2133 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> metacity:2139 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120] Xorg-1840 [001] : sched_switch: task Xorg:1840 [120] (R) ==> wnck-applet:2220 [120]
tracing_on tracing]# echo 0 > tracing_on tracing]# echo 1 > tracing_on tracing]# echo 0> tracing_on tracing]# echo 0 > tracing_on; run_test; echo 0 > tracing_off
trace_marker tracing]# echo 'Hello Chemnitz!' > trace_marker tracing]# cat trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-3798 [001] : 0: Hello Chemnitz!
tracing_on int trace_on_fd; int main(int argc, char *argv[]) { char buf[BUFSIZ]; [...] find_debugfs(buf); strcat(buf, "/tracing/tracing_on"); trace_on_fd = open(buf, O_WRONLY); [...] if (error_detected()) { /* hit bug */ write(trace_on_fd, "0", 1); }
tracing_on and trace_marker int trace_on_fd; int trace_mark_fd; int main(int argc, char *argv[]) { char buf[BUFSIZ]; [...] find_debugfs(buf); strcat(buf, "/tracing/tracing_on"); trace_on_fd = open(buf, O_WRONLY); find_debugfs(buf); strcat(buf, "/tracing/trace_marker"); trace_mark_fd = open(buf, O_WRONLY); [...] write(trace_mark_fd, "Testing for error\n", 18); if (error_detected()) { /* hit bug */ write(trace_on_fd, "0", 1); }
trace-cmd ● Version 1.1-rc1 ● git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git
trace-cmd ● binary tool to read Ftrace's buffers – Records into a trace.dat file for later reads – Reads the trace.dat file ● Can record on big endian, read in little, and vice versa – Reads the raw buffers using splice – Will automatically mount debugfs if it is not mounted ● Must have root access (sudo)
trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ● Default, writes to “trace.dat”
trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ● Default, writes to “trace.dat”
trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph.dat -p function_graph ls -ltr /usr \ > /dev/null plugin function_graph disable all offset=2f2000 offset= ● Default, writes to “trace.dat”
trace-cmd record ~]# trace-cmd record -e sched ls -ltr /usr > /dev/null disable all enable sched offset=2f2000 offset=2f4000 ~]# trace-cmd record -o func.dat -p function ls -ltr /usr > /dev/null plugin function disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph.dat -p function_graph ls -ltr /usr \ > /dev/null plugin function_graph disable all offset=2f2000 offset= ~]# trace-cmd record -o fgraph-events.dat -e sched -p function_graph \ ls -ltr /usr > /dev/null plugin function_graph disable all enable sched offset=2f2000 offset= ● Default, writes to “trace.dat”
Filters, and Options ~]# trace-cmd record -e sched_switch -f 'prev_prio < 100' ~]# trace-cmd record -p function_graph -O nograph-time ~]# trace-cmd record -p function_graph -g sys_read ~]# trace-cmd record -p function_graph -l do_IRQ -l timer_interrupt ~]# trace-cmd record -p function_graph -n '*lock*' ● -f : filter ● -O : option ● -g : same as echoing into set_graph_function ● -l : same as echoing into set_ftrace_filter ● -n : same as echoing into set_ftrace_notrace
trace-cmd report ● Default, reads from “trace.dat” ~]# trace-cmd report | head -15 version = 6 cpus=2 trace-cmd-6157 [000] : sched_stat_runtime: task: trace-cmd:61 trace-cmd-6157 [000] : sched_switch: 6157:120:S ==> 0:1 -0 [000] : sched_stat_wait: task: trace-cmd:61 -0 [000] : sched_switch: 0:120:R ==> 6158:1 ls-6158 [001] : sched_wakeup: 6158:?:? : ls-6158 [001] : sched_stat_runtime: task: trace-cmd:61 ls-6158 [001] : sched_stat_runtime: task: trace-cmd:61 ls-6158 [001] : sched_switch: 6158:120:R ==> 590 migration/ [001] : sched_stat_wait: task: trace-cmd:61 migration/ [001] : sched_migrate_task: task trace-cmd:615 migration/ [001] : sched_switch: 5900:0:S ==> 0:120 ls-6158 [000] : sched_stat_runtime: task: ls:6158 runt ls-6158 [000] : sched_stat_runtime: task: ls:6158 runt
trace-cmd report (continued) ~]# trace-cmd report -i func.dat | head -15 version = 6 cpus=2 ls-6178 [000] : function: fsnotify_modify <-- vfs_write ls-6178 [000] : function: inotify_inode_queue_event <-- fsn ls-6178 [000] : function: fsnotify_parent <-- fsnotify_modi ls-6178 [000] : function: __fsnotify_parent <-- fsnotify_pa ls-6178 [000] : function: inotify_dentry_parent_queue_event ls-6178 [000] : function: fsnotify <-- fsnotify_modify ls-6178 [000] : function: fput_light <-- sys_write ls-6178 [000] : function: audit_syscall_exit <-- sysret_aud ls-6178 [000] : function: audit_get_context <-- audit_sysca ls-6178 [000] : function: audit_free_names <-- audit_syscal ls-6178 [000] : function: path_put <-- audit_free_names ls-6178 [000] : function: dput <-- path_put ls-6178 [000] : function: mntput <-- path_put
trace-cmd report (continued) ~]# trace-cmd report -i fgraph.dat | head -15 | cut -c complement version = 6 cpus=2 ls-6186 [000] funcgraph_entry: | fsnotify_modify() { ls-6186 [000] funcgraph_entry: us | inotify_inode_queue_event(); ls-6186 [000] funcgraph_entry: | fsnotify_parent() { ls-6186 [000] funcgraph_entry: us | __fsnotify_parent(); ls-6186 [000] funcgraph_entry: us | inotify_dentry_parent_queu ls-6186 [000] funcgraph_exit: us | } ls-6186 [000] funcgraph_entry: us | fsnotify(); ls-6186 [000] funcgraph_exit: us | } ls-6186 [000] funcgraph_entry: us | fput_light(); ls-6186 [000] funcgraph_entry: | audit_syscall_exit() { ls-6186 [000] funcgraph_entry: us | audit_get_context(); ls-6186 [000] funcgraph_entry: | audit_free_names() { ls-6186 [000] funcgraph_entry: | path_put() {
trace-cmd report (continued) ~]# trace-cmd report -i fgraph-events.dat | head -15 | \ cut -c complement version = 6 cpus=2 ls-6209 [001] funcgraph_entry:0.385 us | task_of(); ls-6209 [001] funcgraph_entry: | ftrace_raw ls-6209 [001] sched_stat_wait: task: phy0:861 wait: [ns] ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | __dequeue_entity( ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | task_of(); ls-6209 [001] funcgraph_entry: us | hrtick_start_fair ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_exit: us | } ls-6209 [001] funcgraph_entry: us | perf_event_task_sch ls-6209 [001] funcgraph_entry: | ftrace_raw_event_sc ls-6209 [001] sched_switch: 6209:120:R ==> 861:120: phy0 ls-6209 [001] funcgraph_exit: us | }
trace-cmd start ● Using start is like echoing into debugfs – trace-cmd start -e all ● same as “echo 1 > events/enable” ● Uses the same options as trace-cmd record – trace-cmd start -p function_graph – trace-cmd start -p function -e sched_switch ● Reset for manual use – trace-cmd start -p nop
trace-cmd stop / extract ● trace-cmd stop – stops the tracer from writing: ● same as “echo 0 > tracing_on” ● trace-cmd extract -o output.dat – Makes a “dat” file that trace-cmd report can use – Without “-o...” will create “trace.dat”
trace-cmd reset ● trace-cmd stop does not stop the overhead of tracing ● trace-cmd reset disables all tracing – trace-cmd reset ● Removes trace data from kernel – Do the extract before doing the reset
trace-cmd list ● See the trace options, events or plugins – trace-cmd list -o ● shows list of trace options ● these options are used by trace-cmd record -O option – trace-cmd list -p ● available tracers (use to be called plugins) – trace-cmd list -e ● available events
trace-cmd split ● Split by time, events, CPU – trace-cmd split ● splits from timestamp to end of file – trace-cmd split -e 1000 ● splits out the first 1000 events – trace-cmd split -m 1 -r ● split 1 millisecond starting at first timestamp to second timestamp repeatedly – trace.dat.1, trace.dat.2,...
trace-cmd listen ● listen for connections from other boxes – trace-cmd listen -p d ● Record can now send to that box – trace-cmd record -N host:5678 -e all – use “-t” to force TCP otherwise trace data is sent via UDP
What does the kernel do? ● Ever wonder what happens beyond the syscall? ● Use the source Luke! – But use ftrace to see what gets called ● Networking (follow that packet) ● File Systems (where does the info go?)
System Call Tracing ● strace like calls ● Trace system calls of all applications – trace-cmd record -e syscalls sleep 10
Follow that code SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret; } ● fs/read_write.c
Follow that code ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { ssize_t ret; if (!(file->f_mode & FMODE_READ)) return -EBADF; if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read)) return -EINVAL; if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) return -EFAULT; ret = rw_verify_area(READ, file, pos, count); if (ret >= 0) { count = ret; if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else ret = do_sync_read(file, buf, count, pos); if (ret > 0) { fsnotify_access(file); add_rchar(current, ret); } inc_syscr(current); } return ret; }
Follow that code ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { struct iovec iov = {.iov_base = buf,.iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; kiocb.ki_left = len; kiocb.ki_nbytes = len; for (;;) { ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); *ppos = kiocb.ki_pos; return ret; }
Follow that code ● trace-cmd record -F -p function_graph \ ● -g sys_read cat /etc/password ● trace-cmd report | less
Who called that Function? ● trace-cmd record -p function –func-stack -l
stacktrace -0 [001] : kmem_cache_free: call_site=ffffffff81103e40 ptr=ffff88002f4aa [001] : => kmem_cache_free => mempool_free_slab => mempool_free => bio_free => dm_bio_destructor => bio_put => clone_endio => bio_endio
Function Stack Tracer ● Very dangerous! tracing]# echo nop > current_tracer tracing]# echo mod_timer > set_ftrace_filter tracing]# echo function > current_tracer tracing]# echo 1 > options/func_stack_trace tracing]# cat trace |head -15 # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | bash-2758 [001] : mod_timer+0x8/0x49 <-add_timer+0x2f/0x45 bash-2758 [001] : => add_timer+0x2f/0x45 => queue_delayed_work_on+0xe8/0x110 => queue_delayed_work+0x39/0x4f => schedule_delayed_work+0x2e/0x44 => tty_flip_buffer_push+0x6f/0x8a => pty_write+0x56/0x7e => n_tty_write+0x235/0x365 => tty_write+0x1ab/0x267 phy [000] : mod_timer+0x8/0x49 <- mod_beacon_timer+0x4f/0x6a [mac80211]
Event Format Files tracing]# cat events/sched/sched_switch/format name: sched_switch ID: 57 format: field:unsigned short common_type;offset:0;size:2; field:unsigned char common_flags;offset:2;size:1; field:unsigned char common_preempt_count;offset:3;size:1; field:int common_pid;offset:4;size:4; field:int common_lock_depth;offset:8;size:4; field:char prev_comm[TASK_COMM_LEN];offset:12;size:16; field:pid_t prev_pid;offset:28;size:4; field:int prev_prio;offset:32;size:4; field:long prev_state;offset:40;size:8; field:char next_comm[TASK_COMM_LEN];offset:48;size:16; field:pid_t next_pid;offset:64;size:4; field:int next_prio;offset:68;size:4; print fmt: "task %s:%d [%d] (%s) ==> %s:%d [%d]", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"}, { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "W" }) : "R", REC->next_comm, REC->next_pid, REC->next_prio
stack_trace ● echo 1 > /proc/sys/kernel/stack_tracer_enabled ● kernel command line “stacktrace”
stack_trace tracing]# cat stack_trace Depth Size Location (45 entries) ) ftrace_call+0x5/0x2b 1) update_curr+0x10a/0x12b 2) enqueue_entity+0x31/0x20f 3) enqueue_task_fair+0x3d/0x98 4) enqueue_task+0x6b/0x8d [...] 28) sr_test_unit_ready+0x72/0xec 29) sr_media_change+0x57/0x264 30) media_changed+0x63/0xb2 31) cdrom_media_changed+0x44/0x5e 32) sr_block_media_changed+0x2c/0x42 33) check_disk_change+0x3c/0x85 34) cdrom_open+0x8d9/0x96b 35) sr_block_open+0x9f/0xd2 36) __blkdev_get+0xde/0x37c 37) blkdev_get+0x23/0x39 38) blkdev_open+0x85/0xd1 39) __dentry_open+0x14b/0x28f 40) nameidata_to_filp+0x51/0x76 41) do_filp_open+0x514/0x9bc 42) do_sys_open+0x71/0x131 43) sys_open+0x33/0x49 44) system_call_fastpath+0x16/0x1b
ftrace_dump_on_oops ● echo 1 > /proc/sys/kernel/ftrace_dump_on_oops ● Kernel command line “ftrace_dump_on_oops” ● Dumps the trace to the console on oops or panic or NMI lockup detection ● Dump to console via “sysrq-z”
trace_printk ● Acts just like printk in the kernel ● Needs recompile when adding new printks ● Records to ring buffer, so it is fast – use in the scheduler – use in interrupts – save to use even in NMIs
Questions?