ktap is a new script-based dynamic tracing tool for Linux http://www.ktap.org
ktap is a new script-based dynamic tracing tool for Linux. It uses a scripting language and lets the user trace the Linux kernel dynamically. ktap is designed to give operational insights with interoperability that allows users to tune, troubleshoot and extend kernel and application. It's similar to Linux SystemTap and Solaris DTrace.
ktap has different design principles from Linux mainstream dynamic tracing language in that it's based on bytecode, so it doesn't depend upon GCC, doesn't require compiling a kernel module for each script, safe to use in production environment, fulfilling the embedded ecosystem's tracing needs.
Highlights features:
Requirements
CONFIG_EVENT_TRACING
enabledCONFIG_PERF_EVENTS
enabledCONFIG_DEBUG_FS
enabled
make sure debugfs mounted before insmod ktapvm
mount -t debugfs none /sys/kernel/debug/
libelf (optional) Install elfutils-libelf-devel on RHEL-based distros, or libelf-dev on Debian-based distros. Use make NO_LIBELF=1
to build without libelf support. libelf is required for resolving symbols to addresses in DSO, and for SDT.
Note that those configurations should always be enabled in Linux distribution, like RHEL, Fedora, Ubuntu, etc.
Clone ktap from GitHub
$ git clone http://github.com/ktap/ktap.git
Compile ktap
$ cd ktap
$ make #generate ktapvm kernel module and ktap binary
Load ktapvm kernel module(make sure debugfs mounted)
$ make load #need to be root or have sudo access
Run ktap
$ ./ktap samples/helloworld.kp
ktap's syntax is designed with the C language syntax in mind. This is for lowering the entry barrier for C programmers who are working on the kernel or other systems software.
Variable declarations
The biggest syntax differences with C is that ktap is a dynamically-typed language, so you won't need add any variable type declaration, just use the variable.Functions
All functions in ktap should use keyword "function" declarationComments
Comments in ktap start with#
. Long comments are not supported right now.
Others
Semicolons (;
) are not required at the end of statements in ktap. ktap uses a free-syntax style, so you are free to use ';' or not.
ktap uses nil
as NULL
. The result of an arithmetic operation on nil
is also nil
.
ktap does not have array structures, and it does not have any pointer operations.
ktap's if
/else
statement is the same as the C language's.
There are three kinds of for-loop in ktap:
a kinda Lua-ish style:
for (i = init, limit, step) { body }the same form as in C:
for (i = init; i < limit; i += step) { body }Lua's table iterating style:
for (k, v in pairs(t)) { body } # looping all elements of table
Note that ktap does not have the continue
keyword, but C does.
Associative arrays are heavily used in ktap; they are also called "tables".
Table declarations:
t = {}
How to use tables:
t[1] = 1
t[1] = "xxx"
t["key"] = 10
t["key"] = "value"
for (k, v in pairs(t)) { body } # looping all elements of table
print (...)
Receives any number of arguments, and prints their values. print is not intended for formatted output, but only as a quick way to show values, typically for debugging.
For formatted output, use printf
instead.
printf (fmt, ...)
Similar to C's printf
, for formatted string output.
pairs (t)
Returns three values: the next function, the table t, and nil, so that the construction
for (k, v in pairs(t)) { body }
will iterate through all the key-value pairs in the table t
.
len (t) /len (s)
If the argument is a string, returns the length of the string.
If the argument is a table, returns the number of table pairs.
in_interrupt ()
Checks if it is in the context of interrupts.
exit ()
quits ktap programs, similar to the exit
syscall.
pid ()
returns current process pid.
tid ()
returns the current thread id.
uid ()
returns the current process's uid.
execname ()
returns the current process executable's name in a string.
cpu ()
returns the current CPU id.
arch ()
returns machine architecture, like x86
, arm
, and etc.
kernel_v ()
returns Linux kernel version string, like 3.9
and etc.
user_string (addr)
accepts a userspace address, reads the string data from userspace, and returns the ktap string value.
histogram (t)
accepts a table and outputs the table histogram to the user.
curr_task_info (offset, fetch_bytes)
fetches the value in the field offset of task_struct
structure. The argument fetch_bytes
could be 4 or 8. If fetch_bytes
is not given, default to 4.
The user may need to get the field offset by other debugging tools like gdb, for example:
gdb vmlinux
(gdb)p &(((struct task_struct *)0).prio)
print_backtrace ()
prints the current task's kernel stack backtrace.
kdebug.probe_by_id (eventdef_info, eventfun)
This function is the underlying interface for the higher level tracing primitives.
Note that the eventdef_info
argument is just a C pointer value pointing to a userspace memory block holding the real eventdef_info
structure. The structure definition is as follows:
struct ktap_eventdef_info {
int nr; /* the number to id */
int *id_arr; /* id array */
char *filter;
};
Those id
s are read from /sys/kernel/debug/tracing/events/$SYS/$EVENT/id
.
The second argument in above example is a ktap function object:
function eventfun () { action }
kdebug.probe_end (endfunc)
This function is used for invoking a function when tracing ends, it will wait until the user presses CTRL-C
to stop tracing, then ktap will call the argument, the endfunc
function. The user could output tracing results in that function, or do other things.
User usually do not need to use the kdebug
library directly and just use the trace
/trace_end
keywords provided by the language.
table.new (narr, nrec)
pre-allocates a table with narr
array entries and nrec
records.
tracepoints, probe, timer, filters, ring buffer
trace EVENTDEF /FILTER/ { ACTION }
This is the basic tracing block in ktap. You need to use a specific EVENTDEF
string, and your own event function.
There are four types of EVENTDEF
: tracepoints, kprobes, uprobes, SDT probes.
tracepoint:
EventDef | Description |
---|---|
syscalls:* | trace all syscalls events |
syscalls:sys_enter_* | trace all syscalls entry events |
kmem:* | trace all kmem related events |
sched:* | trace all sched related events |
sched:sched_switch | trace sched_switch tracepoint |
*:* | trace all tracepoints in system |
All tracepoint events are based on
/sys/kernel/debug/tracing/events/$SYS/$EVENT
ftrace (kernel 3.3+, and must be compiled with CONFIG_FUNCTION_TRACER
)
EventDef | Description |
---|---|
ftrace:function | trace kernel functions based on ftrace |
User need to use filter (/ip==*/) to trace specific functions. Function must be listed in /sys/kernel/debug/tracing/available_filter_functions
Note of function event
perf support ftrace:function tracepoint since Linux 3.3 (see below commit), ktap is based on perf callback, so it means kernel must be newer than 3.3 then can use this feature.
commit ced39002f5ea736b716ae233fb68b26d59783912 Author: Jiri Olsa <jolsa@redhat.com> Date: Wed Feb 15 15:51:52 2012 +0100 ftrace, perf: Add support to use function tracepoint in perf
kprobe:
EventDef | Description |
---|---|
probe:schedule | trace schedule function |
probe:schedule%return | trace schedule function return |
probe:SyS_write | trace SyS_write function |
probe:vfs* | trace wildcards vfs related function |
uprobe:
EventDef | Description |
---|---|
probe:/lib64/libc.so.6:malloc | trace malloc function |
probe:/lib64/libc.so.6:malloc%return | trace malloc function return |
probe:/lib64/libc.so.6:free | trace free function |
probe:/lib64/libc.so.6:0x82000 | trace function with file offset 0x82000 |
probe:/lib64/libc.so.6:* | trace all libc function |
symbol resolving need libelf support
sdt:
EventDef | Description |
---|---|
sdt:/libc64/libc.so.6:lll_futex_wake | trace stapsdt lll_futex_wake |
sdt:/libc64/libc.so.6:* | trace all static markers in libc |
sdt resolving need libelf support
trace_end { ACTION }
argevent
Event object. You can print it by print(argevent)
, turning the event into a human readable string. The result is mostly the same as each entry in /sys/kernel/debug/tracing/trace
argname
Event name. Each event has a name associated with it.
arg1..9
Evaluates to argument 1 to 9 of the event object.
Note of arg offset
The arg offset(1..9) is determined by event format shown in debugfs.
#cat /sys/kernel/debug/tracing/events/sched/sched_switch/format name: sched_switch ID: 268 format: field:char prev_comm[32]; <- arg1 field:pid_t prev_pid; <- arg2 field:int prev_prio; <- arg3 field:long prev_state; <- arg4 field:char next_comm[32]; <- arg5 field:pid_t next_pid; <- arg6 field:int next_prio; <- arg7
As shown above, the tracepoint event
sched:sched_switch
takes 7 arguments, fromarg1
toarg7
.Note that
arg1
of the syscall event is the syscall number, not the first argument of the syscall function. Usearg2
as the first argument of the syscall function. For example:SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) <arg2> <arg3> <arg4>
This is similar to kprobe and uprobe, the
arg1
of kprobe/uprobe events is always_probe_ip
, not the first argument given by the user, for example:# ktap -e 'trace probe:/lib64/libc.so.6:malloc size=%di' # cat /sys/kernel/debug/tracing/events/ktap_uprobes_3796/malloc/format field:unsigned long __probe_ip; <- arg1 field:u64 size; <- arg2
tick-Ns { ACTION }
tick-Nsec { ACTION }
tick-Nms { ACTION }
tick-Nmsec { ACTION }
tick-Nus { ACTION }
tick-Nusec { ACTION }
profile-Ns { ACTION }
profile-Nsec { ACTION }
profile-Nms { ACTION }
profile-Nmsec { ACTION }
profile-Nus { ACTION }
profile-Nusec { ACTION }
architecture overview picture reference(pnp format)
one-liners
simple event tracing
Q: Why use a bytecode design?
A: Using bytecode is a clean and lightweight solution, you do not need the GCC toolchain to compile every script; all you need is a ktapvm kernel module and the userspace tool called "ktap". Since its language uses a virtual machine design, it has a great portability. Suppose you are working on a multi-arch cluster; if you want to run a tracing script on each board, you will not need cross-compile your tracing scripts for all the boards. You can just use the ktap
tool to run scripts right away.
The bytecode-based design also makes execution safer than the native code generation approach.
It is already observed that SystemTap is not widely used in embedded Linux systems. This is mainly caused by the problem of SystemTap's design decisions in its architecture design. It is a natural design for Red Hat and IBM, because Red Hat/IBM is focusing on the server area, not embedded area.
Q: What's the differences with SystemTap and DTrace?
A: For SystemTap, the answer is already mentioned in the above question, SystemTap chooses the translator design, sacrificing usability for runtime performance. The dependency on the GCC chain when running scripts is the problem that ktap wants to solve.
DTrace shares the same design decision of using bytecode, so basically DTrace and ktap are more alike. There have been some projects aimed at porting DTrace from Solaris to Linux, but these efforts are still under way and are relatively slow in progress. DTrace has its root in Solaris, and there are many huge differences between Solaris's tracing infrastructure and Linux's.
DTrace is based on D language, a language subset of C. It's a restricted language, like without for-looping, for safe use in production systems. It seems that DTrace for Linux only supports x86 architecture, doesn't work on PowerPC and ARM/MIPS. Obviously it's not suited for embedded Linux currently.
DTrace uses ctf as input for debuginfo handing, compared to vmlinux for SystemTap.
On the license part, DTrace is released as CDDL, which is incompatible with GPL. (This is why it's impossible to upstream DTrace into mainline.)
Q: Why use a dynamically-typed language instead of a statically-typed language?
A: It's hard to say which one is better than the other. Dynamically-typed languages bring efficiency and fast prototype production, but lose type checking at the compile phase, and it's easy to make mistake in runtime. It also needs many runtime checks. In contrast, statically-typed languages win on programming safety and performance. Statically-typed languages would suit for interoperation with the kernel, as the kernel is written mainly in C. Note that SystemTap and DTrace both use statically-typed languages.
ktap chooses a dynamically-typed language for its initial implementation.
Q: Why do we need ktap for event tracing? There is already a built-in ftrace
A: This is also a common question for all dynamic tracing tools, not only ktap. ktap provides more flexibility than the built-in tracing infrastructure. Suppose you need to print a global variable at a tracepoint hit, or you want to print a backtrace. Furthermore, you want to store some info into an associative array, and display it as a histogram when tracing ends. ftrace
cannot handle all these requirements. Overall, ktap provides you with great flexibility to script your own trace needs.
Q: How about the performance? Is ktap slow?
A: ktap is not slow. The bytecode is very high-level, based on Lua. The language's virtual machine is register-based (compared to the stack-based JVM and CLR), with a small number of instructions. The table data structure is heavily optimized in ktapvm. ktap uses per-cpu allocation in many places, without the global locking scheme. It is very fast when executing tracepoint callbacks. Performance benchmarks show that the overhead of ktap runtime is nearly 10% (storing event name into associative array), compared to the full speed running time without any tracepoints enabled.
ktap will keep optimizing unfailingly. Hopefully the overhead will decrease to little more than 5%, or even less.
Q: Why not port a higher-level language, like Python or Java, directly into the kernel?
A: I am serious on the size of VM and the memory footprint. The Python VM is too large for embedding into the kernel, and Python has many advanced functionalities which we do not really need.
The number of bytecode opcodes of other higher level languages is also big. ktap only has 32 bytecode opcodes, whereas Python/Java/Erlang all have nearly two hundred opcodes. There are also some problems when porting those languages into the kernel. Kernel programming is very different from userspace programming, like lack of floating-point numbers, handling sleeping code, deadloop is not allowed in the kernel, multi-thread management, etc. So it is impossible to port large language implementations over to the kernel environment with trivial efforts.
Q: What is the status of ktap now?
A: Basically it works on x86-32, x86-64, PowerPC, ARM. It also could work for other hardware architectures, but is not tested yet. (I don't have enough hardware to test.) If you find any bugs, fix it with your own programming skills, or just report to me.
Q: How can I hack on ktap? I want to write some extensions for ktap.
A: Patches welcome! Volunteers welcome! You can write your own libraries to fulfill your specific needs, or write scripts for fun.
Q: What's the plan for ktap? Is there a roadmap?
A: The current plan is to deliver stable ktapvm kernel modules, more ktap scripts, and more bugfixes.
For more release info, please look at RELEASES.txt in project root directory.
simplest one-liner command to enable all tracepoints
ktap -e "trace *:* { print(argevent) }"
syscall tracing on target process
ktap -e "trace syscalls:* { print(argevent) }" -- ls
ftrace(kernel newer than 3.3, and must compiled with CONFIG_FUNCTION_TRACER)
ktap -e "trace ftrace:function { print(argevent) }"
ktap -e "trace ftrace:function /ip==mutex*/ { print(argevent) }"
simple syscall tracing
trace syscalls:* {
print(cpu(), pid(), execname(), argevent)
}
syscall tracing in histogram style
var s = {}
trace syscalls:sys_enter_* {
s[argname] += 1
}
trace_end {
histogram(s)
}
kprobe tracing
trace probe:do_sys_open dfd=%di fname=%dx flags=%cx mode=+4($stack) {
print("entry:", execname(), argevent)
}
trace probe:do_sys_open%return fd=$retval {
print("exit:", execname(), argevent)
}
uprobe tracing
trace probe:/lib/libc.so.6:malloc {
print("entry:", execname(), argevent)
}
trace probe:/lib/libc.so.6:malloc%return {
print("exit:", execname(), argevent)
}
stapsdt tracing (userspace static marker)
trace sdt:/lib64/libc.so.6:lll_futex_wake {
print("lll_futex_wake", execname(), argevent)
}
or:
#trace all static mark in libc
trace sdt:/lib64/libc.so.6:* {
print(execname(), argevent)
}
timer
tick-1ms {
printf("time fired on one cpu\n");
}
profile-2s {
printf("time fired on every cpu\n");
}
FFI (Call kernel function from ktap script, need compile with FFI=1)
ffi.cdef[[
int printk(char *fmt, ...);
]]
ffi.C.printk("This message is called from ktap ffi\n")
More examples can be found at samples directory.
Here is the complete syntax of ktap in extended BNF. (based on Lua syntax: http://www.lua.org/manual/5.1/manual.html#5.1)
chunk ::= {stat [';']} [laststat [';']
block ::= chunk
stat ::= varlist '=' explist |
functioncall |
{ block } |
while exp { block } |
repeat block until exp |
if exp { block {elseif exp { block }} [else block] } |
for Name '=' exp ',' exp [',' exp] { block } |
for namelist in explist { block } |
function funcname funcbody |
function Name funcbody |
var namelist ['=' explist]
laststat ::= return [explist] | break
funcname ::= Name {'.' Name} [':' Name]
varlist ::= var {',' var}
var ::= Name | prefixexp '[' exp ']'| prefixexp '.' Name
namelist ::= Name {',' Name}
explist ::= {exp ',' exp
exp ::= nil | false | true | Number | String | '...' | function |
prefixexp | tableconstructor | exp binop exp | unop exp
prefixexp ::= var | functioncall | '(' exp ')'
functioncall ::= prefixexp args | prefixexp ':' Name args
args ::= '(' [explist] ')' | tableconstructor | String
function ::= function funcbody
funcbody ::= '(' [parlist] ')' { block }
parlist ::= namelist [',' '...'] | '...'
tableconstructor ::= '{' [fieldlist] '}'
fieldlist ::= field {fieldsep field} [fieldsep]
field ::= '[' exp ']' '=' exp | Name '=' exp | exp
fieldsep ::= ',' | ';'
binop ::= '+' | '-' | '*' | '/' | '^' | '%' | '..' |
'<' | '<=' | '>' | '>=' | '==' | '!=' |
and | or
unop ::= '-'