Exploring the Linux Storage Path - Tracing block I/O kernel events

Introduction

Typically, when the operating system is deployed it uses a generic configuration in order to provide a fair performance for any kind of application. The term application, in this paper, is used to reference any software that is executed over the operating system. Therefore, databases, web servers, e-mail servers, in-house made softwares, etc, all of them are referred here by the generic term “application”.

The behavior of each application depends on how it was designed and developed, but also on how it is used. At the end, the behavior of the application is reflected in the behavior of the o.s. and, considering the limitations of the underlying hardware to fulfill the o.s. requests, the application behavior is influenced back.

If we consider the data flow, from the application down to the hardware, there are several layers of software and hardware working together, which we can call the I/O path.

The objective of this work is to present a method to analyze the I/O path, and how to tune the operating system to improve the performance of the block I/O operations.

Enabling the Linux kernel to trace block I/O events

When the kernel is enabled for tracing, the compiler inserts a small No-Operation instruction of 5 bytes at the beginning of every kernel function. That instruction is used in the tracer calls, when tracing is enabled, to gather the timestamp of each function entry.

When tracing is disabled, the overhead of the instructions is very small.

Specifying an alternate directory to hold the new kernel, load the Linux kernel configuration interface:

# cd /usr/src/linux
# make O=/fs1/newkernel menuconfig

select "General setup"
add any string to identify this kernel:
(-ftrace) Local version - append to kernel release
select <Exit>
select "Kernel hacking"
select "Tracers"
pressing "?", you will see:
CONFIG_FTRACE:
Enable the kernel tracing infrastructure.
Symbol: FTRACE [=n]
Prompt: Tracers
Defined at kernel/trace/Kconfig:118
Depends on: TRACING_SUPPORT [=y]
Location:
-> Kernel hacking
press "y", to include it:
[*] Tracers --->
press <enter> to go to the submenus:
pressing "y", select the items that make sense for the type of debugging you intent to do:
[*] Kernel Function Tracer
[*] Sysprof Tracer
[*] Scheduling Latency Tracer
[*] Trace syscalls
[*] Trace boot initcalls
[*] Support for tracing block io actions
[*] Kernel function profiler
[*] Perform a startup test on ftrace
[*] Memory mapped IO tracing

Press <exit> until you return to the prompt. A message like this will appear:

#
# configuration written to .config
#
*** End of Linux kernel configuration.
*** Execute 'make' to build the kernel or try 'make help'.

You can confirm your choices by doing something like this:

#/usr/src/linux> grep -i tracer /fs1/newkernel/.config
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_BOOT_TRACER=y

Compile the kernel

# make O=/fs1/newkernel

Install the new kernel and modules

<pre brush:bash;"="" class="brush:bash;" style="font-family: "Courier New", Courier, monospace; overflow-y: auto; color: rgb(34, 34, 34); text-align: left; background-color: rgb(211, 211, 211); width: 600px; white-space: pre-wrap !important; max-width: 600px !important;"># sudo make O=/fs1/newkernel modules_install installsh /usr/src/linux-2.6.32.12-0.7/arch/x86/boot/install.sh 2.6.32.12-ftrace arch/x86/boot/bzImage \System.map "/boot"Kernel image: /boot/vmlinuz-2.6.32.12-ftraceInitrd image: /boot/initrd-2.6.32.12-ftraceRoot device: /dev/sda3 (mounted on / as ext3)Resume device: /dev/sda1Kernel Modules: hwmon thermal_sys thermal scsi_mod scsi_transport_spi mptbase mptscsih mptspi libata ata_piix ata_generic ide-core piix ide-pci-generic processor fan jbd mbcache ext3 edd crc-t10dif sd_mod usbcore ohci-hcd ehci-hcd uhci-hcd hid usbhidFeatures: block usb resume.userspace resume.kernelBootsplash: SLES (1024x768)80675 blocksNew entry automatically inserted in the /boot/grub/menu.lst file:title Ftrace -- SUSE Linux Enterprise Server 11 SP1 - 2.6.32.12 root (hd0,2) kernel /boot/vmlinuz-2.6.32.12-ftrace root=/dev/sda3 resume=/dev/sda1 splash=silent crashkernel=128M-:64M showopts vga=0x317 initrd /boot/initrd-2.6.32.12-ftrace

As states the Release Notes for SUSE Linux Enterprise Server 11, every kernel module has a flag 'supported' which may assume the values "yes", "external", "" (empty, not set, "unsupported"), and all modules of a recompiled kernel are by default marked as unsupported. Therefore, if the machine is rebooted now in order to activate the ftrace kernel, all the recompiled modules will not be loaded, failing with "unsupported modules" messages.

To surpass this problem, edit the file /etc/modprobe.d/unsupported-modules and assign the value 1 to the attribute allow_unsupported_modules. Afterwards, a new initrd has to be built.

Rename the initrd created during the compilation process:

 # mv vmlinuz-2.6.32.12-ftrace vmlinuz-2.6.32.12-ftrace.from_compilation

Create a new initrd for the ftrace kernel

# mkinitrd -k vmlinuz-2.6.32.12-ftrace -i initrd-2.6.32.12-ftrace

Kernel image:   /boot/vmlinuz-2.6.32.12-ftrace
Initrd image:   /boot/initrd-2.6.32.12-ftrace
Root device:    /dev/disk/by-id/scsi-3600605b0012083a0ff00023621f9fb85-part3 (/dev/sda3) (mounted on / as ext3)
Resume device:  /dev/disk/by-id/scsi-3600605b0012083a0ff00023621f9fb85-part1 (/dev/sda1)
Kernel Modules: scsi_mod megaraid_sas hwmon thermal_sys processor thermal libata ata_piix ata_generic ide-core ide-pci-generic fan jbd mbcache ext3 edd crc-t10dif sd_mod usbcore ohci-hcd ehci-hcd uhci-hcd hid usbhid
Features:       block usb resume.userspace resume.kernel
Bootsplash:     SLES (1024x768)
77176 blocks

Now, reboot the machine:

# shutdown -r now

Ftrace overview

The Ftrace uses the file system debugfs to hold its control files and the files used for output. The debugfs is automatically configured into the kernel when ftrace is configured and its mount point directory /sys/kernel/debug is also created.

You can include debugfs or perhaps it is already included in /etc/fstab and it is mounted automatically at boot time. Verify whether it is already mounted by checking the file /etc/mtab. If it is not mounted, mount it:

# mount -t debugfs nodev /sys/kernel/debug

Go to the ftrace working directory.

 # cd /sys/kernel/debug/tracing

You can list all the functions that ftrace is able to trace by looking into the file available_filter_functions.
Despite the name "ftrace" has the meaning of "function trace", there are more tracers available than only for tracing functions. It's possible to see what tracers have been compiled into the kernel, by looking into the file available_tracers. You can choose which trace to use by echoing it into the file current_tracer. If the file current_tracer has the word "nop", that means that no tracer has been chosen.

The file named "trace" holds the output of what is being traced, in human readable format. It is also possible to control the level of information in this file. You can do this by changing the contents of the file trace_options. In order to see which options are available, just list its contents. The words beginning with the string "no" represent trace options that are disabled. To enable an option, just echo in the option without the string "no"; or vice-versa. For example:

To enable the option stacktrace:

# echo stacktrace > trace_options

To disable the option stacktrace:

# echo nostacktrace > trace_options

The start and stop of the tracing activity is controlled by the file tracing_enabled, by just echoing 1 and 0 into it, respectivelly. The recording of the ring buffer, which is what we see at the "trace" file, is controlled by echoing 1 and 0 in the file tracing_on. When tracing_on holds 0 but tracing_enabled holds 1, the calls made by the tracers still happen, which imply some overhead, but these calls notice that the ring buffer is not recording and therefore they will not write any data into it.

The amount of data recorded by the tracer depends on the size of its buffer, which is controlled by the file buffer_size_kb. The number in it represents the number of records that the tracer can capture per CPU. In order to modify this file, the tracing activity must be stopped and no tracer must have been chosen. For example:

 # echo 0 > tracing_enabled
# echo nop > current_tracer
# echo 1000 > buffer_size_kb

A very interesting feature of ftrace is its ability to trace events, which are grouped by subsystem. They are listed in the file available_events in the format <subsystem>:<event>. It is possible to enable the tracing of a specific event, all the events of a given subsystem or all the events available. For exemple:

To enable the tracing of the specific event block_bio_complete:

 # echo 1 > events/block/block_bio_complete/enable

To enable the tracing of all events of the block subsystem:

 # echo 1 > events/block/enable

To enable the tracing of all the available events:

 # echo 1 > events/enable

The events enabled are listed in the file set_event. It is also possible to enable events by echoing them directly into this file.

Monitoring the block I/O

Now, we proceed with the monitoring of the block I/O activity.

Verify if the tracing activity is started:

 # cat tracing_enabled
1

Specify that all the block I/O events are to be traced:

 # echo 0 > tracing_on
# echo blk > current_tracer
# echo 1 > events/block/enable

# cat set_event
block:block_rq_abort
block:block_rq_insert
block:block_rq_issue
block:block_rq_requeue
block:block_rq_complete
block:block_bio_bounce
block:block_bio_complete
block:block_bio_backmerge
block:block_bio_frontmerge
block:block_bio_queue
block:block_getrq
block:block_sleeprq
block:block_plug
block:block_unplug_timer
block:block_unplug_io
block:block_split
block:block_remap
block:block_rq_remap

Now, start the capture of the tracing information, execute the application that you want to analyze and stop the capture:

 # echo 1 > tracing_on
<<< execute the application >>>
# echo 0 > tracing_on

See the captured tracing information:

 # cat trace
 
tracer: blk
  flush-8:0-1117  [000]  1677.505600: block_remap: 8,0 W 1831633 + 8 <- (8,2) 787408
  flush-8:0-1117  [000]  1677.505603: block_bio_queue: 8,0 W 1831633 + 8 [flush-8:0]
  flush-8:0-1117  [000]  1677.505607: block_getrq: 8,0 W 1831633 + 8 [flush-8:0]
  flush-8:0-1117  [000]  1677.505610: block_plug: [flush-8:0]
  flush-8:0-1117  [000]  1677.505611: block_rq_insert: 8,0 W 0 () 1831633 + 8 [flush-8:0]
     <idle>-0     [000]  1677.509490: block_unplug_timer: [swapper] 1
  kblockd/0-18    [000]  1677.509502: block_unplug_io: [kblockd/0] 1
  kblockd/0-18    [000]  1677.509507: block_rq_issue: 8,0 W 0 () 1831633 + 8 [kblockd/0]
     <idle>-0     [000]  1677.510832: block_rq_complete: 8,0 W () 1831633 + 8 [0]
         dd-7382  [001]  1678.034199: block_bio_queue: 8,16 R 267280 + 8 [dd]
         dd-7382  [001]  1678.034212: block_getrq: 8,16 R 267280 + 8 [dd]
         dd-7382  [001]  1678.034215: block_plug: [dd]
         dd-7382  [001]  1678.034216: block_rq_insert: 8,16 R 0 () 267280 + 8 [dd]
         dd-7382  [001]  1678.034220: block_unplug_io: [dd] 1
         dd-7382  [001]  1678.034222: block_rq_issue: 8,16 R 0 () 267280 + 8 [dd]
     <idle>-0     [001]  1678.277551: block_rq_complete: 8,16 R () 267280 + 8 [0]
         dd-7382  [001]  1678.277763: block_bio_queue: 8,16 R 267272 + 8 [dd]
         dd-7382  [001]  1678.277767: block_getrq: 8,16 R 267272 + 8 [dd]
         dd-7382  [001]  1678.277769: block_plug: [dd]
         dd-7382  [001]  1678.277770: block_rq_insert: 8,16 R 0 () 267272 + 8 [dd]
         dd-7382  [001]  1678.277773: block_unplug_io: [dd] 1
         dd-7382  [001]  1678.277775: block_rq_issue: 8,16 R 0 () 267272 + 8 [dd]
     <idle>-0     [001]  1678.288475: block_rq_complete: 8,16 R () 267272 + 8 [0]

To clean up the trace output before a new execution

 # echo 0 > trace

Let's use the latency format of the trace options:

 # echo latency-format > trace_options
# echo 1 > tracing_on
<<< execute the application >>>
# echo 0 > tracing_on

# cat trace
flush-8:-12819   0..... 115357738us+: block_bio_queue: 8,16 W 267280 + 8 [flush-8:16]
flush-8:-12819   0..... 115357753us+: block_getrq: 8,16 W 267280 + 8 [flush-8:16]
flush-8:-12819   0d.... 115357755us : block_plug: [flush-8:16]
flush-8:-12819   0d.... 115357756us+: block_rq_insert: 8,16 W 0 () 267280 + 8 [flush-8:16]
flush-8:-12819   0..... 115357759us : block_bio_queue: 8,16 W 267288 + 8 [flush-8:16]
flush-8:-12819   0d.... 115357761us+: block_bio_backmerge: 8,16 W 267288 + 8 [flush-8:16]
flush-8:-12819   0..... 115357763us : block_bio_queue: 8,16 W 393216 + 8 [flush-8:16]
flush-8:-12819   0..... 115357764us : block_getrq: 8,16 W 393216 + 8 [flush-8:16]
flush-8:-12819   0d.... 115357764us!: block_rq_insert: 8,16 W 0 () 393216 + 8 [flush-8:16]
  <idle>-0       0..s.. 115360522us+: block_unplug_timer: [swapper] 2
kblockd/-18      0..... 115360544us+: block_unplug_io: [kblockd/0] 2
kblockd/-18      0d.... 115360548us!: block_rq_issue: 8,16 W 0 () 267280 + 16 [kblockd/0]
kblockd/-18      0d.... 115360688us!: block_rq_issue: 8,16 W 0 () 393216 + 8 [kblockd/0]
  <idle>-0       0..s.. 115362079us!: block_rq_complete: 8,16 W () 267280 + 16 [0]
  <idle>-0       0..s.. 115362837us : block_rq_complete: 8,16 W () 393216 + 8 [0]

Now, the question is: what are the meanings of the columns ?

At the first column, we have kernel thread and pid.

At the second column, we have several spots:

spot 1 - number of the cpu used to process the call (0 in the example)
spot 2 - whether interrupts are disabled or not | 'd' interrupts are disabled | '.' otherwise.
spot 3 - whether a reschedule has been called for | 'N' task need_resched is set | '.' otherwise.
spot 4 - whether its running in an interrupt context (hardirq/softirq) | 'H' - hard irq occurred inside a softirq | 'h' - hard irq is running | 's' - soft irq is running | '.' - normal context.
spot 5 - whether preemption has been disabled | The level of preempt_disabled (an integer number)

When using the latency trace option, the timestamp is relative to the start of the trace, in microseconds.

The field after the timestamp and before the colon is set to either '!' or '+' to call attention to especially long delays. The meanings of the symbols are:

'!' - greater than preempt_mark_thresh (default 100)
'+' - greater than 1 microsecond
' ' - less than or equal to 1 microsecond.

Next, we have the block I/O events. The meanings of all of them are given ahead. Let's describe the structure of two of them:

 block_bio_queue: 8,16 W 267288 + 8 [flush-8:16]

<event name>: <device major number>,<device minor number> <type of I/O operation> <sector number> + <amount of sectors> [kernel thread]

block_rq_complete: 8,16 W () 267280 + 16 [0]

<event name>: <device major number>,<device minor number> <type of I/O operation> __get_str(cmd)  <sector number> + <amount of sectors> [errors]

What to look for ?

The most obvious observation is the elapsed time to complete the I/O operation. It is possible to infer this time by subtracting the timestamp of the block_rq_complete from the timestamp of the previous block_rq_issue. This operation reveals the time taken by the device driver to process the I/O request, which in most cases includes the SAN and storage unit processing time. Depending on the vendor/model of the storage unit and the configuration of its LUNs, an elapsed time between 5ms and 10ms is adequate.

The second observation, would be any errors in the block_rq_complete events, which can be revealed by the string between brackets [ ]. A further investigation on this errors is recommended.

If events of the type block_rq_requeue are observed, that could means the SAN or storage unit are not being able to handle the I/O operations, due to a high traffic for example.
Verify the SAN zoning configuration, the LUN mappings in the Storage unit controllers and in the HBAs driver at the server side. At the multipath driver, verify and test the different algorithms of the path selection for the next I/O operation.
The event block_bio_bounce means that it is not being possible to transfer data directly from the block I/O data memory to the device driver memory. This could indicate a possible error in the configuration of the device driver memory allocation.

The events block_bio_backmerge, block_bio_frontmerge and block_plug mean that I/O requests are being aggregated before being sent to the device driver. These events don't necessarily reveal a problem, considering the idea here is to execute I/O in larger chunks in order to improve performance. But, on the other hand, if this behavior is good or not will depend on how the other components are configured and if they match your application I/O profile.

The number of in-flight I/O operations can be modified, considering sometimes is better to have the device queue depth smaller than the scheduler depth:

 # echo 256 > /sys/block/sdc/queue/nr_requests
# echo 128 > /sys/block/sdc/device/queue_depth

The Kernel I/O scheduler may also affect these events. You can change it for each specific disk:

 # echo cfq  > /sys/block/sdc/queue/scheduler
noop anticipatory deadline [cfq]

If the application has an I/O profile in which read operations are more important than writes, it could be recommended to increase the prefetch size, for example for the disk sdc:

 # echo 2048 > /sys/block/sdc/queue/read_ahead_kb

The prefetch size can and must also be increased in the storage unit.

The event block_split may reveal that exists a problem in terms of I/O alignment between the layers: application, filesystem, LVM, partitioning and the RAID configuration.

In this case, when creating filesystems using mkfs, use the extended options to inform the ext2,3,4 driver that the underlying disk is actually a RAID array:

   -E extended-options stride=stride-size stripe-width=stripe-width

Also, create the physical volume in LVM aligned with the underlying RAID disk:

pvcreate -M2 --dataalignment <chunk size>

In summary, there are many layers to be investigated, whenever they exist: application, filesystem, LVM, partitioning, software raid, multipath driver, HBA driver, SAN and Storage Unit.

Description of the block I/O events

block_rq_abort - Abort Block Operation Request
@q: queue containing the block operation request
@rq: block IO operation request
Called immediately after pending block IO operation request. @rq in queue @q is aborted. The fields in the operation request @rq can be examined to determine which device and sectors the pending operation would access.

block_rq_insert - Insert Block Operation Request into Queue
@q: target queue
@rq: block IO operation request
Called immediately before block operation request. @rq is inserted into queue @q. The fields in the operation request @rq struct can be examined to determine which device and sectors the pending operation would access.

block_rq_issue - Issue Pending Block IO Request Operation to Device Driver
@q: queue holding operation
@rq: block IO operation operation request
Called when block operation request @rq from queue @q is sent to a device driver for processing.

block_rq_requeue - Place Block IO Request Back on a Queue
@q: queue holding operation
@rq: block IO operation request
The block operation request @rq is being placed back into queue @q. For some reason the request was not completed and needs to be put back in the queue.

block_rq_complete - Block IO Operation Completed by Device Driver
@q: queue containing the block operation request
@rq: block operations request
The block_rq_complete tracepoint event indicates that some portion of operation request has been completed by the device driver. If the @rq->bio is NULL, then there is absolutely no additonal work to do for the request. If @rq->bio is non-NULL then there is additional work is required to complete the request.

block_bio_bounce - Used Bounce Buffer When Processing Block Operation
@q: queue holding the block operation
@bio: block operation
A bounce buffer was used to handle the block operation @bio in @q. This occurs when hardware limitations prevent a direct transfer of data between the @bio data memory area and the IO device. Use of a bounce buffer requires extra copying of data and decreases performance.

block_bio_complete - Completed All Work on the Block Operation
@q: queue holding the block operation
@bio: block operation completed
This tracepoint indicates there is no further work to do on this block IO operation @bio.

block_bio_backmerge - Merging Block Operation to the End of an Existing Operation
@q: queue holding operation
@bio: new block operation to merge
Merging block request @bio to the end of an existing block request in queue @q.

block_bio_frontmerge - Merging Block Operation to the beginning of an Existing Operation
@q: queue holding operation
@bio: new block operation to merge
Merging block IO operation @bio to the beginning of an existing block operation in queue @q.

block_bio_queue - Putting New Block IO Operation in Queue
@q: queue holding operation
@bio: new block operation
About to place the block IO operation @bio into queue @q.

block_getrq - Get a Free Request Entry in Queue for Block IO Operations
@q: queue for operations
@bio: pending block IO operation
@rw: low bit indicates a read (%0) or a write (%1)
A request struct for queue @q has been allocated to handle the block IO operation @bio.

block_sleeprq - Waiting to Get a Free Request Entry in Queue for Block IO Operation
@q: queue for operation
@bio: pending block IO operation
@rw: low bit indicates a read (%0) or a write (%1)
In the case where a request struct cannot be provided for queue @q, the process needs to wait for an request struct to become available. This tracepoint event is generated each time the process goes to sleep waiting for request struct become available.

block_plug - Keep Operations Requests in Request Queue
@q: request queue to plug
Plug the request queue @q. Do not allow block operation requests to be sent to the device driver. Instead, accumulate requests in the queue to improve throughput performance of the block device.

block_unplug_timer - Timed Release of Operations Requests in Queue to Device Driver
@q: request queue to unplug
Unplug the request queue @q because a timer expired and allow block operation requests to be sent to the device driver.

block_unplug_io - Release of Operations Requests in Request Queue
@q: request queue to unplug
Unplug request queue @q because device driver is scheduled to work on elements in the request queue.

block_split - Split a Single bio struct into Two bio structs
@q: queue containing the bio
@bio: block operation being split
@new_sector: The starting sector for the new bio
The bio request @bio in request queue @q needs to be split into two bio requests. The newly created @bio request starts at @new_sector. This split may be required due to hardware limitation such as operation crossing device boundaries in a RAID system.

block_remap - Map Request for a Partition to the Raw Device
@q: queue holding the operation
@bio: revised operation
@dev: device for the operation
@from: original sector for the operation
An operation for a partion on a block device has been mapped to the raw block device.

Resources

/usr/src/linux/Documentation/trace/ftrace.txt
/usr/src/linux/Documentation/trace/events.txt
/usr/src/linux/include/trace/events/block.h
/usr/src/linux/Documentation/DocBook/tracepoint.tmpl
Debugging the kernel using Ftrace - part 1
Debugging the kernel using Ftrace - part 2
A look at ftrace