The debug method of kernel deadlock~

When a deadlock occurs, what methods can be used to find and solve related problems.

lockdep deadlock detection module

The Linux kernel provides a deadlock debugging module Lockdep, which tracks the state of each lock and the dependencies between each lock, and goes through a series of verification rules to ensure that the dependencies between the locks are correct. The kernel has been introduced in 2006 (https://lwn.net/Articles/185666/).

Enable Lockdep

Locks detected by Lockdep include spinlock, rwlock, mutex, rwsem deadlocks, incorrect release of locks, and sleep errors in atomic operations.

The following are the lockcep kernel options and their explanations:

CONFIG_DEBUG_RT_MUTEXES=y
detects the deadlock of rt mutex and automatically reports the deadlock site information.

CONFIG_DEBUG_SPINLOCK=y
detects issues such as uninitialized use of spinlock. Used with NMI watchdog, spinlock deadlock can be found.

CONFIG_DEBUG_MUTEXES=y
detects and reports mutex errors

CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
detects the slowpath test of wait/wound type mutex.

CONFIG_DEBUG_LOCK_ALLOC=y
detects that the lock in use (spinlock/rwlock/mutex/rwsem) is released, or the lock in use is reinitialized, or the lock is held when the process exits.

CONFIG_PROVE_LOCKING=y
enables the kernel to report deadlock details before the deadlock occurs. See /proc/lockdep_chains.

CONFIG_LOCKDEP=y
the master switch of the entire Lockdep. See /proc/lockdep, /proc/lockdep_stats.

CONFIG_LOCK_STAT=y
remembers the information of the lock holding competition area, including waiting time, holding time and so on. See /proc/lock_stat.

CONFIG_DEBUG_LOCKDEP=y
will perform more self-checks during the use of Lockdep, which will increase a lot of extra overhead.

CONFIG_DEBUG_ATOMIC_SLEEP=y
Sleeping in atomic p may cause many unpredictable problems. These atomic p include spinlock locks, rcu read operations, prohibition of kernel preemption, interrupt processing, and so on.

Lock related kernel nodes

If /proc/sys/kernel/lock_stat is
set, you can view the statistics of /proc/lock_stat, and turn off the statistics of lockdep if it is clear.

/proc/sys/kernel/max_lock_depth

/proc/sys/kernel/prove_locking

/proc/locks

/proc/lock_stat
on lock usage statistics

/proc/lockdep
has a dependent lock

/proc/lockdep_stats
has dependency lock statistics

/proc/lockdep_chains
dependency lock linked list

The kernel also provides Tracepoint to assist in discovering lock usage problems: /sys/kernel/debug/tracing/events/lock.

Lockdep test

Test spin_lock deadlock

void hack_spinAB(void)
{
  printk("hack_lockdep:A->B\n");
  spin_lock(&hack_spinA);
  spin_lock(&hack_spinB);
}

void hack_spinBA(void)
{
  printk("hack_lockdep:B->A\n");
  spin_lock(&hack_spinB);
}

static int __init lockdep_test_init(void)
{
  printk("al: lockdep error test init\n");
  hack_spinAB();
  hack_spinBA();
  return 0;
}

After execution, you can probably know the type of deadlock from the deadlock description . Then it introduces the deadlock point in detail . At this time, you can roughly know which lock is and what calls have caused the deadlock. Next is the detailed backtrace of the deadlock , which helps to analyze the stack backtrace when the deadlock occurs.

al: lockdep error test init
hack_lockdep:A->B
hack_lockdep:B->A

=============================================
[ INFO: possible recursive locking detected ]--------检测到的死锁描述:递归死锁类型
4.0.0+ #87 Tainted: G O 
insmod/658 is trying to acquire lock:--------死锁细节描述:欲持锁点和已持锁点
(hack_spinB){+.+...}, at: [<bf002030>] lockdep_test_init+0x30/0x3c [lock]--------lockdep_test_init中调用hack_spinBA再次持有hack_spinB锁

but task is already holding lock:
(hack_spinB){+.+...}, at: [<bf000038>] hack_spinAB+0x38/0x3c [lock]--------hack_spinB已经在hack_spinAB函数中被持有

other info that might help us debug this:--------锁的其它补充信息
Possible unsafe locking scenario:

CPU0
----
lock(hack_spinB);
lock(hack_spinB);

*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by insmod/658:--------进程共持有两个锁
#0: (hack_spinA){+.+...}, at: [<bf000030>] hack_spinAB+0x30/0x3c [lock]
#1: (hack_spinB){+.+...}, at: [<bf000038>] hack_spinAB+0x38/0x3c [lock]

stack backtrace:--------栈回溯信息:可以看出从lockdep_test_init->_raw_spin_lock->lock_acquire的调用路径。
CPU: 0 PID: 658 Comm: insmod Tainted: G O 4.0.0+ #87
Hardware name: ARM-Versatile Express
[<c00171b4>] (unwind_backtrace) from [<c0012e7c>] (show_stack+0x20/0x24)
[<c0012e7c>] (show_stack) from [<c05ade10>] (dump_stack+0x8c/0xb4)
[<c05ade10>] (dump_stack) from [<c006b988>] (__lock_acquire+0x1aa4/0x1f64)
[<c006b988>] (__lock_acquire) from [<c006c55c>] (lock_acquire+0xf4/0x190)
[<c006c55c>] (lock_acquire) from [<c05b4ec8>] (_raw_spin_lock+0x60/0x98)
[<c05b4ec8>] (_raw_spin_lock) from [<bf002030>] (lockdep_test_init+0x30/0x3c [lock])
......
[<c00a4ddc>] (load_module) from [<c00a55fc>] (SyS_init_module+0x140/0x150)
[<c00a55fc>] (SyS_init_module) from [<c000ec80>] (ret_fast_syscall+0x0/0x4c)
INFO: rcu_sched self-detected stall on CPU
0: (2099 ticks this GP) idle=5ed/140000000000001/0 softirq=13024/13024 fqs=1783 
(t=2100 jiffies g=-51 c=-52 q=22)
Task dump for CPU 0:
insmod R running 0 658 657 0x00000002
[<c00171b4>] (unwind_backtrace) from [<c0012e7c>] (show_stack+0x20/0x24)
[<c0012e7c>] (show_stack) from [<c0052874>] (sched_show_task+0x128/0x184)
......
[<c0008740>] (gic_handle_irq) from [<c0013a44>] (__irq_svc+0x44/0x5c)
Exception stack(0xed5c9d18 to 0xed5c9d60)
9d00: 00000000 00010000
9d20: 0000ffff c02f3898 bf0001b0 c0b1d248 123cc000 00000000 0c99b2c5 00000000
9d40: 00000000 ed5c9d84 ed5c9d60 ed5c9d60 c0070cb4 c0070cb4 60000013 ffffffff
[<c0013a44>] (__irq_svc) from [<c0070cb4>] (do_raw_spin_lock+0xf0/0x1e0)
[<c0070cb4>] (do_raw_spin_lock) from [<c05b4eec>] (_raw_spin_lock+0x84/0x98)
[<c05b4eec>] (_raw_spin_lock) from [<bf002030>] (lockdep_test_init+0x30/0x3c [lock])
......
BUG: spinlock lockup suspected on CPU#0, insmod/658--------错误类型是spinlock,下面的backtrace和上面基本一致。
lock: hack_spinB+0x0/0xfffffedc [lock], .magic: dead4ead, .owner: insmod/658, .owner_cpu: 0--------发生死锁的是hack_spinB
CPU: 0 PID: 658 Comm: insmod Tainted: G O 4.0.0+ #87
Hardware name: ARM-Versatile Express
[<c00171b4>] (unwind_backtrace) from [<c0012e7c>] (show_stack+0x20/0x24)
[<c0012e7c>] (show_stack) from [<c05ade10>] (dump_stack+0x8c/0xb4)
[<c05ade10>] (dump_stack) from [<c0070b2c>] (spin_dump+0x8c/0xd0)
......

Test mutex deadlock

======================================================
[ INFO: possible circular locking dependency detected ]
4.0.0+ #92 Tainted: G           O   
-------------------------------------------------------
kworker/1:1/343 is trying to acquire lock:
 (mutex_a){+.+...}, at: [<bf000080>] lockdep_test_worker+0x24/0x58 [mutexlock]

but task is already holding lock:
 ((&(&delay_task)->work)){+.+...}, at: [<c0041078>] process_one_work+0x130/0x60c

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((&(&delay_task)->work)){+.+...}:
       [<c00406f4>] flush_work+0x4c/0x2bc
       [<c0041cc4>] __cancel_work_timer+0xa8/0x1d0
       [<c0041e28>] cancel_delayed_work_sync+0x1c/0x20
       [<bf000138>] lockdep_thread+0x84/0xa4 [mutexlock]
       [<c0046ee0>] kthread+0x100/0x118
       [<c000ed50>] ret_from_fork+0x14/0x24

-> #0 (mutex_a){+.+...}:
       [<c006c55c>] lock_acquire+0xf4/0x190
       [<c05b09e4>] mutex_lock_nested+0x90/0x480
       [<bf000080>] lockdep_test_worker+0x24/0x58 [mutexlock]
       [<c0041138>] process_one_work+0x1f0/0x60c
       [<c0041fd0>] worker_thread+0x54/0x530
       [<c0046ee0>] kthread+0x100/0x118
       [<c000ed50>] ret_from_fork+0x14/0x24

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((&(&delay_task)->work));
                               lock(mutex_a);
                               lock((&(&delay_task)->work));
  lock(mutex_a);

 *** DEADLOCK ***

2 locks held by kworker/1:1/343:
 #0:  ("events"){.+.+.+}, at: [<c0041078>] process_one_work+0x130/0x60c
 #1:  ((&(&delay_task)->work)){+.+...}, at: [<c0041078>] process_one_work+0x130/0x60c
stack backtrace:
CPU: 1 PID: 343 Comm: kworker/1:1 Tainted: G           O    4.0.0+ #92
Hardware name: ARM-Versatile Express
Workqueue: events lockdep_test_worker [mutexlock]
<c00171b4>] (unwind_backtrace) from [<c0012e7c>] (show_stack+0x20/0x24)
[<c0012e7c>] (show_stack) from [<c05ade10>] (dump_stack+0x8c/0xb4)
[<c05ade10>] (dump_stack) from [<c0065e80>] (print_circular_bug+0x21c/0x344)
[<c0065e80>] (print_circular_bug) from [<c006be44>] (__lock_acquire+0x1f60/0x1f64)
[<c006be44>] (__lock_acquire) from [<c006c55c>] (lock_acquire+0xf4/0x190)
[<c006c55c>] (lock_acquire) from [<c05b09e4>] (mutex_lock_nested+0x90/0x480)
[<c05b09e4>] (mutex_lock_nested) from [<bf000080>] (lockdep_test_worker+0x24/0x58 [mutexlock])
[<bf000080>] (lockdep_test_worker [mutexlock]) from [<c0041138>] (process_one_work+0x1f0/0x60c)
[<c0041138>] (process_one_work) from [<c0041fd0>] (worker_thread+0x54/0x530)
[<c0041fd0>] (worker_thread) from [<c0046ee0>] (kthread+0x100/0x118)
[<c0046ee0>] (kthread) from [<c000ed50>] (ret_from_fork+0x14/0x24)

If you turn on CONFIG_PROVE_LOCKING, the relevant trace will be printed:

lockdep_test_worker
  ->mutex_lock(&mutex_a)
    ->mutex_lock_nested
      ->__mutex_lock_common
        ->mutex_acquire_nest
          ->lock_acquire_exclusive
            ->lock_acquire
              ->__lock_acquire---------下面的validate_chain在打开CONFIG_PROVE_LOCKING才会进行检查。
                ->validate_chain->...->print_circular_bug

1. Inventory of domestic RISC-V core MCU manufacturers!

2. Have you paid attention to the barometer of the economic development of the embedded industry and the industrial development of the EU?

3. How high is the turnover cost of a technical employee?

4. New world, new way: 2021 International Embedded Exhibition and Conference is going digital

5. How is the architecture of the domestic Linux operating system designed?

6. Introduction to Artificial Intelligence: Neural Networks Based on Linux and Python

Disclaimer: This article is reproduced online, and the copyright belongs to the original author. If you are involved in the copyright of the work, please contact us, we will confirm the copyright based on the copyright certification materials you provide and pay the author's remuneration or delete the content.

Guess you like

Origin blog.csdn.net/DP29syM41zyGndVF/article/details/114324390