Kernel Exception Problem Analysis Guide

249e0678ccb3aec5e8250b0fbced77ad.gif

Learn with you for lifeXi, this is Programmer Android

Recommended classic articles. By reading this article, you will gain the following knowledge points:

1. Overview of Kernel Exception
2. Kernel space layout
3. Overview of printk
4. AEE db log mechanism
5. Early exception handling
6. die() process
7. panic() process
8. nested panic

1. Overview of Kernel Exception (KE)

Android OS consists of 3 layers, the bottom layer is Kernel, the top layer is Native bin/lib, and the top layer is Java layer:

7f11bc6c0aafca9c0ddcdb4fb83cfaa2.jpeg

Android OS 3-layer structure

Abnormalities may occur in any software, such as wild pointers, runaways, deadlocks, etc.
When an exception occurs in the kernel layer, we call it KE (kernel exception). Similarly, if it occurs in the Native layer, it is NE, and the Java layer is JE. This article only focuses on the underlying KE.

1. KE category

The kernel has the following 2 (oops、panic)crash categories:

  1. oops (similar to assert, with a chance to recover)

Oops is a common colloquial expression among Americans. It means something unexpected, surprising, or sudden. The kernel behavior is to notify interested modules and print various information, such as register values, stack information...
When oops occurs, we can debug and solve the problem based on registers and other information.
/proc/sys/kernel/panic_on_oopsWhen it is 1, panic occurs. Our default setting is 1, that is, oops will panic.

  • Panic – confusion, panic, it means that the Linux kernel has encountered a situation and does not know how to proceed. Kernel behavior manifests itself as notifications to interested modules, crashes, or reboots.
    In the kernel code, some codes have added error checking. If an error is found, panic() may be called directly, and information will be output to provide debugging.

  1. panic

2. KE common debugging methods

Every program has bugs. Bugs always appear in unexpected places. It is said that the world's first bug was a moth that flew into a relay computer. The unlucky moth got caught between the relays and caused the computer to malfunction. Because of this little bug, errors in the program are called bugs.

If there are bugs, you need to debug, and debugging is a very personalized job. Ten people may have ten debugging methods. But in terms of means, it can be roughly divided into two categories, online debugging (Online Debug) and offline debugging (Offline Debug).

3.Online debugging

Online debugging refers to monitoring the behavior of the program while it is running and analyzing whether it meets expectations. Usually some tools are used, such as GDB and Trace32. Sometimes we also use the assistance of some hardware devices, such as emulators/JTAG, but it is very difficult to prepare the environment and troublesome to use. It is rarely used unless required by some runtime problems.

4. Offline debugging,

Offline debug refers to a method of collecting required information while the program is running and analyzing it based on the collected information after a bug occurs. It is usually divided into two methods, one is Logging and the other is Memory Dump.

The collection of logging
logs or related information can clearly see the execution process of the code. It is an effective analysis method for logical problems. Because it is simple and easy to operate, it is also the most important analysis method.

Memory Dump
translates as memory dump, which refers to dumping all memory information to external memory when an exception occurs, that is, backing up the abnormal on-site information for subsequent analysis. It is a very effective analysis method for CPU execution abnormalities. On the Windows platform, after a program exception occurs, you can choose to start the debugger to debug immediately. On the Linux platform, a core dump will be dumped after an exception occurs in the program, and this coredump can be debugged with the debugger GDB. Similar dumps can also be performed for kernel exceptions.

2. Kernel space layout

Before analyzing KE, you need to understand the kernel memory layout to know which addresses are used for what and what problems may occur.

The following important segments exist in the kernel space:

1. vmlinux code/data segment:

Any program has TEXT (executable code), RW (data segment), ZI segment (uninitialized data segment), the kernel also has it, corresponding to .text, .data, .bss

2.module area:

The kernel can support ko (modules), so a space is needed to store code and data segments.

3. vmalloc area:

In addition to applying for memory with continuous physical addresses, the kernel can also apply for discontinuous memory (virtual addresses are continuous), which can avoid memory fragmentation and failure to apply for memory.

4. io map area:

The area left for io register mapping. Some versions do not have the io map area but directly use the vmalloc area.

5.memmap:

The kernel describes memory through the page structure. Each page frame has a corresponding page structure, and memmap is an array of page structures.

There are other smaller segments not listed, which may differ depending on the version.

6. ARM64bit kernel layout

At present, smart phones have entered 64bit, so there are 32bit layout and 64bit layout, which are explained one by one below.

ARM64 can use up to 48bit physical and virtual addresses (expanded to 64bit, high bits are all 1 or 0). For the Linux kernel, the current configuration is 39bit kernel space.

Since there is up to 512GB of space, the entire RAM can be mapped in. After 0xFFFFFFC000000000, it is mapped one by one, and there is no need for high memory.

In addition to peripheral registers, the vmalloc area functions are directly mapped to vmalloc, so there is no IO map space in the 32bit layout.

Different versions of the kernel have slightly different layouts:

  • kernel-3.10

8820bb3d15ef3876de1252f21e2013b8.jpeg

kernel-3.10

  • = kernel-3.18 && < kernel-4.6

1bbec34790721e5527a2e994d7bac09f.jpeg

>= kernel-3.18 && < kernel-4.6

  • = kernel-4.6/N0.MP8 kernel-4.4(patch back)

58ce9409641a80a2e1b7b4e539de6e23.jpeg

>= kernel-4.6/N0.MP8 kernel-4.4(patch back)

7. ARM32bit kernel layout

Here is a diagram (some addresses may vary)

c5ccec9062676465e29990dd0d57b639.jpeg

ARM32bit kernel layout

The entire address space is 4G, the kernel is configured as 1G, and the program occupies 3G.

The starting address of the kernel code is 0xC0008000, and the page table is placed in front (the starting address is 0xC0004000). If the module (*.ko) is supported, the address is 0xBF000000.

Since the kernel cannot map all the memory, after all, the kernel itself only occupies 1G. If the RAM exceeds 1G, it cannot all be mapped. How to do it? Only a part can be mapped first, this part is called low memory. Others are mapped on demand, and the VMALLOC area is used for on-demand mapping.

ARM's peripheral registers, like memory, have unified address encoding, so a space above 0xF0000000 is used to map peripheral registers to facilitate the operation of hardware modules.

0xFFFF0000 is a special address used by the CPU to store the exception vector table. Most kernel exceptions are CPU exceptions (abort/undef inst. and other exceptions issued by the MMU).

The above is a rough description, you need to check the code to get complete analysis information (the kernel is constantly evolving, and some parts may still change)

3. Overview of printk

1. kernel log

When you first learned programming, you must have used printf(). There is a corresponding function in the kernel called printk().

The simplest debugging method is to use printk() to print out the information you want to know. When we talked about oops/panic in the previous chapter, they used printk() to print the register information/stack information to the kernel log buffer.

You can see that the kernel log can be output through the serial port, or after an oops/panic occurs, the buffer can be saved as a file and packaged into the db, and then the serial port log or db can be obtained to debug and analyze the kernel.

Usually the mobile phone will retain the serial port test point, but to capture the serial port log, you usually need to disassemble the phone, which is more troublesome. As mentioned earlier, the kernel log can be saved as a file and packaged in db. What is db?

4. AEE db log mechanism

db is a file generated by a module called AEE (Android Exception Engine, integrated in Mediatek mobile software) that detects exceptions and collects exception information. It contains key information such as logs required for debugging. db is a bit like the black box of an airplane.

For KE, the db contains the following files (db can be unlocked through the GAT tool, please refer to the FAQ in the appendix):

  • __exp_main.txt: Exception type, call stack and other key information.

  • _exp_detail.txt: Detailed exception information

  • SYS_ANDROID_LOG:android main log

  • SYS_KERNEL_LOG:kernel log

  • SYS_LAST_KMSG: kernel log before the last restart

  • SYS_MINI_RDUMP: similar to coredump, can be debugged with gdb/trace32

  • SYS_REBOOT_REASON: Information recorded by the hardware during restart.

  • SYS_VERSION_INFO: kernel version, used for comparison with vmlinux. Only matching vmlinux can be used to analyze this exception.

  • SYS_WDT_LOG: Watchdog reset information

The above files are generally sufficient to debug KE, unless some special problems require other information, such as serial port logs, etc.

1. Key information when the system restarts

In addition to keeping the last kmsg, the ram console also has important system information, which is very helpful for us to debug. This information is stored in the header ram_console_buffer of the ram console.

f1c9c57885d64e846ff6311c473436f3.jpeg

ram console

off_linux in this structure points to struct last_reboot_reason, which stores important information:

ff7fbc419018febd5835e9a9f605f543.jpeg

ram console

The above important information will be packaged into the SYS_REBOOT_REASON file in the db after restarting. For interpretation of each field of this document, please see:

5. Early exception handling

1.CPU exception capture

Exceptions such as wild pointers and runaways will be intercepted by the MMU and reported to the CPU. This series is all hardware behavior.

This type of problem is difficult to locate and accounts for the majority of the KE ratio. The reason is usually due to various factors such as the memory being trampled, pointer use atfer free, etc. At that time, the abnormality may not occur immediately, but it is possible to use this memory. collapse.

2. Software exception capture

In the kernel code, BUG(), BUG_ON(), and panic() are generally used to intercept unexpected behaviors. This is the function of the software to proactively report exceptions.

In-kernel calls can be used to easily flag bugs, provide assertions and output information. The two most commonly used are BUG() and BUG_ON(). When called, they trigger oops, causing stack traces and error messages to be printed. How to use it

if (condition)
   BUG();
或者 :
BUG_ON(condition); //只是在BUG基础上多层封存而已:
` #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while(0)`

3. 32bit kernel:

The implementation of BUG() adopts the method of burying undefined instructions (0xE7F001F2, remember this value, if you see this value in the log, you should know that BUG()/BUG_ON() was called)

c8514e262fdb5fa62817e565c8c7f243.jpeg

64bit kernel:

In the native kernel, BUG() directly calls panic():

6f93f258f2d6744350ecd38e9a4a7f80.jpeg

However, Mediatek has modified the implementation of BUG() so that more debugging information is output (die() has register and other information output)

ca61a48e4a141300ca24539db5f15875.jpeg

MTK modification

When you see the following log, you should know that it is caused by BUG()/BUG_ON()!

[ 147.234926]<0>-(0)[122:kworker/u8:3]Unable to handle kernel paging request at virtual address 0000dead

6. die() process

After the previous process, we reached the die() function, which mainly outputs important information such as register information/stack information for easy debugging. We analyze KE through log to analyze these information, so we need to know the entire process. The general process of die() => panic() is as follows:

ae4d6f038476fe743ea8b71091b596e1.jpeg

die() flow chart

When learning these processes, it is recommended to read the code and the KE log together, so that you will know where the information in the log is printed in the code.

1.die() overall process

Let’s start with die() and look at the general process of die():

415176ded34766f9cfcde1fc94b47e20.jpeg

die() overall process

When you go to debug_locks_off(), there will be log output, as follows:

057af88b1830720e59dd16e25e1fa126.jpeg

debug_locks_off() log output

If this exception is caused by calling BUG()/BUG_ON() in the code, there will be additional log instructions.

4bb27d4766e81b0abfbb472ba4375fbf.jpeg

The output log is roughly as follows:

2064c67ec7d03d2dfb7b96278818cfb7.jpeg

log

2. __die() process

Most of the key information is output by the __die() function. The process is as follows:

d69c2ac65959d3535573f311a76230eb.jpeg

__die() process

Exception type information

Start printing out the exception type and other information. Check the kernel log to see if there are oops. Just search for the keyword Internal error:

57c280eac49b241907ed69ce16a182c6.jpeg

The output information is roughly as follows:

3665243a7fdba209da7e564d30f012d5.jpeg

log

3. module information

Next is the module information, but we do not recommend using modules and will not introduce them here.

4.CPU register information

Then there is the important CPU register information (32bit code, 64bit is similar):

91e8dc877b021c621c2679d21609ee5e.jpeg

CPU information

The output information is roughly as follows:

6e9c9f0c01cc8bc948cfbe741f03f5c0.jpeg

log information

5. Memory near registers

Memory information that helps us analyze the problem, the problem is likely to lie inside.

4126bf0ab0d3c373194247aafc0ed235.jpeg

The output information is roughly as follows:

a5830cebc91586adcf0decb271b14eac.jpeg

6. Call stack

Sometimes the problem can be seen directly from the call stack, which shows how important the call stack is.

34227e51d5ac20d86fc37fa7b6205419.jpeg

The output information is roughly as follows:

3bfced41f8a0a584c4fcad1542f363cf.jpeg

7.PC nearby command

You can see the instructions near the PC:

a65d4e5766c188c310f92e70d5a2e0cc.jpeg

The output information is roughly as follows:

2b2eebf3fa0ad212c2e055cde2ccd928.jpeg

8. Analyze log

At this point the die() function has completed its mission and outputs important information. How do you debug it next? This depends on your personal skills. You can:

  • Through the function pointed by PC, use addr2line (described later in GNU tools) to locate which line of which file. You can roughly know what happened. If you cannot locate it all at once, you can also observe KE multiple times by combining printk(). Log troubleshooting. If the KE is caused by BUG()/BUG_ON(), you can start to fix the problem.

  • Check the call stack. Sometimes the call stack can explain the process and see if the code runs as expected. If not, you can use printk() to locate the problem.

  • If you want to see function parameters or global variable information, then you need to use the knowledge of "Advanced: Ramdump Analysis" to debug.

7. panic() process

The process has reached panic() and died (abnormal restart) not far away, and the key information has been output to the kernel log. So what does panic() do?

1. panic() process

3077e8a3ed080ad68fd1e1334bd91ce4.jpeg

panic() process

panic() has the iconic log output, which is roughly as follows:

e47489ce78d6a471e4859fa6738590bf.jpeg

kernel panic exception

Therefore, we can also search for the keyword Kernel panic to find out whether a panic occurs.

2. Panic notification chain

panic() will call the callback function on the stack notification chain and modules of interest. For example, our aee has registered a callback function to save key information such as kernel log/mini dump, and save it to the expdb partition of emmc, etc. After restarting, read it back and save it as KE db.

3. expdb

DRAM will be lost during the restart process, so information can only be saved in flash. One item in the partition table is expdb:

cc0e66e5720c561a7613814d3c9a02ad.jpeg

The process is roughly as follows (versions are constantly evolving and may change significantly, for reference only):

4024aa4560d7a9162bc8d0687ef0a74d.jpeg

After restarting, aee will read back the aeedb partition data and convert it into KE db.

八、nested panic

Sometimes the die()/panic() process may not be completed normally, and an exception may occur at a certain step, forming a nest. In this case, we generally do not pay attention to the subsequent exceptions, but focus on the last exception. That exception at the beginning.

In order to avoid nested exceptions, we intercept the second exception when it occurs. We intercept nested panic in three places:

  • do_PrefetchAbort()

  • do_DataAbort()

  • do_undefinstr()

863fb319f4ba7e308d11c659af0bfa81.jpeg

After interception, do not go through the die()/panic() process, because these processes may cause exceptions, use the function aee_stop_nested_panic() we wrote:

f7825bf02a2b337afa36daa38feedd54.jpeg

Try to use the kernel module as little as possible, otherwise exceptions may occur. Just output registers and other important information to the ram console and wait for death (infinite loop, waiting for the watchdog to reset!). At this time, you can see the information in SYS_LAST_KMSG in the db you captured, which is roughly as follows (different versions have slight differences):

d81c3e1bef3fc262bcf47d15474430bc.jpeg


It contains register information, stack information and call stack. We can use the tool (addr2line) to restore the location of the exception at that time.

However, there is very little information that nested panic can refer to, and it is not as rich as ordinary KE.

references:

[Tencent Documentation] Android Framework Knowledge Base
https://docs.qq.com/doc/DSXBmSG9VbEROUXF5

Friendly recommendation:

Collection of useful information on Android development

At this point, this article has ended. The editor thinks the article is reprinted from the Internet and is excellent. You are welcome to click to read the original article and support the original author. If there is any infringement, please contact the editor to delete it. Your suggestions and corrections are welcome. We look forward to your attention and thank you for reading, thank you!

5bd61e175dd10df26813ad7761158e43.jpeg

Click to read the original article and like the boss!

Guess you like

Origin blog.csdn.net/wjky2014/article/details/131733648