目录
2.1 Synchronous exception types
2.2 ARMv8的 External aborts 和 ECC errors
2.3 ESR_ELx, Exception Syndrome Register (ELx)
3 kernel-5.4.18内核中处理物理内存ECC错误的流程
3.3 处理Synchronous external abort(sea)的函数:do_sea()
1 简介
物理内存硬件上的ECC功能可以检测内存错误
单bit错误可以纠正,所以不需要内核进行特殊处理。相反,多bit错误因为无法纠正,会对程序运行造成无法估计的影响。
本文分析ARMv8架构在linux-5.4.18下对多bit错误的处理。
2 ARMv8 手册上的信息
2.1 Synchronous exception types
......
In some implementations, External aborts. External aborts are failed memory accesses, and include accesses to those parts of the memory system that occur during the address translation. The ARMv8 architecture permits, but does not require, implementations to treat such exceptions synchronously.
《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D1-1477
2.2 ARMv8的 External aborts 和 ECC errors
The ARM architecture defines external aborts as errors that occur in the memory system, other than those that are detected by the MMU or debug logic. External aborts include parity or ECC errors detected by the caches or other parts of the memory system.
《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D3-1638
The ARM architecture supports the reporting of both synchronous and asynchronous parity or ECC errors from the cache system. It is IMPLEMENTATION DEFINED what parity or ECC errors in the cache systems, if any, result in synchronous or asynchronous parity or ECC errors.
A fault code is defined for reporting parity or ECC errors, see Use of the ESR_EL1, ESR_EL2, and ESR_EL3 on page D1-1453. 《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D3-1639
2.3 ESR_ELx, Exception Syndrome Register (ELx)
Field descriptions | ||||
bits | name | 位域说明 | Value ( 二进制 ) |
取值说明 |
[31:26] | EC | Exception Class. Indicates the reason for the exception that this register holds information about. | 100100 | Data Abort that caused entry from a lower Exception level, where that Exception level could be using AArch64 or using AArch32. Used for MMU faults generated by data accesses, alignment faults other than those caused by the Stack Pointer misalignment, and Synchronous external aborts, including synchronous parity or ECC errors. Not used for debug related exceptions. |
100101 | Data Abort that caused entry from a current Exception level, where the current Exception level must be using AArch64. Used for MMU faults generated by data accesses, alignment faults other than those caused by the Stack Pointer misalignment, and Synchronous external aborts, including synchronous parity or ECC errors. Not used for debug related exceptions. |
|||
...... | ||||
[24:0] | ISS | Instruction Specific Syndrome | IFSC, bits [5:0] =010000 |
Synchronous external abort, other than synchronous parity or ECC error, not on translation table walk |
IFSC, bits [5:0] = 011000 |
Synchronous parity or ECC error on memory access, not on translation table walk |
|||
IFSC, bits [5:0] = 010100 |
Synchronous external abort, other than synchronous parity or ECC error, on translation table walk, level 0 |
|||
IFSC, bits [5:0] = 010101 |
Synchronous external abort, other than synchronous parity or ECC error, on translation table walk, level 1 |
|||
IFSC, bits [5:0] = 010110 |
Synchronous external abort, other than synchronous parity or ECC error, on translation table walk, level 2 |
|||
IFSC, bits [5:0] = 010111 |
Synchronous external abort, other than synchronous parity or ECC error, on translation table walk, level 3 |
|||
IFSC, bits [5:0] = 011100 |
Synchronous parity or ECC error on memory access on translation table walk, level 0 |
|||
IFSC, bits [5:0] = 011101 |
Synchronous parity or ECC error on memory access on translation table walk, level 1 |
|||
IFSC, bits [5:0] = 011110 |
Synchronous parity or ECC error on memory access on translation table walk, level 2 |
|||
IFSC, bits [5:0] = 011111 |
Synchronous parity or ECC error on memory access on translation table walk, level 3 |
《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D7-1850
《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D7-1869
2.4 手册信息总结
ARMv8架构下有4个等级的异常,每个异常又细分为4个异常分别为:
- 同步异常
- 中断
- 快速中断
- 系统错误
同步异常中又细分为
- 系统调用
- 异常级别0使用svc(Supervisor Call)指令陷入异常级别1
- 异常级别1使用hvc(Hypervisor Call)指令陷入异常级别2
- 异常级别2使用smc(Secure Monitor Call)指令陷入异常级别3
- 数据中止
- 指令中止
- 栈指针或指令地址没有对齐
- 没有定义的指令
- 调试异常
《Linux内核深度解析》P405
ECC检测到的物理内存错误是由“同步异常”中的“数据中止(Data Abort)”异常来处理的。
3 kernel-5.4.18内核中处理物理内存ECC错误的流程
3.1 异常级别1的同步异常处理
//arch/arm64/kernel/entry.S
el1_sync:
kernel_entry 1
mrs x1, esr_el1 // read the syndrome register
lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1 ;ESR_ELx_EC_DABT_CUR的值是0x25,二进制是:100101
b.eq el1_da
cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
b.eq el1_ia
cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
b.eq el1_undef
cmp x24, #ESR_ELx_EC_PC_ALIGN // pc alignment exception
b.eq el1_pc
cmp x24, #ESR_ELx_EC_UNKNOWN // unknown exception in EL1
b.eq el1_undef
cmp x24, #ESR_ELx_EC_BREAKPT_CUR // debug exception in EL1
b.ge el1_dbg
b el1_inv
3.2 数据中止的处理函数: el1_da
3.3 处理Synchronous external abort(sea)的函数:do_sea()
4 总结
ARMv8架构的CPU在linux-5.4.18内核下,物理内存的ECC如果检测到多bit不可纠正错误,会根据错误发生在用户模式还是内核模式采取不同的措施:
- 错误发生在用户模式,会将对应的进程杀死。
- 错误发生在内核模式,会导致panic。