Linux Reading Field-Linux Kernel Monthly Report (October 2020)

About Linux Kernel Monthly Report

Linux reading field

The Linux kernel monthly report column is a summary of the most important first-line development trends of the Linux kernel community that month, making it easier for readers to track the most cutting-edge development trends of the Linux kernel.

Due to space limitations, we will only give a rough overview of the latest technology. Please look forward to the follow-up articles for technical details. Readers are also welcome to contribute to the reading code field community.

The main contributors of this monthly report:

Zhang Jian, Liao Weixiong, Chenwei, Xia

Previous links:

Linux Reading Field-Linux Kernel Monthly Report (June 2020)

Linux Reading Field-Linux Kernel Monthly Report (July 2020)

Linux Reading Field-Linux Kernel Monthly Report (August 2020)

Linux Reading Field-Linux Kernel Monthly Report (September 2020)

阅码场征稿Linux阅码场征集Linux工程师一线研发心得;工程师、高校学生老师、科研院所研研究人员对Linux某一技术要点深入分析的稿件。您的文章将获得近十万一线Linux工程师的广泛受众。投稿要求:原创且从未在任何媒体、博客、公众号发表过的文章。高屋建瓴、深刻全面地论述一个技术点或者面。投稿请微信联系小月:linuxer2016录取的稿件,我们也会奉上微薄的稿酬聊表寸心,稿费标准为300-500元/篇。

One, architecture related

1.1 ARM/arm64 set_fs

补丁集1:ARM: remove set_fs callers and implementation

Patch set 2: arm64: remove set_fs() and friends

When you see fs in the kernel code, what do you think of? File system? However, the picture is broken, the fs we are going to talk about today is actually the fs register of x86 [1]. The 0.1 version of the kernel introduced set_fs to set the address range of user space (`set_fs(USER_DS)`) and kernel space (`set_fs(KERNEL_DS)`). Originally set_fs only set the x86 fs register, and other architectures also followed the name set_fs. It’s okay if only the name is confused, the problem is that set_fs seems to be protecting, the same

It is an attack point. If the corresponding restriction `set_fs(USER_DS)` is not set when returning to the user space from the kernel, the user space has the capability of unlimited memory space. There were similar reports in 2010 (CVE-2010-4258) and 2016.

Hardening based on set_fs [2] seemed to be a way, but it was rejected because it would affect the performance of kernel system calls. Therefore, there is only one option-remove set_fs. emm, three years ago, this was the result of community discussions three years ago. This year Christoph Hellwig continued his original proposal. At present, the core code part of the kernel has been completed, and the remaining architectures need to be modified. This month, the ARM and ARM64 architecture patches are being submitted for review.

For ARM64, in addition to the clean up code, it is necessary to deal with the code involving kernel and user space switching, which mainly involves three parts: SDEI, PAN and UAO. For ARM, in addition to deleting the set_fs code, several system calls of oabi need to be processed.

参考资料
[1]https://lwn.net/Articles/722267/
[2]https://lwn.net/Articles/721305/

1.2 KASan for ARM

Kernel Address SANitizer (KASAN) is a dynamic memory error detector that mainly checks for out-of-bounds and use-after-free. Currently X86, ARM64, RISC-V, PowerPC and other architectures have been supported.

Linus Walleij sent the fourteenth edition patch of KASan for Arm this month-this is the ARM architecture's support for KASan-compared with the patch last month, Linus repaired his crash on the Qualcomm APQ8060 platform. He hopes that too Can fix issues reported by Florian and Ard.

KASan of ARM64 is increasing memory error detection based on hardware tag. This series of patches applies the capabilities of Memory Tagging Extension (MTE) to KASan, generates a random tag every time memory is allocated, and automatically checks it every time memory is accessed. If it does not match, a tag fault is generated and KASan reports the error.

(Patch name: kasan: add hardware tag-based mode for arm64)

1.3 Carry forward IMA measurement log on kexec on ARM64

The implementation of each architecture of the kernel has both commonalities and characteristics, and often latecomers will consider modifying the implementation of a previous architecture to be more general. The RISC-VNUMA work we introduced in the kernel monthly report is based on ARM64. Today's patch set is similar.

IMA (Integrity Measurement Architecture) is one of the two major components of core integrity. IMA can check the integrity of the file when it is executed or opened. For kexec (kexec supports starting a new kernel from the current kernel): IMA can check its kernel, initramfs and command line. Lakshmi's four patches are to complement this capability that is not currently supported by ARM64.

1.4 In addition to the above patches, there are still many patches worthy of attention in October. E.g

  • Introduce the TDP MMU

The purpose is to improve the hot migration performance of TB-level memory virtual machines. The meaning of TDP is: two dimensional paging. Currently, TDP MMU has been used in Google's 12TiB m2-ultramem-416 VM with 416 vCPUs to provide the necessary hot migration performance.

The motivation for this work is to process page faults in very large virtual machines in parallel. When the VM has hundreds of vCPUs and TB of memory, the MMU lock of KVM suffers from extreme competition, resulting in soft-lockup or huge delay in page fault processing in the guest OS.

  • Add support for Asymmetric AArch32 systems

The purpose is to use the aarch32 application in user space when only part of the arm64 CPU supports the aarch32 EL0 operating environment. Remarks: arm64 CPU can have two operating environments (EE: Execution Environment), namely aarch64 and aarch32.

  • PKS: Add Protection Keys Supervisor (PKS) support

  This is a kernel work similar to PKU (User Space Memory Protection Keys), and the expected usage scenarios are trusted keys and PMEM.

2. Core-kernel related

2.1 Sleepable tracepoints

The current Tracers cannot access user-mode data because they cannot handle page faults. However, this is sometimes a requirement.

This series of patches implements a framework that enables tracers to handle page faults in the tracepoint framework, and various tracers will make corresponding changes in the future.

2.2 Core scheduling v8

Core scheduling is a feature that allows trusted processes to run at the same time on the shared resource cpus. The main purpose is to eliminate core-level side channel attacks without disabling the SMT function.

By default, this feature will not change any current scheduling behavior. The user is required to decide which tasks can run in the same core at the same time. For example, when a process A is running, the hyperthreads of the same core are either idle or can only run processes trusted by the process A.

2.3 KFENCE v5: A low-sampling memory error detection tool

The main feature of KFENCE is to use the lowest possible performance sacrifice to sample various memory errors on the line. Compared with KASAN, the accuracy of KFENCE is not so high, but KFENCE has little impact on performance and can be deployed in a large number of production environments. Can also detect memory bugs.

This series of Patches adds KFENCE support to both arm and x86 architectures.

3. File system and Block Layer

3.1 Block request filtering and block device snapshot module

* Patch: https://lwn.net/Articles/834867/

* veeamsnap :

https://github.com/veeam/veeamsnap/

The patch author comes from veeam, an enterprise that provides Linux backup solutions. For a long time, a block device backup service has been provided in the form of a kernel tree module (veeamsnap). This submission is an attempt to merge its main function module into the main branch.

The patch implements two functions, blk-filter and blk-snap:

  • blk-filter implements interception of BIO requests for block devices, and intercepts BIO requests very early, without affecting the request processing queue at all. In addition to intercepting the entire storage medium, it also supports the interception of specific block devices, both in units of partitions. There is also a feature, which supports dynamic enabling and disabling of filtering functions. When the block device is loaded, it can automatically start to filter the BIO request; when the block device is removed, the filter is also automatically removed.

  • blk-snap implements snapshot and block change tracking functions. There is no doubt that it depends on blk-filter. It is designed to create a backup copy of any block device without using a device mapper. The snapshot is temporary and will be destroyed after the backup process is completed. Changed block tracking allows incremental and differential backup copies.

3.2 btrfs added support for 4K subpage read and write

* Patch: https://lwn.net/Articles/834872/

The patch is mainly to allow a system with a 64K page size to mount a 4K sector size btrfs, and to support normal read and write in this state.

The 64K page size will cause internal space waste under some small data. If there can be a smaller operation granularity, the space waste can be greatly alleviated. Therefore, it is necessary to support the operation of 4K subpages. After testing, the author is stable enough and believes that pure metadata operations of subpage size can be performed, such as reflink. The author verified the read data subpage in the compressed and uncompressed cases, verified the read and write of the metadata subpages, and also verified the full page write in the uncompressed case.

This patch actually has a lot of things to do, as well as a lot of challenges. For example, the data (non-metadata) written to the subpage is not implemented. For example, the write-back of the subpage data supports iomap.

Fourth, virtualization

4.1 KVM protected memory extension

== Background and related issues ==

There are many hardware features (such as MKTME, SEV) that can prevent the memory of the virtual machine from being accessed by unauthorized access on the host. This patch set proposes a pure software feature that can mitigate some of the same host-side read-only attacks.

== What has this patch set alleviated? ==

-The host kernel "accidentally" accesses the virtual machine data (consider the speculation)

-Host kernel causes access to virtual machine data (write(fd &guest_data_ptr, len))

-Host user space to access virtual machine data (such as through tampered QEMU)

-Elevate the privileges of the virtual machine through a tampered QEMU device simulator

== What does this patch set not alleviate? ==

-The kernel of the host is completely tampered with. The kernel will map the page again

-Hardware-based attacks

The second edition of the RFC patch set addresses most of the feedback.

But still did not find a good solution to solve the restart and KEXEC. In these operations, it is necessary to cancel the protection of all memory, which is inconsistent with our goal of this feature.

Before not protecting the content required for restart (or KEXEC), cleaning up most of the memory is time-consuming and error-prone. Maybe we should declare that these operations are not supported?

== Sequence overview==

Encrypt the data of the virtual machine through hardware features, and then ensure that only the correct virtual machine can decrypt it, thereby protecting the virtual machine data. The side effect of this is that direct kernel mapping and user space mapping (used by QEMU etc.) become useless.

However, this tells us some very useful information: For ordinary virtual machine operations, kernel mapping and user space mapping are sometimes not really necessary.

In our patch set, we do not use encryption, but simply unmap the memory. Compared with allowing access to ciphertext, one of its advantages is that wrong access will cause exceptions and be caught, rather than simply returning garbage data like encrypted data.

4.2 KVM: Dirty ring interface

The patch was updated twice in October, both of which rebase v15 to the latest KVM branch. This work is a continuation of the KVM dirty ring interface work done by Lei Cao <[email protected]> and Paolo Bonzini.

This new dirty ring interface is another way for virtual machines to reclaim dirty pages. In many ways, it is different from the existing dirty logging interface, mainly:

  • Data format: Dirty data is now in ring format instead of bitmap format, so the bit used to synchronize the dirty data log no longer depends on the size of the virtual machine's memory, but the speed of data dirty. In addition, the new dirty ring is exclusive to each vCPU, while the dirty bitmap is shared by all vCPUs, and the dirty bitmap is per-VM.

  • Data copy: The synchronization of dirty pages no longer needs to copy data. On the contrary, dirty ring is shared between user space and kernel space through page sharing (via mmap on vcpu_fd)

  • Interface: When we want to reset the collected dirty pages to protection mode again, the new dirty ring uses the new KVM_RESET_DIRTY_RINGS ioctl interface instead of the original KVM_GET_DIRTY_LOG, KVM_CLEAR_DIRTY_LOG interface. In order to collect dirty bits, we only need to read the data in the dirty ring, and no longer need to call the ioctl interface.

(END)

更多精彩,尽在"Linux阅码场",扫描下方二维码关注

Guess you like

Origin blog.csdn.net/21cnbao/article/details/109699266