Chapter 3. The /proc Filesystem

Chapter 3. The /proc Filesystem

Introduction

Process Information

Kernel Information and Manipulation

System Information and Manipulation

Conclusion

3.1. Introduction

One of the big reasons why Linux is so popular today is the fact that it combines many of the best features from its UNIX ancestors. One of these features is the /proc filesystem, which it inherited from System V and is a standard part of all kernels included with all of the major distributions. Some distributions provide certain things in /proc that others don’t, so there is no one standard /proc specification; therefore, it should be used with a degree of caution.

Linux 之所以如此受欢迎的一个重要原因是它继承了UNIX许多好的功能。其中一个功能是/proc文件系统, 它从系统 V 继承, 它是所有主要发行版中包含的所有内核的标准部分。各个发行版在/proc中提供的东西不同, 因此没有一个标准/proc规范;因此, 应谨慎使用。

The /proc filesystem is one of the most important mechanisms that Linux provides for examining and configuring the inner workings of the operating system. It can be thought of as a window directly into the kernel’s data structures and the kernel’s view of the user processes running on the system. It appears to the user as a filesystem just like / or /home, so all the common file manipulation programs and system calls can be used with it such as cat(1), more(1), grep(1), open(2), read(2), and write(2)[1]. If permissions are sufficient, writing values to certain files is also easily performed by redirecting output to a file with the > shell character from a shell prompt or by calling the system call write(2) within an application.

/proc文件系统是 Linux 为检查和配置操作系统的内部运作而提供的最重要的机制之一。它可以被认为是一个直接查看内核的数据结构的窗口,和以内核视角查看运行在系统上用户进程。它显示给用户的文件系统就像/或/家, 因此, 所有常见的文件操作程序和系统调用都可以与它一起使用, 如 cat (1)、more (1)、grep (1)、open (2)、read (2) 和write (2) [1]。如果权限足够, 则通过shell重定向符号 > 将shell的输入重定向到某个文件, 或者在应用程序中调用系统调用write (2), 可以很容易地执行对某些文件的修改。

[1] When Linux operation names are appended with a number in parentheses, the number directly refers to a man page section number. Section 1 is for executable programs or shell commands, and section 2 is for system calls (functions provided by the kernel). Typing man 2 read will view the read system call man page from section 2.

当将 Linux 操作名称加上到括号中的数字作为后缀时, 该数字直接引用了 帮助手册 部分编号。1节用于可执行程序或 shell 命令, 2 节用于系统调用 (内核提供的函数)。键入man 2read将从第二节查看read系统调用的帮助页面。

The goal of this chapter is not to be an exhaustive reference of the /proc filesystem, as that would be an entire publication in itself. Instead the goal is to point out and examine some of the more advanced features and tricks primarily related to problem determination and system diagnosis. For more general reference, I recommend reading the proc(5) man page.

本章的目标不是对/proc文件系统的详尽介绍, 因为它本身就是一个介绍手册。相反, 本章的目的是指出并检查一些与问题的确定和系统诊断有关更先进的功能和技巧。为了更一般的参考, 我建议阅读proc (5) 帮助手册。

Note: If you have the kernel sources installed on your system, I also recommend reading /usr/src/linux/Documentation/filesystems/procfs.txt.

注意: 如果您的系统中安装了内核源代码, 我还建议阅读/usr/src/linux/Documentation/filesystems/procfs.txt 。

3.2. Process Information

Along with viewing and manipulating system information, obtaining user process information is another way in which the /proc filesystem shines. When you look at the listing of files in /proc, you will immediately notice a large number of directories identified by a number. These numbers represent process IDs and contain more detailed information on that process ID within it. All Linux systems will have the /proc/1 directory. The process with ID 1 is always the “init” process and is the first user process to be started on the system during bootup. Even though this is a special program, it is a process just like any other, and the /proc/1 directory will contain the same information as any other process including the ls command you use to see the contents of this and any other directory! The following sections will go into more detail on the most useful information that can be found in the /proc/<pid>[2] directory such as viewing and understanding a process’ address space, viewing CPU and memory configuration information, and understanding settings that can greatly enhance application and system troubleshooting.

除了查看和管理系统信息之外, 获取用户进程信息也是该文件系统的另一个亮点。当您查看 /proc中的文件列表时, 您将立即注意到大量由数字标识的目录。这些数字表示进程 id, 其中包含有关该id进程的更详细信息。所有 Linux 系统都将有/proc/1 目录。ID 为1的进程始终是 "init" 过程, 是启动期间在系统上启动的第一个用户进程。虽然这是一个特殊的程序, 它是一个进程, 就像任何其他进程一样, /proc/1 目录将包含类似的信息, 包括 ls 命令, 你用来看到这个和任何其他目录的内容!以下各节将详细介绍在/proc/<pid>[2] 目录中可以找到的最有用的信息,例如查看和理解进程的地址空间、查看 CPU 和内存配置信息以及了解可以大大增强应用程序和系统故障排除的设置.

[2] A common way of generalizing a process’ directory name under the /proc filesystem is to use / proc/<pid> considering a process’ number is random with the exception of the init process.

在/proc文件系统下推广进程目录名的一种常见方法是使用/proc/<pid>考虑进程的编号是随机的, 但init进程除外.

3.2.1. /proc/self

As a quick introduction into how processes are represented in the /proc filesystem, let’s first look at the special link “/proc/self.” The kernel provides this as a link to the currently executing process. Typing “cd /proc/self” will take you directly into the directory containing the process information for your shell process. This is because cd is a function provided by the shell (the currently running process at the time of using the “self” link) and not an external program. If you perform an ls -l /proc/self, you will see a link to the process directory for the ls process, which goes away as soon as the directory listing completes and the shell prompt returns. The following sequence of commands and their associated output illustrate this.

作为对进程在/proc文件系统中的表示方式的快速介绍, 让我们先来看一下特殊的链接 "/proc/self"。内核将其作为指向当前正在执行的进程的链接。键入 "cd/proc/self" 将直接带您进入包含当前 shell 进程信息的目录。这是因为 cd 是 shell 提供的函数 (使用 "self" 链接到当前正在运行的进程), 而不是外部程序。如果执行 ls-l/proc/self, 您将看到 ls 进程的进程目录的链接, 一旦目录列表完成, shell 提示返回, 就会立即消失。下面的命令序列及其相关的输出说明了这一点。

Note: $$ is a special shell environment variable that stores the shell’s process ID, and “/proc/<pid>/cwd” is a special link provided by the kernel that is an absolute link to the current working directory.

注意: $ $ 是一个特殊的 shell 环境变量, 它存储 shell 的进程 ID, "/proc/<pid>/cwd" 是内核提供的一个特殊链接, 它是指向当前工作目录的绝对链接.

 

penguin> echo $$

2602

penguin> ls -l /proc/self

lrwxrwxrwx  1 root   root 64 2003-10-13 08:04 /proc/self -> 2945

penguin> cd /proc/self

penguin> ls -l cwd

lrwxrwxrwx  1 dbehman build  0 2003-10-13 13:00 cwd -> /proc/2602

penguin>

 

The main thing to understand in this example is that 2945 is the process ID of the ls command. The reason for this is that the /proc/self link, just as all files in /proc, is dynamic and will change to reflect the current state at any point in time. The cwd link matches the same process ID as our shell process because we first used “cd” to get into the /proc/self directory.

在此示例中要理解的主要问题是, 2945 是 ls 命令的进程 ID。其原因是/proc/self链接 (就像所有文件在/proc中一样) 是动态的, 并且将更改以反映任何时间点的当前状态。cwd 链接与我们的 shell 进程匹配相同的进程 ID, 因为我们首先使用 "cd" 进入/proc/self。

3.2.2. /proc/<pid> in More Detail

With the understanding that typing “cd /proc/self” will change the directory to the current shell’s /proc directory, let’s examine the contents of this directory further. The commands and output are as follows:

了解到键入 "cd/proc/self" 会将目录更改为当前 shell 的/proc目录, 让我们进一步检查此目录的内容。命令和输出如下所示:

penguin> cd /proc/self

penguin> ls -l

total 0

-r--r--r--    1 dbehman build  0 2003-10-13 13:34    cmdline

lrwxrwxrwx    1 dbehman build  0 2003-10-13 13:34 cwd -> /proc/2602

-r--------    1 dbehman build  0 2003-10-13 13:34 environ

lrwxrwxrwx    1 dbehman build  0 2003-10-13 13:34 exe-> /bin/bash

dr-x------    2 dbehman build  0 2003-10-13 13:34 fd

-rw-------    1 dbehman build  0 2003-10-13 13:34 mapped_base

-r--r--r--    1 dbehman build  0 2003-10-13 13:34 maps

-rw-------    1 dbehman build  0 2003-10-13 13:34 mem

-r--r--r--    1 dbehman build  0 2003-10-13 13:34 mounts

lrwxrwxrwx    1 dbehman build  0 2003-10-13 13:34 root -> /

-r--r--r--    1 dbehman build  0 2003-10-13 13:34 stat

-r--r--r--    1 dbehman build  0 2003-10-13 13:34 statm

-r--r--r--    1 dbehman build  0 2003-10-13 13:34 status

 

Notice how the sizes of all the files are 0, yet when we start examining some of them more closely it’s clear that they do in fact contain information. The reason for the 0 size is because these files are basically a window directly into the kernel’s data structures and therefore are not really files; rather they are very special types of files. When filesystem operations are performed on files within the /proc filesystem, the kernel recognizes what is being requested by the user and dynamically returns the data to the calling process just as if it were being read from the disk.

请注意, 所有文件的大小是 0, 但当我们开始仔细检查其中的一些时, 很明显, 它们实际上包含信息。大小是0的原因是, 这些文件基本上是直接进入内核的数据结构的一个窗口, 因此不是真正的文件;相反, 它们是非常特殊的文件类型。在/proc文件系统中的文件执行文件系统操作时, 内核将识别用户请求的内容, 并将数据动态地返回到调用进程, 就像从磁盘读取一样。

3.2.2.1. /proc/<pid>/maps

The “maps” file provides a view of the process’ memory address space. Every process has its own address space that is handled and provided by the Virtual Memory Manager. The name “maps” is derived from the fact that each line represents a mapping of some part of the process to a particular region of the address space. For this discussion, we’ll focus on the 32-bit x86 hardware. However, 64-bit hardware is becoming more and more important, especially when using Linux, so we’ll discuss the differences with Linux running on x86_64 at the end of this section.

"映射" 文件提供了进程的内存地址空间的视图。每个进程都有自己的地址空间, 由虚拟内存管理器处理和提供。"映射" 的名称来源于以下事实: 文件的每一行表示进程的某个部分映射到地址空间的特定区域。对于本讨论, 我们将重点介绍32位 x86 硬件。然而, 64 位硬件变得越来越重要, 尤其是在使用 linux 的时候, 所以在本节的末尾我们将讨论x86_64 上运行的 linux 的不同之处。

Figure 3.1 shows a sample maps file which we will analyze in subsequent sections. 显示一个映射文件示例, 我们将在后面的部分分析。

Figure 3.1. A /proc/<pid>/ maps file.

Code View: Scroll / Show All

08048000-080b6000 r-xp 00000000 03:08 10667   /bin/bash

080b6000-080b9000 rw-p 0006e000 03:08 10667   /bin/bash

080b9000-08101000 rwxp 00000000 00:00 0

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001a000 rw-p 00000000 00:00 0

4001a000-4001b000 r--p 00000000 03:08 8598    /usr/lib/locale/en_US/LC_IDENTIFICATION

4001b000-4001c000 r--p 00000000 03:08 9920    /usr/lib/locale/en_US/LC_MEASUREMENT

4001c000-4001d000 r--p 00000000 03:08 9917    /usr/lib/locale/en_US/LC_TELEPHONE

4001d000-4001e000 r--p 00000000 03:08 9921    /usr/lib/locale/en_US/LC_ADDRESS

4001e000-4001f000 r--p 00000000 03:08 9918    /usr/lib/locale/en_US/ LC_NAME

4001f000-40020000 r--p 00000000 03:08 9939    /usr/lib/locale/en_US/LC_PAPER

40020000-40021000 r--p 00000000 03:08 9953       /usr/lib/locale/en_US/LC_MESSAGES/SYS_LC_MESSAGES

40021000-40022000 r--p 00000000 03:08 9919    /usr/lib/locale/en_US/LC_MONETARY

40022000-40028000 r--p 00000000 03:08 10057   /usr/lib/locale/en_US/LC_COLLATE

40028000-40050000 r-xp 00000000 03:08 10434   /lib/libreadline.so.4.3

40050000-40054000 rw-p 00028000 03:08 10434   /lib/libreadline.so.4.3

40054000-40055000 rw-p 00000000 00:00 0

40055000-4005b000 r-xp 00000000 03:08 10432   /lib/libhistory.so.4.3

4005b000-4005c000 rw-p 00005000 03:08 10432   /lib/libhistory.so.4.3

4005c000-40096000 r-xp 00000000 03:08 6788    /lib/libncurses.so.5.3

40096000-400a1000 rw-p 00039000 03:08 6788    /lib/libncurses.so.5.3

400a1000-400a2000 rw-p 00000000 00:00 0

400a2000-400a4000 r-xp 00000000 03:08 6673    /lib/libdl.so.2

400a4000-400a5000 rw-p 00002000 03:08 6673    /lib/libdl.so.2

400a5000-401d1000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

401d1000-401d6000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

401d6000-401d9000 rw-p 00000000 00:00 0

401d9000-401da000 r--p 00000000 03:08 8600    /usr/lib/locale/en_US/LC_TIME

401da000-401db000 r--p 00000000 03:08 9952    /usr/lib/locale/en_US/LC_NUMERIC

401db000-40207000 r--p 00000000 03:08 10056   /usr/lib/locale/en_US/LC_CTYPE

40207000-4020d000 r--s 00000000 03:08 8051    /usr/lib/gconv/gconv-modules.cache

4020d000-4020f000 r-xp 00000000 03:08 8002    /usr/lib/gconv/ISO8859-1.so

4020f000-40210000 rw-p 00001000 03:08 8002    /usr/lib/gconv/ISO8859-1.so

40210000-40212000 rw-p 00000000 00:00 0

bfffa000-c0000000 rwxp ffffb000 00:00 0

The first thing that should stand out is the name of the executable /bin/bash. This makes sense because the commands used to obtain this maps file were “cd /proc/self ; cat maps.” Try doing “less /proc/self/maps” and note how it differs.

首先应该突出的是可执行文件/bin/bash 的名称。这是有道理的, 因为用于获取此映射文件的命令是 "cd/proc/self";cat maps。试着做 "less /proc/self/maps", 并注意它的区别。

Let’s look at what each column means. Looking at the first line in the output just listed as an example we know from the proc(5) man page that 08048000-080b6000 is the address space in the process occupied by this entry; the r-xp indicates that this mapping is readable, executable, and private; the 00000000 is the offset into the file; 03:08 is the device (major:minor); 10667 is the inode; and /bin/bash is the pathname. But what does all this really mean?

让我们看看每一列的含义。看一下刚才列出的输出中的第一行, 我们从proc (5) 帮助手册知道, 08048000-080b6000 是该条目所占用进程中的地址空间;r xp 表示此映射是可读的、可执行的和私有的;00000000是文件的偏移量;03:08 是设备 (主要: 次要);10667是 inode;和/bin/bash 是路径名。但这一切到底意味着什么呢?

It means that /bin/bash, which is inode 10667 (“stat /bin/bash” to confirm) on partition 8 of device 03 (examine /proc/devices and /proc/partitions for number to name mappings), had the readable and executable sections of itself mapped into the address range of 0x08048000 to 0x080b6000.

这意味着/bin/bash, 这是 inode 10667 ("stat /bin/bash", 以确认) 在设备03的分区 8 (检查/proc/devices和/proc/partitions 的数字到名称映射), 本身可读和可执行的部分映射到地址范围0x08048000 到0x080b6000。

Now let’s examine what each individual line means. Because the output is the address mappings of the /bin/bash executable, the first thing to point out is where the program itself lives in the address space. On 32-bit x86-based architectures, the first address to which any part of the executable gets mapped is 0x08048000. This address will become very familiar the more you look at maps files. It will appear in every maps file and will always be this address unless someone went to great lengths to change it. Because of Linux’s open source nature, this is possible but very unlikely. The next thing that becomes obvious is that the first two lines are very similar, and the third line’s address mapping follows immediately after the second line. This is because all three lines combined contain all the information associated with the executable /bin/bash.

现在让我们来研究一下每一行的含义。因为输出是/bin/bash 可执行文件的地址映射, 所以首先要指出的是程序本身在地址空间中的位置。在32位 x86 体系结构上, 可执行文件的任何部分被映射到的第一个地址是0x08048000。这个地址将变得非常熟悉, 你看映射文件越多。它将出现在每个映射文件, 并将永远是这个地址, 除非有人去做很大的努力来改变它。由于 Linux 的开源性质, 这是可能的, 但不是很可行。接下来的事情变得很明显, 前两行非常相似, 第三行的地址映射紧跟在第二行之后。这是因为所有三行组合包含与可执行文件/bin/bash 相关的所有信息。

Generally speaking, each of the three lines is considered a segment and can be named the code segment, data segment, and heap segment respectively. Let’s dissect each segment along with its associated line in the maps file.

一般而言, 三行中的每一个都被视为一个段, 并且可以分别命名为代码段、数据段和堆段。让我们解剖每个段以及它在映射文件中的联系。

3.2.2.1.1. Code Segment

The code segment is also very often referred to as the text segment. As will be discussed further in Chapter 9, “ELF: Executable and Linking Format,” the .text section is contained within this segment and is the section that contains all the executable code.

代码段也经常被称为文本段。正如将在第9章 "ELF: 可执行文件和链接格式" 中进一步讨论的那样, ". text" 部分包含在此段中, 是包含所有可执行代码的段。

Note: If you’ve ever seen the error message text file busy (ETXTBSY) when trying to delete or write to an executable program that you know to be binary and not ASCII text, the meaning of the error message stems from the fact that executable code is stored in the .text section

注意: 如果您在试图删除或写入一个您知道是二进制文件而不是 ASCII 文本的可执行程序时遇到错误消息text file busy(ETXTBSY), 则错误消息的含义源于可执行代码存储在. text 段中的事实。

 

Using /bin/bash as our example, the code segment taken from the maps file in Figure 3.1 is represented by this line:

使用/bin/bash 作为我们的示例, 从图3.1 中的映射文件中取出的代码段如下:

08048000-080b6000 r-xp 00000000 03:08 10667 /bin/bash

 

This segment contains the program’s executable instructions. This fact is confirmed by the r-xp in the permissions column. Linux does not support self modifying code, therefore there is no write permission, and since the code is actually executed, the execute permission is set. To give a hands-on practical example of demonstrating what this really means, consider the following code:

此段包含程序的可执行指令。此事实由 "权限" 列中的 r xp 确认。Linux 不支持自修改代码, 因此没有写入权限, 而且由于代码实际被执行, 因此设置了 execute 权限。要举一个实际的例子来展示这到底意味着什么, 请考虑下面的代码:

#include <stdio.h>

 

int main( void )

{

  printf( "Address of function main is 0x%x\n", &main );

  printf( "Sleeping infinitely; my pid is %d\n", getpid() );

 

  while( 1 )

  sleep( 5 );

 

  return 0;

}

 

Compiling and running this code will give this output:

编译并运行此代码将提供以下输出:

Address of function main is 0x804839c

Sleeping infinitely; my pid is 4059

 

While the program is sleeping, examining /proc/4059/maps gives the following maps file:

当程序处于休眠状态时, 检查/proc/4059/maps 提供以下映射文件:

08048000-08049000 r-xp 00000000 03:08 130198 /home/dbehman/testing/c

08049000-0804a000 rw-p 00000000 03:08 130198 /home/dbehman/testing/c

40000000-40018000 r-xp 00000000 03:08 6664   /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664   /lib/ld-2.3.2.so

40019000-4001b000 rw-p 00000000 00:00 0

40028000-40154000 r-xp 00000000 03:08 6661   /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661   /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

 

Looking at the code segment’s address mapping of 08048000 - 08049000 we see that main’s address of 0x804839c does indeed fall within this range. This is an important observation to understand when debugging programs especially when using a debugger such as GDB. The reason for this is because when looking at various addresses in a debugging session, knowing roughly what they are can often help to put the puzzle pieces together much more quickly.

查看代码段的地址映射 08048000-08049000 我们看到main函数的地址0x804839c 确实是在这个范围内。这是一个重要的观察, 以了解何时调试程序, 特别是当使用一个调试器时, 如 GDB。原因是因为当查看调试中的各种地址时, 大致知道它们是什么可以帮助将代码片段更快速地组合在一起。

3.2.2.1.2. Data Segment

For quick reference, the data segment of /bin/bash is represented by line two in Figure 3.1:

为快速引用,/bin/bash 的数据段由图3.1 中的第二行表示:

 

080b6000-080b9000 rw-p 0006e000 03:08 10667   /bin/bash

 

At first glance it appears to be very similar to the code segment line but in fact is quite different. The primary differences are the address mapping and the permissions setting of rw-p which means read-write, non-executable, and private. Logically speaking, a program consists mostly of instructions and variables. We now know that the instructions are in the code segment, which is read-only and executable. Because variables can certainly change throughout the execution of a program and are not considered to be executable, it makes perfect sense that they belong in the data segment. It is important to know that only certain kinds of variables exist in this segment, however. How and where they are declared in the program’s source code will dictate what segment and section they appear in the process’ address space. Variables that exist in the data segment are initialized global variables. The following program demonstrates this.

乍一看, 它似乎是非常类似的代码段, 但实际上是相当不同的。主要区别是地址映射和rw-p 的权限设置, 这意味着读写、不可执行和私有。从逻辑上讲, 程序主要由指令和变量组成。我们现在知道指令在代码段中, 它是只读的和可执行的。由于变量在整个程序的执行过程中肯定会发生变化, 并且是不可执行的, 因此它们属于数据段是完全有意义的。重要的是要知道, 只有某些类型的变量存在于这个部分, 但是。在程序的源代码中声明它们的方式和位置将决定它们在进程的 "地址空间" 中出现的位置。数据段中存在的变量是初始化的全局变量。下面的程序演示了这一点。

#include <stdio.h>

 

int global_var = 3;

 

int main( void )

{

  printf( "Address of global_var is 0x%x\n", &global_var );

  printf( "Sleeping infinitely; my pid is %d\n", getpid() );

 

         while( 1 )

         sleep( 5 );

  return 0;

}

 

Compiling and running this program produces the following output:

编译和运行此程序会产生以下输出:

Address of global_var is 0x8049570

Sleeping infinitely; my pid is 4472

 

While this program sleeps, examining /proc/4472/maps shows the following:

当程序休眠时,检查/ proc / 4472 / maps显示以下内容:

08048000-08049000 r-xp 00000000 03:08 130200  /home/dbehman/testing/d

08049000-0804a000 rw-p 00000000 03:08 130200  /home/dbehman/testing/d

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001b000 rw-p 00000000 00:00 0

40028000-40154000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

 

We see that the address of the global variable does indeed fall within the data segment address mapping range of 0x08049000 - 080804a000. Two other very common types of variables are stack and heap variables. Stack variables will be discussed in the Stack Section further below, and heap variables will be discussed next.

我们看到全局变量的地址确实位于 0x08049000-080804a000 的数据段地址映射范围内。其他两种非常常见的变量类型是堆栈和堆变量。堆栈变量将在下面的堆栈部分中进行讨论, 接下来将讨论堆变量。

3.2.2.1.3. Heap Segment

As the name implies, this segment holds a program’s heap variables. Heap variables are those that have their memory dynamically allocated via programming APIs such as malloc() and new(). Both of these APIs call the brk() system call to extend the end of the segment to accommodate the memory requested. This segment also contains the bss section, which is a special section that contains uninitialized global variables. The reason why a separate section to the data section is used for these types of variables is because space can be saved in the file’s on-disk image because no value needs to be stored in association with the variable. This is also why the bss segment is located at the end of the executable’s mappings — space is only allocated in memory when these variables get mapped. The following program demonstrates how variable declarations in source code correspond to the heap segment.

顾名思义, 此段包含程序的堆变量。堆变量是那些通过 api函数 (如 malloc () 和 new ()) 动态分配的内存。这两个 api 函数都调用 brk () 系统调用来扩展段的末尾以适应所请求的内存。此段还包含 bss 部分, 它是一个包含未初始化的全局变量的特殊节。为这些类型的变量使用单独的部分的原因是因为空间可以保存在文件的磁盘映像中, 因为不需要存储与变量关联的值。这也是为什么 bss 段位于可执行文件映射的末尾的原因:空间仅在这些变量得到映射时在内存中分配。下面的程序演示源代码中的变量声明如何对应于堆段。

#include <stdio.h>

 

int g_bssVar;

 

int main( void )

{

   char *pHeapVar = NULL;

   char szSysCmd[128];

 

   sprintf( sysCmd, "cat /proc/%d/maps", getpid() );

 

   printf( "Address of bss_var is 0x%x\n", &bss_var );

   printf( "sbrk( 0 ) value before malloc is 0x%x\n", sbrk( 0 ));

   printf( "My maps file before the malloc call is:\n" );

   system( sysCmd );

 

   printf( "Calling malloc to get 1024 bytes for heap_var\n" );

           heap_var = (char*)malloc( 1024 );

 

   printf( "Address of heap_var after malloc is 0x%x\n",

          heap_var );

 

   printf( "sbrk( 0 ) value after malloc is 0x%x\n", sbrk( 0 ));

   printf( "My maps file after the malloc call is:\n" );

   system( sysCmd );

 

   return 0;

}

Note: Notice the unusual variable naming convention used. This is taken from what’s called “Hungarian Notation,” which is used to embed indications of the type and scope of the variable in the name itself. For example, sz means NULL terminated string, p means pointer, and g_ means global in scope.

提示: 注意使用的异常变量命名约定。这是所谓的 "匈牙利命名法", 这是用嵌入到名字的符号来表示变量的类型和范围。例如, sz 表示 NULL 终止字符串, p 表示指针, g_ 表示全局变量。

 

Compiling and running this program produces the following output:

编译和运行此程序会产生以下输出:

Code View: Scroll / Show All

penguin> ./heapseg

Address of g_bssVar is 0x8049944

sbrk( 0 ) value before malloc is 0x8049948

My maps file before the malloc call is:

08048000-08049000 r-xp 00000000 03:08 130260 /home/dbehman/book/src/heapseg

08049000-0804a000 rw-p 00000000 03:08 130260  /home/dbehman/book/src/heapseg

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001b000 rw-p 00000000 00:00 0

40028000-40154000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

Calling malloc to get 1024 bytes for pHeapVar

Address of pHeapVar after malloc is 0x8049998

sbrk( 0 ) value after malloc is 0x806b000

My maps file after the malloc call is:

08048000-08049000 r-xp 00000000 03:08  130260 /home/dbehman/book/src/heapseg

08049000-0804a000 rw-p 00000000 03:08  130260 /home/dbehman/book/src/heapseg

0804a000-0806b000 rwxp 00000000 00:00 0

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001b000 rw-p 00000000 00:00 0

40028000-40154000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

 

When examining this output, it may seem that a contradiction exists as to where the bss section actually exists. I’ve written that it exists in the heap segment, but the preceding output shows that the address of the bss variable lives in data segment (that is, 0x8049948 lies within the address range 0x08049000-0x0804a000). The reason for this is that there is unused space at the end of the data segment, due to the small size of the example and the small number of global variables declared, so the bss segment appears in the data segment to limit wasted space. This fact in no way changes its properties.

在检查这一输出时, 似乎存在着一个矛盾, 那就是 "bss" 部分实际上存在的地方。我已经写了它存在于堆段中, 但前面的输出显示, bss 变量的地址存在数据段中 (即0x8049948 位于地址范围0x08049000-0x0804a000 中)。原因是在数据段的末尾有未使用的空间, 这是因为示例的大小和声明的全局变量的个数很少, 因此 bss 段出现在数据段中以限制空间的浪费。这个事实绝不会改变它的属性。

Note: As will be discussed in Chapter 9, the curious reader can verify that g_bssVar’s address of 0x08049944 is in fact in the .bss section by examining readelf - e <exe_name> output and searching for where the .bss section begins. In our example, the .bss section header is at 0x08049940.

注: 如9章所述, 好奇的读者可以通过检查 readelf –e <exe_name>的输出并搜索. bss 部分的起始位置来验证 g_bssVar 的0x08049944 地址实际在.bss段。在我们的示例中,. bss 段的头位于 0x08049940.

 

Also done to limit wasted space in this example, the brk pointer (determined by calling sbrk with a parameter of 0) appears in the data segment when we would expect to see it in the heap segment. The moral of this example is that the three separate entries in the maps files for the exe do not necessarily correspond to hard segment ranges; rather they are more of a soft guide.

在本示例中还做了限制空间的浪费, 当我们期望在堆段中看到它时, brk 指针 (通过调用 参数为0的sbrk 确定) 出现在数据段中。这个例子的寓意是, exe 的映射文件中的三个单独的条目不一定对应于段范围;相反, 他们更象是一个软性规定。

The next important thing to note from this output is that before the malloc call, the heapseg executable only had two entries in the maps file. This meant that there was no heap at that particular point in time. After the malloc call, we now see the third line, which represents the heap segment. Next we see that after the malloc call, the brk pointer is now pointing to the end of the range reported in the maps file, 0x0806b000. Now you may be a bit confused because the brk pointer moved from 0x08049948 to 0x0806b000 which is a total of 136888 bytes. This is an awful lot more than the 1024 that we requested, so what happened? Malloc is smart enough to know that it’s quite likely that more heap memory will be required by the program in the future so rather than continuously calling the expensive brk() system call to move the pointer for every malloc call, it asks for a much larger chunk of memory than immediately needed. This way, when malloc is called again to get a relatively small chunk of memory, brk() need not be called again, and malloc can just return some of this extra memory. Doing this provides a huge performance boost, especially if the program requests many small chunks of memory via malloc calls.

该输出中要注意的下一个重要事项是, 在 malloc 调用之前, heapseg 可执行文件仅有两个入口在映射文件中。这意味着在那个特定的时间点没有堆。在 malloc 调用之后, 我们现在看第三行, 它表示堆段。接下来, 我们将看到, 在 malloc 调用之后, brk 指针现在指向映射文件0x0806b000 中报告的范围的末尾。现在, 您可能有点迷惑, 因为 brk 指针从0x08049948 移动到 0x0806b000, 这总共是136888个字节。这比我们要求的1024还要多, 怎么了?Malloc 非常聪明, 可以知道该程序在将来可能需要更多的堆内存, 而不是不断地调用昂贵的 brk () 系统调用来移动每个 malloc 调用的指针, 它要求更大的内存块。这样, 当 malloc 再次被调用以获得相对较小的内存块时, brk () 就不必再调用了, 而 malloc 可以返回一些额外的内存。这样做可以获得巨大的性能提升, 特别是当程序通过 malloc 调用请求许多小块内存时。

3.2.2.1.4. Mapped Base / Shared Libraries

Continuing our examination of the maps file, the next point of interest is what’s commonly referred to as the mapped base address, which defines where the shared libraries for an executable get loaded. In standard kernel source code (as downloaded from kernel.org), the mapped base address is a hardcoded location defined as TASK_UNMAPPED_BASE in each architecture’s processor.h header file. For example, in the 2.6.0 kernel source code, the file, include/asm-i386/processor.h, contains the definition:

继续检查映射文件, 下一个感兴趣的点是通常称为映射的基址, 它定义了可执行文件的共享库的加载位置。在标准内核源代码 (从 kernel.org 下载) 中, 映射的基地址是在每个体系结构的process. h 头文件中定义为 TASK_UNMAPPED_BASE 的硬编码位置。例如, 在2.6.0 内核源代码中, 文件include/asm i386/process. h 包含定义:

/* This decides where the kernel will search for a free chunk of vm

* space during mmap's.

*/

#define TASK_UNMAPPED_BASE   (PAGE_ALIGN(TASK_SIZE / 3))

 

Resolving the definitions of PAGE_ALIGN and TASK_SIZE, this equates to 0x40000000. Note that some distributions such as SuSE include a patch that allows this value to be dynamically modified. See the discussion on the /proc/ <pid>/mapped_base file in this chapter. Continuing our examination of the mapped base, let’s look at the maps file for bash again:

解决 PAGE_ALIGN 和 TASK_SIZE 的定义, 这等同于0x40000000。请注意, 某些发行版 (如 SuSE) 包含了允许动态修改此值的修补程序。请参见/proc/<pid>/mapped_base 文件的讨论。继续我们对映射基的检查, 让我们再次查看 "bash" 的映射文件:

Code View: Scroll / Show All

08048000-080b6000 r-xp 00000000 03:08 10667   /bin/bash

080b6000-080b9000 rw-p 0006e000 03:08 10667   /bin/bash

080b9000-08101000 rwxp 00000000 00:00 0

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001a000 rw-p 00000000 00:00 0

4001a000-4001b000 r--p 00000000 03:08 8598    /usr/lib/locale/en_US/LC_IDENTIFICATION

4001b000-4001c000 r--p 00000000 03:08 9920    /usr/lib/locale/en_US/LC_MEASUREMENT

4001c000-4001d000 r--p 00000000 03:08 9917    /usr/lib/locale/en_US/LC_TELEPHONE

4001d000-4001e000 r--p 00000000 03:08 9921    /usr/lib/locale/en_US/LC_ADDRESS

4001e000-4001f000 r--p 00000000 03:08 9918    /usr/lib/locale/en_US/LC_NAME

4001f000-40020000 r--p 00000000 03:08 9939    /usr/lib/locale/en_US/LC_PAPER

40020000-40021000 r--p 00000000 03:08 9953    /usr/lib/locale/en_US/LC_MESSAGES/SYS_LC_MESSAGES

40021000-40022000 r--p 00000000 03:08 9919    /usr/lib/locale/en_US/LC_MONETARY

40022000-40028000 r--p 00000000 03:08 10057   /usr/lib/locale/en_US/LC_COLLATE

40028000-40050000 r-xp 00000000 03:08 10434   /lib/libreadline.so.4.3

40050000-40054000 rw-p 00028000 03:08 10434   /lib/libreadline.so.4.3

40054000-40055000 rw-p 00000000 00:00 0

40055000-4005b000 r-xp 00000000 03:08 10432   /lib/libhistory.so.4.3

4005b000-4005c000 rw-p 00005000 03:08 10432   /lib/libhistory.so.4.3

4005c000-40096000 r-xp 00000000 03:08 6788    /lib/libncurses.so.5.3

40096000-400a1000 rw-p 00039000 03:08 6788    /lib/libncurses.so.5.3

400a1000-400a2000 rw-p 00000000 00:00 0

400a2000-400a4000 r-xp 00000000 03:08 6673    /lib/libdl.so.2

400a4000-400a5000 rw-p 00002000 03:08 6673    /lib/libdl.so.2

400a5000-401d1000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

401d1000-401d6000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

401d6000-401d9000 rw-p 00000000 00:00 0

401d9000-401da000 r--p 00000000 03:08 8600    /usr/lib/locale/en_US/LC_TIME

401da000-401db000 r--p 00000000 03:08 9952    /usr/lib/locale/en_US/LC_NUMERIC

401db000-40207000 r--p 00000000 03:08 10056   /usr/lib/locale/en_US/LC_CTYPE

40207000-4020d000 r--s 00000000 03:08 8051    /usr/lib/gconv/gconv-modules.cache

4020d000-4020f000 r-xp 00000000 03:08 8002    /usr/lib/gconv/ISO8859-1.so

4020f000-40210000 rw-p 00001000 03:08 8002    /usr/lib/gconv/ISO8859-1.so

40210000-40212000 rw-p 00000000 00:00 0

bfffa000-c0000000 rwxp ffffb000 00:00 0

 

Note the line:

40000000-40018000 r-xp 00000000 03:08 6664   /lib/ld-2.3.2.so

 

This shows us that /lib/ld-2.3.2.so was the first shared library to be loaded when this process began. /lib/ld-2.3.2.so is the linker itself, so this makes perfect sense and in fact is the case in all executables that dynamically link in shared libraries. Basically what happens is that when creating an executable that will link in one or more shared libraries, the linker is implicitly linked into the executable as well. Because the linker is responsible for resolving all external symbols in the linked shared libraries, it must be mapped into memory first, which is why it will always be the first shared library to show up in the maps file.

这告诉我们, 当这个进程开始时, /lib/ld-2.3. 2.so是第一个被加载共享的库。/lib/ld-2.3. 2.so 链接器本身也是如此, 所以这是完全意义上的, 实际上是所有可执行文件使用共享库中动态链接的情况。基本上, 当创建一个可执行文件将在一个或多个共享库中链接时, 链接器也会隐式链接到可执行文件中。由于链接器负责解析链接共享库中的所有外部符号, 因此必须首先将其映射到内存中, 这就是为什么它始终是在映射文件中显示的第一个共享库。

After the linker, all shared libraries that an executable depends upon will appear in the maps file. You can check to see what an executable needs without running it and looking at the maps file by running the ldd command as shown here:

在链接器之后, 可执行文件所依赖的所有共享库都将显示在映射文件中。您可以通过运行 ldd 命令来检查可执行文件, 而不运行它并查看映射文件, 如下所示:

penguin> ldd /bin/bash

     libreadline.so.4 => /lib/libreadline.so.4 (0x40028000)

     libhistory.so.4 => /lib/libhistory.so.4 (0x40055000)

     libncurses.so.5 => /lib/libncurses.so.5 (0x4005c000)

     libdl.so.2 => /lib/libdl.so.2 (0x400a2000)

     libc.so.6 => /lib/i686/libc.so.6 (0x400a5000)

     /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

 

You can now correlate the list of libraries and their addresses to Figure 3.1 and see what they look like in the maps file.

现在, 您可以将库及其地址的列表与图3.1 关联起来, 并在映射文件中查看它们的样子。

Note: ldd is actually a script that does many things, but the main thing it does is it sets the LD_TRACE_LOADED_OBJECTS environment variable to non-NULL. Try the following sequence of commands and see what happens:

注意:ldd 实际上是一个脚本, 它做了很多事情, 但它所做的主要事情是将 LD_TRACE_LOADED_OBJECTS 环境变量设置为非 NULL。请尝试下面的命令序列, 看看会发生什么:

    export LD_TRACE_LOADED_OBJECTS=1

    less

Note: Be sure to do an unset LD_TRACE_LOADED_OBJECTS to return things to normal.

注意: 一定要做一个未设置的 LD_TRACE_LOADED_OBJECTS 将事情恢复到正常。

 

But what about all those extra LC_ lines in the maps file in Figure 3.1? As the full path indicates, they are all special mappings used by libc’s locale functionality. The glibc library call, setlocale(3), prepares the executable for localization functionality based on the parameters passed to the call. Compiling and running the following source will demonstrate this.

但图3.1 中映射文件中的所有额外 LC_ 行呢?正如完整路径所示, 它们都是 libc 的地域设置功能使用的特殊映射。glibc 库调用, setlocale (3), 根据传递给调用的参数为本地化功能准备可执行文件。编译并运行以下源代码将演示此操作。

#include <stdio.h>

#include <locale.h>

 

int main( void )

{

  char szCommand[64];

 

  setlocale( LC_ALL, "en_US" );

 

  sprintf( szCommand, "cat /proc/%d/maps", getpid() );

 

  system( szCommand );

 

  return 0;

}

 

Running the program produces the following output:

运行该程序会产生以下输出:

Code View: Scroll / Show All

08048000-08049000 r-xp 00000000 03:08 206928  /home/dbehman/book/src/l

08049000-0804a000 rw-p 00000000 03:08 206928  /home/dbehman/book/src/l

0804a000-0806b000 rwxp 00000000 00:00 0

40000000-40018000 r-xp 00000000 03:08 6664    /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664    /lib/ld-2.3.2.so

40019000-4001a000 rw-p 00000000 00:00 0

4001a000-4001b000 r--p 00000000 03:08 8598    /usr/lib/locale/en_US/LC_IDENTIFICATION

4001b000-4001c000 r--p 00000000 03:08 9920    /usr/lib/locale/en_US/LC_MEASUREMENT

4001c000-4001d000 r--p 00000000 03:08 9917    /usr/lib/locale/en_US/LC_TELEPHONE

4001d000-4001e000 r--p 00000000 03:08 9921    /usr/lib/locale/en_US/LC_ADDRESS

4001e000-4001f000 r--p 00000000 03:08 9918    /usr/lib/locale/en_US/LC_NAME

4001f000-40020000 r--p 00000000 03:08 9939    /usr/lib/locale/en_US/LC_PAPER

40020000-40021000 r--p 00000000 03:08 9953    /usr/lib/locale/en_US/LC_MESSAGES/SYS_LC_MESSAGES

40021000-40022000 r--p 00000000 03:08 9919    /usr/lib/locale/en_US/LC_MONETARY

40022000-40028000 r--p 00000000 03:08 10057   /usr/lib/locale/en_US/LC_COLLATE

40028000-40154000 r-xp 00000000 03:08 6661    /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661    /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

4015b000-4015c000 r--p 00000000 03:08 8600    /usr/lib/locale/en_US/LC_TIME

4015c000-4015d000 r--p 00000000 03:08 9952    /usr/lib/locale/en_US/LC_NUMERIC

4015d000-40189000 r--p 00000000 03:08 10056   /usr/lib/locale/en_US/LC_CTYPE

bfffe000-c0000000 rwxp fffff000 00:00 0

 

The LC_* mappings here are identical to the mappings in Figure 3.1.

此处的 LC_ 映射与图3.1 中的映射相同

3.2.2.1.5. Stack Segment

The final segment in the maps output is the stack segment. The stack is where local variables for all functions are stored. Function parameters are also stored on the stack. The stack is very aptly named as data is “push”ed onto it and “pop”ed from it just as in the fundamental data structure. Understanding how the stack works is key to diagnosing and debugging many tricky problems, so it’s recommended that Chapter 5, “The Stack,” be referred to. In the context of the maps file, it is important to understand that the stack will grow toward the heap segment. It is commonly said that on x86 hardware, the stack grows “downward.” This can be confusing when visualizing the maps file. All it really means is that as data is added to the stack, the locations (addresses) of the data become smaller. This fact is demonstrated with the following program:

映射输出中的最后一段是堆栈段。堆栈是存储所有函数的局部变量的地方。函数参数也存储在堆栈上。栈是非常恰当的命名为数据是 "推" 到它和 "弹出", 就像在基本数据结构。了解堆栈的工作原理是诊断和调试许多棘手问题的关键, 因此建议引用第5章 "堆栈"。在映射文件的上下文中, 了解堆栈将向堆段扩展是很重要的。通常说, 在 x86 硬件上, 堆栈会"向下"增长。当可视化映射文件时, 这可能会很混乱。它真正的意思是, 当数据被添加到堆栈中时, 数据的位置 (地址) 就会变小。下面的程序演示了这一事实:

#include <stdio.h>

 

int main( void )

{

   int stackVar1 = 1;

   int stackVar2 = 2;

   char szCommand[64];

 

   printf( "Address of stackVar1 is 0x%x\n\n", &stackVar1 );

   printf( "Address of stackVar2 is 0x%x\n\n", &stackVar2 );

 

   sprintf( szCommand, "cat /proc/%d/maps", getpid() );

 

   system( szCommand );

 

   return 0;

}

 

Compiling and running this program produces the following output:

编译和运行此程序会产生以下输出

Address of stackVar1 is 0xbffff2ec

 

Address of stackVar2 is 0xbffff2e8

 

08048000-08049000 r-xp 00000000 03:08 206930 /home/dbehman/book/src/stack

08049000-0804a000 rw-p 00000000 03:08 206930 /home/dbehman/book/src/stack

40000000-40018000 r-xp 00000000 03:08 6664   /lib/ld-2.3.2.so

40018000-40019000 rw-p 00017000 03:08 6664   /lib/ld-2.3.2.so

40019000-4001b000 rw-p 00000000 00:00 0

40028000-40154000 r-xp 00000000 03:08 6661   /lib/i686/libc.so.6

40154000-40159000 rw-p 0012c000 03:08 6661   /lib/i686/libc.so.6

40159000-4015b000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

 

As you can see, the first stack variable’s address is higher than the second one by four bytes, which is the size of an int.

正如您所看到的, 第一个堆栈变量的地址大于第二个值的四字节, 即 int 的大小。

So if stackVar1 is the first stack variable and its address is 0xbffff2ec, then what is in the address space above it (at higher addresses closer to 0xc0000000)? The answer is that the kernel stores information such as the environment, the argument count, and the argument vector for the program. As has been alluded to previously, the linker plays a very important role in the execution of a program. It also runs through several routines, and some of its information is stored at the beginning of the stack as well.

因此, 如果 stackVar1 是第一个堆栈变量, 其地址为 0xbffff2ec, 那么它上面的地址空间中有什么 (在更高的地址接近 0xc0000000)?答案是内核存储的诸如环境、参数计数和程序的参数向量等信息。正如前面提到的, 链接器在程序的执行中起着非常重要的作用。它还可以运行几个例程, 并且它的一些信息也存储在堆栈的开头。

3.2.2.1.6. The Kernel Segment

The only remaining segment in a process’ address space to discuss is the kernel segment. The kernel segment starts at 0xc0000000 and is inaccessible by user processes. Every process contains this segment, which makes transferring data between the kernel and the process’ virtual memory quick and easy. The details of this segment’s contents, however, are beyond the scope of this book.

进程 "地址空间" 中唯一要讨论的部分是内核段。内核段从0xc0000000 开始, 用户进程无法访问。每个过程都包含此段, 这使得在内核和进程的虚拟内存之间传输数据是快速而容易的。然而, 这部分内容的细节超出了本书的范围。

Note: You may have realized that this segment accounts for one quarter of the entire address space for a process. This is called 3/1 split address space. Losing 1GB out of 4GB isn’t a big deal for the average user, but for high-end applications such as database managers or Web servers, this can become an issue. The real solution is to move to a 64-bit platform where the address space is not limited to 4GB, but due to the large amount of existing 32-bit x86 hardware, it is advantageous to address this issue. There is a patch known as the 4G/4G patch, which can be found at ftp.kernel.org/pub/linux/kernel/people/akpm/patches/ or http://people.redhat.com/mingo/4g-patches. This patch moves the 1GB kernel segment out of each process’ address space, thus providing the entire 4GB address space to applications.

注意: 您可能已经意识到此段占了进程的整个地址空间的四分之一。这称为3/1 分割地址空间。对于普通用户来说, 4GB中的1GB 不算什么, 但是对于高端应用程序 (如数据库管理器或 Web 服务器) 来说, 这可能会成为一个问题。真正的解决方案是移动到一个64位平台, 其地址空间不限于 4GB, 但由于现有的32位 x86 硬件大量存在, 因此解决此问题是有利的。有一个补丁称为4G/4G 补丁, 可以在 ftp 上找到ftp.kernel.org/pub/linux/kernel/people/akpm/patches/或 http://people.redhat.com/mingo/4g-patches。此修补程序将1GB 内核段移出每个进程的地址空间, 从而为应用程序提供整个4GB 地址空间。

3.2.2.1.7. 64-bit /proc/<pid>/maps Differences

32-bit systems are limited to 232-1 = 4GB total addressable memory. In other words, 0xffffffff is the largest address that a process on a 32-bit system can handle. 64-bit computing raises this limit to 264-1 = 16 EB (1 EB = 1,000,000 TB), which is currently only a theoretical limit. Because of this, the typical locations for the various segments in a 32-bit program do not make sense in a 64-bit address space. Following is the maps file for /bin/bash on an AMD64 Opteron machine. Note that due to the length of each line, word-wrapping is unavoidable. Using the 32-bit maps file as a guide, it should be clear what the lines really look like.

32位系统被限制为 232-1 = 4GB可寻址内存。换言之, 0xffffffff 是32位系统上的一个进程可以处理的最大地址。64位计算将此限制提高到 264-1 = 16 eb (1 eb = 100万 TB), 这目前只是一个理论上的限制。因此, 32 位程序中各个段的典型位置在64位地址空间中没有意义。以下是 AMD64 Opteron机上的/bin/bash 的映射文件。注意, 由于每行的长度, 文字环绕是不可避免的。使用32位映射文件作为指南, 应该清楚每行的真实样子。

Code View: Scroll / Show All

0000000000400000-0000000000475000 r-xp 0000000000000000 08:07 10810/bin/bash

0000000000575000-0000000000587000 rw-p 0000000000075000 08:07 10810/bin/bash

0000000000587000-0000000000613000 rwxp 0000000000000000 00:00 0

0000002a95556000-0000002a9556b000 r-xp 0000000000000000 08:07 6820/lib64/ld-2.3.2.so

0000002a9556b000-0000002a9556c000 rw-p 0000000000000000 00:00 0

0000002a9556c000-0000002a9556d000 r--p 0000000000000000 08:07 8050/usr/lib/locale/en_US/

LC_IDENTIFICATION

0000002a9556d000-0000002a9556e000 r--p 0000000000000000 08:07 9564/usr/lib/locale/en_US/

LC_MEASUREMENT

0000002a9556e000-0000002a9556f000 r--p 0000000000000000 08:07 9561/usr/lib/locale/en_US/

LC_TELEPHONE

0000002a9556f000-0000002a95570000 r--p 0000000000000000 08:07 9565/usr/lib/locale/en_US/

LC_ADDRESS

0000002a95570000-0000002a95571000 r--p 0000000000000000 08:07 9562/usr/lib/locale/en_US/

LC_NAME

0000002a95571000-0000002a95572000 r--p 0000000000000000 08:07 9583/usr/lib/locale/en_US/

LC_PAPER

0000002a95572000-0000002a95573000 r--p 0000000000000000 08:07 9597/usr/lib/locale/en_US/

LC_MESSAGES/SYS_LC_MESSAGES

0000002a95573000-0000002a95574000 r--p 0000000000000000 08:07 9563/usr/lib/locale/en_US/

LC_MONETARY

0000002a95574000-0000002a9557a000 r--p 0000000000000000 08:07 9701/usr/lib/locale/en_US/

LC_COLLATE

0000002a9557a000-0000002a9557b000 r--p 0000000000000000 08:07 8052/usr/lib/locale/en_US/

LC_TIME

0000002a9557b000-0000002a9557c000 r--p 0000000000000000 08:07 9596/usr/lib/locale/en_US/

LC_NUMERIC

0000002a9557c000-0000002a9557d000 rw-p 0000000000000000 00:00 0

0000002a95581000-0000002a95583000 rw-p 0000000000000000 00:00 0

0000002a95583000-0000002a955af000 r--p 0000000000000000 08:07 9700/usr/lib/locale/en_US/

LC_CTYPE

0000002a955af000-0000002a955b5000 r--s 0000000000000000 08:07 9438/usr/lib64/gconv/gconv-modules.cache

0000002a9566b000-0000002a9566d000 rw-p 0000000000015000 08:07 6820/lib64/ld-2.3.2.so

0000002a9566d000-0000002a9569b000 r-xp 0000000000000000 08:07 10781/lib64/libreadline.so.4.3

0000002a9569b000-0000002a9576d000 ---p 000000000002e000 08:07 10781/lib64/libreadline.so.4.3

0000002a9576d000-0000002a957a6000 rw-p 0000000000000000 08:07 10781/lib64/libreadline.so.4.3

0000002a957a6000-0000002a957a7000 rw-p 0000000000000000 00:00 0

0000002a957a7000-0000002a957ad000 r-xp 0000000000000000 08:07 10779/lib64/libhistory.so.4.3

0000002a957ad000-0000002a958a7000 ---p 0000000000006000 08:07 10779/lib64/libhistory.so.4.3

0000002a958a7000-0000002a958ae000 rw-p 0000000000000000 08:07 10779/lib64/libhistory.so.4.3

0000002a958ae000-0000002a958f8000 r-xp 0000000000000000 08:07 9799/lib64/libncurses.so.5.3

0000002a958f8000-0000002a959ae000 ---p 000000000004a000 08:07 9799/lib64/libncurses.so.5.3

0000002a959ae000-0000002a95a0f000 rw-p 0000000000000000 08:07 9799/lib64/libncurses.so.5.3

0000002a95a0f000-0000002a95a12000 r-xp 0000000000000000 08:07 6828/lib64/libdl.so.2

0000002a95a12000-0000002a95b0f000 ---p 0000000000003000 08:07 6828/lib64/libdl.so.2

0000002a95b0f000-0000002a95b12000 rw-p 0000000000000000 08:07 6828/lib64/libdl.so.2

0000002a95b12000-0000002a95c36000 r-xp 0000000000000000 08:07 6825/lib64/libc.so.6

0000002a95c36000-0000002a95d12000 ---p 0000000000124000 08:07 6825/lib64/libc.so.6

0000002a95d12000-0000002a95d50000 rw-p 0000000000100000 08:07 6825/lib64/libc.so.6

0000002a95d50000-0000002a95d54000 rw-p 0000000000000000 00:00 0

0000002a95d54000-0000002a95d56000 r-xp 0000000000000000 08:07 9389/usr/lib64/gconv/ISO8859-1.so

0000002a95d56000-0000002a95e54000 ---p 0000000000002000 08:07 9389/usr/lib64/gconv/ISO8859-1.so

0000002a95e54000-0000002a95e56000 rw-p 0000000000000000 08:07 9389/usr/lib64/gconv/ISO8859-1.so

0000007fbfffa000-0000007fc0000000 rwxp ffffffffffffb000 00:00 0

 

Notice how each address in the address ranges is twice as big as those in the 32-bit maps file. Also notice the following differences:

请注意地址范围中的每个地址是32位映射文件中的两倍。同时注意以下差异:

Table 3.1. Address Mapping Comparison.

 

X86 (32 bit)

AMD64 (64 bit)

Start of code segment

0X08048000

0X0000000000400000

start of shared libraries

0X40000000

0x00000002a95556000

Start of stack segment

0x7fffffff

0x0000007fbfffffff

start of kernel segment

0xc0000000

0x0000007fc0000000

3.2.3. /proc/<pid>/cmdline

The cmdline file contains the process’ complete argv. This is very useful to quickly determine exactly how a process was executed including all command-line parameters passed to it. Using the bash process again as an example, we see the following:

命令行文件包含进程的完整 argv。这对于快速确定进程的执行方式 (包括传递给它的所有命令行参数) 非常有用。再次使用 bash 过程作为示例, 我们将看到以下内容:

penguin> cd /proc/self

penguin> cat cmdline

bash

3.2.4. /proc/<pid>/environ

The environ file provides a window directly into the process’ current environment. It is basically a link directly to memory at the very bottom of the process’ stack, which is where the kernel stores this information. Examining this file can be very useful when you need to know settings for environment variables during the program’s execution. A common programming error is misuse of the getenv and putenv library functions; this file can help diagnose these problems.

环境文件直接向进程的当前环境提供窗口。它基本上直接连接到内存的进程栈的底部, 这是内核存储这些信息的地方。当您需要知道程序执行过程中环境变量的设置时, 检查此文件会非常有用。常见的编程错误是误用 getenv 和 putenv 库函数;此文件可以帮助诊断这些问题。

3.2.5. /proc/<pid>/mem

By accessing this file with the fseek library function, one can directly access the process’ pages. One possible application of this could be to write a customized debugger of sorts. For example, say your program has a rather large and complex control block that stores some important information that the rest of the program relies on. In the case of a program malfunction, it would be advantageous to dump out that information. You could do this by opening the mem file for the PID in question and seeking to the known location of a control block. You could then read the size of the control block into another structure, which the homemade debugger could display in a format that programmers and service analysts can understand.

通过使用 fseek 库函数访问此文件, 可以直接访问进程的内存页。其中一个可能的应用是编写排序的自定义调试器。例如, 假设您的程序有一个相当大且复杂的控制块, 它存储了其他程序所依赖的一些重要信息。在程序出现故障的情况下, 将这些信息丢弃是有利的。可以通过打开有关 PID 的内存文件并查找控制块的已知位置来完成此任务。然后, 您可以将控制块的大小读入另一个结构, 自定义的调试器可以以程序员和服务分析人员可以理解的格式显示。

3.2.6. /proc/<pid>/fd

The fd directory contains symbolic links pointing to each file for which the process currently has a file descriptor. The name of the link is the number of the file descriptor itself. File descriptor leaks are common programming errors that can be difficult problems to diagnose. If you suspect the program you are debugging has a leak, examine this directory carefully throughout the life of the program.

fd 目录包含指向该进程当前具有文件描述符的每个文件的符号链接。链接的名称是文件描述符本身的编号。文件描述符泄漏是常见的编程错误, 可能是很难诊断的问题。如果您怀疑正在调试的程序有漏洞, 请在程序的整个生命周期中仔细检查此目录。

3.2.7. /proc/<pid>/mapped_base

As was mentioned previously, the starting point for where the shared library mappings begin in a process’ address space is defined in the Linux kernel by TASK_UNMAPPED_BASE. In the current stable releases of the 2.4 and 2.6 kernels, this value is hardcoded and cannot be changed. In the case of i386, 0x40000000 is not the greatest location because it occurs about one-third into the process’ addressable space. Some applications require and/or benefit from allocating very large contiguous chunks of memory, and in some cases the TASK_UNMAPPED_BASE gets in the way and hinders this.

如前所述, 在进程 "地址空间" 中开始共享库映射的起始点由 TASK_UNMAPPED_BASE 在 Linux 内核中定义。在2.4 和2.6 内核的当前稳定版本中, 此值是硬编码的, 无法更改。在 i386 的情况下, 0x40000000 不是大的位置, 因为它发生在进程的可寻址空间约1/3处。有些应用程序需要和 (或) 受益于分配非常大的连续内存块, 在某些情况下, TASK_UNMAPPED_BASE 会妨碍到这一点。

To address this problem, some distributions such as SuSE Linux Enterprise Server 8 have included a patch that allows the system administrator to set the TASK_UNMAPPED_BASE value to whatever he or she chooses. The /proc/<pid>/mapped_base file is the interface you use to view and change this value.

为了解决此问题, 某些发行版 (如 SuSE Linux 企业服务器 8) 包含了一个修补程序, 允许系统管理员将 TASK_UNMAPPED_BASE 值设置为他或她选择的任何值。/proc/<pid>/mapped_base 文件是用来查看和更改此值的接口.

To view the current value, simply cat the file:

要查看当前值, 只需将文件 cat:

penguin> cat /proc/self/mapped_base

1073741824

penguin>

 

This shows the value in decimal form and is rather ugly. Viewing it as hex is much more meaningful:

这显示了十进制形式的值, 而且相当难看。以十六进制的形式来看, 意义更大:

penguin> printf "0x%x\n" `cat /proc/self/mapped_base`

0x40000000

penguin>

 

We know from our examination of the maps file that the executable’s mapping begins at 0x08048000 in the process address space. We also now know that the in-memory mapping is likely to be larger than the on-disk size of the executable. This is because of variables that are in the BSS segment and because of dynamically allocating memory from the process heap. With mapped_base at the default value, the space allowed for all of this is 939229184 (0x40000000 - 0x08048000). This is just under 1GB and certainly overkill. A more reasonable value would be 0x10000000, which would give the executable room for 133922816 (0x10000000 - 0x08048000). This is just over 128MB and should be plenty of space for most applications. To make this change, root authority is required. It’s also very important to note that the change is only picked up by children of the process in which the change was made, so it might be necessary to call execv() or a similar function for the change to be picked up. The following command will update the value:

我们从对映射文件的检查中知道可执行文件的映射从进程地址空间的0x08048000 开始。我们现在也知道内存中的映射可能大于可执行文件的磁盘大小。这是由于在 BSS 段中的变量以及从进程堆动态分配内存。使用 mapped_base 的默认值, 所有这一切所允许的空间为 939229184 (0x40000000-0x08048000)。这只是1GB 以下, 而且肯定是过分的。一个更合理的价值将是 0x10000000, 这将提供133922816的可执行文件空间 (0x10000000-0x08048000)。这是刚刚超过 128MB, 对大多数应用程序来说,应该足够了,。要进行此更改, 需要使用 root 权限。还有一点很重要的一点是, 只能由子进程在父进程进行更改的过程中获取, 因此可能需要调用 execv () 或类似的函数来进行更改。以下命令将更新该值:

 

penguin> echo 0x10000000 > /proc/<pid>/mapped_base

 

3.3. Kernel Information and Manipulation

At the same level in the /proc filesystem hierarchy that all the process ID directories are located are a number of very useful files and directories. These files and directories provide information and allow setting various items at the system and kernel level rather than per process. Some of the more useful and interesting entries are described in the following sections.

在/proc文件系统层次结构中, 所有进程 ID 目录所在的同一级别都是许多非常有用的文件和目录。这些文件和目录提供信息, 并允许做系统和内核级别而不是每个进程中设置各种项目。下面几节将介绍一些更有用和有趣的条目。

3.3.1. /proc/cmdline

This is a special version of the cmdline that appears in all /proc/<pid> directories. It shows all the parameters that were used to boot the currently running kernel. This can be hugely useful especially when debugging remotely without direct access to the computer.

这是在所有/proc/<pid>中显示的命令行的特殊版本.它显示用于引导当前正在运行的内核的所有参数。这可能非常有用, 特别是在远程调试而不能直接访问计算机时.

3.3.2. /proc/config.gz or /proc/sys/config.gz

This file is not part of the mainline kernel source for 2.4.24 nor 2.6.0, but some distributions such as SuSE have included it in their distributions. It is very useful to quickly examine exactly what options the current kernel was compiled with. For example, if you wanted to quickly find out if your running kernel was compiled with Kernel Magic SysRq support, search /proc/config.gz for SYSRQ:

该文件不是2.4.24 或2.6.0 的主线内核的一部分, 但某些发行版 (如 SuSE) 也包含在它们的发布中。快速检查当前内核编译的是什么选项非常有用。例如, 如果您想快速查明您的运行内核是否包括Kernel Magic SysRq内核选项, 则搜索/proc/config.gz :

penguin> zcat config.gz | grep SYSRQ

CONFIG_MAGIC_SYSRQ=y

3.3.3. /proc/cpufreq

At the time of this writing, this file is not part of the mainline 2.4.24 kernel but is part of the 2.6.0 mainline source. Many distributions such as SuSE have back-ported it to their 2.4 kernels, however. This file provides an interface to manipulate the speed at which the processor(s) in your machine run, depending on various governing factors. There is excellent documentation included in the /usr/src/linux/Documentation/cpu-freq directory if your kernel contains support for this feature.

在编写此文档时, 此文件不是主线2.4.24 内核的一部分, 而是2.6.0 内核的一部分。然而, 许多发行版 (如 SuSE) 都将其移植到2.4 内核中。此文件提供一个接口, 用于操作计算机上处理器的运行速度, 具体取决于各种控制因素。如果内核支持此功能, 则在/usr/src/linux/ Documentation/cpu-freq 目录中包含此文档。

3.3.4. /proc/cpuinfo

This is one of the first files someone will look at when determining the characteristics of a particular computer. It contains detailed information on each of the CPUs in the computer such as speed, model name, and cache size.

这是在确定特定计算机的特性时, 必看的文件之一。它包含计算机中每个 cpu 的详细信息, 如速度、型号名称和缓存大小。

Note: To determine if the CPUs in your system have Intel HyperThreaded(TM) technology, view the cpuinfo file, but you need to know what to look for. If for example, your system has four HyperThreaded CPUs, cpuinfo will report on eight total CPUs with a processor number ranging from 0 to 7. However, examining the “physical id” field for each of the eight entries will yield only four unique values that directly represent each physical CPU.

注意: 要确定系统中的 cpu 是否具有英特尔超线程 (TM) 技术, 请查看 cpuinfo 文件, 但需要知道要查找的内容。例如, 如果系统有四超线程 cpu, cpuinfo 将报告共总有八个 cpu, 处理器数从0到7。但是, 检查这个八个CPU的 "物理 id" 字段时, 只会看到四个唯一值, 用来表示4个不同的物理 CPU。

3.3.5. /proc/devices

This file displays a list of all configured character and block devices. Note that the device entry in the maps file can be cross-referenced with the block devices of this section to translate a device number into a name. The following shows the /proc/devices listing for my computer.

此文件显示了所有配置的字符和块设备。请注意, 映射文件中的设备项可以与本节的块设备交叉引用, 以便将设备编号转换为名称。下面显示了我的计算机的/proc/devices列表。

Code View: Scroll / Show All

Character devices:

  1 mem

  2 pty

  3 ttyp

  4 ttyS

  5 cua

  6 lp

  7 vcs

 10 misc

 13 input

 14 sound

 21 sg

 29 fb

 81 video_capture

116 alsa

119 vmnet

128 ptm

136 pts

162 raw

171 ieee1394

180 usb

188 ttyUSB

254 pcmcia

 

Block devices:

  1 ramdisk

  2 fd

  3 ide0

  7 loop

  9 md

 11 sr

 22 ide1

 

Figure 3.1 shows that /bin/bash is mapped from device 03 which, according to my devices file, is the “ide0” device. This makes perfect sense, as I have only one hard drive, which is IDE.

图3.1 显示了/bin/bash 是从设备03映射的, 根据我的设备文件, 它是 "ide0" 设备。这是完全有意义的, 因为我只有一个硬盘驱动器, IDE。

3.3.6. /proc/kcore

This file represents all physical memory in your computer. With an unstripped kernel binary, a debugger can be used to examine any parts of the kernel desired. This can be useful when the kernel is doing something unexpected or when developing kernel modules.

此文件表示计算机中的所有物理内存。使用 unstripped 内核二进制文件, 调试器可用于检查内核的任何部分。当内核出了意外或正在开发内核模块时, 这可能很有用。

3.3.7. /proc/locks

This file shows all file locks that currently exist in the system. When you know that your program locks certain files, examining this file can be very useful in debugging a wide array of problems.

此文件显示系统中当前存在的所有文件锁。当您知道程序锁定某些文件时, 检查此文件对于调试一系列问题非常有用。

3.3.8. /proc/meminfo

This file is probably the second file after cpuinfo to be examined when determining the specs of a given system. It displays such things as total physical RAM, total used RAM, total free RAM, amount cached, and so on. Examining this file can be very useful when diagnosing memory-related issues.

在确定给定系统的规格时, 此文件可能是 cpuinfo 之后要查的第二个文件。它显示了诸如物理内存数量、使用内存数量、空闲的内存数量、缓存的数量等内容。在诊断与内存相关的问题时, 检查此文件可能非常有用。

3.3.9. /proc/mm

This file is part of Jeff Dike’s User Mode Linux (UML) patch, which allows an instance of a kernel to be run within a booted kernel. This can be valuable in kernel development. One of the biggest advantages of UML is that a crash in a UML kernel will not bring down the entire computer; the UML instance simply needs to be restarted. UML is not part of the mainstream 2.4.24 nor 2.6.0 kernels, but some distributions have back-ported it. The basic purpose of the mm file is to create a new address space by opening it. You can then modify the new address space by writing directly to this file.

此文件是Jeff Dike的用户模式 Linux (UML) 修补程序的一部分, 它允许内核的在引导内核中运行。这在内核开发中是很有价值的。UML的最大优点是 UML 中的内核崩溃不会使整个计算机崩溃;UML 的内核只需要重新启动即可。UML 不是主流2.4.24 和2.6.0 内核的一部分, 但有些发行版已将其移植。mm 文件的基本目的是通过它来创建一个新的地址空间。然后, 您可以通过直接写入此文件来修改新的地址空间。

3.3.10. /proc/modules

This file contains a listing of all modules currently loaded by the system. Generally, the lsmod(8) command is a more common way of seeing this information. lsmod will print the information, but it doesn’t add anything to what’s in this file. Running lsmod is very useful after running modprobe(8) or insmod(8) to dynamically load a kernel module to see if the kernel has, in fact, loaded it. It’s also very useful to view when it is desired to unload a module using the rmmod(8) command. Usually if a module is in use by at least one process, that is, its “Used by” count is greater than 0, it cannot be unloaded by the kernel.

此文件包含系统当前加载的所有模块的列表。通常, lsmod (8) 命令是一种更常见的查看此信息的方式。lsmod 将打印信息, 但它不会向这个文件添加任何东西。在运行 modprobe (8) 或 insmod (8) 以动态加载内核模块后,运行 lsmod以查看内核是否已加载它时非常有用。在使用 rmmod (8) 命令卸载模块时, 查看它是否被卸载也非常有用。通常, 如果一个模块被至少一个进程使用, 即它的 "使用" 计数大于 0, 则内核无法卸载它。

3.3.11. /proc/net

This directory contains several files that represent many different facets of the networking layer. Directly accessing some can be useful in specific situations, but generally it’s much easier and more meaningful to use the netstat(8) command.

此目录包含多个代表网络层的许多不同方面的文件。直接访问一些文件在特定情况下是有用的, 但通常使用 netstat (8) 命令会更容易、更有意义。

3.3.12. /proc/partitions

This file holds a list of disk partitions that Linux is aware of. The partition file categorizes the system’s disk partitions by “major” and “minor.” The major number refers to the device number, which can be cross-referenced with the /proc/devices file. The minor refers to the unique partition number on the device. The partition number also appears in the maps file immediately next to the device number.

此文件包含 Linux 所知道的磁盘分区列表。分区文件按 "主要" 和 "次要" 对系统的磁盘分区进行分类。主编号指的是设备号, 可以与/proc/devices文件交叉引用。次要引用设备上的唯一分区号。分区号还会出现在映射文件中设备编号旁边。

Looking at the maps output for /bin/bash in Figure 3.1, the device field is “03:08.” We can look up the device and partition directly from the partitions file. Using the following output in conjunction with Figure 3.1, we can see that /bin/bash resides on partition 8 of block device 3, or /dev/hda8.

查看图3.1 中/bin/bash的映射输出, 设备字段为 "03:08"。我们可以直接从分区文件查找设备和分区。以下输出与图3.1 一起, 我们可以看到/bin/bash 驻留在块设备3分区8或/dev/hda8 的中。

Code View: Scroll / Show All

major minor #blocks  name       rio rmerge rsect ruse wio wmerge wsect wuse running use aveq

   3    0   46879560 hda 207372 1315792 7751026 396280 91815 388645 3871184 1676033 -3 402884 617934

   3    1   12269848 hda1 230 1573 1803 438 0 0 0 0 0 438 438

   3    2          1 hda2 0 0 0 0 0 0 0 0 0 0 0

   3    5   15119968 hda5 6441 625334 631775 911375 0 0 0 0 0 1324183 911375

   3    6    1028128 hda6 8 0 8 58 0 0 0 0 0 58 58

   3    7    1028128 hda7 66 275 2728 1004 445 4071 38408 27407 0 2916 28525

   3    8   17425768 hda8 200618 688575 7114640 3778237 91370 384574 3832776 1648626 0 1585895 1170787

3.3.13. /proc/pci

This file contains detailed information on each device connected to the PCI bus of your computer. Examining this file can be useful when diagnosing problems with a certain PCI device. The information contained within this file is very specific to each particular device.

此文件包含连接到计算机 PCI 总线的每个设备的详细信息。在诊断特定 PCI 设备的问题时, 检查此文件可能很有用。此文件中包含的信息对每个特定设备都非常具体。

3.3.14. /proc/slabinfo

This file contains statistics on certain kernel structures and caches. It can be useful to examine this file when debugging system memory-related problems. Refer to the slabinfo(5) man page for more detailed information.

此文件包含某些内核结构和缓存的统计信息。在调试与系统内存相关的问题时检查此文件可能很有用。有关详细信息, 请参阅 slabinfo (5) 帮助手册。

3.4. System Information and Manipulation

A key subdirectory in the /proc filesystem is the sys directory. It contains many kernel configuration entries. These entries can be used to view or in some cases manipulate kernel settings. Some of the more useful and important entries will be discussed in the following sections.

/proc文件系统中的关键子目录是 sys 目录。它包含许多内核配置项。这些配置项可用于查看或在某些情况下操作内核设置。下面几节将讨论一些更有用和重要的配置项。

3.4.1. /proc/sys/fs

This directory contains a number of pseudo-files representing a variety of file system information. There is good documentation in the proc(5) man page on the files within this directory, but it’s worth noting some of the more important ones.

此目录包含一些表示各种文件系统信息的伪文件。这个目录中的文件在proc (5) 帮助手册中中有很好的文档 , 但有一些要特别值得注意。

3.4.1.1. dir-notify-enable

This file acts as a switch for the Directory Notification feature. This can be a useful feature in problem determination in that you can have a program watch a specific directory and be notified immediately of any change to it. See /usr/src/linux/Documentation/dnotify.txt for more information and for a sample program.

此文件用作目录通知功能的开关。在问题确定中, 这可能是一个有用的功能, 您可以让程序监视特定的目录,有任何更改,就立即获得通知。有关更多信息和示例程序, 请参见/dnotify/src/linux/ Documentation /dnotify.txt。

3.4.1.2. file-nr

This read-only file contains statistics on the number of files presently opened and available on the system. The file shows three separate values. The first value is the number of allocated file handles, the second is the number of free file handles, and the third is the maximum number of file handles.

此只读文件包含系统当前打开和可用文件数量的统计信息。该文件显示三个单独的值。第一个值是分配的文件句柄的数量, 第二个是可用文件句柄的数目, 第三个是文件句柄的最大数目。

penguin> cat /proc/sys/fs/file-nr

2858    177     104800

 

On my system I have 2858 file handles allocated; 177 of these handles are available for use and have a maximum limit of 104800 total file handles. The kernel dynamically allocates file handles but does not free them when they’re no longer used. Therefore the first number, 2858, is the high water mark for total file handles in use at one time on my system. The maximum limit of file handles is also reflected in the /proc/sys/fs/file-max file.

在我的系统上, 有2858个文件句柄已分配;还有177个句柄可供使用, 最大限制为104800个文件句柄。内核动态分配文件句柄, 但在不再使用它们时不会释放它们。因此, 第一个数字, 2858, 是我的系统中一次使用的总文件句柄的最大值。文件句柄的最大限制也反映在/proc/sys/fs/file-max 文件中。

3.4.1.3. file-max

This file represents the system wide limit for the number of files that can be open at the same time. If you’re running a large workload on your system such as a database management system or a Web server, you may see errors in the system log about running out of file handles. Examine this file along with the file-nr file to determine if increasing the limit is a valid option. If so, simply do the following as root:

此文件表示系统范围内可同时打开的文件数的限制。如果您在系统 (如数据库管理系统或 Web 服务器) 上运行大量工作负载, 则可能会在系统日志中看到有关运行文件句柄的错误。检查此文件连同file-nr 文件, 以确定是否提升限制是一个有效的选项。如果是这样, 只需将以root权限做如下操作:

 

echo 104800 > /proc/sys/fs/file-max

 

This should only be done with a fair degree of caution, considering an overly excessive use of file descriptors could indicate a programming error commonly referred to as a file descriptor leak. Be sure to refer to your application’s documentation to determine what the recommended value for file-max is.

这只能在相当谨慎的情况下进行, 因为过度使用文件描述符可能导致通常称为文件描述符泄漏的编程错误。请务必参考应用程序的文档, 以确定建议的file-max 的值。

3.4.1.4. aio-max-nr, aix-max-pinned, aix-max-size, aio-nr, and aio-pinned

These files are not included as part of the 2.4.24 and 2.6.0 mainline kernels. They provide additional interfaces to the Asynchronous I/O feature, which is a part of the 2.6.0 mainline kernel but not 2.4.24.

这些文件不包括在2.4.24 和2.6.0 主线内核的一部分中。它们为异步 i/o 功能提供了附加接口, 该特性是2.6.0 主线内核的一部分, 但没有包括在2.4.24。

3.4.1.5. overflowgid and overflowuid

These files represent the group IDs and user IDs to use on remote systems that have filesystems that do not support 32-bit gids and uids as Linux does. It is important to make a mental note of this because NFS is very commonly used even though diagnosing NFS-related problems can be very tricky. On my system, these values are defined as follows:

这些文件表示在不支持32位 gids 和 uids 的远程系统上使用的组 id 和用户 id。重要的是要注意这一点, 因为 nfs 是非常常用的, 但是诊断与 nfs 相关的问题可能是非常棘手的。在我的系统中, 这些值定义如下:

penguin> cat overflowgid

65534

penguin> cat overflowuid

65534

3.4.2. /proc/sys/kernel

This directory contains several very important files related to kernel tuning and information. Much of the information here is low-level and will never need to be examined or changed by the average user, so I’ll just highlight some of the more interesting entries, especially pertaining to problem determination.

此目录包含几个与内核优化和信息相关的非常重要的文件。这里的大部分信息都是底层的, 并且不需要由普通用户来检查或更改, 所以我只强调一些更有趣的项, 特别是有关问题诊断的内容。

3.4.2.1. core_pattern

This file is new in the 2.6 kernel, but some distributions such as SuSE have back ported it to their 2.4 kernels. Its value is a template for the name of the file written when an application dumps its core. The advantage of using this is that with the use of % specifiers, the administrator has full control of where the core files get written and what their names will be. For example, it may be advantageous to create a directory called /core and set the core_pattern with a command something like the following:

这个文件是新加入到2.6 内核的, 但是一些发行版 (如 SuSE) 已经将其移植到了它们的2.4 内核中。它的值是在应用程序转储其核心时写入的文件名称的模板。使用此方法的优点是, 使用% 说明符时, 管理员可以完全控制核心文件的写入位置和名称。例如, 创建一个名为/core的目录, 并将 core_pattern 设置为如下所示的命令可能很有好处:

 

penguin> echo "/core/%e.%p" > core_pattern

 

For example, if the program foo causes an exception and dumps its core, the file /core/foo.3135 will be created.

例如, 如果程序 foo 导致异常并转储其核心, 则将创建文件/core/foo. 3135。

3.4.2.2. msgmax, msgmnb, and msgmni

These three files are used to configure the kernel parameters for System V IPC messages. msgmax is used to set the limit for the maximum number of bytes that can be written on a single message queue. msgmnb stores the number of bytes used to initialize subsequently created message queues. msgmni defines the maximum number of message queue identifiers allowed on the system. These values are often very dependent on the workload that your system is running and may need to be updated. Many applications will automatically change these values but some might require the administrator to do it.

这三文件用于配置SystemV IPC 消息的内核参数。msgmax 用于设置可在单个消息队列上写入的最大字节数的限制。msgmnb 存储用于初始化随后创建的消息队列的字节数。msgmni 定义系统上允许的最大消息队列标识符数。这些值通常依赖于系统运行的工作负荷, 可能需要及时更新。许多应用程序将自动更改这些值, 但有些可能需要管理员手动执行此项更改。

3.4.2.3. panic and panic_on_oops

The panic file lets the user control what happens when the kernel enters a panic state. If the value of either file is 0, the kernel will loop and therefore the machine will remain in the panic state until manually rebooted. A non-zero value represents the number of seconds the kernel should remain in panic mode before rebooting. Having the kernel automatically reboot the system in the event of a panic could be a very useful feature if high availability is a primary concern.

panic文件允许用户控制内核进入panic状态时发生的情况。如果任一文件的值为 0, 则内核将循环, 因此计算机将保持处于死机状态, 直到手动重新启动。非零值表示在重新启动前内核应保持在panic模式中的秒数。如果高可用性是首要问题, 让内核在发生panic时自动重新启动系统可能是非常有用的功能。

The panic_on_oops file is new in the 2.6.0 mainline kernel and when set to 1, it informs the kernel to pause for a few seconds before panicking when encountering a BUG or an Oops. This gives the klogd an opportunity to write the Oops or BUG Report to the disk so that it can be easily examined when the system is returned to a normal state.

panic_on_oops 文件在2.6.0 主线内核中加入的, 当设置为1时, 它会在遇到 BUG 或错误时通知内核暂停几秒钟, 然后panic。这使 klogd 有机会将 "Oops" 或 "BUG" 报告写入磁盘, 以便在系统返回到正常状态时可以轻松地检查它。

3.4.2.4. printk

This file contains four values that determine how kernel error messages are logged. Generally, the default values suffice, although changing the values might be advantageous when debugging the kernel.

此文件包含四个值, 用于确定如何记录内核错误消息。通常, 默认值是足够, 但是在调试内核时更改值可能是更有利的。

3.4.2.5. sem

This file contains four numbers that define limits for System V IPC semaphores. These limits are SEMMSL, SEMMNS, SEMOPM, and SEMMNI respectively. SEMMSL represents the maximum number of semaphores per semaphore set; SEMMNS is the maximum number of semaphores in all semaphore sets for the whole system; SEMOPM is the maximum number of operations that can be used in a semop(2) call; and SEMMNI represents the maximum number of semaphore identifiers for the whole system. The values needed for these parameters will vary by workload and application, so it is always best to consult your application’s documentation.

此文件包含四个数字, 用于定义SystemV IPC 信号量的限制。这些限制分别为 SEMMSL、SEMMNS、SEMOPM 和 SEMMNI。SEMMSL 表示每个信号量集的最大信号量数;SEMMNS 是整个系统所有信号量集中的最大信号量;SEMOPM 是可在 semop (2) 调用中使用的最大操作数;和 SEMMNI 表示整个系统的最大信号量标识符数。这些参数的值将因工作负载和应用程序而异, 因此最好查阅应用程序的文档。

3.4.2.6. shmall, shmmax, and shmmni

These three files define the limits for System V IPC shared memory. shmall is the limit for the total number of pages of shared memory for the system. shmmax defines the maximum shared memory segment size. shmmni defines the maximum number of shared memory segments allowed for the system. These values are very workload-dependent and may need to be changed when running a database management system or a Web server.

这三文件定义了System V IPC 共享内存的限制。shmall 是系统共享内存页数的限制。shmmax 定义最大共享内存段大小。shmmni 定义系统允许的最大共享内存段数。这些值与工作负载相关, 可能需要在运行数据库管理系统或 Web 服务器时进行更改。

3.4.2.7. sysrq

This file controls whether the “kernel magic sysrq key” is enabled or not. This feature may have to be explicitly turned on during compilation. If /proc/sys/kernel/sysrq exists, the feature is available; otherwise, you’ll need to recompile your kernel before using it. It is recommended to have this feature enabled because it can help to diagnose some of the tricky system hangs and crashes.

此文件控制是否启用了 "内核magic sysrq key"。此功能必须在编译过程中显式打开。如果存在/proc/sys/kernel/sysrq 存在, 则该功能可用。否则, 您需要在使用之前重新编译内核。建议启用此功能, 因为它可以帮助诊断一些棘手的系统挂起和崩溃。

The basic idea is that the kernel can be interrupted to display certain information by bypassing the rest of the operating system via the ALT-SysRq hotkey combination. In many cases where the machine seems to be hung, the ALT-SysRq key can still be used to gather kernel information for examination and/or forwarding to a distribution cause’s support area or other experts.

基本的想法是, 内核可以被中断, 以显示某些信息, 通过 Alt+SysRq 组合键绕过操作系统余下的工作。在许多情况下, 机器似乎挂起, Alt+SysRq 键仍然可以用来收集内核信息,来检查和/或转发到系统发行版的技术支持部门或其他专家。

To enable this feature, do the following as root:

要启用此功能, 请以root权限执行以下操作:

 

penguin> echo 1 > /proc/sys/kernel/sysrq

 

To test the kernel magic, switch to your first virtual console. You need not log in because the key combination triggers the kernel directly. Hold down the right ALT key, then press and hold the PrtSc/SysRq key, then press the number 5. You should see something similar to the following:

要测试kernel magic, 请切换到第一个虚拟控制台。您不必登录, 因为组合键直接触发内核。按住右 ALT 键, 然后按住 PrtSc/SysRq 键, 然后按数字5。您应该看到类似下面的内容:

 

SysRq : Changing Loglevel

Loglevel set to 5

 

If you do not see this message, it could be that the kernel is set to send messages to virtual console 10 by default. Press CTRL-ALT-F10 to switch to virtual console 10 and check to see if the messages appear there. If they do, then you know that the kernel magic is working properly. If you’d like to switch where the messages get sent by default, say, virtual console 1 by default instead of 10, then run this command as root:

如果没有看到此消息, 则可能是内核设置为默认情况下将消息发送到虚拟控制台10。按 CTRL-ALT-F10 键切换到虚拟控制台10并检查消息是否出现在那里。如果他们这样做, 那么你知道kernel magic工作正常。如果您想切换默认情况下发送消息的位置, 例如, 默认情况下, 虚拟控制台是1而不是 10, 然后以 root 用户的状态运行此命令:

 

/usr/sbin/klogmessage -r 1

 

This change will only be in effect until the next reboot, so to make the change permanent, grep through your system’s startup scripts for “klogmessage” to determine where it gets set to virtual console 10 and change it to whichever virtual console you wish. For my SuSE Pro 9.0 system, this setting occurs in /etc/init.d/boot.klog.

此更改在下一次重新启动之前生效, 因此, 要使更改永久生效, 在系统的启动脚本中搜索 "klogmessage" ,来确定其设置为虚拟控制台10的位置, 并将其更改为希望的任何虚拟控制台。对于我的 SuSE Pro 9.0 系统, 此设置在/etc/init.d/boot. klog。

Where the messages get sent is important to note because in the event your system hangs and kernel magic may be of use, you’ll need to have your system already be on the virtual console where messages appear. This is because it is very likely that the kernel won’t respond to the CTRL-ALT-Function keys to switch virtual consoles.

发送消息的位置很重要, 因为在您的系统挂起和kernel magic可能使用的情况下, 您需要将您的系统放在出现消息的虚拟控制台上。这是因为内核很可能不会响应 CTRL ALT 功能键来切换虚拟控制台。

So what can you do with the kernel magic stuff then?

Press ALT-SysRq-h to see a Help screen. You should see the following:

那么, 你能如何处理kernel magic的东西呢?按 ALT SysRq 查看帮助屏幕。您应该看到以下内容:

SysRq : HELP : loglevel0-8 reBoot Crash Dumpregisters tErm kIll saK

showMem showPc unRaw Sync showTasks Unmount

 

If you’re seeing these messages you can gather this information to determine the cause of the problem. Some of the commands such as showTasks will dump a large amount of data, so it is highly recommended that a serial console be set up to gather and save this information. See the “Setting up a Serial Console” section for more information. Note however, that depending on the state of the kernel, the information may be saved to the /var/log/messages file as well so you may be able to retrieve it after a reboot.

如果您看到这些消息, 您可以收集此信息以确定问题的原因。某些命令 (如 showTasks) 将转储大量数据, 因此强烈建议设置串行控制台以收集和保存此信息。有关详细信息, 请参阅 "设置串行控制台" 部分。请注意, 根据内核的状态, 信息可能会保存到/var/log/messages文件中, 因此您可能可以在重新启动后检索它。

The most important pieces of information to gather would be showPc, showMem, showTasks. Output samples of these commands are shown here. Note that the output of the showTasks command had to be truncated given that quite a bit of data is dumped. Dumpregisters is also valuable to have, but it requires special configuration and is not enabled by default. After capturing this information, it is advisable to execute the Sync and reBoot commands to properly restart the system if an Oops or other kernel error was encountered. Simply using kernel magic at any given time is usually harmless and does not require a Sync and or reBoot command to be performed.

最重要的信息收集将是 showPc, showMem, showTasks。此处显示这些命令的输出示例。请注意, 由于大量数据被输出, showTasks 命令的输出必须被截断。Dumpregisters 也很有价值, 但它需要特殊配置, 默认情况下不启用。在捕获此信息后, 最好执行同步和重新启动命令, 以便在遇到错误或其他内核出错时正确重新启动系统。只要在任何给定的时间内使用kernel magic通常是无害的, 不需要执行同步或重新启动命令。

3.4.2.7.1. showPc Output:

SysRq : Show Regs

 

Pid: 0, comm:              swapper

EIP: 0010:[default_idle+36/48] CPU: 0 EFLAGS: 00003246 Tainted: PF

EIP: 0010:[<c0106f94>] CPU: 0 EFLAGS: 00003246 Tainted: PF

EAX: 00000000 EBX: c0106f70 ECX: 00000000 EDX: 00000019

ESI: c0326000 EDI: c0326000 EBP: ffffe000 DS: 0018 ES: 0018

CR0: 8005003b CR2: 4001a000 CR3: 05349000 CR4: 000006d0

Call Trace:    [cpu_idle+50/96] [rest_init+0/32]

Call Trace:    [<c0106ff2>] [<c0105000>]

3.4.2.7.2. showMem Output:

Code View: Scroll / Show All

SysRq : Show Memory

Mem-info:

Free pages:       15316kB ( 2044kB HighMem)

Zone:DMA freepages:  3780kB

Zone:Normal freepages:  9492kB

Zone:HighMem freepages:  2044kB

( Active: 128981, inactive: 102294, free: 3829 )

1*4kB 0*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB

1*2048kB 0*4096kB = 3780kB)

65*4kB 54*8kB 72*16kB 157*32kB 15*64kB 1*128kB 0*256kB 1*512kB

1*1024kB 0*2048kB 0*4096kB = 9492kB)

7*4kB 2*8kB 3*16kB 15*32kB 9*64kB 1*128kB 1*256kB 1*512kB 0*1024kB

0*2048kB 0*4096kB = 2044kB)

Swap cache: add 295, delete 32, find 0/0, race 0+0

Free swap:       1026940kB

262000 pages of RAM

32624 pages of HIGHMEM

3973 reserved pages

457950 pages shared

263 pages swap cached

33 pages in page table cache

Buffer memory:    74760kB

Cache memory:   778952kB

3.4.2.7.3. showTasks Output:

Code View: Scroll / Show All

SysRq : Show Memory

SysRq : Show State

 

                         free                       sibling

  task             PC    stack   pid father child younger older

init          S CDFED120  236     1      0  4746

(NOTLB)

Call Trace:    [schedule_timeout+99/176] [process_timeout+0/16] [do_select+481/560] [__pollwait+0/208] [sys_select+80

8/1232]

Call Trace:    [<c0125923>] [<c01258b0>] [<c0154471>] [<c01540d0>] [<c0154818>]

  [system_call+51/64]

  [<c0108dd3>]

keventd       S C4925AC0         0       2         1                    3 (L-TLB)

Call Trace:    [context_thread+259/448] [rest_init+0/32] [rest_init+0/32] [arch_kernel_thread+35/48] [context_thread+ 0/448]

Call Trace:    [<c0129d53>] [<c0105000>] [<c0105000>] [<c0107333>] [<c0129c50>]

ksoftirqd_CPU S CDFED080           0       3          1                  4     2 (L-TLB)

Call Trace:    [rest_init+0/32] [ksoftirqd+183/192]

[arch_kernel_thread+35/48] [ksoftirqd+0/192]

Call Trace:    [<c0105000>] [<c0121e57>] [<c0107333>] [<c0121da0>]

kswapd        S C764E680    1260           4          1                   5    3 (L-TLB)

Call Trace:    [kswapd+171/176] [rest_init+0/32] [rest_init+0/32]

[arch_kernel_thread+35/48] [kswapd+0/176]

Call Trace:    [<c013c4ab>] [<c0105000>] [<c0105000>] [<c0107333>] [<c013c400>]

bdflush       S C02DEEE0      60           5          1                   6    4 (L-TLB)

Call Trace:    [interruptible_sleep_on+61/96] [bdflush+195/208]

[rest_init+0/32] [arch_kernel_thread+35/48] [bdflush+

0/208]

Call Trace:    [<c011a24d>] [<c01490a3>] [<c0105000>] [<c0107333>] [<c0148fe0>]

kupdated      S C4925AC0     820          6           1                   7    5 (L-TLB)

Call Trace:    [schedule_timeout+99/176] [process_timeout+0/16]

[kupdate+205/384] [rest_init+0/32] [rest_init+0/32]

Call Trace:    [<c0125923>] [<c01258b0>] [<c014917d>] [<c0105000>] [<c0105000>]

   [arch_kernel_thread+35/48] [kupdate+0/384]

   [<c0107333>] [<c01490b0>]

 

<<SNIP>>

3.4.2.8. tainted

This file gives an indication of whether or not the kernel has loaded modules that are not under the GPL. If it has loaded proprietary modules, the tainted flag will be logically ORed with 1. If a module were loaded forcefully (that is, by running insmod -F), then the tainted flag will be logically ORed with 2. During a kernel Oops and panic, the value of the tainted flag is dumped to reflect the module loading history.

此文件说明内核是否已加载不属于 GPL 的模块。如果它已加载专有模块, 则受污染的标志将在逻辑上 or 运算1。如果模块被强制加载 (即, 通过运行 insmod), 则受污染的标志将在逻辑上 or 运算与2。在内核的 "Oops" 和 "panic" 期间, 被污染的标志的值被丢弃以反映模块加载历史记录。

3.4.3. /proc/sys/vm

This directory holds several files that allow the user to tune the Virtual Memory subsystem of the kernel. The default values are normally fine for everyday use; therefore, these files needn’t be modified too much. The Linux VM is arguably the most complex part of the Linux kernel, so discussion of it is beyond the scope of this book. There are many resources on the Internet that provide documentation for it such as Mel Gorman’s “Understanding the Linux Virtual Memory Manager” Web site located at http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand.

此目录包含几个允许用户调整内核的虚拟内存子系统的文件。默认值通常用于日常使用;因此, 这些文件不必修改太多。linux VM 可以说是 linux 内核中最复杂的部分, 所以对它的讨论超出了本书的范围。互联网上有许多资源, 可以为它提供文档, 如Mel Gorman的 " Understanding the Linux Virtual Memory" 网站位于 http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand。

3.5. Conclusion

There is more to the /proc filesystem than was discussed in this chapter. However, the goal here was to highlight some of the more important and commonly used entries. The /proc filesystem will also vary by kernel version and by distribution as various features appear in some versions and not others. In any case, the /proc filesystem offers a wealth of knowledge into your system and all the processes that run on it.

关于/proc文件系统有更多的内容值得讨论。然而, 这里的目标是强调一些更重要和最常用的项。/proc文件系统也会因内核版本和发行版而异, 因为某些版本中出现了各种功能, 而不是其他原因。在任何情况下,/proc文件系统都向您的系统和在其上运行的所有进程提供丰富的资料。

发布了234 篇原创文章 · 获赞 12 · 访问量 24万+

猜你喜欢

转载自blog.csdn.net/mounter625/article/details/102711849
今日推荐