早晨来到公司收到线上服务器磁盘/目录满了的报警,于是登到服务器上,使用df -h查看磁盘情况:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 36G 36G 0G 100% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/vdb 296G 154M 281G 1% /data
进一步使用du -h --max-depth查看具体哪个目录占用磁盘空间:
# du -h --max-depth=1 /var/log
5.8M /var/log/sa
212K /var/log/tomcat6
4.0K /var/log/httpd
4.0K /var/log/cups
4.0K /var/log/sssd
20K /var/log/logstash
4.0K /var/log/qemu-ga
28K /var/log/prelink
8.0K /var/log/samba
25M /var/log/audit
8.0K /var/log/ConsoleKit
4.0K /var/log/ntpstats
23G /var/log
进入/var/log目录,发现maillog文件20多G。最开始以为就是是这个文件占用磁盘大,于是执行下面命令,将maillog文件大小清空;再查看磁盘空间,发现已经降下去了。
# echo '' > /var/log/maillog
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 36G 3.1G 31G 10% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/vdb 296G 985M 280G 1% /data
但是很快发现,该文件还在不停的疯狂写入。于是查看该文件内容:
# tail -f /var/log/maillog
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[11290]: warning: mail_queue_enter: create file maildrop/98512.11290: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[28639]: warning: mail_queue_enter: create file maildrop/98551.28639: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[13193]: warning: mail_queue_enter: create file maildrop/98611.13193: No space left on device
Sep 29 15:50:41 VM_26_233_centos postfix/postdrop[6030]: warning: mail_queue_enter: create file maildrop/98512.6030: No space left on device
根据错误日志中No space left ...,很快反映过来应该是inode被占满了(上一步仅仅把磁盘空间释放)。于是使用下面命令查看:
# df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/vda2 2.3M 2.3M 0M 100% /
tmpfs 2.0M 1 2.0M 1% /dev/shm
/dev/vdb 19M 1.5K 19M 1% /data
果然!那么究竟是什么造成的inode被打满?google create file maildrop/367284.14836: No space left on device,基本上定位是由于postdrop 耗尽资源导致的。但根据网上提供的方法,检查了一下:
ls /var/spool/postfix/maildrop/
这个目录文件也就10几个,明显不是postdrop造成的inode打满问题。postdrop问题具体可以参考:
http://www.duyumi.net/442.html
https://hambut.com/2015/12/22/crontab-sendmail-postdrop-system-crash/
接下来,需要分析师那个目录的inode被占满。使用下面命令,一层一层的查找:
# for i in /var/spool/*; do echo $i; find $i | wc -l; done
/var/spool/abrt
580000000
/var/spool/abrt-upload
1
/var/spool/anacron
4
/var/spool/at
2
/var/spool/cron
2
/var/spool/cups
2
/var/spool/lpd
1
/var/spool/mail
4
/var/spool/plymouth
2
/var/spool/postfix
41
发现是/var/spool/abrt目录中,文件太多把inode打满了。
接下来查看里面文件的内容:
# pwd
/var/spool/abrt
# ll
total 580000000
-rw------- 1 root root 37 Sep 29 10:07 last-via-server
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:44:04-12072
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:45:03-12071
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:47:31-15236
drwxr-x--- 2 abrt root 4096 Sep 29 09:48 pyhook-2018-09-29-09:48:04-16068
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-09:49:05-16791
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-09:50:04-17222
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-10:04:22-23859
drwxr-x--- 2 abrt root 4096 Sep 29 10:07 pyhook-2018-09-29-10:07:08-25148
# cd pyhook-2018-09-29-10:07:08-25148
# ll
total 48
-rw-r----- 1 abrt root 5 Sep 29 10:07 abrt_version
-rw-r----- 1 abrt root 6 Sep 29 10:07 analyzer
-rw-r----- 1 abrt root 6 Sep 29 10:07 architecture
-rw-r----- 1 abrt root 504 Sep 29 10:07 backtrace
-rw-r----- 1 abrt root 53 Sep 29 10:07 cmdline
-rw-r----- 1 abrt root 37 Sep 29 10:07 executable
-rw-r----- 1 abrt root 46 Sep 29 10:07 hostname
-rw-r----- 1 abrt root 34 Sep 29 10:07 kernel
-rw-r----- 1 abrt root 26 Sep 29 10:07 os_release
-rw-r----- 1 abrt root 58 Sep 29 10:07 reason
-rw-r----- 1 abrt root 10 Sep 29 10:07 time
-rw-r----- 1 abrt root 1 Sep 29 10:07 uid
# cat reason
<string>:1:connect:error: [Errno 110] Connection timed out
# cat cmdline
/usr/bin/python /data/apps/scripts/count_nginx_respt.py
到这,一切都明朗了。原因是有一个crontab在一直执行/usr/bin/python /data/apps/scripts/count_nginx_respt.py ,由于失败导致python会往/var/spool/abrt目录下记录一个崩溃日志。
好。接下来解决方法就是让python报错时cache住,不要发崩溃日志。
将上面两句放到try中即可。
接下来,我们再了解一下/var/spool/abrt 目录。
该目录经常被写满的原因是,来自内核驱动程序的所有崩溃报告和反向跟踪都写在/ var / spool / abrt目录的子目录中。如果要永久停止back trace gathering,我们需要停止下面两个进程
# service abrtd stop
# service abrt-oops stop
And we can remove all those directories and files with following rm command:
# abrt-cli rm /var/spool/abrt/*