生产HDFS Block损坏恢复最佳实践

1、上传文件hello.txt

[root@cdh-node01 apps]# hdfs dfs -mkdir /blockrecover

[root@cdh-node01 apps]# echo "hello word" > hello.txt

[root@cdh-node01 apps]# hdfs dfs -put hello.txt /blockrecover

[root@cdh-node01 apps]# hdfs dfs -ls /blockrecover

Found 1 items
-rw-r--r--   2 root supergroup         11 2019-03-03 18:26 /blockrecover/hello.txt

[root@cdh-node01 apps]# hdfs fsck /

Connecting to namenode via http://cdh-node01:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /192.168.17.20 for path / at Sun Mar 03 18:27:50 CST 2019

Status: HEALTHY
 Number of data-nodes:    3
 Number of racks:        1
 Total dirs:            40
 Total symlinks:        0

Replicated Blocks:
 Total size:    108216 B
 Total files:    35
 Total blocks (validated):    25 (avg. block size 4328 B)
 Minimally replicated blocks:    25 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    2
 Average block replication:    2.0
 Missing blocks:        0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:    0
 Total block groups (validated):    0
 Minimally erasure-coded block groups:    0
 Over-erasure-coded block groups:    0
 Under-erasure-coded block groups:    0
 Unsatisfactory placement block groups:    0
 Average block group size:    0.0
 Missing block groups:        0
 Corrupt block groups:        0
 Missing internal blocks:    0
FSCK ended at Sun Mar 03 18:27:50 CST 2019 in 65 milliseconds

扫描二维码关注公众号,回复: 9392635 查看本文章


The filesystem under path '/' is HEALTHY

二.直接DN节点上删除文件一个block的一个副本(2副本)

删除块和meta文件:

查看块和meta文件位置:

[root@cdh-node02 subdir0]# rm -rf blk_1073741874 blk_1073741874_1065.meta

直接重启HDFS,直接模拟损坏效果,然后fsck检查:

[root@cdh-node01 ~]# hdfs fsck /

Connecting to namenode via http://cdh-node01:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /192.168.17.20 for path / at Sun Mar 03 19:48:31 CST 2019

/blockrecover/hello.txt:  Under replicated BP-794681415-192.168.17.20-1548403311677:blk_1073741874_1065. Target Replicas is 2 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

/user/root/.Trash/Current/blockrecover/hello.txt: MISSING 1 blocks of total size 11 B.
Status: CORRUPT
 Number of data-nodes:    3
 Number of racks:        1
 Total dirs:            45
 Total symlinks:        0

Replicated Blocks:
 Total size:    108227 B
 Total files:    36
 Total blocks (validated):    26 (avg. block size 4162 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:    1 (3.8461537 %)
  MINIMAL BLOCK REPLICATION:    1
  CORRUPT FILES:    1
  MISSING BLOCKS:    1
  MISSING SIZE:        11 B
  ********************************
 Minimally replicated blocks:    25 (96.15385 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    1 (3.8461537 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    2
 Average block replication:    1.8846154
 Missing blocks:        1
 Corrupt blocks:        0
 Missing replicas:        1 (1.9230769 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:    0
 Total block groups (validated):    0
 Minimally erasure-coded block groups:    0
 Over-erasure-coded block groups:    0
 Under-erasure-coded block groups:    0
 Unsatisfactory placement block groups:    0
 Average block group size:    0.0
 Missing block groups:        0
 Corrupt block groups:        0
 Missing internal blocks:    0
FSCK ended at Sun Mar 03 19:48:31 CST 2019 in 100 milliseconds


The filesystem under path '/' is CORRUPT

三.手动修复hdfs debug

修复命令:

[root@cdh-node01 apps]# hdfs debug recoverLease -path /blockrecover/hello.txt -retries 10 recoverLease SUCCEEDED on /blockrecover/hello.txt
直接DN节点查看,block⽂文件和meta⽂文件恢复:

[root@cdh-node02 subdir0]# ll
total 8
-rw-r--r-- 1 root root 11 Mar  4 10:38 blk_1073741874
-rw-r--r-- 1 root root 11 Mar  4 10:38 blk_1073741874_1065.meta

四.自动修复

当数据块损坏后,DN节点执⾏行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进⾏行行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。

总结:

生产上本人一般倾向于使用手动修复方式,但是前提要手动删除损坏的block块。
切记,是删除损坏block文件和meta文件,而不是删除hdfs⽂文件。
当然还可以先把文件get下载,然后hdfs删除,再对应上传。
切记删除不不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了嘛;除非无所谓丢数据,或
者有信心从其他地方可以补数据到hdfs!

发布了43 篇原创文章 · 获赞 34 · 访问量 19万+

猜你喜欢

转载自blog.csdn.net/lin443514407lin/article/details/88099704
今日推荐