hadoop compress file

compress files in directory to another directory

use ‘cut -f 2’

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input /home/houzhizhen/defaultfs/test/input \
  -output /home/houzhizhen/defaultfs/test/outputcut \
  -mapper "cut -f 2"

This produces one file in output directory for one file in input directory. After unzip the file using command ‘gunzip’, the file length is not equals to the source file lenght, the file length reduce by 1 for every line in the file, probably because is replace ‘\n\r’ with ‘\n’.

[houzhizhen@localhost outputcut]$ ll
总用量 12
-rw-r--r--. 1 houzhizhen root 2938 5月  16 10:07 part-00000.gz
-rw-r--r--. 1 houzhizhen root  325 5月  16 10:07 part-00001.gz
-rw-r--r--. 1 houzhizhen root  128 5月  16 10:07 part-00002.gz
-rw-r--r--. 1 houzhizhen root    0 5月  16 10:07 _SUCCESS

use ‘/bin/cat’

The output result is identical to the previous test.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input /home/houzhizhen/defaultfs/test/input \
            -output /home/houzhizhen/defaultfs/test/output-gz \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat

reduce into one compressed file directly

Notice: this will cause all the data to single reduce task, and runs very slow if the input size is large.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /home/houzhizhen/defaultfs/test/input \
        -output /home/houzhizhen/defaultfs/test/archive \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat
  • decompress
/home/houzhizhen/defaultfs/test/archive
bunzip2 part-00000.bz2

猜你喜欢

转载自blog.csdn.net/houzhizhen/article/details/80332793
今日推荐