一文探索文件读写更高效的方式

持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第5天，点击查看活动详情

背景

本部分属于花絮，不喜者跳过本章节

最近在探秘kafka为什么如此快？其背后的秘诀又是什么？

怀着好奇之心，开始像剥洋葱一样逐层内嵌。一步步揭晓kafka能够吊打mq的真因。了解之后不得不说kafka：yyds。

了解到顺序存盘的运用

探测到稀疏索引的引进

知晓其零拷贝技术的威力

嗅觉到mmp（内存映射文件）的神来之笔

......

mmp如此神奇，那么运用于文件压缩，是否同样可以实现飞速压缩呢？

又怀着好奇之心，决定用实际行动证明这个结论（否则我们的知识只能纸上谈兵）

编码是我们的本能功能，好奇之心是我们永远的利器。不能丢

曾几何时，有位BA告诉我他的经历：DEV转为BA后，代码就生疏了，后来他强迫自己每个迭代都领一个小需求鞭策自己。

曾几何时，有位前辈告诉我：即使你以后成长为架构师甚至更高职位，也不能丢失编码这件神器。否则你会发现会很尴尬——会被人称为“需求翻译机”

......

这不是心灵鸡汤，这是来自灵魂的谏言，我深刻了解到：编码真的是学到老活到老的工作。

看到很多优秀的同事离职远去，通过交流感触更加深厚

所以，大家一定记得：学会一个知识要努力应用一遍。这样才能记得牢固；在学习中要不求甚解，完全知道这个知识也要知道为什么这么做

......

场景分析

场景1：小文件单文件压缩

1、原始文件介绍：63.7M、 csv文件、单个文件

2、对比技术介绍：网上流传、使用缓冲区、使用管道、使用mmp

3、对比结果展示：

3.1、方式1：网上流传（流传在坊间的神话，其实是带刺的玫瑰）

小王刚入职不久，有一天突然接到需求，要压缩文件，之前没写过，怎么办？这个时候会在网上搜到这个方法

执行结果（效率很吓人）

zipMethod=withoutBuffer

costTime=327000ms

代码如下：

public void zipFileWithoutBuffer(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile))) {
        try (InputStream inputStream = new FileInputStream(inputFile)){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            int temp;
            while ((temp = inputStream.read()) != -1){
                zipOutputStream.write(temp);
            }
        }
        printResult(beginTime,"withoutBuffer");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}
复制代码

3.2、方式2：使用缓冲区

小王很开心，提交代码，翻转了需求状态，可验收。

小花是团队资深技术达人，走查代码发现一脸懵逼：网上搜的？这个会很慢，你再研究研究

小王又换了一种思路，借助缓冲区BufferedOutputStream

执行结果（快了很多）

zipMethod=withBuffer

costTime=5170ms

代码如下：

public void zipFileWithBuffer(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(zipOutputStream)) {
        try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream(inputFile))){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            int temp;
            while ((temp = bufferedInputStream.read()) != -1){
                bufferedOutputStream.write(temp);
            }
        }
        printResult(beginTime,"withBuffer");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}
复制代码

3.3、方式3：使用通道

小王怀着忐忑的心情，又一次召集大家走查代码。

小花：速度要求没那么高，这样做已经差不多了，代码可以提交了

其实最近研究kafka，接触过nio，知晓：nio有种技术叫通道：Channel

执行结果（好快）

zipMethod=withChannel

costTime=1642ms

代码如下：

public void zipFileWithChannel(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        WritableByteChannel writableByteChannel = Channels.newChannel(zipOutputStream)) {
        try (FileChannel fileChannel = new FileInputStream(inputFile).getChannel()){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            fileChannel.transferTo(0,inputFile.length(),writableByteChannel);
        }
        printResult(beginTime,"withChannel");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}
复制代码

3.4、方式4：使用mmp

研究kafka过程中，不止知晓nio有种技术叫通道：Channel，还有种技术叫mmp

执行结果（好快）

zipMethod=withMmp

costTime=1554ms

代码如下：

public void zipFileWithMmp(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        WritableByteChannel writableByteChannel = Channels.newChannel(zipOutputStream)) {
        zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
        MappedByteBuffer mappedByteBuffer = new RandomAccessFile(INPUT_FILE,"r").getChannel()
                .map(FileChannel.MapMode.READ_ONLY,0,inputFile.length());
        writableByteChannel.write(mappedByteBuffer);
        printResult(beginTime,"withMmp");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}
复制代码

场景2：大文件单文件压缩

1、原始文件介绍：585M、 csv文件、单个文件

2、对比技术介绍：使用缓冲区、使用管道、使用mmp

3、对比结果展示：

使用缓冲区	使用通道	使用mmp
costTime=46034ms	costTime=11885ms	costTime=10810ms

场景3：大文件多文件压缩

1、原始文件介绍：585M、 csv文件、5个文件

2、对比技术介绍：使用缓冲区、使用管道、使用mmp

3、对比结果展示：

使用缓冲区	使用通道	使用mmp
costTime=173122ms	costTime=53982ms	costTime=50543ms

分析结论

1、对比见下表

压缩场景	网上流传	使用缓冲区	使用通道	使用mmp
场景1：小文件单文件压缩（60M）	327000ms	5170ms	1642ms	1554ms
场景2：大文件单文件压缩（585M）	--	46034ms	11885ms	10810ms
场景3：大文件多文件压缩（5个585M）	--	173122ms	53982ms	50543ms
场景4：100K文件单文件压缩	--	28ms	26ms	24ms
场景5：5K文件单文件压缩		18ms	20ms	23ms
场景5：1K文件单文件压缩		15ms	21ms	24ms

2、结论

1）网上流传的方法不可取，效率最差

2）使用缓冲区虽然性能还凑合，但和两种nio技术（通道和mmp）相比，还是差了很多，尤其是在中型文件（500M左右）的单文件压缩和多文件压缩

中，对比更加明显

3）通道技术和mmp技术对比相差不大，小型文件基本没影响，大型文件差距也在几秒之间

4）文件大于10K时，推荐使用通道技术或者mmp技术进行文件压缩

5）文件小于10K时，推荐使用缓冲区技术（比两种nio技术表现了更好的性能）

6）如果有些团队在使用api，可以看看其源码是否使用了nio技术。如果不是，建议修改为文中方式

另外，操作文件操作时，都可以尝试使用nio技术，测试下其效率，理论上应该都是很可观的

背后机密

1、网上流传方法

FileInputStream的read方法如下：

/**
 * Reads a byte of data from this input stream. This method blocks
 * if no input is yet available.
 *
 * @return     the next byte of data, or <code>-1</code> if the end of the
 *             file is reached.
 * @exception  IOException  if an I/O error occurs.
 */public int read() throws IOException {
    return read0();}private native int read0() throws IOException;
复制代码

这是调用本地方法与原生操作系统进行交互，从磁盘中读取数据。每读取一个字节数据就调用一次这个方法（一次交互很耗时）。

这个方法还是每次读取一个字节，假如文件很大，这个开销是巨大的

2、使用缓冲区

BufferedInputSream read方法如下：

/**
 * See
 * the general contract of the <code>read</code>
 * method of <code>InputStream</code>.
 *
 * @return     the next byte of data, or <code>-1</code> if the end of the
 *             stream is reached.
 * @exception  IOException  if this input stream has been closed by
 *                          invoking its {@link #close()} method,
 *                          or an I/O error occurs.
 * @see        java.io.FilterInputStream#in
 */public synchronized int read() throws IOException {
    if (pos >= count) {
        fill();
        if (pos >= count)
            return -1;
    }
    return getBufIfOpen()[pos++] & 0xff;}
复制代码

这样虽然也是一次读一个字节，但不是每次都从底层读取数据，而是一次调用底层系统读取了最多buf.length个字节到buf数组中，然后从 buf中一次读一个字节，减少了频繁调用底层接口的开销。

3、使用通道

在复制大文件时，FileChannel复制文件的速度比BufferedInputStream/BufferedOutputStream复制文件的速度快了近三分之一，体现出FileChannel的速度优势。

NIO的Channel的结构更加符合操作系统执行I/O的方式，所以其速度相比较于传统的IO而言速度有了显著的提高。

操作系统能够直接传输字节从文件系统缓存到目标的Channel中，而不需要实际的copy阶段（copy: 从内核空间转到用户空间的一个过程）

4、使用mmp

内存映射文件，是把位于硬盘中的文件看做是程序地址空间中一块区域对应的物理存储器，文件的数据就是这块区域内存中对应的数据，

读写文件中的数据，直接对这块区域的地址操作，就可以，减少了内存复制的环节。

所以说，内存映射文件比起文件I/O操作，效率要高，而且文件越大，体现出来的差距越大。