How to speed up read write base64 encoded gzipped large files in Java

rayimpr :

The task is to compress/decompress very large data > 2G, which cannot be hold by a single String or ByteArray. My solution is to write compressed/decompressed data chunk by chunk into a file. It works, but not fast enough.

Compress: plain text file -> gzip -> base64 encode -> compressed file
Decompress: compressed file -> base64 decode -> gunzip -> plain text file

Test result on laptop, with 16G memory.

Created compressed file, takes 571346 millis
Created decompressed file, takes 378441 millis

Code block

public static void compress(final InputStream inputStream, final Path outputFile) throws IOException {
    try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
        final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
        final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {

      reader.lines().forEach(line -> {
        try {
          gzipOutput.write(line.getBytes());
          gzipOutput.write(System.getProperty("line.separator").getBytes());
        } catch (final IOException e) {
          e.printStackTrace();
        }
      });
    }
  }

public static void decompress(final InputStream inputStream, final Path outputFile) throws IOException {
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
      final GzipCompressorInputStream gzipStream = new GzipCompressorInputStream(Base64.getDecoder().wrap(inputStream));
      final BufferedReader reader = new BufferedReader(new InputStreamReader(gzipStream))) {

    reader.lines().forEach(line -> {
      try {
        outputStream.write(line.getBytes());
        outputStream.write(System.getProperty("line.separator").getBytes());
      } catch (final IOException e) {
        e.printStackTrace();
      }
    });
  }
}

Furthermore, I tried to do batch write when sending data to file, didn't see much improvement.

# batch write
public static void compress(final InputStream inputStream, final Path outputFile) throws IOException {
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
      final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
      final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output);
      final BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {

    StringBuilder stringBuilder = new StringBuilder();
    final int chunkSize = Integer.MAX_VALUE / 1000;

    String line;
    int counter = 0;
    while((line = reader.readLine()) != null) {
      counter++;
      stringBuilder.append(line).append(System.getProperty("line.separator"));
      if(counter >= chunkSize) {
        gzipOutput.write(stringBuilder.toString().getBytes());
        counter = 0;
        stringBuilder = new StringBuilder();
      }
    }

    if (counter > 0) {
      gzipOutput.write(stringBuilder.toString().getBytes());
    }
  }
}

Question

  1. Looking for suggestion on how to speed up the overall process
  2. What will be the bottlenecks?

10/2/2019 update

I did some more tests, the results show that base64 encoding is the bottleneck.

public static void compress(final InputStream inputStream, final Path outputFile) throws IOException {
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
       final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
       final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output)) {

    final byte[] buffer = new byte[4096];
    int n = 0;
    while (-1 != (n = inputStream.read(buffer))) {
      gzipOutput.write(buffer, 0, n);
    }
  }
}
  • 2.2G test file, with 21.5 Million lines
  • Copy file only: ~ 2 seconds
  • Gzip file only: ~ 12 seconds
  • Gzip + base64: ~ 500 seconds
Joop Eggen :

First: never default the charset, as that is not portable.

String s = ...;
byte[] b = ...;
b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);

For compression of text do not involve a Reader, as that converts bytes given some charset into a String (holding Unicode), and again a conversion back. Also a String's char requires 2 bytes (UTF-16) as opposed to 1 byte for basic ASCII symbols.

Base64 converts binary to an alphabet of 64 ASCII symbols, requiring 4/3 the space. Do not do that other when the data must be transmitted packed in XML or such.

Large files can be (de)compressed.

final int BUFFER_SIZE = 1024 * 64;
Path textFile = Paths.get(".... .txt");
Path gzFile = textFile.resolveSibling(textFile.getFileName().toString() + ".gz");

try (OutputStream out = new GzipOutputStream(Files.newOutputStream(gzFile), BUFFER_SIZE))) {
    Files.copy(textFile, out);
}

try (InputStream in = new GzipInputStream(Files.newInputStream(gzFile), BUFFER_SIZE))) {
    Files.copy(in, textFile);
}

Often the optional parameter BUFFER_SIZE is overlooked, which might degrade performance.

copy can have additional parameters for handling file clashes.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=310241&siteId=1