Druid源码解析之HDFS存储

Druid自身为各种外围功能定义了很多接口，比如存储就定义了：

DataSegmentArchiver：用于对segment文件进行archive与restore，可用在s3之类的存储上，将暂时不用的segment放入到别的bucket中。
DataSegmentFinder：用于在特定的目录下查找Druid segment，有时会根据正确的loadSpec更新deep storage上所有的descriptor.json文件。
DataSegmentKiller：对segment文件进行删除操作。
DataSegmentMover：用于移动segment文件。
DataSegmentPuller：将特定segment的数据提取到特定的目录中。
DataSegementPusher：将特定的segment的数据从特定的目录中提取出来。

其中HDFS存储实现了其中的4个接口：

HdfsDataSegmentFinder
HdfsDataSegementKiller
HdfsDataSegmentPuller
HdfsDataSegmentPusher

1. HdfsDataSegmentFinder：

  @Override
  public Set<DataSegment> findSegments(String workingDirPathStr, boolean updateDescriptor)
      throws SegmentLoadingException
  {
    final Set<DataSegment> segments = Sets.newHashSet();
    final Path workingDirPath = new Path(workingDirPathStr);
    FileSystem fs;
    try {
      fs = workingDirPath.getFileSystem(config);

      log.info(fs.getScheme());
      log.info("FileSystem URI:" + fs.getUri().toString());

      if (!fs.exists(workingDirPath)) {
        throw new SegmentLoadingException("Working directory [%s] doesn't exist.", workingDirPath);
      }

      if (!fs.isDirectory(workingDirPath)) {
        throw new SegmentLoadingException("Working directory [%s] is not a directory!?", workingDirPath);
      }

      final RemoteIterator<LocatedFileStatus> it = fs.listFiles(workingDirPath, true);
      while (it.hasNext()) {
        final LocatedFileStatus locatedFileStatus = it.next();
        final Path path = locatedFileStatus.getPath();
        if (path.getName().equals("descriptor.json")) {
          final Path indexZip = new Path(path.getParent(), "index.zip");
          if (fs.exists(indexZip)) {
            final DataSegment dataSegment = mapper.readValue(fs.open(path), DataSegment.class);
            log.info("Found segment [%s] located at [%s]", dataSegment.getIdentifier(), indexZip);

            final Map<String, Object> loadSpec = dataSegment.getLoadSpec();
            final String pathWithoutScheme = indexZip.toUri().getPath();

            if (!loadSpec.get("type").equals(HdfsStorageDruidModule.SCHEME) || !loadSpec.get("path")
                                                                                        .equals(pathWithoutScheme)) {
              loadSpec.put("type", HdfsStorageDruidModule.SCHEME);
              loadSpec.put("path", pathWithoutScheme);
              if (updateDescriptor) {
                log.info("Updating loadSpec in descriptor.json at [%s] with new path [%s]", path, pathWithoutScheme);
                mapper.writeValue(fs.create(path, true), dataSegment);
              }
            }
            segments.add(dataSegment);
          } else {
            throw new SegmentLoadingException(
                "index.zip didn't exist at [%s] while descripter.json exists!?",
                indexZip
            );
          }
        }
      }
    }
    catch (IOException e) {
      throw new SegmentLoadingException(e, "Problems interacting with filesystem[%s].", workingDirPath);
    }

    return segments;
  }

从以上代码中可以看出，该方法是从特定的hdfs目录中获取符合条件的segments。如果updateDescriptor参数为true，将更新其descriptor.json文件。

2. HdfsDataSegmentKiller：用于在hdfs上将特定的segment删除。代码如下：

  @Override
  public void kill(DataSegment segment) throws SegmentLoadingException
  {
    final Path path = getPath(segment);
    log.info("killing segment[%s] mapped to path[%s]", segment.getIdentifier(), path);

    try {
      if (path.getName().endsWith(".zip")) {

        final FileSystem fs = path.getFileSystem(config);

        if (!fs.exists(path)) {
          log.warn("Segment Path [%s] does not exist. It appears to have been deleted already.", path);
          return ;
        }

        // path format -- > .../dataSource/interval/version/partitionNum/xxx.zip
        Path partitionNumDir = path.getParent();
        if (!fs.delete(partitionNumDir, true)) {
          throw new SegmentLoadingException(
              "Unable to kill segment, failed to delete dir [%s]",
              partitionNumDir.toString()
          );
        }

        //try to delete other directories if possible
        Path versionDir = partitionNumDir.getParent();
        if (safeNonRecursiveDelete(fs, versionDir)) {
          Path intervalDir = versionDir.getParent();
          if (safeNonRecursiveDelete(fs, intervalDir)) {
            Path dataSourceDir = intervalDir.getParent();
            safeNonRecursiveDelete(fs, dataSourceDir);
          }
        }
      } else {
        throw new SegmentLoadingException("Unknown file type[%s]", path);
      }
    }
    catch (IOException e) {
      throw new SegmentLoadingException(e, "Unable to kill segment");
    }
  }

  private boolean safeNonRecursiveDelete(FileSystem fs, Path path)
  {
    try {
      return fs.delete(path, false);
    }
    catch (Exception ex) {
      return false;
    }
  }

  private Path getPath(DataSegment segment)
  {
    return new Path(String.valueOf(segment.getLoadSpec().get(PATH_KEY)));
  }

3. HdfsDataSegmentPuller：用于将hdfs上的Segment提取到本地目录中。

  @Override
  public void getSegmentFiles(DataSegment segment, File dir) throws SegmentLoadingException
  {
    getSegmentFiles(getPath(segment), dir);
  }

  public FileUtils.FileCopyResult getSegmentFiles(final Path path, final File outDir) throws SegmentLoadingException
  {
    final LocalFileSystem localFileSystem = new LocalFileSystem();
    try {
      final FileSystem fs = path.getFileSystem(config);
      if (fs.isDirectory(path)) {

        // --------    directory     ---------

        try {
          return RetryUtils.retry(
              new Callable<FileUtils.FileCopyResult>()
              {
                @Override
                public FileUtils.FileCopyResult call() throws Exception
                {
                  if (!fs.exists(path)) {
                    throw new SegmentLoadingException("No files found at [%s]", path.toString());
                  }

                  final RemoteIterator<LocatedFileStatus> children = fs.listFiles(path, false);
                  final ArrayList<FileUtils.FileCopyResult> localChildren = new ArrayList<>();
                  final FileUtils.FileCopyResult result = new FileUtils.FileCopyResult();
                  while (children.hasNext()) {
                    final LocatedFileStatus child = children.next();
                    final Path childPath = child.getPath();
                    final String fname = childPath.getName();
                    if (fs.isDirectory(childPath)) {
                      log.warn("[%s] is a child directory, skipping", childPath.toString());
                    } else {
                      final File outFile = new File(outDir, fname);

                      // Actual copy
                      fs.copyToLocalFile(childPath, new Path(outFile.toURI()));
                      result.addFile(outFile);
                    }
                  }
                  log.info(
                      "Copied %d bytes from [%s] to [%s]",
                      result.size(),
                      path.toString(),
                      outDir.getAbsolutePath()
                  );
                  return result;
                }

              },
              shouldRetryPredicate(),
              DEFAULT_RETRY_COUNT
          );
        }
        catch (Exception e) {
          throw Throwables.propagate(e);
        }
      } else if (CompressionUtils.isZip(path.getName())) {

        // --------    zip     ---------

        final FileUtils.FileCopyResult result = CompressionUtils.unzip(
            new ByteSource()
            {
              @Override
              public InputStream openStream() throws IOException
              {
                return getInputStream(path);
              }
            }, outDir, shouldRetryPredicate(), false
        );

        log.info(
            "Unzipped %d bytes from [%s] to [%s]",
            result.size(),
            path.toString(),
            outDir.getAbsolutePath()
        );

        return result;
      } else if (CompressionUtils.isGz(path.getName())) {

        // --------    gzip     ---------

        final String fname = path.getName();
        final File outFile = new File(outDir, CompressionUtils.getGzBaseName(fname));
        final FileUtils.FileCopyResult result = CompressionUtils.gunzip(
            new ByteSource()
            {
              @Override
              public InputStream openStream() throws IOException
              {
                return getInputStream(path);
              }
            },
            outFile
        );

        log.info(
            "Gunzipped %d bytes from [%s] to [%s]",
            result.size(),
            path.toString(),
            outFile.getAbsolutePath()
        );
        return result;
      } else {
        throw new SegmentLoadingException("Do not know how to handle file type at [%s]", path.toString());
      }
    }
    catch (IOException e) {
      throw new SegmentLoadingException(e, "Error loading [%s]", path.toString());
    }
  }

4. HdfsDataSegmentPusher：将文件从特定的目录下copy到hdfs上：

  @Override
  public DataSegment push(File inDir, DataSegment segment) throws IOException
  {
    final String storageDir = DataSegmentPusherUtil.getHdfsStorageDir(segment);

    log.info(
        "Copying segment[%s] to HDFS at location[%s/%s]",
        segment.getIdentifier(),
        config.getStorageDirectory(),
        storageDir
    );

    Path outFile = new Path(String.format("%s/%s/index.zip", config.getStorageDirectory(), storageDir));
    FileSystem fs = outFile.getFileSystem(hadoopConfig);

    fs.mkdirs(outFile.getParent());
    log.info("Compressing files from[%s] to [%s]", inDir, outFile);

    final long size;
    try (FSDataOutputStream out = fs.create(outFile)) {
      size = CompressionUtils.zip(inDir, out);
    }

    return createDescriptorFile(
        segment.withLoadSpec(makeLoadSpec(outFile))
               .withSize(size)
               .withBinaryVersion(SegmentUtils.getVersionFromDir(inDir)),
        outFile.getParent(),
        fs
    );
  }

  private DataSegment createDescriptorFile(DataSegment segment, Path outDir, final FileSystem fs) throws IOException
  {
    final Path descriptorFile = new Path(outDir, "descriptor.json");
    log.info("Creating descriptor file at[%s]", descriptorFile);
    ByteSource
        .wrap(jsonMapper.writeValueAsBytes(segment))
        .copyTo(new HdfsOutputStreamSupplier(fs, descriptorFile));
    return segment;
  }

  private ImmutableMap<String, Object> makeLoadSpec(Path outFile)
  {
    return ImmutableMap.<String, Object>of("type", "hdfs", "path", outFile.toString());
  }

Druid源码解析之HDFS存储

猜你喜欢