SQOOP导入数据导致DB2行顺序与HIVE中不同,问题分析

最近项目中遇到了使用SQOOP数据迁移时，HIVE中数据的行顺序与DB2中行顺序不同，百思不得其解（使用--table模式）直接看SQOOP代码。

进入源码发现了Sqoop的import代码都在org.apache.sqoop.tool.importTool.java下.

实际执行importTable的代码如下.

 protected boolean importTable(SqoopOptions options, String tableName,
      HiveImport hiveImport) throws IOException, ImportException {
    String jarFile = null;

    // Generate the ORM code for the tables.
    jarFile = codeGenerator.generateORM(options, tableName);

    Path outputPath = getOutputPath(options, tableName);

    // Do the actual import.
    ImportJobContext context = new ImportJobContext(tableName, jarFile,
        options, outputPath);

    // If we're doing an incremental import, set up the
    // filtering conditions used to get the latest records.
    if (!initIncrementalConstraints(options, context)) {
      return false;
    }

    if (options.isDeleteMode()) {
      deleteTargetDir(context);
    }

    if (null != tableName) {
      manager.importTable(context);
    } else {
      manager.importQuery(context);
    }

    if (options.isAppendMode()) {
      AppendUtils app = new AppendUtils(context);
      app.append();
    } else if (options.getIncrementalMode() == SqoopOptions.IncrementalMode.DateLastModified) {
      lastModifiedMerge(options, context);
    }

    // If the user wants this table to be in Hive, perform that post-load.
    if (options.doHiveImport()) {
      // For Parquet file, the import action will create hive table directly via
      // kite. So there is no need to do hive import as a post step again.
      if (options.getFileLayout() != SqoopOptions.FileLayout.ParquetFile) {
        hiveImport.importTable(tableName, options.getHiveTableName(), false);
      }
    }

    saveIncrementalState(options);

    return true;
  }

逻辑：1.生成用于orm的java类文件这个类文件时通过classWriter生成的，用于对应每一列的类型与每一列的名字(这里不详细深究，问题没有出在这)的 read和write方法。这里可以对数据进行去空格啊等预处理.

2.进入manage.importTable(context);这一方法的目的是生成hdfs文件！实际调用的是org.apache.sqoop.manager.SqlManager.importTable();

详细的进入方法如下:

  /**
   * Default implementation of importTable() is to launch a MapReduce job
   * via DataDrivenImportJob to read the table with DataDrivenDBInputFormat.
   */
  public void importTable(com.cloudera.sqoop.manager.ImportJobContext context)
      throws IOException, ImportException {
    String tableName = context.getTableName();
    String jarFile = context.getJarFile();
    SqoopOptions opts = context.getOptions();

    context.setConnManager(this);

    ImportJobBase importer;
    if (opts.getHBaseTable() != null) {
      // Import to HBase.
      if (!HBaseUtil.isHBaseJarPresent()) {
        throw new ImportException("HBase jars are not present in "
            + "classpath, cannot import to HBase!");
      }
      if (!opts.isBulkLoadEnabled()){
        importer = new HBaseImportJob(opts, context);
      } else {
        importer = new HBaseBulkImportJob(opts, context);
      }
    } else if (opts.getAccumuloTable() != null) {
       // Import to Accumulo.
       if (!AccumuloUtil.isAccumuloJarPresent()) {
         throw new ImportException("Accumulo jars are not present in "
             + "classpath, cannot import to Accumulo!");
       }
       importer = new AccumuloImportJob(opts, context);
    } else {
      // Import to HDFS.
      importer = new DataDrivenImportJob(opts, context.getInputFormat(),
              context);
    }

    checkTableImportOptions(context);
    //找到某一个列  现在不知道啥用
    String splitCol = getSplitColumn(opts, tableName);
    //进入runImport方法进入将数据加载到HDFS文件中
    importer.runImport(tableName, jarFile, splitCol, opts.getConf());
  }

再次进入runImport方法在ImportJobBase下。

  /**
   * Run an import job to read a table in to HDFS.
   *
   * @param tableName  the database table to read; may be null if a free-form
   * query is specified in the SqoopOptions, and the ImportJobBase subclass
   * supports free-form queries.
   * @param ormJarFile the Jar file to insert into the dcache classpath.
   * (may be null)
   * @param splitByCol the column of the database table to use to split
   * the import
   * @param conf A fresh Hadoop Configuration to use to build an MR job.
   * @throws IOException if the job encountered an IO problem
   * @throws ImportException if the job failed unexpectedly or was
   * misconfigured.
   */
  public void runImport(String tableName, String ormJarFile, String splitByCol,
      Configuration conf) throws IOException, ImportException {
    // Check if there are runtime error checks to do
    if (isHCatJob && options.isDirect()
        && !context.getConnManager().isDirectModeHCatSupported()) {
      throw new IOException("Direct import is not compatible with "
        + "HCatalog operations using the connection manager "
        + context.getConnManager().getClass().getName()
        + ". Please remove the parameter --direct");
    }
    if (options.getAccumuloTable() != null && options.isDirect()
        && !getContext().getConnManager().isDirectModeAccumuloSupported()) {
      throw new IOException("Direct mode is incompatible with "
            + "Accumulo. Please remove the parameter --direct");
    }
    if (options.getHBaseTable() != null && options.isDirect()
        && !getContext().getConnManager().isDirectModeHBaseSupported()) {
      throw new IOException("Direct mode is incompatible with "
            + "HBase. Please remove the parameter --direct");
    }
    if (null != tableName) {
      LOG.info("Beginning import of " + tableName);
    } else {
      LOG.info("Beginning query import.");
    }
    String tableClassName = null;
    if (!getContext().getConnManager().isORMFacilitySelfManaged()) {
        tableClassName =
            new TableClassName(options).getClassForTable(tableName);
    }
    // For ORM self managed, we leave the tableClassName to null so that
    // we don't check for non-existing classes.

    loadJars(conf, ormJarFile, tableClassName);

    Job job = createJob(conf);
    try {
      // Set the external jar to use for the job.
      job.getConfiguration().set("mapred.jar", ormJarFile);
      if (options.getMapreduceJobName() != null) {
        job.setJobName(options.getMapreduceJobName());
      }

      propagateOptionsToJob(job);
      //准备数据进入的格式
      configureInputFormat(job, tableName, tableClassName, splitByCol);
      configureOutputFormat(job, tableName, tableClassName);
      configureMapper(job, tableName, tableClassName);
      configureNumTasks(job);
      cacheJars(job, getContext().getConnManager());

      jobSetup(job);
      setJob(job);
      boolean success = runJob(job);
      if (!success) {
        throw new ImportException("Import job failed!");
      }

      completeImport(job);

      if (options.isValidationEnabled()) {
        validateImport(tableName, conf, job);
      }

      if (options.doHiveImport() || isHCatJob) {
        // Publish data for import job, only hive/hcat import jobs are supported now.
        LOG.info("Publishing Hive/Hcat import job data to Listeners for table " + tableName);
        PublishJobData.publishJobData(conf, options, OPERATION, tableName, startTime);
      }

    } catch (InterruptedException ie) {
      throw new IOException(ie);
    } catch (ClassNotFoundException cnfe) {
      throw new IOException(cnfe);
    } finally {
      unloadJars();
      jobTeardown(job);
    }
  }

发现问题可能出现在configureInputFormat方法中。

再次进入DataDrivenImportJob.java中

  @Override
  protected void configureInputFormat(Job job, String tableName,
      String tableClassName, String splitByCol) throws IOException {
    ConnManager mgr = getContext().getConnManager();
    try {
      String username = options.getUsername();
      if (null == username || username.length() == 0) {
        DBConfiguration.configureDB(job.getConfiguration(),
            mgr.getDriverClass(), options.getConnectString(),
            options.getFetchSize(), options.getConnectionParams());
      } else {
        DBConfiguration.configureDB(job.getConfiguration(),
            mgr.getDriverClass(), options.getConnectString(),
            username, options.getPassword(), options.getFetchSize(),
            options.getConnectionParams());
      }

      if (null != tableName) {
        // Import a table.
        String [] colNames = options.getColumns();
        if (null == colNames) {
          colNames = mgr.getColumnNames(tableName);
        }

        String [] sqlColNames = null;
        if (null != colNames) {
          sqlColNames = new String[colNames.length];
          for (int i = 0; i < colNames.length; i++) {
            sqlColNames[i] = mgr.escapeColName(colNames[i]);
          }
        }

        // It's ok if the where clause is null in DBInputFormat.setInput.
        String whereClause = options.getWhereClause();

        // We can't set the class properly in here, because we may not have the
        // jar loaded in this JVM. So we start by calling setInput() with
        // DBWritable and then overriding the string manually.
        /*这里好像就是准备数据select格式的！*/
        DataDrivenDBInputFormat.setInput(job, DBWritable.class,
            mgr.escapeTableName(tableName), whereClause,
            mgr.escapeColName(splitByCol), sqlColNames);

        // If user specified boundary query on the command line propagate it to
        // the job
        if (options.getBoundaryQuery() != null) {
          DataDrivenDBInputFormat.setBoundingQuery(job.getConfiguration(),
                  options.getBoundaryQuery());
        }
      } else {
        // Import a free-form query.
        String inputQuery = options.getSqlQuery();
        String sanitizedQuery = inputQuery.replace(
            DataDrivenDBInputFormat.SUBSTITUTE_TOKEN, " (1 = 1) ");

        String inputBoundingQuery = options.getBoundaryQuery();
        if (inputBoundingQuery == null) {
          inputBoundingQuery = buildBoundaryQuery(splitByCol, sanitizedQuery);
        }
        DataDrivenDBInputFormat.setInput(job, DBWritable.class,
            inputQuery, inputBoundingQuery);
        new DBConfiguration(job.getConfiguration()).setInputOrderBy(
            splitByCol);
      }
      if (options.getRelaxedIsolation()) {
        LOG
          .info("Enabling relaxed (read uncommitted) transaction "
             + "isolation for imports");
        job.getConfiguration()
          .setBoolean(DBConfiguration.PROP_RELAXED_ISOLATION, true);
      }
      LOG.debug("Using table class: " + tableClassName);
      job.getConfiguration().set(ConfigurationHelper.getDbInputClassProperty(),
          tableClassName);

      job.getConfiguration().setLong(LargeObjectLoader.MAX_INLINE_LOB_LEN_KEY,
          options.getInlineLobLimit());

      if (options.getSplitLimit() != null) {
        org.apache.sqoop.config.ConfigurationHelper.setSplitLimit(
          job.getConfiguration(), options.getSplitLimit());
      }

      LOG.debug("Using InputFormat: " + inputFormatClass);
      job.setInputFormatClass(inputFormatClass);
    } finally {
      try {
        mgr.close();
      } catch (SQLException sqlE) {
        LOG.warn("Error closing connection: " + sqlE);
      }
    }
  }

发现他里面的

DataDrivenDBInputFormat.setInput(job, DBWritable.class,
    mgr.escapeTableName(tableName), whereClause,
    mgr.escapeColName(splitByCol), sqlColNames);好像就是用来准备数据进入的格式的

再次进入

终于发现原本在最外面传入的

splitCol是用于排序的键！

进入对应方法

 /**
   * Determine what column to use to split the table.
   * @param opts the SqoopOptions controlling this import.
   * @param tableName the table to import.
   * @return the splitting column, if one is set or inferrable, or null
   * otherwise.
   */
  protected String getSplitColumn(SqoopOptions opts, String tableName) {
    String splitCol = opts.getSplitByCol();
    if (null == splitCol && null != tableName) {
      // If the user didn't specify a splitting column, try to infer one.
      splitCol = getPrimaryKey(tableName);
    }

    return splitCol;
  }
//原来个方法是找到主键！

原来这个方法目的是找到主键。进入getPrimaryKey方法

  @Override
  public String getPrimaryKey(String tableName) {
    try {
      DatabaseMetaData metaData = this.getConnection().getMetaData();
      ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
      if (null == results) {
        return null;
      }

      try {
        if (results.next()) {
          return results.getString("COLUMN_NAME");
        } else {
          return null;
        }
      } finally {
        results.close();
        getConnection().commit();
      }
    } catch (SQLException sqlException) {
      LoggingUtils.logAll(LOG, "Error reading primary key metadata: "
          + sqlException.toString(), sqlException);
      return null;
    }
  }

sqlManage的排序键如果没有指定splitby 就会用第一个主键进行排序！

得出结论:sqoop在把数据从db2迁移到Hdfs文件时其排序是通过从左往右第一个主键进行排序的！但是DB2中是按照联合主键进行排序的！

SQOOP导入数据导致DB2行顺序与HIVE中不同,问题分析

猜你喜欢