最近项目中遇到了使用SQOOP数据迁移时,HIVE中数据的行顺序与DB2中行顺序不同,百思不得其解(使用--table模式)直接看SQOOP代码。
进入源码发现了Sqoop的import代码都在org.apache.sqoop.tool.importTool.java下.
实际执行importTable的代码如下.
protected boolean importTable(SqoopOptions options, String tableName,
HiveImport hiveImport) throws IOException, ImportException {
String jarFile = null;
// Generate the ORM code for the tables.
jarFile = codeGenerator.generateORM(options, tableName);
Path outputPath = getOutputPath(options, tableName);
// Do the actual import.
ImportJobContext context = new ImportJobContext(tableName, jarFile,
options, outputPath);
// If we're doing an incremental import, set up the
// filtering conditions used to get the latest records.
if (!initIncrementalConstraints(options, context)) {
return false;
}
if (options.isDeleteMode()) {
deleteTargetDir(context);
}
if (null != tableName) {
manager.importTable(context);
} else {
manager.importQuery(context);
}
if (options.isAppendMode()) {
AppendUtils app = new AppendUtils(context);
app.append();
} else if (options.getIncrementalMode() == SqoopOptions.IncrementalMode.DateLastModified) {
lastModifiedMerge(options, context);
}
// If the user wants this table to be in Hive, perform that post-load.
if (options.doHiveImport()) {
// For Parquet file, the import action will create hive table directly via
// kite. So there is no need to do hive import as a post step again.
if (options.getFileLayout() != SqoopOptions.FileLayout.ParquetFile) {
hiveImport.importTable(tableName, options.getHiveTableName(), false);
}
}
saveIncrementalState(options);
return true;
}
逻辑:1.生成用于orm的java类文件这个类文件时通过classWriter生成的 ,用于对应每一列的类型与每一列的名字(这里不详细深究,问题没有出在这)的 read和write方法。这里可以对数据进行去空格啊等预处理.
2.进入manage.importTable(context);这一方法的目的是生成hdfs文件!实际调用的是org.apache.sqoop.manager.SqlManager.importTable();
详细的进入方法如下:
/**
* Default implementation of importTable() is to launch a MapReduce job
* via DataDrivenImportJob to read the table with DataDrivenDBInputFormat.
*/
public void importTable(com.cloudera.sqoop.manager.ImportJobContext context)
throws IOException, ImportException {
String tableName = context.getTableName();
String jarFile = context.getJarFile();
SqoopOptions opts = context.getOptions();
context.setConnManager(this);
ImportJobBase importer;
if (opts.getHBaseTable() != null) {
// Import to HBase.
if (!HBaseUtil.isHBaseJarPresent()) {
throw new ImportException("HBase jars are not present in "
+ "classpath, cannot import to HBase!");
}
if (!opts.isBulkLoadEnabled()){
importer = new HBaseImportJob(opts, context);
} else {
importer = new HBaseBulkImportJob(opts, context);
}
} else if (opts.getAccumuloTable() != null) {
// Import to Accumulo.
if (!AccumuloUtil.isAccumuloJarPresent()) {
throw new ImportException("Accumulo jars are not present in "
+ "classpath, cannot import to Accumulo!");
}
importer = new AccumuloImportJob(opts, context);
} else {
// Import to HDFS.
importer = new DataDrivenImportJob(opts, context.getInputFormat(),
context);
}
checkTableImportOptions(context);
//找到某一个列 现在不知道啥用
String splitCol = getSplitColumn(opts, tableName);
//进入runImport方法进入将数据加载到HDFS文件中
importer.runImport(tableName, jarFile, splitCol, opts.getConf());
}
再次进入runImport方法在ImportJobBase下。
/**
* Run an import job to read a table in to HDFS.
*
* @param tableName the database table to read; may be null if a free-form
* query is specified in the SqoopOptions, and the ImportJobBase subclass
* supports free-form queries.
* @param ormJarFile the Jar file to insert into the dcache classpath.
* (may be null)
* @param splitByCol the column of the database table to use to split
* the import
* @param conf A fresh Hadoop Configuration to use to build an MR job.
* @throws IOException if the job encountered an IO problem
* @throws ImportException if the job failed unexpectedly or was
* misconfigured.
*/
public void runImport(String tableName, String ormJarFile, String splitByCol,
Configuration conf) throws IOException, ImportException {
// Check if there are runtime error checks to do
if (isHCatJob && options.isDirect()
&& !context.getConnManager().isDirectModeHCatSupported()) {
throw new IOException("Direct import is not compatible with "
+ "HCatalog operations using the connection manager "
+ context.getConnManager().getClass().getName()
+ ". Please remove the parameter --direct");
}
if (options.getAccumuloTable() != null && options.isDirect()
&& !getContext().getConnManager().isDirectModeAccumuloSupported()) {
throw new IOException("Direct mode is incompatible with "
+ "Accumulo. Please remove the parameter --direct");
}
if (options.getHBaseTable() != null && options.isDirect()
&& !getContext().getConnManager().isDirectModeHBaseSupported()) {
throw new IOException("Direct mode is incompatible with "
+ "HBase. Please remove the parameter --direct");
}
if (null != tableName) {
LOG.info("Beginning import of " + tableName);
} else {
LOG.info("Beginning query import.");
}
String tableClassName = null;
if (!getContext().getConnManager().isORMFacilitySelfManaged()) {
tableClassName =
new TableClassName(options).getClassForTable(tableName);
}
// For ORM self managed, we leave the tableClassName to null so that
// we don't check for non-existing classes.
loadJars(conf, ormJarFile, tableClassName);
Job job = createJob(conf);
try {
// Set the external jar to use for the job.
job.getConfiguration().set("mapred.jar", ormJarFile);
if (options.getMapreduceJobName() != null) {
job.setJobName(options.getMapreduceJobName());
}
propagateOptionsToJob(job);
//准备数据进入的格式
configureInputFormat(job, tableName, tableClassName, splitByCol);
configureOutputFormat(job, tableName, tableClassName);
configureMapper(job, tableName, tableClassName);
configureNumTasks(job);
cacheJars(job, getContext().getConnManager());
jobSetup(job);
setJob(job);
boolean success = runJob(job);
if (!success) {
throw new ImportException("Import job failed!");
}
completeImport(job);
if (options.isValidationEnabled()) {
validateImport(tableName, conf, job);
}
if (options.doHiveImport() || isHCatJob) {
// Publish data for import job, only hive/hcat import jobs are supported now.
LOG.info("Publishing Hive/Hcat import job data to Listeners for table " + tableName);
PublishJobData.publishJobData(conf, options, OPERATION, tableName, startTime);
}
} catch (InterruptedException ie) {
throw new IOException(ie);
} catch (ClassNotFoundException cnfe) {
throw new IOException(cnfe);
} finally {
unloadJars();
jobTeardown(job);
}
}
发现问题可能出现在configureInputFormat方法中。
再次进入DataDrivenImportJob.java中
@Override
protected void configureInputFormat(Job job, String tableName,
String tableClassName, String splitByCol) throws IOException {
ConnManager mgr = getContext().getConnManager();
try {
String username = options.getUsername();
if (null == username || username.length() == 0) {
DBConfiguration.configureDB(job.getConfiguration(),
mgr.getDriverClass(), options.getConnectString(),
options.getFetchSize(), options.getConnectionParams());
} else {
DBConfiguration.configureDB(job.getConfiguration(),
mgr.getDriverClass(), options.getConnectString(),
username, options.getPassword(), options.getFetchSize(),
options.getConnectionParams());
}
if (null != tableName) {
// Import a table.
String [] colNames = options.getColumns();
if (null == colNames) {
colNames = mgr.getColumnNames(tableName);
}
String [] sqlColNames = null;
if (null != colNames) {
sqlColNames = new String[colNames.length];
for (int i = 0; i < colNames.length; i++) {
sqlColNames[i] = mgr.escapeColName(colNames[i]);
}
}
// It's ok if the where clause is null in DBInputFormat.setInput.
String whereClause = options.getWhereClause();
// We can't set the class properly in here, because we may not have the
// jar loaded in this JVM. So we start by calling setInput() with
// DBWritable and then overriding the string manually.
/*这里好像就是准备数据select格式的!*/
DataDrivenDBInputFormat.setInput(job, DBWritable.class,
mgr.escapeTableName(tableName), whereClause,
mgr.escapeColName(splitByCol), sqlColNames);
// If user specified boundary query on the command line propagate it to
// the job
if (options.getBoundaryQuery() != null) {
DataDrivenDBInputFormat.setBoundingQuery(job.getConfiguration(),
options.getBoundaryQuery());
}
} else {
// Import a free-form query.
String inputQuery = options.getSqlQuery();
String sanitizedQuery = inputQuery.replace(
DataDrivenDBInputFormat.SUBSTITUTE_TOKEN, " (1 = 1) ");
String inputBoundingQuery = options.getBoundaryQuery();
if (inputBoundingQuery == null) {
inputBoundingQuery = buildBoundaryQuery(splitByCol, sanitizedQuery);
}
DataDrivenDBInputFormat.setInput(job, DBWritable.class,
inputQuery, inputBoundingQuery);
new DBConfiguration(job.getConfiguration()).setInputOrderBy(
splitByCol);
}
if (options.getRelaxedIsolation()) {
LOG
.info("Enabling relaxed (read uncommitted) transaction "
+ "isolation for imports");
job.getConfiguration()
.setBoolean(DBConfiguration.PROP_RELAXED_ISOLATION, true);
}
LOG.debug("Using table class: " + tableClassName);
job.getConfiguration().set(ConfigurationHelper.getDbInputClassProperty(),
tableClassName);
job.getConfiguration().setLong(LargeObjectLoader.MAX_INLINE_LOB_LEN_KEY,
options.getInlineLobLimit());
if (options.getSplitLimit() != null) {
org.apache.sqoop.config.ConfigurationHelper.setSplitLimit(
job.getConfiguration(), options.getSplitLimit());
}
LOG.debug("Using InputFormat: " + inputFormatClass);
job.setInputFormatClass(inputFormatClass);
} finally {
try {
mgr.close();
} catch (SQLException sqlE) {
LOG.warn("Error closing connection: " + sqlE);
}
}
}
发现他里面的
DataDrivenDBInputFormat.setInput(job, DBWritable.class,
mgr.escapeTableName(tableName), whereClause,
mgr.escapeColName(splitByCol), sqlColNames);好像就是用来准备数据进入的格式的
再次进入
终于发现 原本在最外面传入的
splitCol是用于排序的键!
进入对应方法
/**
* Determine what column to use to split the table.
* @param opts the SqoopOptions controlling this import.
* @param tableName the table to import.
* @return the splitting column, if one is set or inferrable, or null
* otherwise.
*/
protected String getSplitColumn(SqoopOptions opts, String tableName) {
String splitCol = opts.getSplitByCol();
if (null == splitCol && null != tableName) {
// If the user didn't specify a splitting column, try to infer one.
splitCol = getPrimaryKey(tableName);
}
return splitCol;
}
//原来个方法是找到主键!
原来这个方法目的是找到主键。进入getPrimaryKey方法
@Override
public String getPrimaryKey(String tableName) {
try {
DatabaseMetaData metaData = this.getConnection().getMetaData();
ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
if (null == results) {
return null;
}
try {
if (results.next()) {
return results.getString("COLUMN_NAME");
} else {
return null;
}
} finally {
results.close();
getConnection().commit();
}
} catch (SQLException sqlException) {
LoggingUtils.logAll(LOG, "Error reading primary key metadata: "
+ sqlException.toString(), sqlException);
return null;
}
}
sqlManage的排序键 如果 没有指定splitby 就会用第一个主键进行排序!
得出结论:sqoop在把数据从db2迁移到Hdfs文件时 其排序是通过从左往右第一个主键进行排序的!但是DB2中是按照联合主键进行排序的!