Hadoop 迭代器重用问题与其他问题整理

Hadoop 迭代器重用问题

开始是由于业务的问题发现最终结果与预期不符，在代码中打日志调试发现了这个问题。reduce方法的javadoc中已经说明了会出现的问题：

引用
The framework calls this method for each pair in the grouped inputs. Output values must be of the same type as input values. Input keys must not be altered. The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of.

也就是说虽然reduce方法会反复执行多次，但key和value相关的对象只有两个，reduce会反复重用这两个对象。所以如果要保存key或者value的结果，只能将其中的值取出另存或者重新clone一个对象，而不能直接赋引用。因为引用从始至终都是指向同一个对象，会影响最终结果。

啥意思呢：

看下面代码：

 @Override
    protected void reduce(Text key, Iterable<DZDataType> values, Context context) throws IOException, InterruptedException {
        try {
            List<DZDataType> bankDataList = new ArrayList<DZDataType>();
            List<DZDataType> cpcnDataList = new ArrayList<DZDataType>();

            for (DZDataType flagDataType : values) {
                if (FileTypeEnum.FILETPYE_BANK.getValue() == flagDataType.getFlag()) {
                    DZDataType data = new DZDataType();
                    data.setFlag(flagDataType.getFlag());
                    data.setInfo(flagDataType.getInfo());
                    bankDataList.add(data);
                } else {
                    DZDataType data = new DZDataType();
                    data.setFlag(flagDataType.getFlag());
                    data.setInfo(flagDataType.getInfo());
                    cpcnDataList.add(data);
                }
            }
         }catch(){}
}

里面的foreach 语句，如果不new的话单单每次都取出flagdataType 只会是一个重复的 bankDataList 会加入很多重复的值。

还可能出现的问题是

Hadoop Mapreduce Error: GC overhead limit exceeded

问题

mapred.child.java.opts

参数的配置

　map阶段：mapreduce.admin.map.child.java.opts < mapred.child.java.opts < mapred.map.child.java.opts，也就是说最终会采用mapred.map.child.java.opts定义的jvm参数，如果有冲突的话。

　　reduce阶段：mapreduce.admin.reduce.child.java.opts < mapred.child.java.opts < mapred.reduce.child.java.opts

通过调整该参数，可以一定程度上避免该问题