MapTask和ReduceTask均会对数据按照key进行排序,Hadoop的默认排序是按照字典顺序排序,实现排序的方法是快速排序。
1、MapTask排序
将处理的结果暂时放到环形缓冲区中,当环形缓冲区使用率达到一定的阈值后,再对缓冲区中的数据进行一次快速排序,并将这些有序数据溢写到磁盘,而当数据处理完毕后,对磁盘上的所有文件进行归并排序
2、ReduceTask排序
从每个MapTask上远程拷贝相应的数据文件,如果文件大小超过一定阈值则溢写到磁盘上,否则存储在内存中,
如果磁盘上文件数目达到一定阈值,则进行一次归并排序生成一个更大的文件。如果内存中文件大小或者数目超过一定阈值,则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完成后,ReduceTask统一对内存和磁盘上的所有数据进行一次归并排序。
3、排序的分类
(1)部分排序
MapReduce根据输入记录的键对数据集排序。保证输出的每个文件内部有序
(2)全排序
最终输出结果只有一个文件,且文件内部是有序的
实现方式是只设置一个ReduceTask。但是该方法在处理大型文件时效率极其低下。
(3)辅助排序(GroupingComparator分组)
在Reduce端对key进行分组,比如,在接受的key是bean对象时,想根据对象中某几个属性值来决定进入同一个reduce方法时,可以采用分组排序
(4)二次排序
在自定义排序中,如果compareTo中的判断条件为两个即为二次排序
4、自定义排序WritableComparable
bean对象做为key传输,需要实现WritableComparable接口重写compareTo方法,就可以实现排序。
public class Flow implements Writable , WritableComparable<Flow> {
private Long upFlow;//上行流量
private Long downFlow;//下行流量
private Long sumFlow;//总流量
@Override
public String toString() {
return "Flow{" +
"upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}';
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
upFlow=dataInput.readLong();
downFlow=dataInput.readLong();
sumFlow=dataInput.readLong();
}
@Override
public int compareTo(Flow o) {
return this.getSumFlow().compareTo(o.getSumFlow());
}
}
5、GroupingComparator分组(辅助排序)
对Reduce阶段的数据根据某一个或几个字段进行分组。
分组排序步骤:
(1)自定义类继承WritableComparator重写compare()方法
(2)创建一个构造将比较对象的类传给父类
public OrderGroupingCompartor(){
super(Order.class,true);
}
案例:求订单中最贵商品
代码如下:
public class Order implements WritableComparable<Order>{ private String id; private Float price; public String getId() { return id; } public void setId(String id) { this.id = id; } public Float getPrice() { return price; } public void setPrice(Float price) { this.price = price; } @Override public void readFields(DataInput input) throws IOException { this.id = input.readUTF(); this.price=input.readFloat(); } @Override public void write(DataOutput output) throws IOException { output.writeUTF(this.id); output.writeFloat(this.price); } @Override public int compareTo(Order o) { if(o.getId().equals(this.id)){ return o.getPrice().compareTo(this.getPrice()); }else{ return this.id.compareTo(o.getId()); } } @Override public String toString() { return "Order [id=" + id + ", price=" + price + "]"; } }
public class OrderMapper extends Mapper<LongWritable, Text,Order,NullWritable>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Order, NullWritable>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split("\t");
Order order=new Order();
order.setId(splits[0]);
order.setPrice(Float.valueOf(splits[2]));
context.write(order, NullWritable.get());
}
}
public class OrderReduce extends Reducer<Order,NullWritable,Order,NullWritable>{
@Override
protected void reduce(Order order, Iterable<NullWritable> iterable,
Context context) throws IOException, InterruptedException {
System.out.println("order="+order.toString());
context.write(order, NullWritable.get());
}
}
public class OrderGroupingCompartor extends WritableComparator{
public OrderGroupingCompartor(){
super(Order.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
Order aOrder=(Order) a;
Order bOrder=(Order) b;
return aOrder.getId().compareTo(bOrder.getId());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME", "root");
Configuration configuration=new Configuration();
Job job = Job.getInstance(configuration);
job.setMapperClass(OrderMapper.class);
job.setMapOutputKeyClass(Order.class);
job.setMapOutputValueClass(NullWritable.class);
job.setReducerClass(OrderReduce.class);
job.setOutputKeyClass(Order.class);
job.setOutputValueClass(NullWritable.class);
//设置分区
//job.setPartitionerClass(OrderPartition.class);
//job.setNumReduceTasks(3);
//添加grouping
job.setGroupingComparatorClass(OrderGroupingCompartor.class);
FileInputFormat.setInputPaths(job, new Path("/mapreduce/sort/groupingCompartor"));
FileOutputFormat.setOutputPath(job, new Path("/mapreduce/sort/output"));
boolean waitForCompletion = job.waitForCompletion(true);
System.exit(waitForCompletion==true?0:1);
}
运行结果:
[root@master mapreduce]# hdfs dfs -text /mapreduce/sort/output/part-r-00000
Order [id=0000001, price=222.8]
Order [id=0000002, price=722.4]
Order [id=0000003, price=232.8]