Mapreduce中的join操作

一、背景

MapReduce提供了表连接操作其中包括Map端join、Reduce端join还有半连接，现在我们要讨论的是Map端join，Map端join是指数据到达map处理函数之前进行合并的，效率要远远高于Reduce端join，因为Reduce端join是把所有的数据都经过Shuffle，非常消耗资源。

二、具体join

1、join的例子

比如我们有两个文件，分别存储订单信息：products.txt，和商品信息：orders.txt ，详细数据如下：

products.txt：

//商品ID，商品名称，商品类型（数字表示，我们假设有一个数字和具体类型的映射）
p0001,xiaomi,001
p0002,chuizi,001

orders.txt：
```
//订单号，时间，商品id，购买数量 
1001,20170710,p0001,1 
1002,20170710,p0001,3 
1003,20170710,p0001,3 
1004,20170710,p0002,1
```
我们想象有多个商品，并有海量的订单信息，并且存储在多个 HDFS 块中。
```
xiaomi,7
chuizi,1
```
该怎么处理？我们分析上面我们想要的结果，商品名称和销量，这两个属性分别存放到不同的文件中，那我们就要考虑在一个地方（mapper）读取这两个文件的数据，并把数据在一个地方（reducer）进行结合。这就是 MapReduce 中的 Join 了。

代码如下：

Mapper：

public class joinMapper extends Mapper<LongWritable,Text,Text,Text> {

    private Text outKey=new Text();
    private Text outValue=new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split(",");
        FileSplit inputSplit = (FileSplit) context.getInputSplit();
        String name = inputSplit.getPath().getName();
        //两个文件 在一个 mapper 中处理
        //通过文件名判断是那种数据
        if(name.startsWith("a")){
            //取商品ID 作为 输出key 和 商品名称 作为 输出value，即 第0、1 的数据
            outKey.set(split[0]);
            outValue.set("product#" + split[1]);
            context.write(outKey, outValue);
        }else{
            //取商品ID 作为 输出key 和 购买数量 作为 输出value，即 第2、3 的数据
            outKey.set(split[2]);
            outValue.set("order#" + split[3]);
            context.write(outKey, outValue);
        }
    }
}

Reducer

public class joinReducer extends Reducer<Text,Text,Text,Text> {
    private Text outValue = new Text();
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        //用来存放：商品ID、商品名称
        List<String> productsList = new ArrayList<String>();
        //用来存放：商品ID、购买数量
        List<Integer> ordersList = new ArrayList<Integer>();

        for (Text text:values){
            String value = text.toString();
            if(value.startsWith("product#")) {
                productsList.add(value.split("#")[1]); //取出 商品名称
            } else if(value.startsWith("order#")){
                ordersList.add(Integer.parseInt(text.toString().split("#")[1].trim())); //取出商品的销量
            }
        }
        int totalOrders = 0;
        for (int i=0; i < productsList.size(); i++) {
            System.out.println(productsList.size());

            for (int j=0; j < ordersList.size(); j++) {
                System.out.println(ordersList.size());
                totalOrders += ordersList.get(j);
            }
            outValue.set(productsList.get(i) + "\t" + totalOrders );
            //最后的输出是：商品ID、商品名称、购买数量
            context.write(key, outValue);
        }

    }
}

App：

public class App  {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "file:///");

        Path path = new Path("F:\\mr\\join\\out");
        FileSystem fileSystem = path.getFileSystem(conf);
        if(fileSystem.isDirectory(path)){
            fileSystem.delete(path,true);
        }
        Job job = Job.getInstance(conf);
        //设置job的各种属性
        job.setJobName("App");                        //作业名称
        job.setJarByClass(App.class);                 //搜索类
        job.setInputFormatClass(TextInputFormat.class); //设置输入格式

        job.setMapperClass(joinMapper.class);
        job.setReducerClass(joinReducer.class);
        //添加输入路径
        FileInputFormat.addInputPath(job,new Path("F:\\mr\\join\\map"));
        //设置输出路径
        FileOutputFormat.setOutputPath(job,new Path("F:\\mr\\join\\out"));
        //map输出类型
        job.setOutputKeyClass(Text.class);           //
        job.setOutputValueClass(Text.class);        //
        job.waitForCompletion(true);

    }
}

输出结果

p0001    xiaomi    7
p0002    chuizi    1

2、 Map Join

一个数据集很大，另一个数据集很小（能够被完全放进内存中），MAPJION会把小表全部读入内存中，把小表拷贝多份分发到大表数据所在实例上的内存里，在map阶段直接拿另外一个表的数据和内存中表数据做匹配，由于在map是进行了join操作，省去了reduce运行的效率会高很多。

- left outer join的左表必须是大表
- right outer join的右表必须是大表
- inner join左表或右表均可以作为大表
- full outer join不能使用mapjoin；
- mapjoin支持小表为子查询，使用mapjoin时需要引用小表或是子查询时，需要引用别名；在mapjoin中，可以使用不等值连接或者使用or连接多个条件；

1.2、 Map Join事例

customers表

1,tom,12
2,tomaa,13
3,tomada,14
4,tomas,15

orders表

1,no001,12.23,1
2,no001,12.23,1
3,no001,12.23,2
4,no001,12.23,2
5,no001,12.23,2
6,no001,12.23,3
7,no001,12.23,3
8,no001,12.23,3
9,no001,12.23,3

期望输出

1,tom,12,1,no001,12.23
1,tom,12,2,no001,12.23
2,tomaa,13,3,no001,12.23
2,tomaa,13,4,no001,12.23
2,tomaa,13,5,no001,12.23
3,tomada,14,6,no001,12.23
3,tomada,14,7,no001,12.23
3,tomada,14,8,no001,12.23
3,tomada,14,9,no001,12.23

Mapreduce中的join操作

一、背景

二、具体join

猜你喜欢