mapreduce之join连接

1、reduce side join（reduce端表连接）

使用分布式缓存API，完成两个数据集的连接操作

优点：操作简单
缺点：map端shffule后传递给reduce端的数据量过大，极大的降低了性能

连接方法：
（1）map端读入输入数据，以连接键为Key，待连接的内容为value，但是value需要添加特别的标识，
    表示的内容为表的表示，即若value来自于表1，则标识位设置为1，若来自表2，则设置为2，
    然后将map的内容输出到reduce。
（2）reduce端接收来自map端shuffle后的结果，即<key, values>内容，然后遍历values，对每一个value进行处理
    主要的处理过程是：
    判断每一个标志位，如果来自1表，则将value放置在特地为1表创建的数组之中
    若来自2表，则将value放置在为2表创建的数组中，最后对两个数组进行求笛卡儿积，然后输出结果，即为最终表的连接结果。

2、问题分析

MapReduce连接取决于数据集的规模及分区方式
如果一个数据集很大而另外一个数据集很小，小的分发到集群中的每一个节点

mapper阶段读取大数据集中的数据
reducer获取本节点上的数据（也就是小数据集中的数据）并完成连接操作

3、缓存在本地的目录设置

以下为默认值：
<property>
  <name>mapred.local.dir</name>
  <value>${hadoop.tmp.dir}/mapred/localdir/filecache</value>
</property>

<property>
  <name>local.cache.size</name>
  <value>10737418240</value> 
</property>

4、使用方式

旧版本的DistributedCache已经被注解为过时，以下为Hadoop-2.2.0以上的新API接口，测试的Hadoop版本为2.7.2。
Job job = Job.getInstance(conf);
//将hdfs上的文件加入分布式缓存
job.addCacheFile(new URI("hdfs://url:port/filename#symlink"));

之后在map/reduce函数中可以通过context来访问到缓存的文件，一般是重写setup方法来进行初始化：
直接使用hadoop方式：
protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0) {
        String path = context.getLocalCacheFiles()[0].getName();
        File itermOccurrenceMatrix = new File(path);
        FileReader fileReader = new FileReader(itermOccurrenceMatrix);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String s;
        while ((s = bufferedReader.readLine()) != null) {
            //TODO:读取每行内容进行相关的操作
        }
        bufferedReader.close();
        fileReader.close();
    }
}
或者采用以下方法：将hadoop方式转化为java方式进行处理
    Configuration config=context.getConfiguration();
    FileSystem fs=FileSystem.get(config);
    FSDataInputStream in=fs.open(new Path(path));
    Text line=new Text(“ ”);
    LineReader lineReader=new LineReader(in,config);
    int  offset=0;
    do{
        offset=lineReader.readLine(line);  
        //读入path中一行到Text类型的line中，返回字节数
        if(offset>0){
            String[] tokens=line.toString().split(“,”); 
            countryCodesTreeMap.put(tokens[0],tokens[1]);
} 
}while(offset!=0); 

得到的path为本地文件系统上的路径

这里的getLocalCacheFiles方法也被注解为过时了，只能使用context.getCacheFiles方法，
和getLocalCacheFiles不同的是，getCacheFiles得到的路径是HDFS上的文件路径，
如果使用这个方法，那么程序中读取的就不再试缓存在各个节点上的数据了，相当于共同访问HDFS上的同一个文件。
可以直接通过符号连接来跳过getLocalCacheFiles获得本地的文件。

5、实现步骤

1)把数据放到缓存中的方法 
public void addCacheFile(URI uri); 
public void addCacheArchive(URI uri);// 以上两组方法将文件或存档添加到分布式缓存 
public void setCacheFiles(URI[] files); 
public void setCacheArchives(URI[] archives);// 以上两组方法将一次性向分布式缓存中添加一组文件或存档 
public void addFileToClassPath(Path file); 
public void addArchiveToClassPath(Path archive);// 以上两组方法将文件或存档添加到 MapReduce 任务的类路径

在缓存中可以存放两类对象：文件（files）和存档（achives）。
文件被直接放置在任务节点上，而存档则会被解档之后再将具体文件放置在任务节点上。 

2)其次掌握在map或者reduce任务中，使用API从缓存中读取数据。
可以通过 getFileClassPaths()和getArchivesClassPaths()方法获取被添加到任务的类路径下的文件和文档。

1、map side join

在map端进行表的连接，对表的大小有要求，首先有一个表必须足够小，可以读入内存，另外的一个表很大，
与reduce端连接比较，map端的连接，不会产生大量数据的传递，而是在map端连接完毕之后就进行输出，效率极大的提高

连接方法：
（1）首先要重写Mapper类下面的setup方法，因为这个方法是先于map方法执行的，将较小的表先读入到一个HashMap中。
（2）重写map函数，一行行读入大表的内容，逐一的与HashMap中的内容进行比较，若Key相同，
     则对数据进行格式化处理，然后直接输出。

2、Map侧的连接

两个数据集中一个非常小，可以让小数据集存入缓存。
在作业开始这些文件会被复制到运行task的节点上。
 一开始，它的setup方法会检索缓存文件。

3、Map侧连接需要满足条件

与reduce侧连接不同，Map侧连接需要等待参与连接的数据集满足如下条件：

1.除了连接键外，所有的输入都必须按照连接键排序。
  输入的各种数据集必须有相同的分区数。
  所有具有相同键的记录需要放在同一分区中。
当Map任务对其他Mapreduce作业的结果进行处理时（Cleanup时），Map侧的连接条件都自动满足。
CompositeInputFormat类用于执行Map侧的连接，而输入和连接类型的配置可以通过属性指定。

2.如果其中一个数据集足够小，旁路的分布式通道可以用在Map侧的连接中。

输入：
num1文件和num2文件：
xm@master:~/workspace$ hadoop fs -text /b/num1
1,Beijing
2,Guangzhou
3,Shenzhen
4,Xian
xm@master:~/workspace$ hadoop fs -text /b/num2
Beijing Red Star,1
Shenzhen Thunder,3
Guangzhou Honda,2
Beijing Rising,1
Guangzhou Development Bank,2
Tencent,3
Back of Beijing,1

输出：
Back of Beijing Beijing
Beijing Red Star    Beijing
Beijing Rising  Beijing
Guangzhou Development Bank  Guangzhou
Guangzhou Honda Guangzhou
Shenzhen Thunder    Shenzhen
Tencent Shenzhen

实现代码：
package mr_01;

import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.LineReader;

public class reduceJoin { 

    static String INPUT_PATH="hdfs://master:9000/b/num2";
    static String OUTPUT_PATH="hdfs://master:9000/output";

    //Map直接写入，不需修改
    static class MyMapper extends Mapper<Object,Object,Text,Text>{  

            Text  output_key=new Text();   //fname
            Text  output_value=new Text(); //add-id

        protected void map(Object key, Object value, Context context) throws IOException, InterruptedException{

            String[] tokens=value.toString().split(",");   

            if(tokens!=null && tokens.length==2){
                output_key.set(tokens[0]);  
                output_value.set(tokens[1]);
                context.write(output_key,output_value);
            }
        }
    }

    static class MyReduce extends Reducer<Text,Text,Text,Text>{

        Text  output_key=new Text();
        Text  output_value=new Text();

        Map<String,String> addMap=new HashMap<String,String>();  //a(  addr-id,addr-name )

        //setup方法将文件中的数据写入hashmap中
        protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException{

                URI  uri=context.getCacheFiles()[0];
                Path  path=new Path(uri);
                System.out.println("path="+uri.toString());
                FileSystem fs= path.getFileSystem(context.getConfiguration());

                LineReader  lineReader=new LineReader(fs.open(path));

                Text line=new Text();
                while(lineReader.readLine(line)>0){
                    String[]  tokens=line.toString().split(",");
                    if(tokens!=null && tokens.length==2)
                    addMap.put(tokens[0], tokens[1]);                

                }

                System.out.println("addMap.size="+addMap.size());

         }

        //reduce进行取用（key-->value对应）
         protected void reduce(Text key, Iterable<Text> values, Context context) 
                 throws IOException, InterruptedException{
             //id
             if(values==null) return;

            String addrName= addMap.get(values.iterator().next().toString());
            output_value.set(addrName);

             context.write(key,output_value);
         }

    }

    public static void main(String[] args) throws Exception{

         Path outputpath=new Path(OUTPUT_PATH);
         Path cacheFile=new Path("hdfs://master:9000/b/num1");
         Configuration conf=new Configuration();

         FileSystem  fs=outputpath.getFileSystem(conf);
         if(fs.exists(outputpath)){
             fs.delete(outputpath, true);
         }

         Job  job=Job.getInstance(conf);

         FileInputFormat.setInputPaths(job, INPUT_PATH);
         FileOutputFormat.setOutputPath(job, outputpath);

         URI  uri=cacheFile.toUri();
         job.setCacheFiles(new URI[]{uri});

         job.setMapperClass(MyMapper.class);   //map
         job.setReducerClass(MyReduce.class);   //reduce

         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(Text.class);

         job.waitForCompletion(true);
    }
}

如果要实现多表连接，那么只需将多个表存进缓存中取用即可。

1.避免生成太多依赖I/O的map任务，数量由输入决定
2.作业加速主要来源于Map任务，有更高的并行度
3.Combiner对效率的提高，不仅在map reduce任务之间的数据传输，而且体现在降低了map侧I/O负载
4.自定义分区器可以在不同的reduce之间做负载均衡
5.分布式缓存对于小文件场景很有用，但应该避免过多或大的文件存储在缓存中

猜你喜欢