在Map侧高效完成的join

看下如何在Map侧高效完成的join,因为在reduce侧进行join在shuffle阶段会消耗大量的时间,如果在Map端进行Join,那么就会节省大量的资源,当然,这也是有具体的应用场景的。

使用场景:一张表十分小、一张表很大。
   用法:在提交作业的时候先将小表文件放到该作业的DistributedCache中,然后从DistributeCache中取出该小表进行join key / value解释分割放到内存中(可以放大Hash Map等等容器中)。然后扫描大表,看大表中的每条记录的join key /value值是否能够在内存中找到相同join key的记录,如果有则直接输出结果。

模拟的测试数据如下:

小表: HDFS路径:hdfs://192.168.75.130:9000/root/dist/a.txt
1,三劫散仙,13575468248
2,凤舞九天,18965235874
3,忙忙碌碌,15986854789
4,少林寺方丈,15698745862

大表:HDFS路径:hdfs://192.168.75.130:9000/root/inputjoindb/b.txt
3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07

使用Hadoop1.2的版本进行实现,源码如下:

Java代码 复制代码  收藏代码
  1. package com.mapjoin;  
  2.   
  3. import java.io.BufferedReader;  
  4. import java.io.FileReader;  
  5. import java.io.IOException;  
  6. import java.net.URI;  
  7. import java.nio.charset.Charset;  
  8. import java.nio.file.Files;  
  9. import java.nio.file.Paths;  
  10. import java.util.HashMap;  
  11. import java.util.List;  
  12.   
  13. import org.apache.hadoop.conf.Configuration;  
  14. import org.apache.hadoop.filecache.DistributedCache;  
  15. import org.apache.hadoop.fs.FileSystem;  
  16. import org.apache.hadoop.fs.Path;  
  17. import org.apache.hadoop.io.Text;  
  18. import org.apache.hadoop.mapred.JobConf;  
  19. import org.apache.hadoop.mapreduce.Job;  
  20. import org.apache.hadoop.mapreduce.Mapper;  
  21. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  22. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
  23. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  24. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  
  25.   
  26.    
  27.    
  28.   
  29. /*** 
  30.  * 基于Map侧的复制链接 
  31.  *Hadoop技术交流群: 37693216  
  32.  * @author qindongliang  
  33.  ***/  
  34. public class MapJoin {  
  35.       
  36.       
  37.       
  38.     /*** 
  39.      * 在Map侧setup方法里,取出缓存文件 
  40.      * 放入HashMap中进行join 
  41.      *  
  42.      *  
  43.      * **/  
  44.     public static  class MMppe extends Mapper<Object, Text, Text, Text>{  
  45.           
  46.           
  47.         /** 
  48.          * 此map是存放小表数据用的 
  49.          * 注意小表的key是不重复的 
  50.          * 类似与数据库的外键表 
  51.          * 在这里的小表,就相当于一个外键表 
  52.          *  
  53.          *  
  54.          * **/  
  55.         private HashMap<String,String> map=new HashMap<String, String>();  
  56.           
  57.         /** 
  58.          * 输出的Key 
  59.          * */  
  60.         private Text outputKey=new Text();  
  61.           
  62.         /** 
  63.          * 输出的Value 
  64.          *  
  65.          * */  
  66.         private Text outputValue=new Text();  
  67.           
  68.         //存放map的一行数据  
  69.         String mapInputStr=null;  
  70.         //存放主表的整个列值  
  71.         String mapInputStrs[] =null;  
  72.           
  73.         //存放外键表(小表)的,除了链接键之外的整个其他列的拼接字符串  
  74.         String mapSecondPart=null;  
  75.           
  76.           
  77.         /** 
  78.          * Map的初始化方法 
  79.          *  
  80.          * 主要任务是将小表存入到一个Hash中 
  81.          * 格式,k=外键   ===  v=其他列拼接的字符串 
  82.          *  
  83.          *  
  84.          * **/  
  85.         @Override  
  86.         protected void setup(Context context)throws IOException, InterruptedException {  
  87.                
  88.             //读取文件流  
  89.             BufferedReader br=null;  
  90.             String temp;  
  91.             // 获取DistributedCached里面 的共享文件  
  92.             Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());  
  93.               
  94.             for(Path p:path){  
  95.                   
  96.                 if(p.getName().endsWith("a.txt")){  
  97.                     br=new BufferedReader(new FileReader(p.toString()));  
  98.                     //List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));  
  99.                       
  100.                     while((temp=br.readLine())!=null){  
  101.                         String ss[]=temp.split(",");  
  102.                         map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中  
  103.                     }  
  104.                 }  
  105.             }  
  106.               
  107.             //System.out.println("map完:"+map);  
  108.               
  109.               
  110.         }  
  111.           
  112.           
  113.           
  114.         /** 
  115.          *  
  116.          * 在map里,直接读取数据,从另一个表的map里 
  117.          * 获取key进行join就可以了 
  118.          *  
  119.          *  
  120.          * ***/  
  121.         @Override  
  122.         protected void map(Object key, Text value,Context context)throws IOException, InterruptedException {  
  123.                
  124.               
  125.             //空值跳过  
  126.             if(value==null||value.toString().equals("")){  
  127.                 return;  
  128.             }  
  129.               
  130.             this.mapInputStr=value.toString();//读取输入的值  
  131.             this.mapInputStrs=this.mapInputStr.split(",");//拆分成数组  
  132.               
  133.               
  134.             this.mapSecondPart=map.get(mapInputStrs[0]);//获取外键表的部分  
  135.               
  136.             //如果存在此key  
  137.             if(this.mapSecondPart!=null){  
  138.                 this.outputKey.set(mapInputStrs[0]);//输出的key  
  139.                 //输出的value是拼接的两个表的数据  
  140.                 this.outputValue.set(this.mapSecondPart+"\t"+mapInputStrs[1]+"\t"+mapInputStrs[2]+"\t"+mapInputStrs[3]);  
  141.                   
  142.                 //写入磁盘  
  143.                 context.write(this.outputKey, this.outputValue);  
  144.             }  
  145.               
  146.               
  147.               
  148.               
  149.               
  150.               
  151.               
  152.         }  
  153.           
  154.           
  155.           
  156.         //驱动类  
  157.         public static void main(String[] args)throws Exception {  
  158.               
  159.            
  160.             JobConf conf=new JobConf(MMppe.class);   
  161.               
  162.             //小表共享  
  163.             String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";  
  164.             //添加到共享cache里  
  165.             DistributedCache.addCacheFile(new URI(bpath), conf);  
  166.           
  167.               
  168.                
  169.              conf.set("mapred.job.tracker","192.168.75.130:9001");  
  170.             conf.setJar("tt.jar");  
  171.                 
  172.                 
  173.               Job job=new Job(conf, "2222222");  
  174.              job.setJarByClass(MapJoin.class);  
  175.              System.out.println("模式:  "+conf.get("mapred.job.tracker"));;  
  176.                
  177.                
  178.              //设置Map和Reduce自定义类  
  179.              job.setMapperClass(MMppe.class);  
  180.              job.setNumReduceTasks(0);  
  181.                
  182.              //设置Map端输出  
  183.             // job.setMapOutputKeyClass(Text.class);  
  184.              job.setMapOutputValueClass(Text.class);  
  185.                
  186.              //设置Reduce端的输出  
  187.              job.setOutputKeyClass(Text.class);  
  188.              job.setOutputValueClass(Text.class);  
  189.                
  190.           
  191.              job.setInputFormatClass(TextInputFormat.class);  
  192.              job.setOutputFormatClass(TextOutputFormat.class);  
  193.                
  194.            
  195.              FileSystem fs=FileSystem.get(conf);  
  196.                
  197.              Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew3/");  
  198.                
  199.              if(fs.exists(op)){  
  200.                  fs.delete(op, true);  
  201.                  System.out.println("存在此输出路径,已删除!!!");  
  202.              }  
  203.                
  204.                
  205.           FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb/b.txt"));  
  206.           FileOutputFormat.setOutputPath(job, op);  
  207.              
  208.           System.exit(job.waitForCompletion(true)?0:1);  
  209.               
  210.               
  211.         }  
  212.           
  213.     }  
  214.       
  215.       
  216.       
  217.       
  218.       
  219.       
  220.       
  221.       
  222.       
  223.       
  224.       
  225.       
  226.       
  227.       
  228.       
  229.       
  230.   
  231. }  
package com.mapjoin;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 
 

/***
 * 基于Map侧的复制链接
 *Hadoop技术交流群: 37693216 
 * @author qindongliang 
 ***/
public class MapJoin {
	
	
	
	/***
	 * 在Map侧setup方法里,取出缓存文件
	 * 放入HashMap中进行join
	 * 
	 * 
	 * **/
	public static  class MMppe extends Mapper<Object, Text, Text, Text>{
		
		
		/**
		 * 此map是存放小表数据用的
		 * 注意小表的key是不重复的
		 * 类似与数据库的外键表
		 * 在这里的小表,就相当于一个外键表
		 * 
		 * 
		 * **/
		private HashMap<String,String> map=new HashMap<String, String>();
		
		/**
		 * 输出的Key
		 * */
		private Text outputKey=new Text();
		
		/**
		 * 输出的Value
		 * 
		 * */
		private Text outputValue=new Text();
		
		//存放map的一行数据
		String mapInputStr=null;
		//存放主表的整个列值
		String mapInputStrs[] =null;
		
		//存放外键表(小表)的,除了链接键之外的整个其他列的拼接字符串
		String mapSecondPart=null;
		
		
		/**
		 * Map的初始化方法
		 * 
		 * 主要任务是将小表存入到一个Hash中
		 * 格式,k=外键   ===  v=其他列拼接的字符串
		 * 
		 * 
		 * **/
		@Override
		protected void setup(Context context)throws IOException, InterruptedException {
			 
			//读取文件流
			BufferedReader br=null;
			String temp;
			// 获取DistributedCached里面 的共享文件
			Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());
			
			for(Path p:path){
				
				if(p.getName().endsWith("a.txt")){
					br=new BufferedReader(new FileReader(p.toString()));
					//List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));
					
					while((temp=br.readLine())!=null){
						String ss[]=temp.split(",");
						map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
					}
				}
			}
			
			//System.out.println("map完:"+map);
			
			
		}
		
		
		
		/**
		 * 
		 * 在map里,直接读取数据,从另一个表的map里
		 * 获取key进行join就可以了
		 * 
		 * 
		 * ***/
		@Override
		protected void map(Object key, Text value,Context context)throws IOException, InterruptedException {
			 
			
			//空值跳过
			if(value==null||value.toString().equals("")){
				return;
			}
			
			this.mapInputStr=value.toString();//读取输入的值
			this.mapInputStrs=this.mapInputStr.split(",");//拆分成数组
			
			
			this.mapSecondPart=map.get(mapInputStrs[0]);//获取外键表的部分
			
			//如果存在此key
			if(this.mapSecondPart!=null){
				this.outputKey.set(mapInputStrs[0]);//输出的key
				//输出的value是拼接的两个表的数据
				this.outputValue.set(this.mapSecondPart+"\t"+mapInputStrs[1]+"\t"+mapInputStrs[2]+"\t"+mapInputStrs[3]);
				
				//写入磁盘
				context.write(this.outputKey, this.outputValue);
			}
			
			
			
			
			
			
			
		}
		
		
		
		//驱动类
		public static void main(String[] args)throws Exception {
			
		 
			JobConf conf=new JobConf(MMppe.class); 
			
			//小表共享
			String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
			//添加到共享cache里
			DistributedCache.addCacheFile(new URI(bpath), conf);
		
			
			 
			 conf.set("mapred.job.tracker","192.168.75.130:9001");
			conf.setJar("tt.jar");
			  
			  
			  Job job=new Job(conf, "2222222");
			 job.setJarByClass(MapJoin.class);
			 System.out.println("模式:  "+conf.get("mapred.job.tracker"));;
			 
			 
			 //设置Map和Reduce自定义类
			 job.setMapperClass(MMppe.class);
			 job.setNumReduceTasks(0);
			 
			 //设置Map端输出
			// job.setMapOutputKeyClass(Text.class);
			 job.setMapOutputValueClass(Text.class);
			 
			 //设置Reduce端的输出
			 job.setOutputKeyClass(Text.class);
			 job.setOutputValueClass(Text.class);
			 
		
			 job.setInputFormatClass(TextInputFormat.class);
			 job.setOutputFormatClass(TextOutputFormat.class);
			 
		 
			 FileSystem fs=FileSystem.get(conf);
			 
			 Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew3/");
			 
			 if(fs.exists(op)){
				 fs.delete(op, true);
				 System.out.println("存在此输出路径,已删除!!!");
			 }
			 
			 
		  FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb/b.txt"));
		  FileOutputFormat.setOutputPath(job, op);
		   
		  System.exit(job.waitForCompletion(true)?0:1);
			
			
		}
		
	}
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	

}


运行日志:

Java代码 复制代码  收藏代码
  1. 模式:  192.168.75.130:9001  
  2. 存在此输出路径,已删除!!!  
  3. WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.  
  4. INFO - FileInputFormat.listStatus(237) | Total input paths to process : 1  
  5. INFO - NativeCodeLoader.<clinit>(43) | Loaded the native-hadoop library  
  6. WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded  
  7. INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404250130_0011  
  8. INFO - JobClient.monitorAndPrintJob(1393) |  map 0% reduce 0%  
  9. INFO - JobClient.monitorAndPrintJob(1393) |  map 100% reduce 0%  
  10. INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404250130_0011  
  11. INFO - Counters.log(585) | Counters: 19  
  12. INFO - Counters.log(587) |   Job Counters   
  13. INFO - Counters.log(589) |     SLOTS_MILLIS_MAPS=9878  
  14. INFO - Counters.log(589) |     Total time spent by all reduces waiting after reserving slots (ms)=0  
  15. INFO - Counters.log(589) |     Total time spent by all maps waiting after reserving slots (ms)=0  
  16. INFO - Counters.log(589) |     Launched map tasks=1  
  17. INFO - Counters.log(589) |     Data-local map tasks=1  
  18. INFO - Counters.log(589) |     SLOTS_MILLIS_REDUCES=0  
  19. INFO - Counters.log(587) |   File Output Format Counters   
  20. INFO - Counters.log(589) |     Bytes Written=172  
  21. INFO - Counters.log(587) |   FileSystemCounters  
  22. INFO - Counters.log(589) |     HDFS_BYTES_READ=188  
  23. INFO - Counters.log(589) |     FILE_BYTES_WRITTEN=55746  
  24. INFO - Counters.log(589) |     HDFS_BYTES_WRITTEN=172  
  25. INFO - Counters.log(587) |   File Input Format Counters   
  26. INFO - Counters.log(589) |     Bytes Read=74  
  27. INFO - Counters.log(587) |   Map-Reduce Framework  
  28. INFO - Counters.log(589) |     Map input records=4  
  29. INFO - Counters.log(589) |     Physical memory (bytes) snapshot=78663680  
  30. INFO - Counters.log(589) |     Spilled Records=0  
  31. INFO - Counters.log(589) |     CPU time spent (ms)=230  
  32. INFO - Counters.log(589) |     Total committed heap usage (bytes)=15728640  
  33. INFO - Counters.log(589) |     Virtual memory (bytes) snapshot=725975040  
  34. INFO - Counters.log(589) |     Map output records=4  
  35. INFO - Counters.log(589) |     SPLIT_RAW_BYTES=114  
模式:  192.168.75.130:9001
存在此输出路径,已删除!!!
WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
INFO - FileInputFormat.listStatus(237) | Total input paths to process : 1
INFO - NativeCodeLoader.<clinit>(43) | Loaded the native-hadoop library
WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404250130_0011
INFO - JobClient.monitorAndPrintJob(1393) |  map 0% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) |  map 100% reduce 0%
INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404250130_0011
INFO - Counters.log(585) | Counters: 19
INFO - Counters.log(587) |   Job Counters 
INFO - Counters.log(589) |     SLOTS_MILLIS_MAPS=9878
INFO - Counters.log(589) |     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO - Counters.log(589) |     Total time spent by all maps waiting after reserving slots (ms)=0
INFO - Counters.log(589) |     Launched map tasks=1
INFO - Counters.log(589) |     Data-local map tasks=1
INFO - Counters.log(589) |     SLOTS_MILLIS_REDUCES=0
INFO - Counters.log(587) |   File Output Format Counters 
INFO - Counters.log(589) |     Bytes Written=172
INFO - Counters.log(587) |   FileSystemCounters
INFO - Counters.log(589) |     HDFS_BYTES_READ=188
INFO - Counters.log(589) |     FILE_BYTES_WRITTEN=55746
INFO - Counters.log(589) |     HDFS_BYTES_WRITTEN=172
INFO - Counters.log(587) |   File Input Format Counters 
INFO - Counters.log(589) |     Bytes Read=74
INFO - Counters.log(587) |   Map-Reduce Framework
INFO - Counters.log(589) |     Map input records=4
INFO - Counters.log(589) |     Physical memory (bytes) snapshot=78663680
INFO - Counters.log(589) |     Spilled Records=0
INFO - Counters.log(589) |     CPU time spent (ms)=230
INFO - Counters.log(589) |     Total committed heap usage (bytes)=15728640
INFO - Counters.log(589) |     Virtual memory (bytes) snapshot=725975040
INFO - Counters.log(589) |     Map output records=4
INFO - Counters.log(589) |     SPLIT_RAW_BYTES=114



结果如下:

Java代码 复制代码  收藏代码
  1. 3   忙忙碌碌    15986854789 A   99  2013-03-05  
  2. 1   三劫散仙    13575468248 B   89  2013-02-05  
  3. 2   凤舞九天    18965235874 C   69  2013-03-09  
  4. 3   忙忙碌碌    15986854789 D   56  2013-06-07  
3	忙忙碌碌	15986854789	A	99	2013-03-05
1	三劫散仙	13575468248	B	89	2013-02-05
2	凤舞九天	18965235874	C	69	2013-03-09
3	忙忙碌碌	15986854789	D	56	2013-06-07


可以看出,结果是正确的,这种方式,非常高效,但通常,只适应于两个表里面,一个表非常大,而另外一张表,则非常小,究竟什么样的算小,基本上当你的内存能够,很轻松的装下,并不会对主程序造成很大影响的时候,我们就可以在Map端通过利用DistributeCached复制链接技术进行Join了。

猜你喜欢

转载自weitao1026.iteye.com/blog/2267052