Hadoop--MapReduce5--倒排索引

日常检索时输入某一个关键字输出与这个关键字有关的文档列表，如果将文档名称看做key,文档内容看成value,一般检索可以通过key来检索value,现在如果把文档中内容碎片化，抽取关键词，然后处理所有文件，可以得到一个以关键词为key,value为文档列表的倒排文档列表，这样便实现了倒排索引。

需求：有大量的文本文档，如下所示：

a.txt


hello tom

hello jim

hello kitty

hello rose

b.txt


hello jerry

hello jim

hello kitty

hello jack

c.txt


hello jerry

hello java

hello c++

hello c++

最终结果如下所示：即建立以文档中单位为key value为该单词所出现的文件以及出现的次数列表

hello  a.txt-->4  b.txt-->4  c.txt-->4

java   c.txt-->1

jerry  b.txt-->1  c.txt-->1

思路：可以使用两步mapreduce任务来完成

第一步:

map读取文档的每一行输出单词-文件名为key value=1

reduce接收相同key聚合，输出统计出每个单词在每个文件中出现的次数即单词-文件名在该文件中出现的次数

第二步：

扫描二维码关注公众号，回复： 5671583 查看本文章

map读取第一步reduce输出的数据 key=单词 value=文件名-->该单词在该文件中出现的次数

reduce中相同key聚合将文件名以及单词出现的总词数拼接字符串得到单词文件名1-->次数文件名2-->次数

实现关键在于map处理每一行数据时，如何判定该行数据属于哪个文件

利用map中context中InputSplit来实现获取当前行属于的文件名

FileSplit split = context.getInputSplit();
String fileName = split.getpath().getName();

可以利用map方法中setup方法，每读取一个切片，执行map方法前在setup方法中获取当前map所处理切片所属的文件名。

具体代码实现如下：

第一步：

public class IndexStepOne {

	public static class IndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
		
		String fileName = null;
		
		@Override
		protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
		}

		// 产生 <hello-文件名，1> 
		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] words = value.toString().split(" ");
			for (String w : words) {
				// 将"单词-文件名"作为key，1作为value，输出
				context.write(new Text(w + "-" + fileName), new IntWritable(1));
			}
		}
	}

	public static class IndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			int count = 0;
			for (IntWritable value : values) {
				count += value.get();
			}
			context.write(key, new IntWritable(count));
		}
	}
	
	
	
	public static void main(String[] args) throws Exception{
		
		Configuration conf = new Configuration(); 
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(IndexStepOne.class);
		job.setMapperClass(IndexStepOneMapper.class);
		job.setReducerClass(IndexStepOneReducer.class);
		job.setNumReduceTasks(1);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\hadoop-2.8.1\\data\\index\\input"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out1"));

		job.waitForCompletion(true);	
	}
}

第一步输出：

c++-c.txt	2
hello-a.txt	4
hello-b.txt	4
hello-c.txt	4
jack-b.txt	1
java-c.txt	1
jerry-b.txt	1
jerry-c.txt	1
jim-a.txt	1
jim-b.txt	1
kitty-a.txt	1
kitty-b.txt	1
rose-a.txt	1
tom-a.txt	1

第二步：

public class IndexStepTwo {

	public static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text> {

		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] split = value.toString().split("-");
			context.write(new Text(split[0]), new Text(split[1].replaceAll("\t", "-->")));	
		}
	}

	public static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text> {
		// 一组数据：  <hello,a.txt-->4> <hello,b.txt-->4> <hello,c.txt-->4>
		@Override
		protected void reduce(Text key, Iterable<Text> values,Context context)
				throws IOException, InterruptedException {
			StringBuilder sb = new StringBuilder();	
			for (Text value : values) {
				sb.append(value.toString()).append("\t");
			}
			context.write(key, new Text(sb.toString()));
		}
	}
	
	public static void main(String[] args) throws Exception{
		
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(IndexStepTwo.class);
		job.setMapperClass(IndexStepTwoMapper.class);
		job.setReducerClass(IndexStepTwoReducer.class);
		job.setNumReduceTasks(1);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out1"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out2"));

		job.waitForCompletion(true);
	}
}

第二步输出为：

c++	c.txt-->2	
hello	a.txt-->4	b.txt-->4	c.txt-->4	
jack	b.txt-->1	
java	c.txt-->1	
jerry	b.txt-->1	c.txt-->1	
jim	a.txt-->1	b.txt-->1	
kitty	a.txt-->1	b.txt-->1	
rose	a.txt-->1	
tom	a.txt-->1

Hadoop--MapReduce5--倒排索引

猜你喜欢