Hadoop--MapReduce5--倒排索引

        日常检索时输入某一个关键字输出与这个关键字有关的文档列表,如果将文档名称看做key,文档内容看成value,一般检索可以通过key来检索value,现在如果把文档中内容碎片化,抽取关键词,然后处理所有文件,可以得到一个以关键词为key,value为文档列表的倒排文档列表,这样便实现了倒排索引。

需求:有大量的文本文档,如下所示:

a.txt


hello tom

hello jim

hello kitty

hello rose

b.txt


hello jerry

hello jim

hello kitty

hello jack

c.txt


hello jerry

hello java

hello c++

hello c++

最终结果如下所示: 即建立以文档中单位为key   value为该单词所出现的文件以及出现的次数列表

hello  a.txt-->4  b.txt-->4  c.txt-->4

java   c.txt-->1

jerry  b.txt-->1  c.txt-->1

思路:可以使用两步mapreduce任务来完成

第一步:

map读取文档的每一行   输出  单词-文件名为key       value=1

reduce接收相同key聚合,输出统计出每个单词在每个文件中出现的次数        即      单词-文件名    在该文件中出现的次数

第二步:

扫描二维码关注公众号,回复: 5671583 查看本文章

map读取第一步reduce输出的数据    key=单词      value=文件名-->该单词在该文件中出现的次数

reduce中相同key聚合    将文件名以及单词出现的总词数拼接字符串得到   单词    文件名1-->次数    文件名2-->次数

实现关键在于map处理每一行数据时,如何判定该行数据属于哪个文件

利用map中context中InputSplit来实现获取当前行属于的文件名

FileSplit split = context.getInputSplit();
String fileName = split.getpath().getName();

可以利用map方法中setup方法,每读取一个切片,执行map方法前在setup方法中获取当前map所处理切片所属的文件名。

具体代码实现如下:

第一步:

public class IndexStepOne {

	public static class IndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
		
		String fileName = null;
		
		@Override
		protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
		}

		// 产生 <hello-文件名,1> 
		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] words = value.toString().split(" ");
			for (String w : words) {
				// 将"单词-文件名"作为key,1作为value,输出
				context.write(new Text(w + "-" + fileName), new IntWritable(1));
			}
		}
	}

	public static class IndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			int count = 0;
			for (IntWritable value : values) {
				count += value.get();
			}
			context.write(key, new IntWritable(count));
		}
	}
	
	
	
	public static void main(String[] args) throws Exception{
		
		Configuration conf = new Configuration(); 
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(IndexStepOne.class);
		job.setMapperClass(IndexStepOneMapper.class);
		job.setReducerClass(IndexStepOneReducer.class);
		job.setNumReduceTasks(1);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\hadoop-2.8.1\\data\\index\\input"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out1"));

		job.waitForCompletion(true);	
	}
}

第一步输出:

c++-c.txt	2
hello-a.txt	4
hello-b.txt	4
hello-c.txt	4
jack-b.txt	1
java-c.txt	1
jerry-b.txt	1
jerry-c.txt	1
jim-a.txt	1
jim-b.txt	1
kitty-a.txt	1
kitty-b.txt	1
rose-a.txt	1
tom-a.txt	1

第二步:

public class IndexStepTwo {

	public static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text> {

		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] split = value.toString().split("-");
			context.write(new Text(split[0]), new Text(split[1].replaceAll("\t", "-->")));	
		}
	}

	public static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text> {
		// 一组数据:  <hello,a.txt-->4> <hello,b.txt-->4> <hello,c.txt-->4>
		@Override
		protected void reduce(Text key, Iterable<Text> values,Context context)
				throws IOException, InterruptedException {
			StringBuilder sb = new StringBuilder();	
			for (Text value : values) {
				sb.append(value.toString()).append("\t");
			}
			context.write(key, new Text(sb.toString()));
		}
	}
	
	public static void main(String[] args) throws Exception{
		
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(IndexStepTwo.class);
		job.setMapperClass(IndexStepTwoMapper.class);
		job.setReducerClass(IndexStepTwoReducer.class);
		job.setNumReduceTasks(1);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out1"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop-2.8.1\\data\\index\\out2"));

		job.waitForCompletion(true);
	}
}

第二步输出为:

c++	c.txt-->2	
hello	a.txt-->4	b.txt-->4	c.txt-->4	
jack	b.txt-->1	
java	c.txt-->1	
jerry	b.txt-->1	c.txt-->1	
jim	a.txt-->1	b.txt-->1	
kitty	a.txt-->1	b.txt-->1	
rose	a.txt-->1	
tom	a.txt-->1	

猜你喜欢

转载自blog.csdn.net/u014106644/article/details/88131681