HADOOP复习

注意：alt+/和ctrl+1 是非常重要的快捷键。

1.hadoop的编程模板(针对hdfs上的文件的操作)

//创建conf,会在去加载hadoop的一些配置文件
Configuration conf = new Configuration();
//对conf进行参数配置（要是不用编程，就在配置文件中进行修改）
conf.set("dfs.replication", "2");
conf.set("dfs.blocksize", "64m");
FileSystem fs = FileSystem.get(new URI("hdfs://hdp-001:9000"), conf, "root");

上面的步骤只要是客户端连接hdfs都需要创建

下面是一些hadoop的一些API：
//上传文件到hdfs中去
fs.copyFromLocalFile(new Path("C:\\Users\\liu-xiao-ge\\Desktop\\问题.txt"), new Path("/"));

//下载文件
fs.copyToLocalFile(new Path("/问题.txt"), new Path("C:\\Users\\liu-xiao-ge\\Desktop\\haha.txt"));

//hdfs本地移动文件
fs.rename(new Path("/问题.txt"), new Path("/liu/liu.txt"));
		
fs.close();

2.转载Hadoop FileSystem常用API的使用

https://blog.csdn.net/pzsoftchen/article/details/17632173

3.MapReduce编程

（1）需要实现mapper类、reducer类的接口
①mapper类的接口实现类WordcountMapper

KEYIN ：是map task读取到的数据的key的类型，是一行的起始偏移量Long
VALUEIN:是map task读取到的数据的value的类型，是一行的内容String
KEYOUT：是用户的自定义map方法要返回的结果kv数据的key的类型，在wordcount逻辑中，我们需要返回的是单词String
VALUEOUT:是用户的自定义map方法要返回的结果kv数据的value的类型，在wordcount逻辑中，我们需要返回的是整数Integer
但是，在mapreduce中，map产生的数据需要传输给reduce，需要进行序列化和反序列化，而jdk中的原生序列化机制产生的数据量比较冗余，就会导致数据在mapreduce运行过程中传输效率低下
所以，hadoop专门设计了自己的序列化机制，那么，mapreduce中传输的数据类型就必须实现hadoop自己的序列化接口
hadoop为jdk中的常用基本类型Long String Integer Float等数据类型封住了自己的实现了hadoop序列化接口的类型：LongWritable,Text,IntWritable,FloatWritable

代码如下：

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {

		// 切单词
		String line = value.toString();
		String[] words = line.split(" ");
		for(String word:words){
			context.write(new Text(word), new IntWritable(1));
		}
	}
}

②实现reducer的实现类WordcountReducer
代码如下：

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
	
		
		int count = 0;
		
		Iterator<IntWritable> iterator = values.iterator();
		while(iterator.hasNext()){
			
			IntWritable value = iterator.next();
			count += value.get();
		}
		context.write(key, new IntWritable(count));
		
	}

③需要一个提交Job的类（这个类要有main方法，是主类）
这里需要说明，提交job一共有一下形式：
1）从window提交到yarn上
代码如下：

public static void main(String[] args) throws Exception {
		
		// 在代码中设置JVM系统参数，用于给job对象来获取访问HDFS的用户身份
		System.setProperty("HADOOP_USER_NAME", "root");
		
		
		Configuration conf = new Configuration();
		// 1、设置job运行时要访问的默认文件系统
		conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
		// 2、设置job提交到哪去运行
		conf.set("mapreduce.framework.name", "yarn");
		conf.set("yarn.resourcemanager.hostname", "hdp-01");
		// 3、如果要从windows系统上运行这个job提交客户端程序，则需要加这个跨平台提交的参数
		conf.set("mapreduce.app-submission.cross-platform","true");
		
		Job job = Job.getInstance(conf);
		
		// 1、封装参数：jar包所在的位置
		job.setJar("D:\\appdev\\hadoop-16\\mapreduce24\\target\\mapreduce24-0.0.1-SNAPSHOT.jar");
		//job.setJarByClass(JobSubmitter.class);
		
		// 2、封装参数： 本次job所要调用的Mapper实现类、Reducer实现类
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		// 3、封装参数：本次job的Mapper实现类、Reducer实现类产生的结果数据的key、value类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		
		
		Path output = new Path("/wordcount/output");
		FileSystem fs = FileSystem.get(new URI("hdfs://hdp-01:9000"),conf,"root");
		if(fs.exists(output)){
			fs.delete(output, true);
		}
		
		// 4、封装参数：本次job要处理的输入数据集所在路径、最终结果的输出路径
		FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
		FileOutputFormat.setOutputPath(job, output);  // 注意：输出路径必须不存在
		
		
		// 5、封装参数：想要启动的reduce task的数量
		job.setNumReduceTasks(2);
		
		// 6、提交job给yarn
		boolean res = job.waitForCompletion(true);
		
		System.exit(res?0:-1);
		
	}

2）从Linux提交到yarn上

public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
		conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
		// 没指定默认文件系统
		// 没指定mapreduce-job提交到哪运行

		Job job = Job.getInstance(conf);
		
		
		job.setJarByClass(JobSubmitterLinuxToYarn.class);
		
		
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
		FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
		
		job.setNumReduceTasks(3);
		
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
		
	}

3）在window上本地测试运行

public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();

		//conf.set("fs.defaultFS", "file:///");
		//conf.set("mapreduce.framework.name", "local");

		Job job = Job.getInstance(conf);
		
		job.setJarByClass(JobSubmitterLinuxToYarn.class);
		
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path("f:/mrdata/wordcount/input"));
		FileOutputFormat.setOutputPath(job, new Path("f:/mrdata/wordcount/output"));
		
		job.setNumReduceTasks(3);
		
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
		
	}

4.MapReduce的升级版

（1）提交job的类（在windows上本地运行）
代码如下：

public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(JobSubmit.class);
		
		
		job.setMapperClass(MapperIp.class);
		job.setReducerClass(ReduceIp.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowB.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowB.class);
		
		FileInputFormat.setInputPaths(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1"));
		FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1\\1"));
		
		job.setNumReduceTasks(2);
		
		job.waitForCompletion(true);

}

（2）mapper的实现类
代码如下：

扫描二维码关注公众号，回复： 9143400 查看本文章

@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		String[] ls = value.toString().split("\t");
		String phone = ls[1];
		int upflow = Integer.parseInt(ls[ls.length-3]);
		int dflow = Integer.parseInt(ls[ls.length-2]);
		
		context.write(new Text(phone), new FlowB(phone,upflow,dflow));
	}

（3）reducer的实现类：
代码如下：

@Override
	protected void reduce(Text key, Iterable<FlowB> value, Reducer<Text, FlowB, Text, FlowB>.Context context)
			throws IOException, InterruptedException {
		int upsum = 0;
		int dsum = 0;
		
		for (FlowB flowB : value) {
			upsum += flowB.getUpflow();
			dsum += flowB.getDflow();
		}
		
		context.write(key, new FlowB(key.toString(), upsum, dsum));
	}

（4）FlowB类
本案例的功能：演示自定义数据类型如何实现hadoop的序列化接口
该类一定要保留空参构造函数
write方法中输出字段二进制数据的顺序要与 readFields方法读取数据的顺序一致

代码如下：

public class FlowB implements Writable {

	private int upFlow;
	private int dFlow;
	private String phone;
	private int amountFlow;

	public FlowB(){}
	
	public FlowB(String phone, int upFlow, int dFlow) {
		this.phone = phone;
		this.upFlow = upFlow;
		this.dFlow = dFlow;
		this.amountFlow = upFlow + dFlow;
	}

	public String getPhone() {
		return phone;
	}

	public void setPhone(String phone) {
		this.phone = phone;
	}

	public int getUpFlow() {
		return upFlow;
	}

	public void setUpFlow(int upFlow) {
		this.upFlow = upFlow;
	}

	public int getdFlow() {
		return dFlow;
	}

	public void setdFlow(int dFlow) {
		this.dFlow = dFlow;
	}

	public int getAmountFlow() {
		return amountFlow;
	}

	public void setAmountFlow(int amountFlow) {
		this.amountFlow = amountFlow;
	}

	/**
	 * hadoop系统在序列化该类的对象时要调用的方法
	 */
	@Override
	public void write(DataOutput out) throws IOException {

		out.writeInt(upFlow);
		out.writeUTF(phone);
		out.writeInt(dFlow);
		out.writeInt(amountFlow);

	}

	/**
	 * hadoop系统在反序列化该类的对象时要调用的方法
	 */
	@Override
	public void readFields(DataInput in) throws IOException {
		this.upFlow = in.readInt();
		this.phone = in.readUTF();
		this.dFlow = in.readInt();
		this.amountFlow = in.readInt();
	}

	@Override
	public String toString() {
		 
		return this.phone + ","+this.upFlow +","+ this.dFlow +"," + this.amountFlow;
	}
	
}

纤月流水@刘小哥

发布了51 篇原创文章 · 获赞 9 · 访问量 2万+

私信关注

1.hadoop的编程模板(针对hdfs上的文件的操作)

2.转载Hadoop FileSystem常用API的使用

3.MapReduce编程

4.MapReduce的升级版

猜你喜欢