HADOOP复习

注意:alt+/和ctrl+1 是非常重要的快捷键。

1.hadoop的编程模板(针对hdfs上的文件的操作)

//创建conf,会在去加载hadoop的一些配置文件
Configuration conf = new Configuration();
//对conf进行参数配置(要是不用编程,就在配置文件中进行修改)
conf.set("dfs.replication", "2");
conf.set("dfs.blocksize", "64m");
FileSystem fs = FileSystem.get(new URI("hdfs://hdp-001:9000"), conf, "root");

上面的步骤只要是客户端连接hdfs都需要创建

下面是一些hadoop的一些API:
//上传文件到hdfs中去
fs.copyFromLocalFile(new Path("C:\\Users\\liu-xiao-ge\\Desktop\\问题.txt"), new Path("/"));

//下载文件
fs.copyToLocalFile(new Path("/问题.txt"), new Path("C:\\Users\\liu-xiao-ge\\Desktop\\haha.txt"));

//hdfs本地移动文件
fs.rename(new Path("/问题.txt"), new Path("/liu/liu.txt"));
		
fs.close();


2.转载Hadoop FileSystem常用API的使用

https://blog.csdn.net/pzsoftchen/article/details/17632173

3.MapReduce编程

(1)需要实现mapper类、reducer类的接口
①mapper类的接口实现类WordcountMapper

KEYIN :是map task读取到的数据的key的类型,是一行的起始偏移量Long
VALUEIN:是map task读取到的数据的value的类型,是一行的内容String
KEYOUT:是用户的自定义map方法要返回的结果kv数据的key的类型,在wordcount逻辑中,我们需要返回的是单词String
VALUEOUT:是用户的自定义map方法要返回的结果kv数据的value的类型,在wordcount逻辑中,我们需要返回的是整数Integer
但是,在mapreduce中,map产生的数据需要传输给reduce,需要进行序列化和反序列化,而jdk中的原生序列化机制产生的数据量比较冗余,就会导致数据在mapreduce运行过程中传输效率低下
所以,hadoop专门设计了自己的序列化机制,那么,mapreduce中传输的数据类型就必须实现hadoop自己的序列化接口
hadoop为jdk中的常用基本类型Long String Integer Float等数据类型封住了自己的实现了hadoop序列化接口的类型:LongWritable,Text,IntWritable,FloatWritable

代码如下:

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {

		// 切单词
		String line = value.toString();
		String[] words = line.split(" ");
		for(String word:words){
			context.write(new Text(word), new IntWritable(1));
		}
	}
}

②实现reducer的实现类WordcountReducer
代码如下:

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
	
		
		int count = 0;
		
		Iterator<IntWritable> iterator = values.iterator();
		while(iterator.hasNext()){
			
			IntWritable value = iterator.next();
			count += value.get();
		}
		context.write(key, new IntWritable(count));
		
	}

③需要一个提交Job的类(这个类要有main方法,是主类)
这里需要说明,提交job一共有一下形式:
1)从window提交到yarn上
代码如下:

public static void main(String[] args) throws Exception {
		
		// 在代码中设置JVM系统参数,用于给job对象来获取访问HDFS的用户身份
		System.setProperty("HADOOP_USER_NAME", "root");
		
		
		Configuration conf = new Configuration();
		// 1、设置job运行时要访问的默认文件系统
		conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
		// 2、设置job提交到哪去运行
		conf.set("mapreduce.framework.name", "yarn");
		conf.set("yarn.resourcemanager.hostname", "hdp-01");
		// 3、如果要从windows系统上运行这个job提交客户端程序,则需要加这个跨平台提交的参数
		conf.set("mapreduce.app-submission.cross-platform","true");
		
		Job job = Job.getInstance(conf);
		
		// 1、封装参数:jar包所在的位置
		job.setJar("D:\\appdev\\hadoop-16\\mapreduce24\\target\\mapreduce24-0.0.1-SNAPSHOT.jar");
		//job.setJarByClass(JobSubmitter.class);
		
		// 2、封装参数: 本次job所要调用的Mapper实现类、Reducer实现类
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		// 3、封装参数:本次job的Mapper实现类、Reducer实现类产生的结果数据的key、value类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		
		
		Path output = new Path("/wordcount/output");
		FileSystem fs = FileSystem.get(new URI("hdfs://hdp-01:9000"),conf,"root");
		if(fs.exists(output)){
			fs.delete(output, true);
		}
		
		// 4、封装参数:本次job要处理的输入数据集所在路径、最终结果的输出路径
		FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
		FileOutputFormat.setOutputPath(job, output);  // 注意:输出路径必须不存在
		
		
		// 5、封装参数:想要启动的reduce task的数量
		job.setNumReduceTasks(2);
		
		// 6、提交job给yarn
		boolean res = job.waitForCompletion(true);
		
		System.exit(res?0:-1);
		
	}

2)从Linux提交到yarn上

public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
		conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
		// 没指定默认文件系统
		// 没指定mapreduce-job提交到哪运行

		Job job = Job.getInstance(conf);
		
		
		job.setJarByClass(JobSubmitterLinuxToYarn.class);
		
		
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
		FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
		
		job.setNumReduceTasks(3);
		
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
		
	}

3)在window上本地测试运行

public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();

		//conf.set("fs.defaultFS", "file:///");
		//conf.set("mapreduce.framework.name", "local");

		Job job = Job.getInstance(conf);
		
		job.setJarByClass(JobSubmitterLinuxToYarn.class);
		
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path("f:/mrdata/wordcount/input"));
		FileOutputFormat.setOutputPath(job, new Path("f:/mrdata/wordcount/output"));
		
		job.setNumReduceTasks(3);
		
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
		
	}

4.MapReduce的升级版

(1)提交job的类(在windows上本地运行)
代码如下:

public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(JobSubmit.class);
		
		
		job.setMapperClass(MapperIp.class);
		job.setReducerClass(ReduceIp.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowB.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowB.class);
		
		FileInputFormat.setInputPaths(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1"));
		FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1\\1"));
		
		job.setNumReduceTasks(2);
		
		job.waitForCompletion(true);

}

(2)mapper的实现类
代码如下:

扫描二维码关注公众号,回复: 9143400 查看本文章
@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		String[] ls = value.toString().split("\t");
		String phone = ls[1];
		int upflow = Integer.parseInt(ls[ls.length-3]);
		int dflow = Integer.parseInt(ls[ls.length-2]);
		
		context.write(new Text(phone), new FlowB(phone,upflow,dflow));
	}

(3)reducer的实现类:
代码如下:

@Override
	protected void reduce(Text key, Iterable<FlowB> value, Reducer<Text, FlowB, Text, FlowB>.Context context)
			throws IOException, InterruptedException {
		int upsum = 0;
		int dsum = 0;
		
		for (FlowB flowB : value) {
			upsum += flowB.getUpflow();
			dsum += flowB.getDflow();
		}
		
		context.write(key, new FlowB(key.toString(), upsum, dsum));
	}

(4)FlowB类
本案例的功能:演示自定义数据类型如何实现hadoop的序列化接口
该类一定要保留空参构造函数
write方法中输出字段二进制数据的顺序 要与 readFields方法读取数据的顺序一致

代码如下:

public class FlowB implements Writable {

	private int upFlow;
	private int dFlow;
	private String phone;
	private int amountFlow;

	public FlowB(){}
	
	public FlowB(String phone, int upFlow, int dFlow) {
		this.phone = phone;
		this.upFlow = upFlow;
		this.dFlow = dFlow;
		this.amountFlow = upFlow + dFlow;
	}

	public String getPhone() {
		return phone;
	}

	public void setPhone(String phone) {
		this.phone = phone;
	}

	public int getUpFlow() {
		return upFlow;
	}

	public void setUpFlow(int upFlow) {
		this.upFlow = upFlow;
	}

	public int getdFlow() {
		return dFlow;
	}

	public void setdFlow(int dFlow) {
		this.dFlow = dFlow;
	}

	public int getAmountFlow() {
		return amountFlow;
	}

	public void setAmountFlow(int amountFlow) {
		this.amountFlow = amountFlow;
	}

	/**
	 * hadoop系统在序列化该类的对象时要调用的方法
	 */
	@Override
	public void write(DataOutput out) throws IOException {

		out.writeInt(upFlow);
		out.writeUTF(phone);
		out.writeInt(dFlow);
		out.writeInt(amountFlow);

	}

	/**
	 * hadoop系统在反序列化该类的对象时要调用的方法
	 */
	@Override
	public void readFields(DataInput in) throws IOException {
		this.upFlow = in.readInt();
		this.phone = in.readUTF();
		this.dFlow = in.readInt();
		this.amountFlow = in.readInt();
	}

	@Override
	public String toString() {
		 
		return this.phone + ","+this.upFlow +","+ this.dFlow +"," + this.amountFlow;
	}
	
}
发布了51 篇原创文章 · 获赞 9 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/qq_43316411/article/details/100695509