第一小节:如何学好这一章
1.Linux基础
2.Java编程
3.大数据核心组件,hadoop安装,部署,配置等等
第二小节:构建工程
1.新建工程
2.新建工程变成maven工程
3.安装配置maven环境,编辑setting文件
4.idea工具配置maven
5.编辑pom.xml文件
<properties>
<hadoop.version>2.7.0</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
6.import changes
第三小节:基于Java API对HDFS操作的初始化配置
Configuration -> FileSystem
第四小节:读取
Path -> FSDataInputStream ->IOUtils
需要两个配置文件:core-site.xml ,hdfs-site.xml
第五小节:写入文件
FSDataOutputStream FileInputStream
IOUtils.copyBytes(fileInputStream,outputStream,4096,false);
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.io.File;
import java.io.FileInputStream;
public class HdfsAPP {
private FileSystem getFileSystem()throws Exception{
Configuration configuration = new Configuration();
FileSystem fileSystem = FileSystem.get(configuration);
return fileSystem;
}
private void readHDFSFile(String filePath){
FSDataInputStream fsDataInputStream = null;
try {
Path path = new Path(filePath);
fsDataInputStream = this.getFileSystem().open(path);
IOUtils.copyBytes(fsDataInputStream,System.out,4096,false);
}catch (Exception e){
e.printStackTrace();
}finally {
if(fsDataInputStream != null){
IOUtils.closeStream(fsDataInputStream);
}
}
}
private void writeHDFS(String localPath,String hdfsPath){
FSDataOutputStream outputStream = null;
FileInputStream fileInputStream = null;
try{
Path path = new Path(hdfsPath);
outputStream = this.getFileSystem().create(path);
fileInputStream = new FileInputStream(new File(localPath));
IOUtils.copyBytes(fileInputStream,outputStream,4096,false);
}catch (Exception e){
e.printStackTrace();
}finally {
if(fileInputStream !=null){
IOUtils.closeStream(fileInputStream);
}
if(outputStream !=null){
IOUtils.closeStream(outputStream);
}
}
}
public static void main(String[] args) {
HdfsAPP hdfsAPP = new HdfsAPP();
//String filePath = "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/hdfs/core-site.xml";
String hdfsPath = "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/hdfs/local.xml";
String localPath = "/Users/zhangjingyu/Documents/workspace/src/main/resources/hdfs-site.xml";
//hdfsAPP.readHDFSFile(filePath);
hdfsAPP.writeHDFS(localPath,hdfsPath);
}
}
第六、七小节:MR编程模型
1.MR是分布式计算模型
2.MR整个并行计算过程中会抽象出两个函数:
map():它是对我们独立元素中的每一个元素进行并行计算操作的函数
reduce():它是对我们独立元素中的数据进行合并
3.一个简单的MR程序,我们只需要指定map() reduce() input output,剩下的事情交个我们的框架来完成。
关于MR的数据处理流程:
1)数据处理的阶段 input -> map -> reduce ->output
2)数据处理过程中,MR要求的数据格式是以<key,value>
要考虑三个问题:
1.input -> 如何变成<key,value>
default -> <0,hadoop spark>
2.map ->数据如何处理并变成<key,value>
<0,hadoop spark> -> split -> <hadoop,1> <spark,1>
3.reduce -> 数据如何处理并变成<key,value>
<hadoop,List(1)>
<spark,List(1,1,1,1)>
<java,List(1,1)
通过wordcount 来分析,了解我们MR的编程模型
数据源:
hadoop spark
spark hive
java java hive
spark kafka
kafka spark storm
数据结果:
hadoop 1
hive 2
java 2
kafka 2
spark 4
storm 1
第八小节:编写MR模板
1.map类实现
2.reduce类实现
3.driver(组装所有的过程到job)
1)get conf
2)create job
3.1)input
3.2)map
3.3)reduce
3.4)output
4)commit
第九小节:wordcount业务逻辑编写
Map数据处理:
<0,hadoop spark> -> split -> <hadoop,1> <spark,1>
Reduce数据处理:
<hadoop,List(1)>
<spark,List(1,1,1,1)>
<java,List(1,1)
//Java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountMR {
//1.map
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text mapOutputKey = new Text();
private IntWritable mapOutputValue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
System.out.println("keyIn:"+key +" ValueIn:"+value);
String lineValue = value.toString();
String[] strs = lineValue.split(" ");
for(String str : strs){
mapOutputKey.set(str);
context.write(mapOutputKey,mapOutputValue);
}
}
}
//2.reduce
public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outputValue = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable value : values){
sum +=value.get();
}
outputValue.set(sum);
context.write(key,outputValue);
}
}
public int run(String args[]) throws Exception {
//driver
//1) get conf
Configuration configuration = new Configuration();
//2) create job
Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
job.setJarByClass(this.getClass());
//3.1) input
Path path = new Path(args[0]);
FileInputFormat.addInputPath(job, path);
//3.2) map
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//3.3) reduce
job.setReducerClass(WordCountReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//3.4) output
Path output = new Path(args[1]);
FileOutputFormat.setOutputPath(job, output);
//4) commit
boolean isSuc = job.waitForCompletion(true);
return (isSuc) ? 0 : 1;
}
public static void main(String[] args) {
//用于yarn测试的时候记得注释掉
args = new String[]{
"hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/datas/wordcount.txt",
"hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/mr/output"
};
WordCountMR wordCountMR = new WordCountMR();
try {
//先判断路径是否存在,存在先删除
Path fileOutPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(new Configuration());
if(fileSystem.exists(fileOutPath)){
fileSystem.delete(fileOutPath,true);
}
int status = wordCountMR.run(args);
System.exit(status);
} catch (Exception e) {
e.printStackTrace();
}
}}
想系统学习大数据的话,可以加入大数据技术学习扣扣君羊:522189307
第十小节:本地测试
基于本地测试,去掉core-site.xml中 “hadoop.tmp.dir”配置项
第十一小节:打包并发测试