【大数据开发】Hadoop的高级编程 (一)

 第一小节:如何学好这一章

  1.Linux基础

  2.Java编程

  3.大数据核心组件,hadoop安装,部署,配置等等

第二小节:构建工程

  1.新建工程

  2.新建工程变成maven工程

  3.安装配置maven环境,编辑setting文件

  4.idea工具配置maven

  5.编辑pom.xml文件

    <properties>        <hadoop.version>2.7.0</hadoop.version>    </properties>
    <dependencies>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-client</artifactId>            <version>${hadoop.version}</version>        </dependency>    </dependencies>

   6.import changes

扫描二维码关注公众号,回复: 11081682 查看本文章

第三小节:基于Java API对HDFS操作的初始化配置

Configuration -> FileSystem

第四小节:读取

Path -> FSDataInputStream  ->IOUtils

需要两个配置文件:core-site.xml ,hdfs-site.xml

第五小节:写入文件

FSDataOutputStream  FileInputStream  

IOUtils.copyBytes(fileInputStream,outputStream,4096,false);

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;
import java.io.File;import java.io.FileInputStream;
public class HdfsAPP {
    private  FileSystem getFileSystem()throws Exception{        Configuration configuration = new Configuration();        FileSystem fileSystem = FileSystem.get(configuration);        return fileSystem;    }
    private void readHDFSFile(String filePath){        FSDataInputStream fsDataInputStream = null;        try {            Path path = new Path(filePath);            fsDataInputStream = this.getFileSystem().open(path);            IOUtils.copyBytes(fsDataInputStream,System.out,4096,false);
        }catch (Exception e){            e.printStackTrace();        }finally {            if(fsDataInputStream != null){                IOUtils.closeStream(fsDataInputStream);            }        }    }

    private void writeHDFS(String localPath,String hdfsPath){        FSDataOutputStream outputStream = null;        FileInputStream fileInputStream = null;        try{        Path path = new Path(hdfsPath);        outputStream = this.getFileSystem().create(path);        fileInputStream = new FileInputStream(new File(localPath));        IOUtils.copyBytes(fileInputStream,outputStream,4096,false);        }catch (Exception e){            e.printStackTrace();        }finally {            if(fileInputStream !=null){                IOUtils.closeStream(fileInputStream);            }            if(outputStream !=null){                IOUtils.closeStream(outputStream);            }
        }    }
    public static void main(String[] args) {        HdfsAPP hdfsAPP = new HdfsAPP();        //String filePath = "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/hdfs/core-site.xml";        String hdfsPath = "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/hdfs/local.xml";        String localPath = "/Users/zhangjingyu/Documents/workspace/src/main/resources/hdfs-site.xml";
        //hdfsAPP.readHDFSFile(filePath);        hdfsAPP.writeHDFS(localPath,hdfsPath);    }}

第六、七小节:MR编程模型

1.MR是分布式计算模型

2.MR整个并行计算过程中会抽象出两个函数:

map():它是对我们独立元素中的每一个元素进行并行计算操作的函数

reduce():它是对我们独立元素中的数据进行合并

3.一个简单的MR程序,我们只需要指定map()  reduce()  input output,剩下的事情交个我们的框架来完成。

关于MR的数据处理流程:

1)数据处理的阶段 input   ->  map   -> reduce   ->output

2)数据处理过程中,MR要求的数据格式是以<key,value>

要考虑三个问题:

1.input  -> 如何变成<key,value>

   default  ->  <0,hadoop  spark>

2.map   ->数据如何处理并变成<key,value>

   <0,hadoop  spark>    ->   split  ->   <hadoop,1>   <spark,1>

3.reduce -> 数据如何处理并变成<key,value>

   <hadoop,List(1)>

   <spark,List(1,1,1,1)>

   <java,List(1,1)

通过wordcount 来分析,了解我们MR的编程模型

数据源:

hadoop spark

spark hive

java java hive

spark kafka

kafka spark storm

数据结果:

hadoop  1

hive    2

java    2

kafka   2

spark   4

storm   1

第八小节:编写MR模板

1.map类实现

2.reduce类实现

3.driver(组装所有的过程到job)

1)get conf

2)create job

3.1)input

3.2)map

3.3)reduce

3.4)output

4)commit

第九小节:wordcount业务逻辑编写

Map数据处理:

   <0,hadoop  spark>    ->   split  ->   <hadoop,1>   <spark,1>

Reduce数据处理:

   <hadoop,List(1)>

   <spark,List(1,1,1,1)>

   <java,List(1,1)

//Java

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountMR {    //1.map    public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private Text mapOutputKey = new Text();        private IntWritable mapOutputValue = new IntWritable(1);        @Override        protected void map(LongWritable key, Text value, Context context)                throws IOException, InterruptedException {            System.out.println("keyIn:"+key +"    ValueIn:"+value);            String lineValue  = value.toString();            String[] strs = lineValue.split(" ");            for(String str : strs){                mapOutputKey.set(str);                context.write(mapOutputKey,mapOutputValue);            }        }    }

    //2.reduce    public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable outputValue = new IntWritable();        @Override        protected void reduce(Text key, Iterable<IntWritable> values, Context context)                throws IOException, InterruptedException {
            int sum = 0;            for(IntWritable value : values){                sum +=value.get();            }            outputValue.set(sum);            context.write(key,outputValue);        }    }

    public int run(String args[]) throws Exception {
        //driver        //1) get conf        Configuration configuration = new Configuration();
        //2) create job        Job job = Job.getInstance(configuration, this.getClass().getSimpleName());        job.setJarByClass(this.getClass());
        //3.1)  input        Path path = new Path(args[0]);        FileInputFormat.addInputPath(job, path);

        //3.2) map        job.setMapperClass(WordCountMapper.class);        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(IntWritable.class);
        //3.3) reduce        job.setReducerClass(WordCountReduce.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);
        //3.4) output        Path output = new Path(args[1]);        FileOutputFormat.setOutputPath(job, output);
        //4) commit        boolean isSuc = job.waitForCompletion(true);        return (isSuc) ? 0 : 1;    }
    public static void main(String[] args) {
      //用于yarn测试的时候记得注释掉        args = new String[]{         "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/datas/wordcount.txt",        "hdfs://bigdata-pro-m01.kfk.com:9000/user/kfk/mr/output"        };
        WordCountMR wordCountMR = new WordCountMR();        try {            //先判断路径是否存在,存在先删除            Path fileOutPath = new Path(args[1]);            FileSystem fileSystem = FileSystem.get(new Configuration());            if(fileSystem.exists(fileOutPath)){                fileSystem.delete(fileOutPath,true);            }            int status = wordCountMR.run(args);            System.exit(status);        } catch (Exception e) {            e.printStackTrace();        }    }}

想系统学习大数据的话,可以加入大数据技术学习扣扣君羊:522189307

第十小节:本地测试

基于本地测试,去掉core-site.xml中 “hadoop.tmp.dir”配置项

第十一小节:打包并发测试

发布了222 篇原创文章 · 获赞 6 · 访问量 4万+

猜你喜欢

转载自blog.csdn.net/mnbvxiaoxin/article/details/105715557