mapreduce中使用hadoop序列化

序列化概述

1.什么是序列化
序列化就是将对象转换为字节序列以便于存储到磁盘或网络传输。
反序列化就是将字节序列转换为对象的过程。

2.为什么要序列化
程序中的对象不能直接网络传输或者持久化,所以在跨主机通信和数据持久化的场景下就需要用到序列化。

3.为什么不用java原生序列化
java原生序列化是一个重量级的实现,一个对象被序列化后会附带很多额外的信息(各种校验信息,Header,继承体系),不便于持久化和网络传输。所以Hadoop自己实现了一套序列化方案。

mapreduce中使用序列化

在mapreduce程序中当需传递自定义对象时,该对象需要实现序列化接口。下面以一个例子来讲解具体的使用。
需求
统计每一个手机号耗费的总上行流量、下行流量、总流量。

输入数据格式

手机号码,上行流量,下行流量
13881743089,100,34300
13655669078,34434,300
......

期望输出格式

手机号码,总上行流量,总下行流量,总流量
13881743089,4540,39300,43840‬
......

实现代码

FlowBean.java

import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {

    private long upFlow;
    private long downFlow;
    private long sumFlow;

    public FlowBean() {
        super();
    }

    public FlowBean(long upFlow, long downFlow) {
        super();
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        //反序列化属性的顺序一定要与序列化时保持一致
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return upFlow +"," + downFlow +"," + sumFlow;
    }
}

FlowMapper.java

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String[] fields = line.split(",");

        String phoneNumber = fields[0];
        long upFlow = Long.parseLong(fields[1]);
        long downFlow = Long.parseLong(fields[2]);
        FlowBean flowBean = new FlowBean(upFlow, downFlow);

        context.write(new Text(phoneNumber), flowBean);
    }
}

FlowReducer.java

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {

        long sum_upFlow = 0;
        long sum_downFlow = 0;

        for (FlowBean flowBean: values) {
            sum_upFlow += flowBean.getUpFlow();
            sum_downFlow += flowBean.getDownFlow();
        }

        FlowBean result = new FlowBean(sum_upFlow,sum_downFlow);
        context.write(key, result);
    }
}

FlowCount.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class FlowCount {
    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        job.setJarByClass(FlowCount.class);
        job.setJobName("flowcount");

        //设置文件输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //设置Mapper
        job.setMapperClass(FlowMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        //设置Reducer
        job.setReducerClass(FlowReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        job.setNumReduceTasks(1);

        job.waitForCompletion(true);
    }
}

pom.xml

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>2.6.5</version>
</dependency>
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-common</artifactId>
	<version>2.6.5</version>
</dependency>

输入文件

[root@master software]# cat flow.txt 
13881743089,100,34300
13655669078,34434,300
18677563354,3443,3209
13881743089,109,3300
13655669078,3434,230

打包,并提交到集群运行

yarn jar mapreduce-1.0-SNAPSHOT.jar cn.aiaudit.flow.FlowCount /input/flow.txt /output 

结果文件

[root@master software]# hdfs dfs -text  /output/part-r-00000
13655669078     37868,530,38398
13881743089     209,37600,37809
18677563354     3443,3209,6652

猜你喜欢

转载自blog.csdn.net/cl2010abc/article/details/104785686