Twelve, hadoop serialization

A sequence of basic overview

1, what is serialized

The sequence of objects in memory is converted into a sequence of bytes (or other data according to the converted transfer protocol), as well as for storage to disk in the persistent transmission network

2. Why do we need to serialize

Under normal circumstances, only the object stored in the local memory, the process allows only local calls. With the advent of distributed programs, it requires a different process calls objects on different hosts, which requires an object is transferred to another host on the network. But the object is not processed by the network can not be transmitted, through the following process sequence, the object can be transmitted over the network.

3, java in the serialization scheme

java achieved in its own sequence scheme, as long as the definition of a class implements Serializable time interfaces, then the internal java automatically implemented corresponding serialization. Such as:

public class Test implements Serializable{

    //这个序列化号是必须的,用于标识该类
    private static final long serialVersionUID = xxxx;
}

However, due to the time in Java serialization interface, and will comes with a lot of extra information, such as various verification information, header, inheritance system and so on. Do not facilitate efficient transmission (not high performance) in the network. So hadoop own additional serialization mechanism to achieve a volume of short, low bandwidth, serialization and de-serialization fast

Two, hadoop serialized

1, the basic class dependent

Writable hadoop to implement this interface, it can be serialized. Hadoop and implements many of the basic types of sequence classes. Dependency graph is as follows:
Twelve, hadoop serialization

FIG dependency graph serialization 2.1 hadoop

You can see all the serializable classes implement the WritableComparable this interface, this interface also inherited Writable and Comparable interface. Let's look at the three interfaces:

//WritableComparable.java
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
/*
空的接口
*/

//Writable.java
public interface Writable {
    void write(DataOutput var1) throws IOException;

    void readFields(DataInput var1) throws IOException;
}
/*
主要包含读和写序列化对象的方法
*/

//Comparable.java
public interface Comparable<T> {
    public int compareTo(T o);
}
/*
提供序列化对象间比较的方法
*/

2, hadoop basic sequence of categories and types of table

java type hadoop writable type
boolean BooleanWritable
byte ByteWritable
Int IntWritable
float FloatWritable
long LongWritable
double DoubleWritable
string Text
map MapWritable
array ArrayWritable

3, common source implementation of the sequence of the class

Below pick IntWritable this common serialization class to look at the source

package org.apache.hadoop.io;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience.Public;
import org.apache.hadoop.classification.InterfaceStability.Stable;

@Public
@Stable
public class IntWritable implements WritableComparable<IntWritable> {
    private int value;

    public IntWritable() {
    }

    public IntWritable(int value) {
        this.set(value);
    }

    public void set(int value) {
        this.value = value;
    }

    public int get() {
        return this.value;
    }

    //这里是实现了 writable 接口的方法
    public void readFields(DataInput in) throws IOException {
        this.value = in.readInt();
    }

    public void write(DataOutput out) throws IOException {
        out.writeInt(this.value);
    }

    //序列化对象的equals比较方法
    public boolean equals(Object o) {
        if (!(o instanceof IntWritable)) {
            return false;
        } else {
            IntWritable other = (IntWritable)o;
            return this.value == other.value;
        }
    }

    public int hashCode() {
        return this.value;
    }

    //比较对象大小的方法
    public int compareTo(IntWritable o) {
        int thisValue = this.value;
        int thatValue = o.value;
        return thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1);
    }

    public String toString() {
        return Integer.toString(this.value);
    }

    /*这里是关键,将下面的Comparator内部类作为默认的比较方法。
    因为这里采用静态代码块的方式,所以只要该类载入时,就会执行该代码块,直接创建 Comparator对象,后面无需通过外部类创建对象的方式来调用 compare方法,因为对象已经提前创建好了。比起上的 compareTo 方法,还要手动创建一个外部类对象才能调用 compareTo 方法,这里可以直接调用,效率要快。
    */
    static {
        WritableComparator.define(IntWritable.class, new IntWritable.Comparator());
    }

    //这个内部类也实现了 compare比较方法
    public static class Comparator extends WritableComparator {
        public Comparator() {
            super(IntWritable.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            int thisValue = readInt(b1, s1);
            int thatValue = readInt(b2, s2);
            return thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1);
        }
    }
}

Other short, to achieve the long sequence of categories is also similar.

4, custom serialization class

Points:
(1) Writable must implement the interface
(2) Have the reference constructor must, because of the need to call constructor with no arguments reflection deserialization
(3) a method of rewriting sequence

public void write(DataOutput out) throws IOException{
    //DataOutput接口中定义了每个基本类型序列化的方法,这里以Long为例
    out.writeLong(upFlow);
    out.writeLong(downFlow);
    out.writeLong(sumFlow);
}

(4) The method of rewriting deserialization

public void readFields(DataInput in) throws IOException{
    upFlow = in.readLong();
    downFlow = in.readLong();
    sumFlow = in.readLong();
}

To note (5) serialization and deserialization read write, sequential read and write must match exactly
(6) can be rewritten as needed toSting method facilitates content stored in the file
(7) if the custom serialization of the class is used, the key as a key, because MapReduce will be sorted by key, it will involve the comparison of key issues. It is necessary to implement the Comparable interface. The interface will have to implement the method compareTo

public int compareTo(Test o) {
    return (-1 | 0 |1 ); 表示小于,等于,大于三种结果
}

5, comprising the sequence of attributes of a custom class

First, the class attribute is need to implement custom serialization interface. So the following DateDimension and ContactDimension are already achieved a serialized.

public class ComDimension extends BaseDimension {
    private DateDimension dateDimension = new DateDimension();
    private ContactDimension contactDimension = new ContactDimension();

//序列化就直接调用类的write方法即可,按照下面的形式
 @Override
    public void write(DataOutput dataOutput) throws IOException {
        this.dateDimension.write(dataOutput);
        this.contactDimension.write(dataOutput);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.dateDimension.readFields(dataInput);
        this.contactDimension.readFields(dataInput);
    }

}

Guess you like

Origin blog.51cto.com/kinglab/2446185