Overview
Flink uses the StreamPartitioner to control the downstream flow of the elements in the DataStream. Flink provides 8 StreamPartitioners:
- BroadcastPartitioner
- GlobalPartitioner
- RebalancePartitioner
- ShufflePartitioner
- RescalePartitioner
- ForwardPartitioner
- KeyGroupStreamPartitioner
- CustomPartitionerWrapper
StreamPartitioner inherits from the ChannelSelector interface. Channel is a simple abstraction of Flink's data writing destination (a subtask of a downstream parallel operator). We can directly think of it as a concurrent instance of a downstream operator . All subclasses of StreamPartitioner must implement the selectChannel() method to select the partition number. Let's take a look at their principles respectively.
ChannelSelector
public interface ChannelSelector<T extends IOReadableWritable> {
//初始化下游channels数量,即下游subtask的数量
void setup(int var1);
/*
根据当前的record以及Channel总数,
决定应将record发送到下游哪个Channel。
不同的分区策略会实现不同的该方法。
*/
int selectChannel(T var1);
//是否以广播的形式发送到下游所有的算子实例
boolean isBroadcast();
}
StreamPartitioner
public abstract class StreamPartitioner<T> implements ChannelSelector<SerializationDelegate<StreamRecord<T>>>, Serializable
{
private static final long serialVersionUID = 1L;
protected int numberOfChannels;
public StreamPartitioner() {
}
//初始化下游channels数量,即下游subtask的数量
public void setup(int numberOfChannels) {
this.numberOfChannels = numberOfChannels;
}
//默认不以广播的方式发送到下游所有的算子实例
public boolean isBroadcast() {
return false;
}
public abstract org.apache.flink.streaming.runtime.partitioner.StreamPartitioner<T> copy();
}
BroadcastPartitioner
Method of calling: dataStream.broadcast();
Role: Send to all downstream operator instances. The downstream instances all save a complete copy of the data of the upstream operator, and then the instances can directly obtain the data locally. For the associated processing of a large dataStream and a small dataStream , broadcasting the small dataStream to the downstream can improve the efficiency of association.
Principle: There is no need to select downstream partitions.
源码:
@Internal
public class BroadcastPartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
public BroadcastPartitioner() {
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
throw new UnsupportedOperationException("Broadcast partitioner does not support select channels.");
}
public boolean isBroadcast() {
return true;
}
public StreamPartitioner<T> copy() {
return this;
}
public String toString() {
return "BROADCAST";
}
}
GlobalPartitioner
Calling method: dataStream.global();
Role: Only the data will be output to the first instance of the downstream operator ()
Principle: The partition numbered 0 is fixed in the selectChannel method.
Source code:
@Internal
public class GlobalPartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
public GlobalPartitioner() {
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
return 0;
}
public StreamPartitioner<T> copy() {
return this;
}
public String toString() {
return "GLOBAL";
}
}
RebalancePartitioner
Calling method: dataStream.rebalance();
Role: Send data to downstream instances in turn
Principle: Randomly select an instance of a downstream operator, and then use round-robin to start looping output from that instance. This method can ensure complete downstream load balancing, and is often used to process skewed original data streams. Source code:
@Internal
public class RebalancePartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
private int nextChannelToSendTo;
public RebalancePartitioner() {
}
public void setup(int numberOfChannels) {
super.setup(numberOfChannels);
//初始化channel的id,返回[0,numberOfChannels)的伪随机数
this.nextChannelToSendTo = ThreadLocalRandom.current().nextInt(numberOfChannels);
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
this.nextChannelToSendTo = (this.nextChannelToSendTo + 1) % this.numberOfChannels;
return this.nextChannelToSendTo;
}
public StreamPartitioner<T> copy() {
return this;
}
public String toString() {
return "REBALANCE";
}
}
ShufflePartitioner
Calling method: dataStream.shuffle();
Role: Randomly output data to concurrent instances of downstream operators
Principle: Use java.util.Random to randomly select a downstream instance.
Source code:
public class ShufflePartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
private Random random = new Random();
public ShufflePartitioner() {
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
return this.random.nextInt(this.numberOfChannels);
}
public StreamPartitioner<T> copy() {
return new ShufflePartitioner();
}
public String toString() {
return "SHUFFLE";
}
}
RescalePartitioner
Calling method: dataStream.rescale();;
Role: Only the data will be output to the first instance of the downstream operator
Principle: Based on the parallelism of the upstream and downstream operators, the records are output to each instance of the downstream Operator in a circular manner.
Example: The upstream parallelism is 2 and the downstream is 4, then one upstream parallelism outputs records to two downstream parallelism in a circular manner; the other parallelism upstream outputs records to the other two downstream in a circular manner Parallelism.
If the upstream parallelism is 4 and the downstream parallelism is 2, then the upstream two parallelism will output records to the downstream one; the other two upstream parallelism will output the records to the other downstream parallelism.
Source code:
@Internal
public class RescalePartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
private int nextChannelToSendTo = -1;
public RescalePartitioner() {
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
if (++this.nextChannelToSendTo >= this.numberOfChannels) {
this.nextChannelToSendTo = 0;
}
return this.nextChannelToSendTo;
}
public StreamPartitioner<T> copy() {
return this;
}
public String toString() {
return "RESCALE";
}
}
ForwardPartitioner
Calling method: dataStream.forward();
Role: the GlobalPartitioner same realization. But it will output the data to the first instance of the downstream operator running locally instead of globally.
Principle: The partition numbered 0 is fixed in the selectChannel method.
Source code:
@Internal
public class ForwardPartitioner<T> extends StreamPartitioner<T> {
private static final long serialVersionUID = 1L;
public ForwardPartitioner() {
}
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
return 0;
}
public StreamPartitioner<T> copy() {
return this;
}
public String toString() {
return "FORWARD";
}
}
KeyGroupStreamPartitioner
Calling method: dataStream.shuffle();
Function: It is the StreamPartitioner used at the bottom of the keyBy() operator ,
Principle: First the key after the value based on the obtained double hash key corresponding hash value, the first weight is Java carrying the hashCode () , the second weight is MurmurHash . Then multiply the hash value by the operator parallelism and divide by the maximum parallelism to get the final partition ID
Source code:
CustomPartitionerWrapper
Calling method: dataStream.partitionCustom(< Partitioner> );
Role: custom partition logic
Principle: The custom partition logic is implemented by inheriting the Partitioner interface and passed into the partitionCustom() method.
Such as custom partition:
dataStream.partitionCustom(new Partitioner<String>() {
@Override
public int partition(String key, int numPartitions) {
return key.length() % numPartitions;
}
},0);