问题解决
运行代码
public class JavaSourceEx {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 1)使用fromCollection(Collection) 读取数据
//ArrayList<String> list = new ArrayList<>();
//list.add("hello");list.add("word");list.add("cctv");
//DataStreamSource<String> stream01 = env.fromCollection(list);
// 2)使用fromCollection(Iterator, Class) 读取数据
Iterator<String> it = list.iterator();
DataStreamSource<String> stream02 = env.fromCollection(it, TypeInformation.of(String.class));
stream02.print().setParallelism(1);
env.execute();
}
}
报错内容
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: java.util.ArrayList$Itr@a1cdc6d is not serializable. The implementation accesses fields of its enclosing class, which is a common reason for non-serializability. A common solution is to make the function a proper (non-inner) class, or a static inner class.
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:164)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:132)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:69)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:2053)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1737)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.fromCollection(StreamExecutionEnvironment.java:1147)
at Examples.JavaSourceEx.main(JavaSourceEx.java:30)
Caused by: java.io.NotSerializableException: java.util.ArrayList$Itr
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:624)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:143)
... 6 more
Process finished with exit code 1
解决方式
若是从内部容器中读取数据:
1)flink官方还提供如下方法:fromCollection(Collection)
,可以将从迭代器读取数据的方法替换为该方法;
2)报错原因是因为迭代器未实现序列机接口。容器已实现序列化,但是迭代器未实现,所以若想使用,需要自定义迭代器并实现序列化接口,该操作比较多余,所以建议按照第一种方式解决;
自定义实现序列化
package Examples.Utils;
import com.sun.org.apache.xpath.internal.functions.WrongNumberArgsException;
import com.sun.tools.jdi.EventSetImpl;
import java.io.Serializable;
import java.util.Arrays;
import java.util.Iterator;
public class MyListItr<T> implements Serializable{
private static int default_capacity = 10;
private int size = 0;
private Object[] elements;
public MyListItr(){
this.elements = new Object[default_capacity];
}
public MyListItr(int capa){
this.default_capacity = capa;
this.elements = new Object[default_capacity];
}
public int size(){
return this.size;
}
public T get(int index) throws MyException {
if(index<0){
throw new MyException("Index given cannot be less than 0");
}
if(index>=size){
throw new MyException("Index given cannot be larger than or equal to the collection size");
}
return (T)elements[index];
}
public T add(T ele){
if(size == default_capacity){
elements = Arrays.copyOf(elements,default_capacity*2);
default_capacity *=2;
elements[size++] = ele;
}else{
elements[size++] = ele;
}
return (T)ele;
}
public Iterator iterator(){
return new Itr();
}
private class Itr implements Iterator<T>, Serializable {
int cursor;
Itr(){
}
@Override
public boolean hasNext() {
return cursor!=size();
}
@Override
public T next() {
return (T)elements[cursor++];
}
}
public static void main(String[] args) throws MyException {
MyListItr<Integer> obj = new MyListItr<>();
obj.add(1);
obj.add(2);
System.out.println(obj.get(0));
obj.add(3);
Iterator it = obj.iterator();
while(it.hasNext()){
System.out.println(it.next());
}
}
}
class MyException extends Exception implements Serializable{
public MyException(String message) {
super(message);
}
}
重新执行代码
//实现序列化
MyListItr<Integer> myList = new MyListItr<>();
myList.add(1);myList.add(2);myList.add(3);
Iterator<Integer> it02 = myList.iterator();
DataStreamSource<Integer> stream02 = env.fromCollection(it02, TypeInformation.of(Integer.class));
stream02.print().setParallelism(1);
env.execute();
可以运行:
问题深入
为什么会抛出该错误
说点深入的,java需要运行在jvm平台上,并且以字节码的形式被JVM所解释运行。flink是分布式计算,所以map等算子内的数据会在各个网络节点中分发进行计算。另外当flink的源代码编译为字节码文件后,可以从算子部分字节码文件中看到会读取对象进入该算子中,进入算子的所有对象都要实现序列化。如果没有序列化,就会抛出错误。
为什么需要序列化
在分布式计算中,比如spark、mapreduce、flink等计算前提都需要实现计算对象的可序列化。序列化是为了在网络节点中减少数据传输和交换带来的延迟、损失、资源消耗。未序列化的对象将无法在网络节点中分发。
该错误抛出源
这个错误是flink在执行闭包清理逻辑时报错的。具体逻辑在这个类中:org.apache.flink.api.java.ClosureCleaner
。
为什么要清理闭包
很多时候为了方便快捷会选择使用匿名类或者嵌套子类。那么当类A需要被序列化传输的时候,就同时也需要内部子类也可以被序列化,但是一般嵌套类内部可能会引用一些不必要的类或者不必要的变量信息,那么flink有必要进行清理,节省序列化的开销。