ViLT: Vision-Language Transformer Model Without Convolution and Regional Supervision
NoSuchKey
Guess you like
Origin blog.csdn.net/qq_27590277/article/details/132399877
Recommended
Ranking