keras实现简单的图片描述

又在GitHub上找到一个keras版本的Image Caption的实现。
还有这份比较详细的教程：

How to Develop a Deep Learning Photo Caption Generator from Scratch
大致就是，每次将同一张图片和该图片描述前面的词输入模型，模型的输出是描述的后一个词。
数据可以通过Dataset Request Form下载。

在这里插入图片描述
只不过这里输入的不是完整的图片，而是先用预训练好的CNN模型进行特征抽取并保存这些特征备用。
这里用到的是Inception-v3模型：

预处理

模型的输入大小为（299,299,3）首先对图片进行预处理，我们把模型最后一层拿掉，并保留max pooling，所以模型的输出一个大小为（1，2048）的向量，我们保存到本地。

def preprocess(image_path):
    img = image.load_img(image_path, target_size=(299, 299))
    x = image.img_to_array(img)
    x /= 255
    x = np.expand_dims(x, axis=0)
    return x
model = InceptionV3(weights='imagenet', include_top=False, pooling='max') 
print(model.predict(preprocess(train_images[0])).shape)

(1, 2048)

def extract_features(images):
    features = {}
    for img in tqdm(images):
        features[os.path.split(img)[1]] = model.predict(preprocess(img)).reshape(-1)
    with open('features.pkl', 'wb') as f:
        pickle.dump(features, f)

模型

模型的输入有两个，一是某张图片，二是部分描述，任务是预测描述的下一个词。

# 输入为图片抽取的特征
inputs1 = Input(shape=(2048,)) 
# 经过一层全连接层
x1 = Dense(embedding_size, activation='relu')(inputs1)
# 复制max_len次
x1 = RepeatVector(max_len)(x1)

# 输入部分描述
inputs2 = Input(shape=(max_len,))
# 进行词嵌入
x2 = Embedding(vocab_size + 1, embedding_size, input_length=max_len, mask_zero=True)(inputs2)
# 经过LSTM层，并保留每个时间步的输出
x2 = LSTM(256, return_sequences=True)(x2)
# 对每个时间步的输出，分别应用一层全连接层
x2 = TimeDistributed(Dense(200, activation='relu'))(x2)

# 我们将两部分处理后的输入在最后一维进行拼接
x3 = Concatenate()([x1, x2])
# 再经过LSTM，输出最后一个时间步
x3 = LSTM(256)(x3)
# 将全连接层的结果进行softmax，映射到词汇表每个词的概率
outputs = Dense(vocab_size, activation='softmax')(x3)

model = Model([inputs1, inputs2], outputs)
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
plot_model(model, show_shapes=True)

具体流程图如下，左边是描述，右边输入的是图片特征：
在这里插入图片描述

测试

进行生成的时候可以直接取最大概率的词或使用beam search。

def predict_captions(image):
    start_word = ["<start>"]
    while True:
        par_caps = [word2id[i] for i in start_word]
        par_caps = sequence.pad_sequences([par_caps], maxlen=max_len, padding='post')
        feature = features[os.path.split(image)[1]]
        preds = model.predict([np.reshape(feature, (1, -1)), np.array(par_caps)])
        word_pred = words[np.argmax(preds[0])]
        start_word.append(word_pred)
        if word_pred == "<end>" or len(start_word) > max_len:
            break
    return ' '.join(start_word[1:-1])

def beam_search_predictions(image, beam_index = 5):
    start = [word2id["<start>"]]
    
    start_word = [[start, 0.0]]
    
    while len(start_word[0][0]) < max_len:
        temp = []
        for s in start_word:
            par_caps = sequence.pad_sequences([s[0]], maxlen=max_len, padding='post')
            feature = features[os.path.split(image)[1]]
            preds = model.predict([np.reshape(feature, (1, -1)), np.array(par_caps)])
            
            word_preds = np.argsort(preds[0])[-beam_index:]
            
            # Getting the top <beam_index>(n) predictions and creating a 
            # new list so as to put them via the model again
            for w in word_preds:
                next_cap, prob = s[0][:], s[1]
                next_cap.append(w)
                prob += preds[0][w]
                temp.append([next_cap, prob])
                    
        start_word = temp
        # Sorting according to the probabilities
        start_word = sorted(start_word, reverse=False, key=lambda x: x[1])
        # Getting the top words
        start_word = start_word[-beam_index:]
    
    start_word = start_word[-1][0]
    intermediate_caption = [words[i] for i in start_word]

    final_caption = []
    
    for i in intermediate_caption:
        if i != '<end>':
            final_caption.append(i)
        else:
            break
    
    final_caption = ' '.join(final_caption[1:])
    return final_caption

最后效果还是有的，有些不是太准。这里是我自己跑的代码。
在这里插入图片描述

Keras实现简单的图片描述

keras实现简单的图片描述

预处理

模型

测试

猜你喜欢