VQA的LSTM分类（对图片和文本特征的集合进行分类）找最合适回答

现在我们试试LSTM，也是之前Deep Learning部分讲过的，目前LanguageModel上效果比较好的模型

首先，跟刚刚的MLP的文字处理不一样，

我们这里，考虑time step（前后文出现）
先写下实现数据先不上传了还在处理数据调试

import sys
from random import shuffle
import argparse

import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, Reshape
from keras.layers import Merge
from keras.layers.recurrent import LSTM
from keras.utils import np_utils, generic_utils

from sklearn import preprocessing
from sklearn.externals import joblib

from spacy.en import English

# 参数们
max_len = 30
word_vec_dim= 300
img_dim = 4096
dropout = 0.5
activation_mlp = 'tanh'
num_epochs = 1
model_save_interval = 5

num_hidden_units_mlp = 1024
num_hidden_units_lstm = 512
num_hidden_layers_mlp = 3
num_hidden_layers_lstm = 1
batch_size = 128

# 先造一个图片模型，也就是专门用来处理图片部分的
image_model = Sequential()
image_model.add(Reshape((img_dim,), input_shape = (img_dim,)))

# 在来一个语言模型，专门用来处理语言的
# 因为，只有语言部分，需要LSTM。
language_model = Sequential()
if num_hidden_layers_lstm == 1:
    language_model.add(LSTM(output_dim = num_hidden_units_lstm, return_sequences=False, input_shape=(max_len, word_vec_dim)))
else:
    language_model.add(LSTM(output_dim = num_hidden_units_lstm, return_sequences=True, input_shape=(max_len, word_vec_dim)))
    for i in xrange(num_hidden_layers_lstm-2):
        language_model.add(LSTM(output_dim = num_hidden_units_lstm, return_sequences=True))
    language_model.add(LSTM(output_dim = num_hidden_units_lstm, return_sequences=False))
    
# 接下来，把楼上两个模型merge起来，
# 做最后的一步“分类”
model = Sequential()
model.add(Merge([language_model, image_model], mode='concat', concat_axis=1))
for i in xrange(num_hidden_layers_mlp):
    model.add(Dense(num_hidden_units_mlp, init='uniform'))
    model.add(Activation(activation_mlp))
    model.add(Dropout(dropout))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# 同理，我们把模型结构存下来
json_string = model.to_json()
model_file_name = '../data/lstm_1_num_hidden_units_lstm_' + str(num_hidden_units_lstm) + \
                    '_num_hidden_units_mlp_' + str(num_hidden_units_mlp) + '_num_hidden_layers_mlp_' + \
                    str(num_hidden_layers_mlp) + '_num_hidden_layers_lstm_' + str(num_hidden_layers_lstm)
open(model_file_name + '.json', 'w').write(json_string)

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
print 'Compilation done'

开始训练

features_struct = scipy.io.loadmat(vgg_model_path)
VGGfeatures = features_struct['feats']
print 'loaded vgg features'
image_ids = open('../data/coco_vgg_IDMap.txt').read().splitlines()
img_map = {}
for ids in image_ids:
    id_split = ids.split()
    img_map[id_split[0]] = int(id_split[1])

nlp = English()
print 'loaded word2vec features...'
## training
print 'Training started...'
for k in xrange(num_epochs):
    
    progbar = generic_utils.Progbar(len(questions_train))
    
    for qu_batch,an_batch,im_batch in zip(grouper(questions_train, batch_size, fillvalue=questions_train[-1]), 
                                            grouper(answers_train, batch_size, fillvalue=answers_train[-1]), 
                                            grouper(images_train, batch_size, fillvalue=images_train[-1])):

        X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, max_len)
        X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures)
        Y_batch = get_answers_matrix(an_batch, labelencoder)
        loss = model.train_on_batch([X_q_batch, X_i_batch], Y_batch)
        progbar.add(batch_size, values=[("train loss", loss)])


    if k%model_save_interval == 0:
        model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k))

model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k))

loaded vgg features
loaded word2vec features…
Training started…
215552/215519 [==============================] - 282s - train loss: 3.4884
依旧是只train了一个epoch。
可以把epoch调成150试试（我GPU1066大概20多分钟）

效果会比MLP好一点但是我之前调的MLP也不错（毕竟简单暴力）
大家可以多跑跑可以看看效果。

VQA的LSTM分类（对图片和文本特征的集合进行分类）找最合适回答

猜你喜欢