使用PTB数据集训练词向量模型出现张量类型错误

+ 关键字：`数据类型`，`dtype`

+ 问题描述：使用PTB数据集训练词向量模型，设置输入层的`dtype`参数值为`float32`，在启动训练的时候出现张量类型错误。

+ 报错信息：

```
<ipython-input-6-daf8837e1db3> in train(use_cuda, train_program, params_dirname)
37 num_epochs=1,
38 event_handler=event_handler,
---> 39 feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in train(self, num_epochs, event_handler, reader, feed_order)
403 else:
404 self._train_by_executor(num_epochs, event_handler, reader,
--> 405 feed_order)
406
407 def test(self, reader, feed_order):

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_executor(self, num_epochs, event_handler, reader, feed_order)
481 exe = executor.Executor(self.place)
482 reader = feeder.decorate_reader(reader, multi_devices=False)
--> 483 self._train_by_any_executor(event_handler, exe, num_epochs, reader)
484
485 def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_any_executor(self, event_handler, exe, num_epochs, reader)
510 fetch_list=[
511 var.name
--> 512 for var in self.train_func_outputs
513 ])
514 else:

/usr/local/lib/python3.5/dist-packages/paddle/fluid/executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
468
469 self._feed_data(program, feed, feed_var_name, scope)
--> 470 self.executor.run(program.desc, scope, 0, True, True)
471 outs = self._fetch_data(fetch_list, fetch_var_name, scope)
472 if return_numpy:

EnforceNotMet: Tensor holds the wrong type, it holds f at [/paddle/paddle/fluid/framework/tensor_impl.h:29]
PaddlePaddle Call Stacks:
```

+ 问题复现：在使用`fluid.layers.data`接口定义网络的输出层，设置每个输入层的`name`为单独的名称，`shape`的值为`[1]`且设置`dtype`的值为`float32`，启动训练的时候就会出现该错误。错误代码如下：

```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='float32')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='float32')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='float32')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='float32')
```

+ 解决问题：因为PTB数据集下训练的时候，已经把单词转换成整数，所以输入的数据应该是整数而不是浮点数字，出现的错误也是因为这个原因。正确代码如下：

```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
```

+ 问题拓展：PaddlePaddle的输入层数据类型有`float`、`int`、`uint`、`bool`，但是就没有字符串类型，所以训练数据都会转换成相应数据类型，所以在PTB数据集数据中也是把字符串的单词转换成整型。

+ 问题分析：编写神经网络时，获得编写任何程序时，细节都是重要的，细节不正确就会导致程序运行不起来，而深度学习的编程中，类型不正确是常出现的错误，要避免这类错误，你需要熟悉你使用的训练数据的数据类型，如果不熟悉，此时最好的方法就是在使用时打印一下数据的类型与shape，方便编写出正确的fluid.layers.data

## `已审阅` 2.问题：设置向量表征类型为整型时训练报错

+ 关键字：`数据类型`，`词向量`

+ 问题描述：定义N-gram神经网络训练PTB数据集时，使用PaddlePaddle内置的`fluid.layers.embedding`接口计算词向量，当设置该数据类型为`int64`时报错。

+ 报错信息：

```
<ipython-input-6-daf8837e1db3> in train(use_cuda, train_program, params_dirname)
31 # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
32 optimizer_func=optimizer_func,
---> 33 place=place)
34
35 trainer.train(

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in __init__(self, train_func, optimizer_func, param_path, place, parallel, checkpoint_config)
280 with self._prog_and_scope_guard():
281 exe = executor.Executor(place)
--> 282 exe.run(self.startup_program)
283
284 if self.checkpoint_cfg and self.checkpoint_cfg.load_serial is not None:

EnforceNotMet: op uniform_random does not have kernel for data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] at [/paddle/paddle/fluid/framework/operator.cc:733]
PaddlePaddle Call Stacks:
```

+ 问题复现：使用`fluid.layers.embedding`接口定义词向量时，设置参数`dtype`的值为`int64`，`size`为`[数据的单词数量, 词向量维度]`，在训练的时候就会报这个错误。错误代码如下：

```python
embed_first = fluid.layers.embedding(
input=first_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_second = fluid.layers.embedding(
input=second_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_third = fluid.layers.embedding(
input=third_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
input=fourth_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
```

+ 解决问题：输入层的数据类型虽然是`int64`，但是词向量的数据类型是`float32`。用户可能是理解误以为词向量的数据类型也许是`int64`，所以才会导致错误。正确代码如下：

```python
embed_first = fluid.layers.embedding(
input=first_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_second = fluid.layers.embedding(
input=second_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_third = fluid.layers.embedding(
input=third_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
input=fourth_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
```

+ 问题拓展：词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如`embedding(母亲节)=[0.3,4.2,−1.5,...]`，`embedding(康乃馨)=[0.2,5.6,−2.3,...]`。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”。

+ 问题分析：NLP中，词向量技术是比较底层的计算，是很多上层技术的支撑，如RNN、LSTM等，输入都是经过词向量嵌入后的向量，将词编码成相应要保持其语义信息是更好的，即将词编程稠密向量，one-hot独热向量虽然简单，但会编码维度灾难与语义鸿沟的问题。

使用PTB数据集训练词向量模型出现张量类型错误

猜你喜欢