使用PTB数据集训练词向量模型出现张量类型错误

 + 关键字:`数据类型`,`dtype`

 + 问题描述:使用PTB数据集训练词向量模型,设置输入层的`dtype`参数值为`float32`,在启动训练的时候出现张量类型错误。

 + 报错信息:

```
<ipython-input-6-daf8837e1db3> in train(use_cuda, train_program, params_dirname)
     37         num_epochs=1,
     38         event_handler=event_handler,
---> 39         feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in train(self, num_epochs, event_handler, reader, feed_order)
    403         else:
    404             self._train_by_executor(num_epochs, event_handler, reader,
--> 405                                     feed_order)
    406 
    407     def test(self, reader, feed_order):

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_executor(self, num_epochs, event_handler, reader, feed_order)
    481             exe = executor.Executor(self.place)
    482             reader = feeder.decorate_reader(reader, multi_devices=False)
--> 483             self._train_by_any_executor(event_handler, exe, num_epochs, reader)
    484 
    485     def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_any_executor(self, event_handler, exe, num_epochs, reader)
    510                                       fetch_list=[
    511                                           var.name
--> 512                                           for var in self.train_func_outputs
    513                                       ])
    514                 else:

/usr/local/lib/python3.5/dist-packages/paddle/fluid/executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    468 
    469         self._feed_data(program, feed, feed_var_name, scope)
--> 470         self.executor.run(program.desc, scope, 0, True, True)
    471         outs = self._fetch_data(fetch_list, fetch_var_name, scope)
    472         if return_numpy:

EnforceNotMet: Tensor holds the wrong type, it holds f at [/paddle/paddle/fluid/framework/tensor_impl.h:29]
PaddlePaddle Call Stacks: 
```

 + 问题复现:在使用`fluid.layers.data`接口定义网络的输出层,设置每个输入层的`name`为单独的名称,`shape`的值为`[1]`且设置`dtype`的值为`float32`,启动训练的时候就会出现该错误。错误代码如下:

```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='float32')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='float32')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='float32')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='float32')
```

 + 解决问题:因为PTB数据集下训练的时候,已经把单词转换成整数,所以输入的数据应该是整数而不是浮点数字,出现的错误也是因为这个原因。正确代码如下:

```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
```

 + 问题拓展:PaddlePaddle的输入层数据类型有`float`、`int`、`uint`、`bool`,但是就没有字符串类型,所以训练数据都会转换成相应数据类型,所以在PTB数据集数据中也是把字符串的单词转换成整型。

 + 问题分析:编写神经网络时,获得编写任何程序时,细节都是重要的,细节不正确就会导致程序运行不起来,而深度学习的编程中,类型不正确是常出现的错误,要避免这类错误,你需要熟悉你使用的训练数据的数据类型,如果不熟悉,此时最好的方法就是在使用时打印一下数据的类型与shape,方便编写出正确的fluid.layers.data


## `已审阅` 2.问题:设置向量表征类型为整型时训练报错

 + 关键字:`数据类型`,`词向量`

 + 问题描述:定义N-gram神经网络训练PTB数据集时,使用PaddlePaddle内置的`fluid.layers.embedding`接口计算词向量,当设置该数据类型为`int64`时报错。

 + 报错信息:

```
<ipython-input-6-daf8837e1db3> in train(use_cuda, train_program, params_dirname)
     31         # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
     32         optimizer_func=optimizer_func,
---> 33         place=place)
     34 
     35     trainer.train(

/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in __init__(self, train_func, optimizer_func, param_path, place, parallel, checkpoint_config)
    280         with self._prog_and_scope_guard():
    281             exe = executor.Executor(place)
--> 282             exe.run(self.startup_program)
    283 
    284         if self.checkpoint_cfg and self.checkpoint_cfg.load_serial is not None:

/usr/local/lib/python3.5/dist-packages/paddle/fluid/executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    468 
    469         self._feed_data(program, feed, feed_var_name, scope)
--> 470         self.executor.run(program.desc, scope, 0, True, True)
    471         outs = self._fetch_data(fetch_list, fetch_var_name, scope)
    472         if return_numpy:

EnforceNotMet: op uniform_random does not have kernel for data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] at [/paddle/paddle/fluid/framework/operator.cc:733]
PaddlePaddle Call Stacks: 
```

 + 问题复现:使用`fluid.layers.embedding`接口定义词向量时,设置参数`dtype`的值为`int64`,`size`为`[数据的单词数量, 词向量维度]`,在训练的时候就会报这个错误。错误代码如下:

```python
embed_first = fluid.layers.embedding(
    input=first_word,
    size=[dict_size, EMBED_SIZE],
    dtype='int64',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_second = fluid.layers.embedding(
    input=second_word,
    size=[dict_size, EMBED_SIZE],
    dtype='int64',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_third = fluid.layers.embedding(
    input=third_word,
    size=[dict_size, EMBED_SIZE],
    dtype='int64',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
    input=fourth_word,
    size=[dict_size, EMBED_SIZE],
    dtype='int64',
    is_sparse=is_sparse,
    param_attr='shared_w')
```

 + 解决问题:输入层的数据类型虽然是`int64`,但是词向量的数据类型是`float32`。用户可能是理解误以为词向量的数据类型也许是`int64`,所以才会导致错误。正确代码如下:

```python
embed_first = fluid.layers.embedding(
    input=first_word,
    size=[dict_size, EMBED_SIZE],
    dtype='float32',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_second = fluid.layers.embedding(
    input=second_word,
    size=[dict_size, EMBED_SIZE],
    dtype='float32',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_third = fluid.layers.embedding(
    input=third_word,
    size=[dict_size, EMBED_SIZE],
    dtype='float32',
    is_sparse=is_sparse,
    param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
    input=fourth_word,
    size=[dict_size, EMBED_SIZE],
    dtype='float32',
    is_sparse=is_sparse,
    param_attr='shared_w')
```

 + 问题拓展:词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量(embedding vector),如`embedding(母亲节)=[0.3,4.2,−1.5,...]`,`embedding(康乃馨)=[0.2,5.6,−2.3,...]`。在这个映射到的实数向量表示中,希望两个语义(或用法)上相似的词对应的词向量“更像”。

 + 问题分析:NLP中,词向量技术是比较底层的计算,是很多上层技术的支撑,如RNN、LSTM等,输入都是经过词向量嵌入后的向量,将词编码成相应要保持其语义信息是更好的,即将词编程稠密向量,one-hot独热向量虽然简单,但会编码维度灾难与语义鸿沟的问题。
 

猜你喜欢

转载自blog.csdn.net/PaddlePaddle/article/details/87929216