tensorflow 和keras的CTC_loss问题调试

最近在研究语音识别，ctc_loss 是绕不开的。但是keras 和tensor 的文档简单得不行，结果在下面几个文档里面卡了好几天，最后尝试了很多办法终于过了。现在记录下遇到的问题和解决方法。
tf.nn.ctc_loss，keras.backend.ctc_batch_cost。

有不对的地方请指教。

1.首先要弄懂CTC LOSS 是怎么回事

这里贴上我当时看CTC loss还算好懂的博客

先看这个，
https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c
再看这个
https://distill.pub/2017/ctc/
基本上现在能弄懂CTC 是干嘛的。简单来说它在进行一个比对，比对输入序列（y_true）与预测的序列的（y_pred）差异。注意输入序列是“定的”，而预测序列是序列中每个位置的各个类别的概率值。比对后对差异值给出一个loss。

2.TF和keras 都提供了CTC的解决方案

最终我是用keras.backend.ctc_batch_cost 做的。但是也把tf.nn.ctc_loss 的解读放前面。

我们用一个简单的例子贯穿我们的调试。

batch 有2个sample。
输入输出必定为3个时间片time_frame。
每个时间片有4个特征n_channel。
有5个类别n_class （0， 1， 2，3，4）

输入
batch_x 的形状为（batch_size, time_frame, n_channel）。
batch_y 的形状为（batch_size, time_frame），输入是用数字标记的类别。
可能的y的举例：
[[4, 2, 1],
[1, 2, 0]]

2.1 tf.nn.ctc_loss

https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss?hl=en

tf.nn.ctc_loss(
    labels,
    logits,
    label_length,
    logit_length,
    logits_time_major=True,
    unique=None,
    blank_index=None,
    name=None
)

其中最重要的是
labels: tensor of shape [batch_size, max_label_seq_length] or SparseTensor
注意这个是SparseTensor，所以要先想办法把输入的dense转为SparseTensor 不然会报错。keras 有给一个tf.keras.backend.ctc_label_dense_to_sparse() 但是我在尝试的时候各种迷之错误，以后再补充。

logits: tensor of shape [frames, batch_size, num_labels], if logits_time_major == False, shape is [batch_size, frames, num_labels].

label_length: tensor of shape [batch_size], None if labels is SparseTensor Length of reference label sequence in labels.

logit_length: tensor of shape [batch_size] Length of input sequence in logits.

label为输入的一个batch的数字标记的真实值。
logit 是每个frame中的各个label 的概率，一般是网络最后一层softmax的输出。

2.2 keras.backend.ctc_batch_cost

而实际上也是用tf做的后端，贴上官方文档。
https://www.tensorflow.org/api_docs/python/tf/keras/backend/ctc_batch_cost
https://keras.io/backend/#ctc_batch_cost

y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.

需要注意的是

y_true里面是数字标记的tensor
y_pred 里面是每个frame 各个class的概率。
另外，第三个参数input_length 是y_pred的每个sample的序列长度，而最后一个参数label_length才是y_true的序列长度。并不是1-3，2-4，而是1-4，2-3.

如果我们用keras 做快速的建模的话，明显keras.backend.ctc_batch_cost 比tf的封装程度高些。可以省很多麻烦。所以用这个。

简单地写个小程序测试一下，可以通过。

import keras
import tensorflow as tf
import numpy as np

y_true = np.array([[4, 2, 1], [2, 3, 0]])                                    # (2, 3)
y_pred = keras.utils.to_categorical(np.array([[4, 1, 3], [1, 2, 4]]), 5)     # (2, 3, 5)


input_length = np.array([[2], [2]])                                         # (2, 1)
label_length = np.array([[2], [2]])                                         # (2, 1)

cost = keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)

另一个问题需要注意的是，根据理论我们知道，y_pred 的序列长度必须要大于y_label的序列长度，也就是说input_length中的每个item 要大过label_length中的对应item，否则会报错。这是CTC的原理决定的。（另外，想想语音识别任务，label可能是句子中的英文字母，而y_pred 则是每一帧的音素预测，帧数明显是多于字母数的。）

3. keras.backend.ctc_batch_cost调试错误

3.1 基本文档没讲的注意点。关于模型的null label

 (0) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 28 labels: 24,22,9,17,3,24,4,26,26,21,18,11,10,7,21,8,7,6,20,17,11,4,13,7,20,23,16,27,17,0,22,22,0,4,2,17,11,21,3,26,11,27,5,21,10,19,12,1,14,3 labels seen so far: 24,22,9,17,3,24,4,26,26,21,18,11,10,7,21,8,7,6,20,17,11,4,13,7,20,23,16
	 [[{{node loss_4/time_distributed_45_loss/CTCLoss}}]]

翻译一下，这个说的是有一个non-null label 在一个null label 后面。说的是label。不过既然它说的是不允许有non-null label 在null label后面是无效的，那么其实这个更相当于文档的终止标记，也就是对一个自动补齐的y_true，应该用null label（字母表最后一个值）来补在每个序列的末尾。

实际上，ctc_loss就是把所有大于等于 num_classes -1 的值视为空符，其实是说“你的预测字母表的最后一个字符是ctc_loss的null label”。而小于的num_class为实际的有效的标记。

num_class是根据y_pred的形状得来的，是y_pred的shape[2]。

从ctc_loss基本原理,我们知道ctc_loss有一个标记为空符‘-’。而如果我们网络的预测的类别与我们已知的类别数是相等的。那么网络就没有给出ctc_loss 的空符。

如果没有空符则这个函数依然把 index>=num_class - 1 的class 当成了null label。那么就可能出现上面的报错。

解决方案是：

对一开始做数据集的时候就没有用空符的，在网络的最后一层加多一个输出用于预测ctc的null label。
使num_classes 为 n_class + 1。(number_classes 是ctc_loss内部的class（也是y_pred 中的类别数目）。而n_class 是我们的字母表里面的类别数目。)
或者在做数据集的时候增加一个空符在字母表的最后，用于补齐。

3.2 长度问题，

刚刚已经说了，y_pred 的序列长度必须要大于y_label的序列长度，也就是说input_length中的每个item 要大过label_length中的对应item。
如果不对的话。会报出这个错。

(0) Invalid argument: Not enough time for target transition sequence (required: 150, available: 90)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
	 [[{{node loss_10/time_distributed_99_loss/CTCLoss}}]]
  (1) Invalid argument: Not enough time for target transition sequence (required: 150, available: 90)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
	 [[{{node loss_10/time_distributed_99_loss/CTCLoss}}]]
	 [[training_8/Adam/gradients/loss_10/time_distributed_99_loss/CTCLoss_grad/mul/_2619]]
0 successful operations.

还有一些其他的长度问题，解决方案就是检查长度。

3.3 其他问题

我是在keras 中专门把ctc_loss 封装成一个loss来训练的。也就是自定义一个loss 函数如果没有reshape 的话，tf.nn.ctc_loss 会迷之报错。就是这个错误卡了我4天。最后终于在把y_true ，y_pred print出来后发现它们的形状没有指定。于是用reshape指定形状才把draft程序调过。

#%% CTC loss

def ctc_loss(y_true, y_pred):

    print(y_true)
    print(y_pred)
    
    y_true = tf.reshape(y_true, (BATCH_SIZE, time_step_len))
    y_pred = tf.reshape(y_pred, (BATCH_SIZE, time_step_len, NUM_CHARACTERS+1) )
    #y_pred = tf.reshape(y_pred, (BATCH_SIZE, time_step_len, NUM_CHARACTERS) )
    
    
    #
    return tf.keras.backend.ctc_batch_cost(y_true, y_pred, np.array([[90], [150]]), np.array([[150], [20]]))

Walter_0000

发布了2 篇原创文章 · 获赞 4 · 访问量 346

私信关注

ctc_loss问题调试记录