问题：

理解Keras中语音识别的CTC损失

轩辕海

2023-03-14

我试图了解CTC损失是如何为语音识别工作的，以及它如何在Keras中实现。

我认为我理解的（如果我错了，请纠正我！）

大体上，CTC损耗被添加到经典网络之上，以便逐个元素（文本或语音的字母）解码顺序信息，而不是直接解码元素块（例如单词）。

假设我们将一些句子的语句作为MFCC输入。

使用CTC损失的目标是学习如何使每个字母在每个时间步与MFCC匹配。因此，Dense softmax输出层由与句子组成所需元素数量一样多的神经元组成：

字母表（a，b，…，z）

然后，softmax层具有29个神经元（26个用于一些特殊字符的字母表）。

为了实现它，我发现我可以这样做：

# CTC implementation from Keras example found at https://github.com/keras- 
# team/keras/blob/master/examples/image_ocr.py

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    # print "y_pred_shape: ", y_pred.shape
    y_pred = y_pred[:, 2:, :]
    # print "y_pred_shape: ", y_pred.shape
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)



input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)

x = Bidirectional(lstm(...,return_sequences=True))(input_data)

x = Bidirectional(lstm(...,return_sequences=True))(x)

y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)

loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
                  [y_pred, y_true, input_length, label_length])

model = Model(inputs=[input_data, y_true, input_length,label_length], 
                      outputs=loss_out)

ALPHABET_LENGTH=29（字母表长度特殊字符）

以及：

y_true：包含真值标签的张量（样本，最大字符串长度）

（来源）

现在，我面临一些问题：

这是编码和使用CTC丢失的正确方法吗

澹台新知

2023-03-14

y_true您的基本真相数据。您将要与培训中的模型输出进行比较的数据。（另一方面，y_pred是模型的计算输出）

这种损失似乎期望您的模型的输出（y_pred）有不同的长度，以及您的地面真相数据（y_true）。这可能是为了避免计算句子结束后垃圾字符的损失（因为您需要一个固定大小的张量来同时处理大量的句子）

因为函数的留档是要求形状（样本，长度），所以格式是...每个句子中每个char的char索引。

有一些可能性。

如果所有长度相同，您可以轻松地将其用作常规损耗：

def ctc_loss(y_true, y_pred):

    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    #where input_length and label_length are constants you created previously
    #the easiest way here is to have a fixed batch size in training 
    #the lengths should have the same batch size (see shapes in the link for ctc_cost)    

model.compile(loss=ctc_loss, ...)   

#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)

这有点复杂，你需要你的模型以某种方式告诉你每个输出句子的长度。

有一个end_of_sentence字符，并检测它在句子中的位置。
让你的模型的一个分支来计算这个数字，并将其舍入为整数。
（Hardcore）如果你使用有状态的手动训练循环，获取你决定完成一个句子的迭代的索引

我喜欢第一个想法，并将在这里举例说明。

def ctc_find_eos(y_true, y_pred):

    #convert y_pred from one-hot to label indices
    y_pred_ind = K.argmax(y_pred, axis=-1)

    #to make sure y_pred has one end_of_sentence (to avoid errors)
    y_pred_end = K.concatenate([
                                  y_pred_ind[:,:-1], 
                                  eos_index * K.ones_like(y_pred_ind[:,-1:])
                               ], axis = 1)

    #to make sure the first occurrence of the char is more important than subsequent ones
    occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())

    #is eos?
    is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
    is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))

    #lengths
    true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
    pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)

    #reshape
    true_lengths = K.reshape(true_lengths, (-1,1))
    pred_lengths = K.reshape(pred_lengths, (-1,1))

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

model.compile(loss=ctc_find_eos, ....)

如果使用另一个选项，请使用模型分支来计算长度，将这些长度连接到输出的第一步或最后一步，并确保对地面真相数据中的真实长度执行相同的操作。然后，在损失函数中，只取长度的部分：

def ctc_concatenated_length(y_true, y_pred):

    #assuming you concatenated the length in the first step
    true_lengths = y_true[:,:1] #may need to cast to int
    y_true = y_true[:, 1:]

    #since y_pred uses one-hot, you will need to concatenate to full size of the last axis, 
    #thus the 0 here
    pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
    y_pred = y_pred[:, 1:]

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

理解Keras中语音识别的CTC损失

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档