1710.10467 generalized end-to-en

2020-02-24  本文已影响0人  Jack_Woo

简要说明

摘要原文

2017年,作者提出了一个新的损失函数,称为广义损失端到端(GE2E)损失,与之前(2016年)基于元组的端到端(TE2E)丢失函数相比,这使说话人验证模型的训练更加有效。

与TE2E不同,GE2E损失函数以新的方式更新网络参数,其通过关注(emphasizes)在训练过程的各步骤(step)中都难以验证的样本来实现。另外,GE2E损失函数不需要样本选择的初始阶段。 通过这些特性,我们具有新的损失函数的模型将说话人验证的EER降低10%以上,同时将训练时间缩短了60%。

我们还介绍了MultiReader技术,使我们能够进行域自适应,训练出支持多个关键字(Multi keywords)的更准确的模型(即,“ OK Google”和“ Hey Google”)以及多种方言。

容易混淆之处

论文涉及点有些多:

论文回避之处:

背景

这里包括几个相关的Loss算子:

Softmax

交叉熵损失函数。直接输出分类的类别概率。

img

Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one.Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.

Triplet Loss

2015年FaceNet论文提出Triplet Loss.

img img

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

Triplet Loss解决的问题:

triplet loss的优势在于细节区分,triplet loss的缺点在于其收敛速度慢,有时不收敛。

Offline triplet

Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

根据样本之间的距离,分为:semi-hard triplets,hard triplets与easy triplets三种,选择semi-hard triplets,hard triplets进行训练。

此方法不够高效,因为每过几个epoch,要重新对negative examples进行分类。

Online triplet

Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

使用triplet进行分类

FaceNet是特征向量提取器,输出的是一个欧几里得空间向量,随后就可以用各种机器学习算法进行分类。

FaceNet. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

前作 Tuple Based End-to-End Loss

2016 End-to-end text-dependent speaker verification. ICASSP

Tuple Based End-to-End Loss:

img img

Pros

Cons

本文 Generalized End-to-End Loss

2017 Generalized end-to-end loss for speaker verification.

GE2E Loss

img

上图,相同颜色的属于同一类。 [图片上传失败...(image-e1c855-1582508024839)] 是各类的中心。

GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker. [图片上传失败...(image-8a3965-1582508024839)]

Similarit Matrix

相似矩阵

img

Construct a similarity matrix for each batch:

[图片上传失败...(image-c91b8b-1582508024839)]

[图片上传失败...(image-f863af-1582508024839)]

[图片上传失败...(image-bd6fbb-1582508024839)]

Softmax vs. Contrast

img

Each row of [图片上传失败...(image-ad80a9-1582508024839)] for [图片上传失败...(image-70f086-1582508024839)] defines similarity between [图片上传失败...(image-3598eb-1582508024839)] and every centroid [图片上传失败...(image-552440-1582508024839)] . We want [图片上传失败...(image-6fb2d8-1582508024839)] to be close to [图片上传失败...(image-f5f189-1582508024839)] and far away from [图片上传失败...(image-63f305-1582508024839)] for [图片上传失败...(image-3ad46f-1582508024839)] .

Softmax与Contrast两个Loss计算方式的对比,以下是作者的实验结果。

这里有个疑问,为何在text-independent任务中Contrast(对比度损失)会比Softmax差呢?

Contrast方法其实是类似Triplet Loss,其Loss计算为:正对和最积极的负对之和。

Trick about centroid

For true speaker centroid, we should exclude the embedding itself. 即计算 [图片上传失败...(image-5601ee-1582508024839)] 时要排除 [图片上传失败...(image-621ff9-1582508024840)] 。

img

To avoid trivial solution: all utterances have same embedding. 即统一成一个公式。

[图片上传失败...(image-698e1b-1582508024840)]

Efficiency estimate

TE2E vs. GE2E

主要思想是,TE2E是一个tuple算一次,而GE2E是基本等于一个批量的tuple,同时放到GPU计算,效率更高。

For TE2E, assume a training batch has: - N speakers - Each speaker has M utterances - P enrollment utterances per speaker

Number of all possible tuples: [图片上传失败...(image-23526b-1582508024840)] Theoretically, one GE2E step is equivalent to [图片上传失败...(image-4186b6-1582508024840)] TE2E steps.

TODO 对这里的具体计算不是很理解。为何能得出 [图片上传失败...(image-3ee240-1582508024840)] 这个数值关系。

对比Triplet Loss

作者认为Triplet Loss的优劣:

这里的runtime behavior,即语音/人脸验证(verification)场景中的使用流程:

  1. 注册: 注册语音 -> 语音向量 -> 多条语音向量取平均
  2. 验证: 对验证时录制的语音提取特征向量 -> 与注册语音库中的平均向量计算相似度(如Cosine距离) -> 根据设置的阈值在判断是否同一个人。
img

Text-Independent

Text-Independent Speaker Verification

We are interested in identifying speaker based on arbitrary speech Challenge:

Naive solution: Full sequence training?

解决方法:Sliding window inference

img

Training

Text-independent Training.

加速训练,同时,充分利用数据。因每批次内部utterances长度相同,从而各utterances 计算时间相同。

img

Experiment

Text-independent experiments.

img

Text-dependent

TODO

训练记录

Text-independent

流程分为两个步骤:数据预处理与训练.

音频数据预处理:wav转spectrogram等步骤很耗时,考虑使用GPU来计算,如使用PyTorch audio。

参考

训练耗时参考:

Dataset:

time: trained 1.56M steps (20 days with a single GPU) with a batch size of 64. GPU: GTX 1080 Ti.

训练过程:

img

最终达到的效果:特征向量使用UMAP降维再画图。

img
上一篇 下一篇

猜你喜欢

热点阅读