会议-说话人

2019-11-06 本文已影响0人原来是酱紫呀

20190420

一、paralinguistic speech attribute recognition

General framework
speech signals ---> feature extraction ---> representation ---> variability compensation ---> backend classification ---> result

语音短时平稳信号：20ms-30ms近乎平稳
不定长语音信号--->不定长语音特征

文本不同，不能直接ali起来。--->生成模型拟合数据

End-to-end framework
feature extraction ---> representation ---> backend classifier

不定长---> Encoding layer OR RNN layer

早期：语音切分进行帧级别的dnn，或拼接成几十帧级别的dnn，然后在得分级别进行average。

现：句子级别：pooling层。需要比较好的loader

等。。。

二、基于深度学习的短时声纹识别技术

声纹识别难点：短时、跨信道

解决短时问题：（1）多维度挖掘短时内声纹信息（2）对挖掘的embedding做一个补偿
speaker embedding
在短时语音识别中性能更好
网络结构：d-vector、x-vector、ct-dnn、end2end

特征补偿：wccn、lda、plda
线性补偿技术的局限性：（1）结构简单，线性变换（2）i-vector中说话人和信道变化的叠加是非线性且高斯的
基于神经网络的i-vector补偿技术
优势：（1）非线性变化的特性（2）层级结构的复杂性（3）误差回传训练算法的强学习能力
神经网络的类型：（1）说话人分类网络（2）说话人距离度量学习网络

三、基于深度学习的说话人识别方法

帧级特征--->段级特征--->相似度度量--->说话人识别
典型系统：i-vector、x-vector
x-vector：
frame-level nn module --->聚类映射模块aggregation ---> loss function
CNN-C-D2
深度神经网络结构中，不同网络层输出存在分辨率和语义的渐变过程：

底层输出，时频域分辨率高，局部信息丰富。
高层输出，时频域分辨率低，全局语义信息丰富。

特征融合

四、对抗学习

problem： training-testing mismatch

problem：adversarial examples

五、基于对抗多任务学习的抗噪鲁棒性说话人识别

六、normalization for speaker embedding

starting from gmm-ubm
neural-based embedding
properties of neural embeddings (different from I-vector)

inferred from discriminative models
less probabilistic meaning
highly discriminative

why discriminative embeddings need discriminative back-end ?
because of normalization...
why lda+plda works?
lda makes the conditional embeddings more gaussian, hence suitable for plea
pca also works

lda regurlize conditional distribution
pca regularize marginal distribution
pca单独不起作用，pca+plda起作用

lda/pca does not work for ivector+plda

I-vector is gaussian constrained (marginally)

problem of pca/lda normalization

plda requires prior and conditional to be gaussian; neither pca nor lda matches all.
linear shallow models cannot derive gaussian prior/conditional with complex observed marginal and observed conditional of d/x-vector

七、recent advances in deep embedding learning for speaker identification and spoofing detection

GAN: data augmentation for speaker embeddings

Extend: VAE for data augmentation

Knowledge distillation for speaker embedding

八、基于结构化度量学习的声纹识别研究

度量学习：

损失函数：
Triplet loss , Cross entropy Loss, others...
相似性准则：
Cosine similarity based, PLDA based, others...

problem: 度量学习是否可以直接优化评价指标？
可以，结构化损失函数（创新点）+ 合适的相似度量（适配创新点）

基于余弦相似度的度量学习算法---优化EER
基于马氏距离的度量学习算法---优化pAUC

九、内容和说话人联合识别研究

内容和说话人相互影响

声纹影响对语音内容识别的感知
Johnson的“说话人坐标”(talker coordinate)理论
语音内容影响说话人识别（司法声纹鉴定）

总结：语音内容和说话人信息被听者共同感知，知悉一个维度的信息对另一个维度信息的识别与理解有显著提升

内容和说话人联合识别

司法声纹鉴定
反诈骗
关注涉案语音内容，取证诈骗过程
关注说话人信息，确认诈骗人和被诈骗人信息
语音质检
提取客服人员的语音
对客服语音内容进行分析，监控不规范/不文明用语

粗力度内容对齐，细粒度说话人识别比较好
说话人自适应主要技术路线

GMM-HMM框架：MLLR,fMLLR
LHUC
-拼接特征（i/x-vector+MFCC等）

文本相关说话人识别

硬对齐：HMM, DNN, PGMM
软对齐：Baum Welch统计量

总结：（1）内容对说话人，说话人对内容的影响，尺度不同
（2）方法差异大

deep feature for text-dependent speaker verification
collaborative joint training with multitask recurrent model for speaker recognition
unsupervised learning of disentangled and interpretable representations from sequential data
FHVAE将语音分解为2个维度隐变量：内容z1，说话人z2
不足：强调z2对z1的影响，忽略z1对z2的影响
speaker embedding extraction with phonetic information
（1）多任务学习：前几层共享，后几层分开
优点：采用x-vector，不同层面信息
不足：缺乏反馈
（2）增加语音识别对说话人识别反馈
方法1: 根据音素标签训练音素相关TDNN网络
方法2: 将提取音素vector，在统计池化前拼接
不足：单一反馈
（3）交叉反馈
说话人：xvector
语音： tdnn-asr
考虑点：不同层面，共享网络，交叉反馈