重读R.Rabbinner数字语音信号处理—Chap2 The

2019-01-16 本文已影响0人锅锅Iris

这章主要讲一下 Phonetic representation 以及production of speech

如何表示语音

phonemes 音素，有限的集合，大部分语言来说，phoneme的数量大概在32-64之间。
英语的话有ARPAbet表 / allophonic 变音，比如glottal(声门的停顿或者停止)
- 这个表和第一章给出的一个图有点像，用英语的24个字母来表示这些，不需要其他的字母。
- diphthongs 双元音，glides 喉音， nasals 鼻音， stops/plosives 爆破音，fricatives 擦音， affricates 塞擦音
- 很多时候，phonemes在频域上都不一样，尽管SH 和 S 看起来像随机的噪声，但是元音的话，/UH/, /IY/ 和/EY/在结构上都是highly structures 而且是quasi-periodic（准周期的）
- 有时间可以稍微入门一下，纠正一下发音以及理解语音。

Speech Production的模型

原书的图实在太过于简陋，在Voice Acoustics: an introduction^[1]扒了张图下来，同时发现…这个入门网站将的东西还不错.

Source Filter Model

vocal tract，as a tube of nonuniform cross-sectional（横截面积非均匀） area that is bounded at one end by the vocal cords and at the other by the mouth opening. 声道，一端是声带，一段是嘴巴，和这两端的移动有关（中英文表达还真不一样）
Voiced sound包括vowels，liquids/W/，glides and nasals，此时glottal周期性的开和关。声带也在变化。
unvoiced sound, created by constriction(收缩) in vocal tract.比如/SH/和/S/
声带收缩constriction 和 quasi-periodic vocal tract vibration，比如/V/,/Z/
Plosive sounds /P/ and affricates such as /CH/ are formed by momentarily closing off air flow.
关于formants，这段可以说是看了两遍才看懂。vocal tract 的变化和时域信号变化相比，还是很慢的,其实可以看做是声源，The resonance（谐振）是因为发音器官的变化。这些谐振频率叫做formant frequencies of the sound. The ﬁne structure of the time waveform is created by the sound sources in the vocal tract, and the resonances of the vocal tract tube shape these sound sources into the phonemes.
下面这张图^[2]是从网上扒的

Fant’s source -filter model of speech production:
因为vocal tract的变化比较慢，所以我们可以假设在10ms内系统的相应差别不大
这个线性系统的方程可以表示为
$H(z) = \frac{\sum_{k=0}^{M}b_kz^{-k}}{1-\sum_{k=1}^{N}a_kz^{-k}} =\frac{b_0\prod_{k=1}^{M}(1-d_kz^{-1})}{\prod_{k=1}^{N}(1-c_kz^{-1})}$
其中 $a_k$ 是filter bank的系数（峰值）， $b_k$ 是vocal tract的参数（一秒钟改变50-100次）。

一些极点 $c_k$ 和单位圆（unit circle）很近，产生resonance来建模formant. 而且有时候用零点 $d_k$ 来角膜nasal鼻音和fricative擦音。然而很多应用只包含了极点，这样参数估计能简单一点。
对于Voiced Sound, 声门（glottal）的激励（excitation）决定了音高（pitch），周期性的平滑的glottal pulses 有一条harmonic line spectrum（谐波谱线), 随着频率身高，幅度下降。
对于unvoiced sound, the linnear system is excited by a random number generator that produces a discrete-time noise signal with flat spectrum.
在短时间内，是不变的。

More Refined Models

上面的介绍的模型，虽然适用于大部分语音处理的场景，但是based on many approximations, 有很多近似化处理，包括声源和这个系统是不相干的，是线性系统，离散连续时域声道能被近似为一个离散线性时不变（distributed continuous-time vocal tract system can be modeled by discrete linear time-invarient system).
Fluid mechanics(流体力学)和acoustic wave propagation theory are fundamental physical principles.
后面还有很多人，对glottal flow, interaction of the glottal source and vocal tract in production, nonlinearities that enter into the sound generation and transmission.
还包括一些X-ray， MRI 的imaging 分析也是最近speech science的一部分

Summarize

短时间内比如说10ms内，语音信号可以看成是不变的
voice sound的比如元音，鼻音，liquids，glides的周期性的开和关
Plosive sound 是突然关闭的
unvoiced sound glottal 是关闭的，激励是random noise
Formant，resonance 和articulator比如舌，上颚牙齿之类的有关。
我们把语音信号和语音传输系统近似是线性时不变的，以及假设是没有interact的。
总体来说没有什么大问题，如果要拓展的话可以再把数字信号处理学一遍。

重读R.Rabbinner数字语音信号处理—Chap2 The

如何表示语音

Speech Production的模型

More Refined Models

Summarize

Reference

猜你喜欢

热点阅读