2018-12-14  本文已影响0人  Duqcuid

CNN在音乐识别中卷积核的设计

原文地址:http://www.jordipons.me/cnn-filter-shapes-discussion/

We aim to study how deep learning techniques can learn generalizable musical concepts. For doing so, we discuss which musical concepts can be fitted under the constraint of an specific CNN filter shape.

Several architectures can be combined to construct deep learning models: feed-forward neural networks, RNNs or CNNs. However, since the goal of our work is to understand which (musical) features deep learning models are learning, CNNs seemed an intuitive choice regarding that it is common to feed CNNs with spectrograms. Spectrograms have a meaning in time and in frequency and therefore, the resulting CNN filters will have interpretable dimensions (at least) in the first layer: time and frequency. This basic observation, motivates the following discussion.

filter_shape

Figure 1. Discussed filter shapes. From left to right: squared/rectangular filter, temporal filter and frequency filter.

1. CNN filter shapes discussion

Due to the CNNs success in the computer vision research field, its literature significantly influenced the music informatics research (MIR) community. In the image processing literature, squared small CNNs filters (ie. 3×3 or 7×7) are common. As a result of that, MIR researchers tend to use similar filter shape setups. However, note that the image processing filter dimensions have spatial meaning, while the audio spectrograms filters dimensions correspond to time and frequency. Therefore, wider filters may be capable of learning longer temporal dependencies in the audio domain while higher filters may be capable of learning more spread timbral features.

In order to motivate researchers to be conscious about the potential impact of choosing one filter shape or another, three examples and a use case are discussed in the following. Throughout this post we assume the spectrogram dimensions to be M-by-N, the filter dimensions to be m-by-n and the feature map dimensions to be M’-by-N’. M, m and M’ standing for the number of frequency bins and N, n and N’for the number of time frames:

To conclude this section, we discuss the results posted by Keunwoo Choi as a study case. They use a 5-layer CNN of squared 3-by-3 filters for genre classification. After auralising and visualizing the network filters, they provide an interpretation of the learned CNNs filters in every layer:

Note that Keunwoo Choi observations are in concordance with the previously presented discussion. As a result of using small squared filters of 3-by-3, the lower layers of the deep CNN are learning musical concepts that fit under the constraint of being represented in a sub-band for a short-time. Moreover note that deeper layers in the network learn horizontal and vertical lines, denoting the plausible utility of the temporal and frequency filters in CNNs for MIR.

As observed in this example, the model needed deep representations (stacked CNN layers) for being able to represent large time-frequency contexts since it is difficult for the first layers to scope long time dependencies or wide frequency signatures with such small squared filters. This fact remarks the potential of employing temporal and frequency filters; by using these filters in the first layer(s), the depth of the network can be employed for learning other features rather than learning vertical and horizontal lines.

To conclude this text we want to remark that these interpretations do not only hold for music, since a similar reasoning could be done for speech audio or for any audio related deep learning task.

Next post proposes and assesses some musically motivated architectures that consider the here presented discussion.

上一篇 下一篇

猜你喜欢

热点阅读