Speech Enhancement with a Generative Adversarial Network (SEGAN)

The enhancement problem is defined so that we have an input noisy signal '˜x' and we want to clean it to obtain the enhanced signal 'ˆx'. We propose to do so with a speech enhancement GAN (SEGAN). In our case, the G network performs the enhancement. Its inputs are the noisy speech signal '˜x' together with the latent representation z, and its output is the enhanced version 'ˆx = G('˜x')'. We design G to be fully convolutional, so that there are no dense layers at all. This enforces the network to focus on temporally-close correlations in the input signal and throughout the whole layering process. Furthermore, it reduces the number of training parameters and hence training time.

The G network is structured similarly to an auto-encoder (Fig. 2). In the encoding stage, the input signal is projected and compressed through a number of strided convolutional layers followed by parametric rectified linear units (PReLUs) [23], getting a convolution result out of every N steps of the filter. We choose strided convolutions as they were shown to be more stable for GAN training than other pooling approaches [22]. Decimation is done until we get a condensed representation, called the thought vector c, which gets concatenated with the latent vector z. The encoding process is reversed in the decoding stage by means of fractional-strided transposed convolutions (sometimes called deconvolutions), followed again by PReLUs.

The G network also features skip connections, connecting each encoding layer to its homologous decoding layer, and bypassing the compression performed in the middle of the model (Fig. 2). This is done because the input and output of the model share the same underlying structure, which is that of natural speech. Therefore, many low level details could be lost to reconstruct the speech waveform properly if we force all information to flow through the compression bottleneck. Skip connections directly pass the fine-grained information of the waveform to the decoding stage (e.g., phase, alignment). In addition, they offer a better training behavior, as the gradients can flow deeper through the whole structure [24].

An important feature of G is its end-to-end structure, so that it processes raw speech sampled at 16 kHz, getting rid of any intermediate transformations to extract acoustic features (contrasting to many common pipelines). In this type of model, we have to be careful with typical regression losses like mean absolute error or mean squared error, as noted in the raw speech generative model WaveNet [25]. These losses work under strong assumptions on how our output distribution is shaped and, therefore, impose important modeling limitations (like not allowing multi-modal distributions and biasing the predictions towards an average of all the possible predictions). Our solution to overcome these limitations is to use the generative adversarial setting. This way, D is in charge of transmitting information to G of what is real and what is fake, such that G can slightly correct its output waveform towards the realistic distribution, getting rid of the noisy signals as those are signaled to be fake. In this sense, D can be understood as learning some sort of loss for G's output to look real.

In preliminary experiments, we found it convenient to add a secondary component to the loss of G in order to minimize the distance between its generations and the clean examples. To measure such distance, we chose the L1 norm, as it has been proven to be effective in the image manipulation domain [20, 26]. This way, we let the adversarial component to add more fine-grained and realistic results. The magnitude of the L1 norm is controlled by a new hyper-parameter λ. Therefore, the G loss, which we choose to be the one of LSGAN (Eq. 4), becomes

Speech Enhancement with a Generative Adversarial Network (SEGAN)