Speech Enhancement with Generative Adversarial Networks (SEGAN)

The enhancement problem is defined as having an input noisy signal '˜x' and the goal is to clean it to obtain the enhanced signal 'ˆx'. We propose to achieve this with a speech enhancement GAN (SEGAN). In our case, the G network performs the enhancement. Its inputs are the noisy speech signal '˜x' along with the latent representation 'z', and its output is the enhanced version 'ˆx = G('˜x')'. We design G to be fully convolutional, ensuring no dense layers at all. This forces the network to focus on temporally-close correlations in the input signal and throughout the whole layering process. Additionally, it reduces the number of training parameters, thereby decreasing training time.

The G network is structured similar to an auto-encoder (Fig. 2). During the encoding stage, the input signal is projected and compressed through several strided convolutional layers followed by parametric rectified linear units (PReLUs) [23], resulting in a convolution output every N steps of the filter. We opt for strided convolutions as they have proven to be more stable for GAN training compared to other pooling approaches [22]. Decimation is performed until we obtain a condensed representation, referred to as the thought vector 'c', which is concatenated with the latent vector 'z'. The encoding process is reversed in the decoding stage through fractional-strided transposed convolutions (sometimes called deconvolutions), again followed by PReLUs.

The G network also incorporates skip connections, linking each encoding layer to its corresponding decoding layer and bypassing the compression performed in the middle of the model (Fig. 2). This is implemented because the model's input and output share the same underlying structure, which is that of natural speech. Consequently, many low-level details could be lost if all information is forced to flow through the compression bottleneck, hindering proper reconstruction of the speech waveform. Skip connections directly pass the fine-grained information of the waveform to the decoding stage (e.g., phase, alignment). Furthermore, they contribute to better training behavior, as gradients can flow deeper through the entire structure [24].

An important feature of G is its end-to-end structure, enabling it to process raw speech sampled at 16 kHz, eliminating any intermediate transformations for extracting acoustic features (in contrast to many common pipelines). In this type of model, we need to be cautious with typical regression losses like mean absolute error or mean squared error, as noted in the raw speech generative model WaveNet [25]. These losses operate under strong assumptions regarding the shape of our output distribution, thereby imposing significant modeling limitations (such as not allowing multi-modal distributions and biasing the predictions towards an average of all possible predictions). Our solution to overcome these limitations is to employ the generative adversarial setting. This way, D is responsible for transmitting information to G about what is real and what is fake, allowing G to slightly adjust its output waveform towards the realistic distribution and eliminate noisy signals as they are identified as fake. In this context, D can be understood as learning a form of loss for G's output to appear real.

In preliminary experiments, we found it beneficial to add a secondary component to G's loss to minimize the distance between its generations and the clean examples. To measure this distance, we chose the L1 norm, as it has proven effective in the image manipulation domain [20, 26]. This approach allows the adversarial component to contribute more fine-grained and realistic results. The magnitude of the L1 norm is controlled by a new hyper-parameter 'λ'. Consequently, the G loss, which we select as the one of LSGAN (Eq. 4), becomes

Speech Enhancement with Generative Adversarial Networks (SEGAN)