Revolutionizing Speech Enhancement with Pre-trained Generative Audioencoders and Vocoders

Leveraging Pre-trained Audio Models for Superior Speech Enhancement

Recent innovations in speech enhancement have shifted focus from traditional mask-based or signal prediction techniques to leveraging pre-trained audio models that extract richer and more transferable audio embeddings. Models like WavLM provide meaningful audio representations that significantly boost speech enhancement performance. Some methods use these embeddings to predict masks or combine them with spectral features, while others employ generative neural vocoders to reconstruct clean speech directly from noisy embeddings.

Limitations of Current Approaches

Many current approaches freeze pre-trained models or require extensive fine-tuning, which hampers adaptability and increases computational demands. This makes it challenging to transfer the models to other audio-related tasks such as dereverberation or source separation.

A Lightweight and Flexible Speech Enhancement System

Researchers from MiLM Plus, Xiaomi Inc., propose a novel, efficient, and adaptable speech enhancement system based on pre-trained generative audioencoders and vocoders. The process begins by extracting audio embeddings from noisy speech using a frozen audioencoder. These embeddings are then denoised by a small encoder before being passed to a vocoder that reconstructs clean speech.

Unlike task-specific models, the audioencoder and vocoder are pre-trained independently, enabling the system to adapt easily to other tasks. Experimental results demonstrate that generative models outperform discriminative models in terms of speech quality and speaker fidelity. Despite its simplicity, this approach surpasses leading speech enhancement models in listening tests.

System Architecture and Training

The system consists of three core components:

Pre-trained Audioencoder: Processes noisy speech into audio embeddings.
Denoise Encoder: Refines these embeddings to reduce noise.
Vocoder: Reconstructs clean speech from the denoised embeddings.

The denoise encoder and vocoder are trained separately, both relying on the frozen audioencoder. Training involves minimizing the Mean Squared Error between noisy and clean embeddings generated from paired speech samples. The denoise encoder uses a Vision Transformer (ViT) architecture with standard activation and normalization layers.

The vocoder is trained in a self-supervised manner using clean speech only. It reconstructs speech waveforms by predicting Fourier spectral coefficients, converted back to audio via inverse short-time Fourier transform. The vocoder is based on a modified Vocos framework and employs a Generative Adversarial Network (GAN) setup with a ConvNeXt-based generator and multi-period, multi-resolution discriminators. Losses include adversarial, reconstruction, and feature matching components. The audioencoder remains frozen throughout training, using weights from publicly available models.

Performance and Evaluation

Generative audioencoders like Dasheng significantly outperform discriminative counterparts. On the DNS1 dataset, Dasheng achieved a speaker similarity score of 0.881 compared to WavLM and Whisper with scores around 0.486 and 0.489. Speech quality metrics such as DNSMOS and NISQAv2 showed notable improvements even with smaller denoise encoders; for example, ViT3 achieved DNSMOS of 4.03 and NISQAv2 of 4.41.

Subjective listening tests with 17 participants reported a Mean Opinion Score (MOS) of 3.87 for Dasheng, outperforming Demucs (3.11) and LMS (2.98), confirming superior perceptual quality.

Summary

This study introduces an efficient, adaptable speech enhancement system that capitalizes on pre-trained generative audioencoders and vocoders, avoiding full model fine-tuning. By denoising audio embeddings with a lightweight encoder and reconstructing speech via a pre-trained vocoder, it achieves high computational efficiency and superior speech quality. The approach promises versatility and stronger speaker fidelity, making it a compelling solution for advanced speech enhancement applications.

For more details, refer to the original paper and GitHub page associated with this research.