MAGE: A Coarse-to-Fine Speech Enhancer with Masked Audio Generator

The Hieu Pham†*, Tan Dat Nguyen‡*, Phuong Thanh Tran Nguyen, Joon Son Chung, Duc Dung Nguyen

AITech Lab, Ho Chi Minh City University of Technology, VNUHCM
Korea Advanced Institute of Science and Technology (KAIST)
* These authors contributed equally to this work

Existing speech enhancement (SE) models face a trade-off: discriminative methods offer efficiency but generalize poorly, while powerful generative approaches are computationally expensive and difficult to train. To address this, we introduce MAGE, a lightweight mask-based generative audio enhancer model. Finetuned from the Qwen2.5-0.5B language model and built upon the BigCodec, MAGE has a small footprint of approximately 200 million parameters. MAGE employs a novel coarse-to-fine scarcity-aware masking strategy alongside an auxiliary corrector to improve learning efficiency and smooth the generation process, respectively. Extensive experiments on various enhancement tasks demonstrate that MAGE sets a new benchmark in enhancement performance while remaining significantly more efficient than previous methods.

Live Demo

Test MAGE speech enhancement with your own audio. Record your voice or upload an audio file to see the enhancement in action.

Important: Audio must be in 16kHz sample rate for optimal results.
Note: This demo runs on shared infrastructure and may take a few moments to process. Please be patient.
Ready to record 00:00

Drop your audio file here

or click to browse

Supports: WAV, MP3, FLAC, M4A (will be converted to 16kHz WAV)

Audio Samples Demo

Compare audio quality across different models. Click on any audio player to show/hide its spectrogram visualization.

Loading audio samples...