Audio-Visual Speech Separation via Bottleneck Iterative Network

Presented in ICML 2025 Workshop on Machine Learning for Audio.

Introduction

We introduce a new AVSS model, Bottleneck Iterative Network (BIN), that iteratively refines the audio and visual representations using their fused representation via a repetitive progression through the bottleneck fusion variables and the outputs of the two modalities from the same fusion block. Tested on two popular AVSS benchmarks, BIN strikes a good balance between speech separation quality and computing resources, being on par with RTFS-Net’s state-of-theart performance (and improving on SI-SDR) while saving up to 74% training time and 80% GPU inference time. Our code is available on Github.

Intuition

A traditional fusion method focuses on learning representations from unimodal inputs and fusion spaces (blue arrows in the plot), but lacks bringing the fusion and late representation information back to the early embedding space (black arrows with the question marks in the plot).

This backward connection can be important. Consider an audio-visual speech separation (AVSS) task where For example, a female speaker says “WE MUST ADOPT THAT WAY OF” while the other female speaker says “THEY’RE LIKE THE BASEBALL SCOUTS 20 YEARS”, but because of the distortion from the noise and the tones of the two speakers, the two phrases sound close to each other at the beginning words “WE MUST” and “THEY’RE”. With no backward connection, the required information at the early stage might be lost or compressed too much to be recovered correctly in the final fused latent representations. Meanwhile, a backward connection adds back the fused late latent representation to the original embeddings as residuals, then process the combined reprenstation again to get a refined used latent representation, so the early information is constantly added to avoid information loss.

Paper Citation

@misc{zhang2025audiovisualspeechseparationbottleneck,
      title={Audio-Visual Speech Separation via Bottleneck Iterative Network}, 
      author={Sidong Zhang and Shiv Shankar and Trang Nguyen and Andrea Fanelli and Madalina Fiterau},
      year={2025},
      eprint={2507.07270},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2507.07270}, 
}

Audio Samples

Case 1: BIN avoids the separation distortion at the begining and in the middle brought by AVLIT and RTFS-Net

Audio Mixture
Ground Truth Text
WE MUST ADOPT THAT WAY OF
THEY’RE LIKE THE BASEBALL SCOUTS 20 YEARS
Clean Audio
AVLIT Output
RTFS-Net output
BIN Output

Case 2: BIN separates the audio with some flaws, AVLIT fails to separate the audio, RTFS-Net assigns the separated audio to the wrong person

Audio Mixture
Ground Truth Text
I MAY NEVER GO TO YOUR HOUSE YOU MAY
HOW CAN WE GET SO MUCH INFO
Clean Audio
AVLIT Output
RTFS-Net output
BIN Output

Case 3: All models fail to separate the audio

Audio Mixture
Ground Truth Text
THIS ECONOMY WON'T WORK AND
AND YOU WALK AROUND AND IT POINTS
Clean Audio
AVLIT Output
RTFS-Net output
BIN Output