ICML 2026 Project Page

Bridging the Stability-Expressivity Gap

Synthetic data can make low-resource spoken language models more stable, but too much of it flattens prosody. SE-Bridge-TTS studies that trade-off and introduces two self-alignment frameworks for expressive Thai and Lao speech.

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, and Xiaoyu Shen

Thai complex tonal language with scarce transcribed speech
Lao pure-synthetic and zero-shot cloning demonstrations
DGSA alignment with prosody-timbre disentanglement
TDSC self-critique under minimal authentic references

Overview

Low-resource speech synthesis needs both clarity and life.

The paper frames synthetic speech scaling as a double-edged tool: it supplies phonetic supervision where transcribed speech is scarce, yet it can pull the model toward low-entropy, less expressive prosody.

01

Observe Synthetic Erosion

Scaling synthetic data improves pronunciation stability first, then progressively suppresses prosodic variability beyond a critical ratio.

02

Align with real references

DGSA constructs identity-consistent preference pairs from structural separation of prosody and timbre, recovering expressivity for Thai.

03

Bootstrap without anchors

TDSC uses multi-temperature exploration and filtering to create pseudo-real anchors when authentic Lao references are extremely limited.

Key idea

Bridge the gap instead of choosing one side.

Synthetic data gives phonetic stability; preference alignment restores the expressive variation that makes speech sound human.

Stability
Expressivity

The audio section lets visitors hear benchmark comparisons, zero-shot cloning, erosion across synthetic ratios, and the alignment gains from DGSA and TDSC.

Methods

Two self-alignment paths for different data regimes.

The project page keeps the method visuals from the original demo and connects them to the paper narrative, so the page works as both an abstract and a guided listening interface.

Diagram of the DGSA alignment framework
Thai Preference alignment Prosody recovery

Disentanglement-Guided Self-Alignment

DGSA exploits prosody-timbre separation in flow-matching SLMs to form self-supervised preference pairs that improve expressivity while preserving pronunciation and speaker identity.

Diagram of the TDSC self-critique framework
Lao Self-critique Pure synthetic

Temperature-Driven Self-Critique

TDSC samples candidates across generation temperatures, filters them automatically, and reuses the accepted outputs as pseudo-real anchors for very low-resource synthesis.

Audio demo

Selected samples for fast paper browsing.

A compact listening set covers benchmark comparisons, zero-shot cloning, synthetic erosion, and the two self-alignment methods.

Loading samples...
Loading selected audio samples.

Results

Headline metrics from the demo bundle.

The numbers below mirror the current project data file and are meant to orient readers before they jump into the listening examples.

Citation

Reference the ICML paper.

The repository is configured as a project page first. Paper-related code will be added later; Code coming soon.

@inproceedings{geng2026bridging,
  title = {Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models},
  author = {Geng, Yizhong and Li, Yanliang and Yang, Jinghan and Jiang, Tianhan and An, Boxun and Li, Ya and Shen, Xiaoyu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}