RRPO: Robust Reward Policy Optimization
for LLM-based Emotional TTS

(Submitted on ICASSP 2026)

Authors : Cong Wang1,†, Changfeng Gao2, Yang Xiang2, Zhihao Du2,Keyu An2,
Han Zhao2, Qian Chen2, Xiangang Li2, Yingming Gao1, Ya Li1,*
1Beijing University of Posts and Telecommunications, Beijing, China
2Speech Team, Tongyi Lab, Alibaba Group
Work performed during the internship at Tongyi Lab

Related Works : CosyVoice & DiffRO
(Our method is based on the DiffRO reinforcement learning algorithm within CosyVoice.)

1. Abstract

Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines.

2. Model Framework

Fig. 1. The framework of our proposed Robust Reward Policy Optimization (RRPO).

3. LLM-based Emotional TTS Samples