Audio Large Language Models for All-Type Audio Deepfake Detection via Frequency-Time Reinforcement Learning

Audio Types

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, naive RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time (FT) structured chain-of-thought (COT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose FT-GRPO, a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints, while explicitly leveraging non-think samples flagged as mismatches during annotation. Experiments demonstrate that FT-GRPO achieves strong all-type detection performance with interpretable FT-grounded rationales.

All Real Fake Speech Sound Singing Music