SODA is a suite of open discrete audio foundation models.
It unifies audio and text tasks (Continuation, ASR, TTS) into a single next-token prediction framework, meaning that all of these tasks only differ by how the SODA model is prompted. This demo runs on soda-4b-base. We release checkpoints for all sizes (135M to 4B) – find them on our Hugging Face collection.
Note: SODA was trained exclusively on English speech data and does not currently support other languages.
Continue speech from an audio prompt!
Input
Automatically trim silence from the beginning and end of input audio (for more stability)
Generation Parameters (100 tokens ≈ 1 second)
0.12
0.11
1003000
01000
Output
We thank Marin and OpenAthena for enabling this project with open-development LLM and training infrastructure.