🥤 SODA: Scaling Open Discrete Audio

SODA is a suite of open discrete audio foundation models. It unifies audio and text tasks (Continuation, ASR, TTS) into a single next-token prediction framework, meaning that all of these tasks only differ by how the SODA model is prompted. This demo runs on soda-4b-base. We release checkpoints for all sizes (135M to 4B) – find them on our Hugging Face collection.

Note: SODA was trained exclusively on English speech data and does not currently support other languages.

Continue speech from an audio prompt!

Input

Automatically trim silence from the beginning and end of input audio (for more stability)

Generation Parameters (100 tokens ≈ 1 second)

0.1 2
0.1 1
100 3000
0 1000

Output

We thank Marin and OpenAthena for enabling this project with open-development LLM and training infrastructure.