RAVE — AEmotionStudio mirror

Curated mirror of public RAVE (Realtime Audio Variational autoEncoder) checkpoints, used by MAESTRO's RAVE Timbre Transfer panel (opt-in starter pack). Sources:

The Intelligent-Instruments-Lab/rave-models curated set (birds, voices, organs, water, etc.).
The official ACIDS-IRCAM public catalog, pulled from the canonical anonymous API at https://play.forum.ircam.fr/rave-vst-api/get_available_models.

RAVE was developed by Antoine Caillon and the ACIDS team at IRCAM. Paper: arXiv:2111.05011. Upstream code: acids-ircam/RAVE.

License

CC-BY-NC-4.0 — non-commercial use only, inherited from the upstream distributions. Generated audio is fine for non-commercial use. Commercial use of the models themselves (e.g. shipping them inside a paid product) requires permission from the original authors / IRCAM.

Per MAESTRO's stance (see LICENSE_AUDIT.md and the feedback_download_on_demand_licensing memory), these weights are fetched on demand by the end user — the user (not MAESTRO the binary) is the licensee.

Models — IIL-curated set (b2048 streaming exports, 18 models)

Each .ts checkpoint has a <stem>.json sidecar with name, license, sample-rate, latent-dim, source URL, and a one-line description.

Voice / speech

voice_vocalset_b2048_r48000_z16.ts — Voice (VocalSet). Voice timbre trained on the VocalSet corpus — covers vocal techniques across multiple singers. Use for the canonical 'make this sound like a voice' transfer.
voice-multi-b2048-r48000-z11.ts — Voice (Multi-speaker). Aggregated multi-speaker voice corpus. Wider speaker diversity than VocalSet — produces more 'average human' renders.
voice_hifitts_b2048_r48000_z16.ts — Voice (HiFi-TTS). High-fidelity expressive English speech corpus. Cleaner, more articulate than the multi-speaker model.
voice_jvs_b2048_r44100_z16.ts — Voice (JVS, Japanese). JVS Japanese multi-speaker corpus at 44.1 kHz. Use for Japanese-language sources or non-Latin phoneme structure.
voice_vctk_b2048_r44100_z22.ts — Voice (VCTK, English). VCTK English multi-speaker corpus from CSTR Edinburgh, 44.1 kHz. High 22-dim latent — captures more speaker idiosyncrasies.

Bird / wildlife

birds_motherbird_b2048_r48000_z16.ts — Birds (Motherbird). Bird-vocalization corpus — chirps + textural transients. The canonical 'weird' pick: produces wildly warped output for any arbitrary input.
birds_dawnchorus_b2048_r48000_z8.ts — Birds (Dawn Chorus). Dense overlapping bird vocalizations recorded at dawn. Smaller 8-dim latent — outputs lean ensemble-textural over individual calls.
birds_pluma_b2048_r48000_z12.ts — Birds (Pluma). Lighter, individual bird-call timbres. Mid-size 12-dim latent balances character + clarity.
humpbacks_pondbrain_b2048_r48000_z20.ts — Humpback Whales. Humpback-whale song. Long, slow, hauntingly-deep vocal contours — pairs well with sustained input.
marinemammals_pondbrain_b2048_r48000_z20.ts — Marine Mammals. Mixed marine-mammal vocalizations — dolphins, orcas, sea-life clicks and cries.

Instruments

guitar_iil_b2048_r48000_z16.ts — Guitar (IIL). Acoustic / electric guitar timbre. Good demo for transferring voice or synth input into a plucked-string voice.
organ_bach_b2048_r48000_z16.ts — Organ (Bach). Pipe-organ timbre trained on Bach repertoire. Sustained harmonic textures — pairs well with melodic input.
organ_archive_b2048_r48000_z16.ts — Organ (Archive). Historical pipe-organ recordings — broader, dustier textures than the Bach model. Good for film-score atmospheres.
sax_soprano_franziskaschroeder_b2048_r48000_z20.ts — Soprano Sax (Schroeder). Soprano-saxophone extended techniques by Franziska Schroeder. Multiphonics, growls, key clicks. 20-dim latent — captures fine-grained articulation.
mrp_strengjavera_b2048_r44100_z16.ts — Magnetic Resonator Piano (Strengjavera). Sustained metallic-string overtones produced by electromagnetically driving piano strings — 44.1 kHz.
crozzoli_bigensemblesmusic_18d.ts — Big Ensemble Music (Crozzoli). Big-ensemble orchestral music (M. Crozzoli). Broad 18-dim latent for hugely-textured renders. Sample rate not embedded in filename — defaults to 48 kHz.

Textures / environment

water_pondbrain_b2048_r48000_z16.ts — Water (PondBrain). Water / aquatic textures. Treats any input as if it were running through liquid — bubbles, ripples, splashes.
magnets_b2048_r48000_z8.ts — Magnets. Ferromagnetic / electromagnetic resonance textures — metallic hums, distant industrial buzz, magnetized-string ringing.

Models — ACIDS public catalog (10 models, mirrored 2026-05-18)

Pulled from the canonical anonymous-download endpoint https://play.forum.ircam.fr/rave-vst-api/get_model/<slug>. Each .ts has a matching <slug>.json sidecar in the same schema as the IIL set.

Slug	Display name	Type	Author	Year	Size	Prior
`VCTK`	VCTK (English Speech)	RAVE v1 (default)	Jb Dupuy	2022	177 MB	✓
`darbouka_onnx`	Darbouka (Percussion)	RAVE v2 (ONNX)	Antoine Caillon	2022	26 MB	–
`nasa`	NASA Apollo 11	RAVE v1 (default)	Antoine Caillon	2022	159 MB	✓
`percussion`	Percussion (Mixed)	RAVE v1 (default)	Antoine Caillon	2022	71 MB	✓
`vintage`	Vintage Music	RAVE v1 (large)	Antoine Caillon	2022	482 MB	✓
`isis`	ISiS (IRCAM Vocal DB)	RAVE v2	A. Chemla–Romeu-Santos	2023	149 MB	–
`musicnet`	MusicNet (Classical)	RAVE v2	A. Chemla–Romeu-Santos	2023	237 MB	✓
`sol_ordinario`	Studio OnLine (Ordinario)	RAVE v2	A. Chemla–Romeu-Santos	2023	149 MB	–
`sol_full`	Studio OnLine (Full)	RAVE v2	A. Chemla–Romeu-Santos	2023	149 MB	–
`sol_ordinario_fast`	Studio OnLine (Ordinario, fast)	RAVE v2 (small)	A. Chemla–Romeu-Santos	2023	43 MB	–

ACIDS set total: ~1.6 GB across 10 models.

Note: VCTK.ts (ACIDS v1, 48 kHz, original 2022 release) and voice_vctk_b2048_r44100_z22.ts (IIL v2 retrain, 44.1 kHz) are different models trained on the same source corpus — keep both for comparison.

File format

Each *.ts is a TorchScript export of the RAVE model, streaming-mode (causal convolutions, cached state) — ready for realtime or offline inference.

import torch
model = torch.jit.load("vintage.ts")
# Encode (B, 1, T) → latents
z = model.encode(audio)
# Decode latents → audio
y = model.decode(z)

Models with "Prior available" additionally ship a learned prior that can generate latents autoregressively (see the RAVE repo for usage).

Where to find more RAVE models

Neutone FX models — community + curated .nm files (the Neutone wrapper format).
IRCAM Forum projects — individual user-submitted models; many require Forum account.
acids-ircam GitHub releases — reference checkpoints from the maintainers.
IRCAM RAVE Model Challenge 2025 — 11 prize-winner / submission models gated behind a Forum account.

Citation

@inproceedings{caillon2021rave,
  title={RAVE: A variational autoencoder for fast and high-quality neural audio synthesis},
  author={Caillon, Antoine and Esling, Philippe},
  booktitle={arXiv preprint arXiv:2111.05011},
  year={2021}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for AEmotionStudio/rave-models

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Paper • 2111.05011 • Published Nov 9, 2021