RAVE — AEmotionStudio mirror

Curated mirror of public RAVE (Realtime Audio Variational autoEncoder) checkpoints, used by MAESTRO's RAVE Timbre Transfer panel (opt-in starter pack). Sources:

RAVE was developed by Antoine Caillon and the ACIDS team at IRCAM. Paper: arXiv:2111.05011. Upstream code: acids-ircam/RAVE.

License

CC-BY-NC-4.0 — non-commercial use only, inherited from the upstream distributions. Generated audio is fine for non-commercial use. Commercial use of the models themselves (e.g. shipping them inside a paid product) requires permission from the original authors / IRCAM.

Per MAESTRO's stance (see LICENSE_AUDIT.md and the feedback_download_on_demand_licensing memory), these weights are fetched on demand by the end user — the user (not MAESTRO the binary) is the licensee.


Models — IIL-curated set (b2048 streaming exports, 18 models)

Each .ts checkpoint has a <stem>.json sidecar with name, license, sample-rate, latent-dim, source URL, and a one-line description.

Voice / speech

  • voice_vocalset_b2048_r48000_z16.tsVoice (VocalSet). Voice timbre trained on the VocalSet corpus — covers vocal techniques across multiple singers. Use for the canonical 'make this sound like a voice' transfer.
  • voice-multi-b2048-r48000-z11.tsVoice (Multi-speaker). Aggregated multi-speaker voice corpus. Wider speaker diversity than VocalSet — produces more 'average human' renders.
  • voice_hifitts_b2048_r48000_z16.tsVoice (HiFi-TTS). High-fidelity expressive English speech corpus. Cleaner, more articulate than the multi-speaker model.
  • voice_jvs_b2048_r44100_z16.tsVoice (JVS, Japanese). JVS Japanese multi-speaker corpus at 44.1 kHz. Use for Japanese-language sources or non-Latin phoneme structure.
  • voice_vctk_b2048_r44100_z22.tsVoice (VCTK, English). VCTK English multi-speaker corpus from CSTR Edinburgh, 44.1 kHz. High 22-dim latent — captures more speaker idiosyncrasies.

Bird / wildlife

  • birds_motherbird_b2048_r48000_z16.tsBirds (Motherbird). Bird-vocalization corpus — chirps + textural transients. The canonical 'weird' pick: produces wildly warped output for any arbitrary input.
  • birds_dawnchorus_b2048_r48000_z8.tsBirds (Dawn Chorus). Dense overlapping bird vocalizations recorded at dawn. Smaller 8-dim latent — outputs lean ensemble-textural over individual calls.
  • birds_pluma_b2048_r48000_z12.tsBirds (Pluma). Lighter, individual bird-call timbres. Mid-size 12-dim latent balances character + clarity.
  • humpbacks_pondbrain_b2048_r48000_z20.tsHumpback Whales. Humpback-whale song. Long, slow, hauntingly-deep vocal contours — pairs well with sustained input.
  • marinemammals_pondbrain_b2048_r48000_z20.tsMarine Mammals. Mixed marine-mammal vocalizations — dolphins, orcas, sea-life clicks and cries.

Instruments

  • guitar_iil_b2048_r48000_z16.tsGuitar (IIL). Acoustic / electric guitar timbre. Good demo for transferring voice or synth input into a plucked-string voice.
  • organ_bach_b2048_r48000_z16.tsOrgan (Bach). Pipe-organ timbre trained on Bach repertoire. Sustained harmonic textures — pairs well with melodic input.
  • organ_archive_b2048_r48000_z16.tsOrgan (Archive). Historical pipe-organ recordings — broader, dustier textures than the Bach model. Good for film-score atmospheres.
  • sax_soprano_franziskaschroeder_b2048_r48000_z20.tsSoprano Sax (Schroeder). Soprano-saxophone extended techniques by Franziska Schroeder. Multiphonics, growls, key clicks. 20-dim latent — captures fine-grained articulation.
  • mrp_strengjavera_b2048_r44100_z16.tsMagnetic Resonator Piano (Strengjavera). Sustained metallic-string overtones produced by electromagnetically driving piano strings — 44.1 kHz.
  • crozzoli_bigensemblesmusic_18d.tsBig Ensemble Music (Crozzoli). Big-ensemble orchestral music (M. Crozzoli). Broad 18-dim latent for hugely-textured renders. Sample rate not embedded in filename — defaults to 48 kHz.

Textures / environment

  • water_pondbrain_b2048_r48000_z16.tsWater (PondBrain). Water / aquatic textures. Treats any input as if it were running through liquid — bubbles, ripples, splashes.
  • magnets_b2048_r48000_z8.tsMagnets. Ferromagnetic / electromagnetic resonance textures — metallic hums, distant industrial buzz, magnetized-string ringing.

Models — ACIDS public catalog (10 models, mirrored 2026-05-18)

Pulled from the canonical anonymous-download endpoint https://play.forum.ircam.fr/rave-vst-api/get_model/<slug>. Each .ts has a matching <slug>.json sidecar in the same schema as the IIL set.

Slug Display name Type Author Year Size Prior
VCTK VCTK (English Speech) RAVE v1 (default) Jb Dupuy 2022 177 MB
darbouka_onnx Darbouka (Percussion) RAVE v2 (ONNX) Antoine Caillon 2022 26 MB
nasa NASA Apollo 11 RAVE v1 (default) Antoine Caillon 2022 159 MB
percussion Percussion (Mixed) RAVE v1 (default) Antoine Caillon 2022 71 MB
vintage Vintage Music RAVE v1 (large) Antoine Caillon 2022 482 MB
isis ISiS (IRCAM Vocal DB) RAVE v2 A. Chemla–Romeu-Santos 2023 149 MB
musicnet MusicNet (Classical) RAVE v2 A. Chemla–Romeu-Santos 2023 237 MB
sol_ordinario Studio OnLine (Ordinario) RAVE v2 A. Chemla–Romeu-Santos 2023 149 MB
sol_full Studio OnLine (Full) RAVE v2 A. Chemla–Romeu-Santos 2023 149 MB
sol_ordinario_fast Studio OnLine (Ordinario, fast) RAVE v2 (small) A. Chemla–Romeu-Santos 2023 43 MB

ACIDS set total: ~1.6 GB across 10 models.

Note: VCTK.ts (ACIDS v1, 48 kHz, original 2022 release) and voice_vctk_b2048_r44100_z22.ts (IIL v2 retrain, 44.1 kHz) are different models trained on the same source corpus — keep both for comparison.


File format

Each *.ts is a TorchScript export of the RAVE model, streaming-mode (causal convolutions, cached state) — ready for realtime or offline inference.

import torch
model = torch.jit.load("vintage.ts")
# Encode (B, 1, T) → latents
z = model.encode(audio)
# Decode latents → audio
y = model.decode(z)

Models with "Prior available" additionally ship a learned prior that can generate latents autoregressively (see the RAVE repo for usage).

Where to find more RAVE models

Citation

@inproceedings{caillon2021rave,
  title={RAVE: A variational autoencoder for fast and high-quality neural audio synthesis},
  author={Caillon, Antoine and Esling, Philippe},
  booktitle={arXiv preprint arXiv:2111.05011},
  year={2021}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AEmotionStudio/rave-models