MelodyFlow — AEmotionStudio mirror

1:1 mirror of facebook/melodyflow-t24-30secs. Used by the MAESTRO / Æmotion Studio AI Workstation's MelodyFlow panel (Design → MelodyFlow).

License — Non-Commercial

Weights: CC-BY-NC-4.0. Generated outputs may NOT be used in commercial projects, paid releases, or client work.

Code (audiocraft): MIT. MelodyFlow's inference code lives in the facebook/MelodyFlow HuggingFace Space — Meta uploaded it there but never merged it into audiocraft main. MAESTRO vendors that Space's audiocraft/ subtree under backend/ai/melodyflow_pkg/. The non-commercial clause attaches only to the weights and to anything derived from running them.

Format

This mirror keeps the upstream .bin layout (PyTorch pickle) verbatim — state_dict.bin (the flow-matching DiT language model) plus compression_state_dict.bin (the EnCodec compression model, 2-channel / 32 kHz). We do NOT convert to safetensors here because the vendored audiocraft loader expects pickled {xp.cfg, best_state} packages and reads the OmegaConf cfg blob alongside the tensor dict in one torch.load call. Splitting cfg into a sidecar would require a custom loader — deferred.

PyTorch 2.6+'s default weights_only=True rejects these pickles (numpy scalars in xp.cfg). MAESTRO's runner wraps the load in a _TorchLoadWeightsOnlyShim context manager; vanilla audiocraft users on torch ≥ 2.6 will hit the same issue and need a similar shim.

Loading

# Requires the facebook/MelodyFlow Space's audiocraft subtree on PYTHONPATH
# (the upstream audiocraft PyPI release does NOT include MelodyFlow).
from audiocraft.models import MelodyFlow
model = MelodyFlow.get_pretrained('AEmotionStudio/melodyflow-models', device='cuda')

# Generate from text alone:
model.set_generation_params(solver='midpoint', steps=64, duration=10.0)
wav = model.generate(descriptions=['cinematic strings'])

# OR edit a source clip via regularized latent inversion:
import torchaudio
src, sr = torchaudio.load('source.wav')   # MelodyFlow's EnCodec is stereo
if src.shape[0] == 1: src = src.repeat(2, 1)
src = src.unsqueeze(0).to('cuda')
import torch
with torch.no_grad():
    prompt_tokens = model.encode_audio(src)
model.set_editing_params(solver='euler', steps=25, regularize=True,
                          regularize_iters=4, lambda_kl=0.2)
edited = model.edit(prompt_tokens=prompt_tokens,
                     descriptions=['solo piano with reverb'],
                     src_descriptions=['gentle arpeggio'])
torchaudio.save('edited.wav', edited[0].cpu(), model.sample_rate)

Citation

MelodyFlow is described in:

Le Lan, G., Nagaraja, V., Chang, E., Kant, D., Ni, Z., Shi, Y., Iandola, F., & Chandra, V. (2024). High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching. arXiv:2407.03648.

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AEmotionStudio/melodyflow-models

High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Paper • 2407.03648 • Published Jul 4, 2024 • 20