Audiocraft Audio Generation — AudioCraft: MusicGen text-to-music, AudioGen text-to-sound
Audiocraft Audio Generation
Section titled “Audiocraft Audio Generation”AudioCraft: MusicGen text-to-music, AudioGen text-to-sound.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/models/audiocraft |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | audiocraft, torch>=2.0.0, transformers>=4.30.0 |
| Tags | Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
AudioCraft: Audio Generation
Section titled “AudioCraft: Audio Generation”Comprehensive guide to using Meta’s AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
When to use AudioCraft
Section titled “When to use AudioCraft”Use AudioCraft when:
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
Key features:
- MusicGen: Text-to-music generation with melody conditioning
- AudioGen: Text-to-sound effects generation
- EnCodec: High-fidelity neural audio codec
- Multiple model sizes: Small (300M) to Large (3.3B)
- Stereo support: Full stereo audio generation
- Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
- Stable Audio: For longer commercial music generation
- Bark: For text-to-speech with music/sound effects
- Riffusion: For spectogram-based music generation
- OpenAI Jukebox: For raw audio generation with lyrics
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# From PyPIpip install audiocraft
# From GitHub (latest)pip install git+https://github.com/facebookresearch/audiocraft.git
# Or use HuggingFace Transformerspip install transformers torch torchaudioBasic text-to-music (AudioCraft)
Section titled “Basic text-to-music (AudioCraft)”import torchaudiofrom audiocraft.models import MusicGen
# Load modelmodel = MusicGen.get_pretrained('facebook/musicgen-small')
# Set generation parametersmodel.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0)
# Generate from textdescriptions = ["happy upbeat electronic dance music with synths"]wav = model.generate(descriptions)
# Save audiotorchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)Using HuggingFace Transformers
Section titled “Using HuggingFace Transformers”from transformers import AutoProcessor, MusicgenForConditionalGenerationimport scipy
# Load model and processorprocessor = AutoProcessor.from_pretrained("facebook/musicgen-small")model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")model.to("cuda")
# Generate musicinputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt").to("cuda")
audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
# Savesampling_rate = model.config.audio_encoder.sampling_ratescipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())Text-to-sound with AudioGen
Section titled “Text-to-sound with AudioGen”from audiocraft.models import AudioGen
# Load AudioGenmodel = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
# Generate sound effectsdescriptions = ["dog barking in a park with birds chirping"]wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)Core concepts
Section titled “Core concepts”Architecture overview
Section titled “Architecture overview”AudioCraft Architecture:┌──────────────────────────────────────────────────────────────┐│ Text Encoder (T5) ││ │ ││ Text Embeddings │└────────────────────────┬─────────────────────────────────────┘ │┌────────────────────────▼─────────────────────────────────────┐│ Transformer Decoder (LM) ││ Auto-regressively generates audio tokens ││ Using efficient token interleaving patterns │└────────────────────────┬─────────────────────────────────────┘ │┌────────────────────────▼─────────────────────────────────────┐│ EnCodec Audio Decoder ││ Converts tokens back to audio waveform │└──────────────────────────────────────────────────────────────┘Model variants
Section titled “Model variants”| Model | Size | Description | Use Case |
|---|---|---|---|
musicgen-small | 300M | Text-to-music | Quick generation |
musicgen-medium | 1.5B | Text-to-music | Balanced |
musicgen-large | 3.3B | Text-to-music | Best quality |
musicgen-melody | 1.5B | Text + melody | Melody conditioning |
musicgen-melody-large | 3.3B | Text + melody | Best melody |
musicgen-stereo-* | Varies | Stereo output | Stereo generation |
musicgen-style | 1.5B | Style transfer | Reference-based |
audiogen-medium | 1.5B | Text-to-sound | Sound effects |
Generation parameters
Section titled “Generation parameters”| Parameter | Default | Description |
|---|---|---|
duration | 8.0 | Length in seconds (1-120) |
top_k | 250 | Top-k sampling |
top_p | 0.0 | Nucleus sampling (0 = disabled) |
temperature | 1.0 | Sampling temperature |
cfg_coef | 3.0 | Classifier-free guidance |
MusicGen usage
Section titled “MusicGen usage”Text-to-music generation
Section titled “Text-to-music generation”from audiocraft.models import MusicGenimport torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-medium')
# Configure generationmodel.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter))
# Generate multiple samplesdescriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar"]
# Generate (returns [batch, channels, samples])wav = model.generate(descriptions)
# Save eachfor i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)Melody-conditioned generation
Section titled “Melody-conditioned generation”from audiocraft.models import MusicGenimport torchaudio
# Load melody modelmodel = MusicGen.get_pretrained('facebook/musicgen-melody')model.set_generation_params(duration=30)
# Load melody audiomelody, sr = torchaudio.load("melody.wav")
# Generate with melody conditioningdescriptions = ["acoustic guitar folk song"]wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)Stereo generation
Section titled “Stereo generation”from audiocraft.models import MusicGen
# Load stereo modelmodel = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"]wav = model.generate(descriptions)
# wav shape: [batch, 2, samples] for stereoprint(f"Stereo shape: {wav.shape}") # [1, 2, 480000]torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)Audio continuation
Section titled “Audio continuation”from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
# Load audio to continueimport torchaudioaudio, sr = torchaudio.load("intro.wav")
# Process with text and audioinputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt")
# Generate continuationaudio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)MusicGen-Style usage
Section titled “MusicGen-Style usage”Style-conditioned generation
Section titled “Style-conditioned generation”from audiocraft.models import MusicGen
# Load style modelmodel = MusicGen.get_pretrained('facebook/musicgen-style')
# Configure generation with stylemodel.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence)
# Configure style conditionermodel.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length)
# Load style referencestyle_audio, sr = torchaudio.load("reference_style.wav")
# Generate with text + styledescriptions = ["upbeat dance track"]wav = model.generate_with_style(descriptions, style_audio, sr)Style-only generation (no text)
Section titled “Style-only generation (no text)”# Generate matching style without text promptmodel.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only)
wav = model.generate_with_style([None], style_audio, sr)AudioGen usage
Section titled “AudioGen usage”Sound effect generation
Section titled “Sound effect generation”from audiocraft.models import AudioGenimport torchaudio
model = AudioGen.get_pretrained('facebook/audiogen-medium')model.set_generation_params(duration=10)
# Generate various soundsdescriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest"]
wav = model.generate(descriptions)
for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)EnCodec usage
Section titled “EnCodec usage”Audio compression
Section titled “Audio compression”from audiocraft.models import CompressionModelimport torchimport torchaudio
# Load EnCodecmodel = CompressionModel.get_pretrained('facebook/encodec_32khz')
# Load audiowav, sr = torchaudio.load("audio.wav")
# Ensure correct sample rateif sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)
# Encode to tokenswith torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes
# Decode back to audiowith torch.no_grad(): decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)Common workflows
Section titled “Common workflows”Workflow 1: Music generation pipeline
Section titled “Workflow 1: Music generation pipeline”import torchimport torchaudiofrom audiocraft.models import MusicGen
class MusicGenerator: def __init__(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0): self.model.set_generation_params( duration=duration, top_k=250, temperature=temperature, cfg_coef=cfg )
with torch.no_grad(): wav = self.model.generate([prompt])
return wav[0].cpu()
def generate_batch(self, prompts, duration=30): self.model.set_generation_params(duration=duration)
with torch.no_grad(): wav = self.model.generate(prompts)
return wav.cpu()
def save(self, audio, path): torchaudio.save(path, audio, sample_rate=self.sample_rate)
# Usagegenerator = MusicGenerator()audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0)generator.save(audio, "epic_music.wav")Workflow 2: Sound design batch processing
Section titled “Workflow 2: Sound design batch processing”import jsonfrom pathlib import Pathfrom audiocraft.models import AudioGenimport torchaudio
def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications.
Args: sound_specs: list of {"name": str, "description": str, "duration": float} output_dir: output directory path """ model = AudioGen.get_pretrained('facebook/audiogen-medium') output_dir = Path(output_dir) output_dir.mkdir(exist_ok=True)
results = []
for spec in sound_specs: model.set_generation_params(duration=spec.get("duration", 5))
wav = model.generate([spec["description"]])
output_path = output_dir / f"{spec['name']}.wav" torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
results.append({ "name": spec["name"], "path": str(output_path), "description": spec["description"] })
return results
# Usagesounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2}]
results = batch_generate_sounds(sounds, "sound_effects/")Workflow 3: Gradio demo
Section titled “Workflow 3: Gradio demo”import gradio as grimport torchimport torchaudiofrom audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef )
with torch.no_grad(): wav = model.generate([prompt])
# Save to temp file path = "temp_output.wav" torchaudio.save(path, wav[0].cpu(), sample_rate=32000) return path
demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo")
demo.launch()Performance optimization
Section titled “Performance optimization”Memory optimization
Section titled “Memory optimization”# Use smaller modelmodel = MusicGen.get_pretrained('facebook/musicgen-small')
# Clear cache between generationstorch.cuda.empty_cache()
# Generate shorter durationsmodel.set_generation_params(duration=10) # Instead of 30
# Use half precisionmodel = model.half()Batch processing efficiency
Section titled “Batch processing efficiency”# Process multiple prompts at once (more efficient)descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]wav = model.generate(descriptions) # Single batch
# Instead offor desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)GPU memory requirements
Section titled “GPU memory requirements”| Model | FP32 VRAM | FP16 VRAM |
|---|---|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
Common issues
Section titled “Common issues”| Issue | Solution |
|---|---|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |
References
Section titled “References”- Advanced Usage - Training, fine-tuning, deployment
- Troubleshooting - Common issues and solutions
Resources
Section titled “Resources”- GitHub: https://github.com/facebookresearch/audiocraft
- Paper (MusicGen): https://arxiv.org/abs/2306.05284
- Paper (AudioGen): https://arxiv.org/abs/2209.15352
- HuggingFace: https://huggingface.co/facebook/musicgen-small
- Demo: https://huggingface.co/spaces/facebook/MusicGen