Heartmula — HeartMuLa: Suno-like song generation from lyrics + tags
Heartmula
Section titled “Heartmula”HeartMuLa: Suno-like song generation from lyrics + tags.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/media/heartmula |
| Version | 1.0.0 |
| Tags | music, audio, generation, ai, heartmula, heartcodec, lyrics, songs |
| Related skills | audiocraft |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
HeartMuLa - Open-Source Music Generation
Section titled “HeartMuLa - Open-Source Music Generation”Overview
Section titled “Overview”HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags, with multilingual support. Generates full songs from lyrics + tags. Comparable to Suno for open-source. Includes:
- HeartMuLa - Music language model (3B/7B) for generation from lyrics + tags
- HeartCodec - 12.5Hz music codec for high-fidelity audio reconstruction
- HeartTranscriptor - Whisper-based lyrics transcription
- HeartCLAP - Audio-text alignment model
When to Use
Section titled “When to Use”- User wants to generate music/songs from text descriptions
- User wants an open-source Suno alternative
- User wants local/offline music generation
- User asks about HeartMuLa, heartlib, or AI music generation
Hardware Requirements
Section titled “Hardware Requirements”- Minimum: 8GB VRAM with
--lazy_load true(loads/unloads models sequentially) - Recommended: 16GB+ VRAM for comfortable single-GPU usage
- Multi-GPU: Use
--mula_device cuda:0 --codec_device cuda:1to split across GPUs - 3B model with lazy_load peaks at ~6.2GB VRAM
Installation Steps
Section titled “Installation Steps”1. Clone Repository
Section titled “1. Clone Repository”cd ~/ # or desired directorygit clone https://github.com/HeartMuLa/heartlib.gitcd heartlib2. Create Virtual Environment (Python 3.10 required)
Section titled “2. Create Virtual Environment (Python 3.10 required)”uv venv --python 3.10 .venv. .venv/bin/activateuv pip install -e .3. Fix Dependency Compatibility Issues
Section titled “3. Fix Dependency Compatibility Issues”IMPORTANT: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:
# Upgrade datasets (old version incompatible with current pyarrow)uv pip install --upgrade datasets
# Upgrade transformers (needed for huggingface-hub 1.x compatibility)uv pip install --upgrade transformers4. Patch Source Code (Required for transformers 5.x)
Section titled “4. Patch Source Code (Required for transformers 5.x)”Patch 1 - RoPE cache fix in src/heartlib/heartmula/modeling_heartmula.py:
In the setup_caches method of the HeartMuLa class, add RoPE reinitialization after the reset_caches try/except block and before the with device: block:
# Re-initialize RoPE caches that were skipped during meta-device loadingfrom torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPEfor module in self.modules(): if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built: module.rope_init() module.to(device)Why: from_pretrained creates model on meta device first; Llama3ScaledRoPE.rope_init() skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.
Patch 2 - HeartCodec loading fix in src/heartlib/pipelines/music_generation.py:
Add ignore_mismatched_sizes=True to ALL HeartCodec.from_pretrained() calls (there are 2: the eager load in __init__ and the lazy load in the codec property).
Why: VQ codebook initted buffers have shape [1] in checkpoint vs [] in model. Same data, just scalar vs 0-d tensor. Safe to ignore.
5. Download Model Checkpoints
Section titled “5. Download Model Checkpoints”cd heartlib # project roothf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'All 3 can be downloaded in parallel. Total size is several GB.
GPU / CUDA
Section titled “GPU / CUDA”HeartMuLa uses CUDA by default (--mula_device cuda --codec_device cuda). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.
- The installed
torch==2.4.1includes CUDA 12.1 support out of the box torchtunemay report version0.4.0+cpu— this is just package metadata, it still uses CUDA via PyTorch- To verify GPU is being used, look for “CUDA memory” lines in the output (e.g. “CUDA memory before unloading: 6.20 GB”)
- No GPU? You can run on CPU with
--mula_device cpu --codec_device cpu, but expect generation to be extremely slow (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.
Basic Generation
Section titled “Basic Generation”cd heartlib. .venv/bin/activatepython ./examples/run_music_generation.py \ --model_path=./ckpt \ --version="3B" \ --lyrics="./assets/lyrics.txt" \ --tags="./assets/tags.txt" \ --save_path="./assets/output.mp3" \ --lazy_load trueInput Formatting
Section titled “Input Formatting”Tags (comma-separated, no spaces):
piano,happy,wedding,synthesizer,romanticor
rock,energetic,guitar,drums,male-vocalLyrics (use bracketed structural tags):
[Intro]
[Verse]Your lyrics here...
[Chorus]Chorus lyrics...
[Bridge]Bridge lyrics...
[Outro]Key Parameters
Section titled “Key Parameters”| Parameter | Default | Description |
|---|---|---|
--max_audio_length_ms | 240000 | Max length in ms (240s = 4 min) |
--topk | 50 | Top-k sampling |
--temperature | 1.0 | Sampling temperature |
--cfg_scale | 1.5 | Classifier-free guidance scale |
--lazy_load | false | Load/unload models on demand (saves VRAM) |
--mula_dtype | bfloat16 | Dtype for HeartMuLa (bf16 recommended) |
--codec_dtype | float32 | Dtype for HeartCodec (fp32 recommended for quality) |
Performance
Section titled “Performance”- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
- Output: MP3, 48kHz stereo, 128kbps
Pitfalls
Section titled “Pitfalls”- Do NOT use bf16 for HeartCodec — degrades audio quality. Use fp32 (default).
- Tags may be ignored — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
- Triton not available on macOS — Linux/CUDA only for GPU acceleration.
- RTX 5080 incompatibility reported in upstream issues.
- The dependency pin conflicts require the manual upgrades and patches described above.
- Repo: https://github.com/HeartMuLa/heartlib
- Models: https://huggingface.co/HeartMuLa
- Paper: https://arxiv.org/abs/2601.10547
- License: Apache-2.0