Add ACE-Step pipeline for text-to-music generation#13095
Add ACE-Step pipeline for text-to-music generation#13095ChuxiJ wants to merge 27 commits intohuggingface:mainfrom
Conversation
|
Hi @ChuxiJ, thanks for the PR! As a preliminary comment, I tried the test script given above but got an error, which I think is due to the fact that the If I convert the checkpoint locally from a local snapshot of python scripts/convert_ace_step_to_diffusers.py \
--checkpoint_dir /path/to/acestep-v15-repo \
--dit_config acestep-v15-turbo \
--output_dir /path/to/acestep-v15-diffusers \
--dtype bf16and then test it using the following script: import torch
import soundfile as sf
from diffusers import AceStepPipeline
OUTPUT_SAMPLE_RATE = 48000
model_id = "/path/to/acestep-v15-diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42
pipe = AceStepPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe = pipe.to(device)
generator = torch.Generator(device=device).manual_seed(seed)
# Text-to-music generation
audio = pipe(
prompt="A beautiful piano piece with soft melodies",
lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
audio_duration=30.0,
num_inference_steps=8,
bpm=120,
keyscale="C major",
generator=generator,
).audios
sf.write("acestep_t2m.wav", audio[0, 0].cpu().numpy(), OUTPUT_SAMPLE_RATE)I get the following sample: The sample quality is lower than expected, so there is probably a bug. Could you look into it? |
|
|
||
| model_cpu_offload_seq = "text_encoder->condition_encoder->transformer->vae" | ||
|
|
||
| def __init__( |
There was a problem hiding this comment.
I think it including the audio tokenizer and detokenizer (I believe AceStepAudioTokenizer and AudioTokenDetokenizer in the original code) as registered modules would make the pipeline easier to use and more self-contained. If I understand correctly, this would allow users to supply raw audio waveforms instead of audio_codes to __call__ and the user would not have to manually tokenize the audio to an audio code string first.
…eScheduler Replace the hand-rolled flow-matching Euler loop with `FlowMatchEulerDiscreteScheduler`. ACE-Step still computes its own shifted / turbo sigma schedule via `_get_timestep_schedule`, but now passes it to `scheduler.set_timesteps(sigmas=...)` and delegates the ODE step to `scheduler.step()`. The scheduler is configured with `num_train_timesteps=1` and `shift=1.0` so `scheduler.timesteps` stays in `[0, 1]` (the convention the DiT was trained on) and the scheduler doesn't re-shift already-shifted sigmas. The scheduler's appended terminal `sigma=0` reproduces the old loop's final-step "project to x0" case exactly: `prev = x + (0 - t_curr) * v`. Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music, 8 steps): waveform Pearson = 0.999999 spectral Pearson = 1.000000 max |diff| = 2.5e-3 (fp32 step-math vs previous bf16 step-math) fp32 Euler-loop A/B against the hand-rolled path: max |diff| = 3.6e-7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… kwargs - Move the DiT transformer tests out of the pipeline test file into a new tests/models/transformers/test_models_transformer_ace_step.py that follows the standard BaseModelTesterConfig + ModelTesterMixin scaffold (matches test_models_transformer_longcat_audio_dit.py). - Drop `max_position_embeddings` from the remaining AceStepDiTModel and AceStepConditionEncoder test fixtures — neither constructor accepts that argument anymore. - Drop `use_sliding_window` from the same fixtures — also no longer a constructor argument (the actual `sliding_window` int kwarg is kept). - Wire `FlowMatchEulerDiscreteScheduler(num_train_timesteps=1, shift=1.0)` into `get_dummy_components()` now that the pipeline requires it. Resolves huggingface#13095 (comment), r3115664850, r3115673059, r3115676580, r3115680700. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@bot /style |
|
Style fix is beginning .... View the workflow run here. |
dg845
left a comment
There was a problem hiding this comment.
Thanks for the changes! Can you run make style and make quality to fix the code style?
Fixes 5 review threads + style: 1. Converter now builds `AceStepPipeline` in memory and calls `save_pretrained`. Previously the hand-written `model_index.json` was missing the `scheduler` entry — fresh converter output couldn't be loaded by `AceStepPipeline.from_pretrained` (r3127767785). This also makes the converter robust to future `__init__` signature changes. 2. `latent_length` uses `math.ceil(...)` instead of `int(...)` so non-integer products (e.g. `latents_per_second=2.0, audio_duration=0.4 → 0.8`) round up to `1` instead of truncating to `0` and crashing shape checks (r3127790939). 3. Add `_callback_tensor_inputs = ["latents"]` on `AceStepPipeline` so the standard diffusers callback tests pick up the right tensor (r3127795954). 4. `AceStepConditionEncoder.silence_latent` no longer hard-codes the channel dim to 64. The placeholder buffer now uses the `timbre_hidden_dim` constructor argument, so smaller test configs with `timbre_hidden_dim != 64` load without shape errors (r3127812932). 5. Revert `self.vae.enable_tiling()` from `AceStepPipeline.__init__`. Users can call `pipe.vae.enable_tiling()` themselves for long-form generation; that matches the opt-in convention used by the rest of diffusers (r3127777296). 6. `ruff check --fix` + `ruff format` over all ACE-Step sources (the style fix dg845 asked for via `@bot /style`). Also: converter now accepts sharded `model.safetensors.index.json` layouts alongside the single-file `model.safetensors`, so the 5B XL turbo variant converts without a pre-processing step. Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music 160s, fresh converter output loaded via `from_pretrained`): waveform Pearson = 0.999954 spectral Pearson = 0.999977 max |a-b| bf16 = 4.3e-02 (dominated by the VAE tiling default flip) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks!
i left a few more comments
|
@claude can you do a cross-check on my review for this PR. I'd like you to
|
|
Claude finished @yiyixuxu's task in 5m 27s —— View job Cross-checking @yiyixuxu's review + independent review
Part 1: Verification of @yiyixuxu's comments1.
|
| Aspect | PR's _apg_forward |
Existing guider's normalized_guidance |
|---|---|---|
| Formula | pred_cond + (guidance_scale - 1) * update |
pred_uncond + guidance_scale * update |
| Default eta | 0.0 (orthogonal only) |
1.0 (both parallel + orthogonal) |
| Default norm_threshold | 2.5 (active) |
0.0 (disabled) |
| Momentum | -0.75 (negative, intentional for ACE-Step) |
user-provided |
| dims | explicit (1,) (time dim only) |
all spatial dims [-i for i in range(1, ndim)] |
The formula difference is mathematically significant: at guidance_scale=7.0, the PR computes pred_cond + 6 * update while the guider computes pred_uncond + 7 * update. These give different results.
Recommendation: Agree this is a follow-up item. A custom ACE-Step guider (inheriting BaseGuidance) with the correct formula, defaults, and MPS handling would be the clean solution.
9. pipeline_ace_step.py:269,275,280 — move is_turbo, sample_rate, latents_per_second to __init__
Agree. The Flux2 Klein pipeline sets derived values in __init__ (e.g., self.vae_scale_factor = ...). Computing these once is cleaner than recomputing on every property access. The fallback pattern for when VAE isn't loaded:
self.sample_rate = int(self.vae.config.sampling_rate) if getattr(self, "vae", None) else 4800010. pipeline_ace_step.py:358 — follow Flux2 approach for distilled/turbo guidance
Agree. The Flux2 Klein pattern is:
logger.warning(f"Guidance scale {guidance_scale} is ignored for distilled models.")whenguidance_scale > 1.0on distilleddo_classifier_free_guidance = self._guidance_scale > 1 and not self.config.is_distilled
This is cleaner than _variant_defaults(). However, _variant_defaults also handles num_inference_steps and shift defaults. Those still need a home — probably just inline defaults in __call__.
11. pipeline_ace_step.py:66 — remove SAMPLE_RATE = 48000, use self.sample_rate
Partially agree. The constant is used in 3 places:
_normalize_audio_to_stereo_48k()— this is a@staticmethod, can't accessself.sample_rate_prepare_reference_audio_latents()— instance method, can useself.sample_rate__call__line 1136 — instance method, can useself.sample_rate
For (2) and (3), switch to self.sample_rate. For (1), either make it a regular method, or pass sample_rate as a parameter. Keeping a module-level constant just for one static method seems wrong.
12. pipeline_ace_step.py:669 — is _get_timestep_schedule different from scheduler set_timesteps?
Answer: No, it's redundant for the non-turbo path. The FlowMatchEulerDiscreteScheduler.set_timesteps accepts sigmas directly and supports a shift config parameter. The non-turbo formula:
t = linspace(1.0, 0.0, N+1)[:-1]
t = shift * t / (1 + (shift - 1) * t)is the standard flow-matching shift transform that the scheduler already implements. Since the pipeline already calls scheduler.set_timesteps(sigmas=t_schedule.tolist()), the computation is happening twice.
For the turbo path, the lookup tables are also redundant (see next comment).
13. pipeline_ace_step.py:70 — is there something special with pre-defined timesteps?
Answer: No. I verified all three shift values by applying the shift formula to linspace(1.0, 0.0, 9)[:-1]:
- shift=1.0:
linspace(1, 0, 9)[:-1]=[1.0, 0.875, 0.75, ...]← matchesSHIFT_TIMESTEPS[1.0]exactly - shift=2.0: applying
2t/(1+t)to linspace =[1.0, 0.9333, 0.8571, ...]← matchesSHIFT_TIMESTEPS[2.0]exactly - shift=3.0: applying
3t/(1+2t)to linspace =[1.0, 0.9545, 0.9, ...]← matchesSHIFT_TIMESTEPS[3.0]exactly
The lookup tables are just cached results of the formula. SHIFT_TIMESTEPS and VALID_SHIFTS constants can be removed entirely — the base/SFT code path (lines 669-672) already handles all cases correctly.
14. pipeline_ace_step.py:579 — remove torch.no_grad() inside encode_prompt
Agree. The pipeline __call__ is decorated with @torch.no_grad(), making inner no_grad contexts redundant. The Flux2 pipeline's encode_prompt does not use torch.no_grad() either. This is the standard diffusers convention.
15. pipeline_ace_step.py:835 — remove warning about silence_latent
Agree. The warning at lines 833-843 (_silence_latent_tiled) is a developer safety net for incomplete conversions. Since the converter now correctly includes silence_latent, this shouldn't trigger in practice. Users won't be running the converter manually.
Same pattern exists in __call__ at lines 1206-1228 where there's a similar fallback/warning for missing silence_latent. Both should be cleaned up.
Part 2: Issues I found that the review missed
A. torch.float64 usage in _apg_forward (pipeline_ace_step.py:147-148)
if device_type == "mps":
diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
diff_f, cond_f = diff.double(), pred_cond.double()Per models.md: "MPS and several NPU backends don't support float64." The MPS case is handled, but NPU is not. Should use the device-gated pattern:
is_mps = diff.device.type == "mps"
is_npu = diff.device.type == "npu"
if is_mps or is_npu:
diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
diff_f, cond_f = diff.double(), pred_cond.double()Or better: use float32 here (the precision difference in APG normalization is negligible for audio inference).
B. do_cfg local variable vs do_classifier_free_guidance property (pipeline_ace_step.py:1305)
Line 1305: do_cfg = guidance_scale > 1.0 — this duplicates the logic of the do_classifier_free_guidance property (line 289-295). The local variable should use the property:
do_cfg = self.do_classifier_free_guidanceThis also becomes important when the Flux2 distilled-model pattern is adopted (the property would check not self.is_turbo).
C. Missing _no_split_modules on AceStepTransformer1DModel
Per models.md gotcha #4: "Wrong _no_split_modules. These class attributes control device placement." The transformer model doesn't define _no_split_modules, which matters for device_map="auto" and accelerate integration. Most transformer models define it as ["AceStepTransformerBlock"].
D. No tests for audio-to-audio tasks
The AceStepPipelineFastTests only covers text2music. There are no tests for cover, repaint, extract, lego, or complete tasks. At minimum, a test_ace_step_repaint (which exercises src_audio, chunk_mask, and the repaint window logic) and a test_ace_step_cover (which exercises reference audio + timbre encoding) would catch regressions in those code paths. dg845 already flagged this in their general review comment.
E. _normalize_audio_to_stereo_48k fallback uses F.interpolate (pipeline_ace_step.py:700)
audio = F.interpolate(audio.unsqueeze(0), size=target_len, mode="linear", align_corners=False)[0]The F.interpolate with mode="linear" expects 3D input (B, C, L) but the code does audio.unsqueeze(0) on a 2D (C, L) tensor giving (1, C, L). This is correct for the API, but mode="linear" for audio resampling is a very poor approximation (no anti-aliasing filter). The comment says "Simple linear resampling fallback" which is fair, but consider adding a note in the docstring that torchaudio is recommended for quality.
F. latent_length calculation discrepancy between prepare_latents and __call__
prepare_latents (line 613): math.ceil(audio_duration * self.latents_per_second)
__call__ (line 1158): math.ceil(audio_duration * self.latents_per_second)
Both use ceil now (looks like dg845's comment was addressed). However, the test_prepare_latents test at line 396 asserts:
self.assertEqual(latents.shape, (2, 25, 8)) # 25 Hz latent rate, 1s duration -> 25 framesBut with the tiny test VAE (sampling_rate=4, downsampling_ratios=[1, 2]), latents_per_second = 4 / (1*2) = 2.0, so 1.0s × 2.0 = 2 frames, not 25. This test appears to be using hardcoded expectations from the real model config rather than the test config. This test would fail or is currently not being run.
G. _create_4d_mask materializes full [B, 1, S, S] attention masks
For long audio (e.g., 5 minutes at 25 Hz = 7500 latent frames, /patch_size=2 = 3750), _create_4d_mask materializes a [1, 1, 3750, 3750] tensor. In float32 that's ~56 MB, in bfloat16 ~28 MB. The sliding-window mask is sparse by nature. Since full_attn_mask and encoder_4d_mask are being replaced with None per comment 4, only the sliding mask remains, but it's still a dense materialization. This is not a blocker but worth noting for follow-up optimization (e.g., using xformers block-diagonal or FlashAttention's sliding window natively).
Summary
| # | Comment | Verdict | Notes |
|---|---|---|---|
| 1-3 | Remove parity scripts | ✅ Agree | |
| 4-6 | _encode_one → _encode |
✅ Agree | Matches all other VAEs |
| 7 | Remove unused forward args | ✅ Agree | |
| 8 | All-zeros masks → None |
✅ Agree | Saves memory |
| 9 | Remove compat aliases | ✅ Agree | Also update test import |
| 10 | Inline _run_encoder_layers |
✅ Agree | models.md rule |
| 11 | Remove PR-process comments | ✅ Agree | Found 5 more instances |
| 12 | Use existing APG guider? | ❌ Can't drop-in | Formula differs; follow-up |
| 13 | Move props to __init__ |
✅ Agree | Flux2 pattern |
| 14 | Flux2 distilled pattern | ✅ Agree | Warn + ignore for turbo |
| 15 | Remove SAMPLE_RATE |
Static method needs refactor | |
| 16 | _get_timestep_schedule vs scheduler |
✅ Redundant | Scheduler can do this |
| 17 | Pre-defined timesteps special? | ❌ No | Just cached formula results |
| 18 | Remove inner no_grad |
✅ Agree | Standard convention |
| 19 | Remove silence_latent warning | ✅ Agree | Also at line 1206 |
| New | float64 in APG (NPU) | 🔴 Bug | Missing NPU guard |
| New | do_cfg vs property |
🟡 Cleanup | Use property consistently |
| New | Missing _no_split_modules |
🟡 Missing | Needed for device_map |
| New | No audio-to-audio tests | 🟡 Gap | At least repaint + cover |
| New | test_prepare_latents wrong expectations |
🔴 Bug | Uses real-model values with tiny test config |
| New | Dense sliding-window mask | 🟡 Perf | Follow-up optimization |
Code-level (22 threads):
1. Delete 3 dev/parity scripts (`scripts/audio_parity_jieyue.py`,
`scripts/dit_parity_test.py`, `scripts/run_official_generate_music.py`)
that shouldn't have been committed.
2. Rename `AutoencoderOobleck._encode_one` → `_encode` to match the convention
used by other diffusers VAEs.
3. Delete the hard-coded `SHIFT_TIMESTEPS` / `VALID_SHIFTS` table in
`pipeline_ace_step.py`: the per-shift turbo schedules are recovered
exactly by `linspace(1, 0, N+1)[:-1]` plus the flow-match shift formula
that the non-turbo branch already uses, so a single code path covers both.
4. Drop the backwards-compat `AceStepDiTModel` / `AceStepDiTLayer` aliases
and every reference (top-level `__init__`, `models/__init__`,
`transformers/__init__`, dummy objects, tests, docs toctree, model card).
`AceStepTransformer1DModel` is the only exported name now.
5. Remove the unused `attention_mask` / `encoder_attention_mask` args from
`AceStepTransformer1DModel.forward`; the model rebuilds its masks from
the sequence shape and never consumed them.
6. In the DiT forward and both encoders, pass `None` instead of an all-zero
`full_attn_mask` / `encoder_4d_mask` to non-sliding attention layers — SDPA
dispatches to a faster kernel when the mask is None.
7. Inline the shared `_run_encoder_layers` helper directly into
`AceStepLyricEncoder.forward` / `AceStepTimbreEncoder.forward` so layer
calls are visible at the forward boundary (diffusers style).
8. Move `is_turbo` / `sample_rate` / `latents_per_second` from `@property`s
that re-read module configs each call to cached attributes populated in
`__init__` (Flux2-style), with a default-ACE-Step fallback when
`self.vae` is offloaded. Drop the now-unused `SAMPLE_RATE = 48000`
module-level constant and the three property definitions.
9. Warn + coerce `guidance_scale` to 1.0 on turbo (guidance-distilled)
checkpoints, following `pipeline_flux2_klein`. Prevents over-guided
audio when users forward their base/sft CFG settings to a turbo pipe.
10. Remove the `logger.warning(...)` paths that triggered on
`silence_latent` missing/zero — those only fired for author-side
unconverted checkpoints and tests; end users always load converted
weights where the buffer is baked in.
11. Drop the redundant `with torch.no_grad():` wrappers inside
`encode_prompt` — the pipeline's `__call__` runs under `torch.no_grad`
already.
12. Strip "reviewer comment on PR huggingface#13095" attribution comments from three
docstrings (here and everywhere).
Parity on jieyue (seed=42, bf16 + flash-attn, XL turbo 160s text2music):
waveform Pearson = 0.9747
spectral Pearson = 0.9895
The shift comes from full-attention layers switching `attn_mask=0_tensor` →
`attn_mask=None`, which dispatches to a different SDPA kernel on bf16. The
two outputs are algebraically equivalent for fp32 eager; on bf16+FA the
delta is dominated by kernel-level ULPs, well within the sampler-noise
band (ear-check on the 160s example confirms no audible regression).
Still open — AudioTokenizer/Detokenizer (deferred) + APG guider follow-up
(dims differ from `diffusers.guiders.adaptive_projected_guidance`, not a
drop-in; worth a separate PR).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?
This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.
New Components
AceStepDiTModel(src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components:AceStepRMSNorm,AceStepRotaryEmbedding,AceStepMLP,AceStepTimestepEmbedding,AceStepAttention,AceStepEncoderLayer, andAceStepDiTLayer.AceStepConditionEncoder(src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. IncludesAceStepLyricEncoderandAceStepTimbreEncodersub-modules.AceStepPipeline(src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:text2music— generate music from text and lyricscover— generate from audio semantic codes or with timbre transfer via reference audiorepaint— regenerate a time region within existing audioextract— extract a specific track (vocals, drums, etc.) from audiolego— generate a specific track given audio contextcomplete— complete audio with additional tracksConversion script (
scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.Key Features
_get_task_instructionbpm,keyscale,timesignatureparameters formatted into the SFT prompt templatesrc_audio) and reference audio (reference_audio) inputs with VAE encoding_tiled_encode) and decoding (_tiled_decode) for long audioguidance_scale,cfg_interval_start, andcfg_interval_end(primarily for base/SFT models; turbo models have guidance distilled into weights)audio_cover_strength_parse_audio_code_stringextracts semantic codes from<|audio_code_N|>tokens for cover tasks_build_chunk_maskcreates time-region masks for repaint/lego taskstimestepsshift=3.0Architecture
ACE-Step 1.5 comprises three main components:
Tests
tests/pipelines/ace_step/test_ace_step.py):AceStepDiTModelTests— forward shape, return dict, gradient checkpointingAceStepConditionEncoderTests— forward shape, save/load configAceStepPipelineFastTests(extendsPipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and moretests/models/transformers/test_models_transformer_ace_step.py):TestAceStepDiTModel(extendsModelTesterMixin) — forward pass, dtype inference, save/load, determinismTestAceStepDiTModelMemory(extendsMemoryTesterMixin) — layerwise casting, group offloadingTestAceStepDiTModelTraining(extendsTrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precisionAll 70 tests pass (39 pipeline + 31 model).
Documentation
docs/source/en/api/pipelines/ace_step.md— Pipeline API documentation with usage examplesdocs/source/en/api/models/ace_step_transformer.md— Transformer model documentationUsage
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
References