Add ACE-Step pipeline for text-to-music generation by ChuxiJ · Pull Request #13095 · huggingface/diffusers

ChuxiJ · 2026-02-07T11:24:08Z

What does this PR do?

This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.

New Components

AceStepDiTModel (src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components: AceStepRMSNorm, AceStepRotaryEmbedding, AceStepMLP, AceStepTimestepEmbedding, AceStepAttention, AceStepEncoderLayer, and AceStepDiTLayer.
AceStepConditionEncoder (src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. Includes AceStepLyricEncoder and AceStepTimbreEncoder sub-modules.
AceStepPipeline (src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:
- text2music — generate music from text and lyrics
- cover — generate from audio semantic codes or with timbre transfer via reference audio
- repaint — regenerate a time region within existing audio
- extract — extract a specific track (vocals, drums, etc.) from audio
- lego — generate a specific track given audio context
- complete — complete audio with additional tracks
Conversion script (scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.

Key Features

Multi-task support: 6 task types with automatic instruction routing via _get_task_instruction
Music metadata conditioning: Optional bpm, keyscale, timesignature parameters formatted into the SFT prompt template
Audio-to-audio tasks: Source audio (src_audio) and reference audio (reference_audio) inputs with VAE encoding
Tiled VAE encode/decode: Memory-efficient chunked encoding (_tiled_encode) and decoding (_tiled_decode) for long audio
Classifier-free guidance (CFG): Dual forward pass with configurable guidance_scale, cfg_interval_start, and cfg_interval_end (primarily for base/SFT models; turbo models have guidance distilled into weights)
Audio cover strength blending: Smooth interpolation between cover-conditioned and text-only-conditioned outputs via audio_cover_strength
Audio code parsing: _parse_audio_code_string extracts semantic codes from <|audio_code_N|> tokens for cover tasks
Chunk masking: _build_chunk_mask creates time-region masks for repaint/lego tasks
Anti-clipping normalization: Post-decode normalization to prevent audio clipping
Multi-language lyrics: 50+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
Variable-length generation: Configurable duration from 10 seconds to 10+ minutes
Custom timestep schedules: Pre-defined shifted schedules for shift=1.0/2.0/3.0, or user-provided timesteps
Turbo model variant: Optimized for 8 inference steps with shift=3.0

Architecture

ACE-Step 1.5 comprises three main components:

Oobleck autoencoder (VAE): Compresses 48kHz stereo waveforms into 25Hz latent representations
Qwen3-Embedding-0.6B text encoder: Encodes text prompts and lyrics for conditioning
Diffusion Transformer (DiT): Denoises audio latents using flow matching with an Euler ODE solver

Tests

Pipeline tests (tests/pipelines/ace_step/test_ace_step.py):
- AceStepDiTModelTests — forward shape, return dict, gradient checkpointing
- AceStepConditionEncoderTests — forward shape, save/load config
- AceStepPipelineFastTests (extends PipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and more
Model tests (tests/models/transformers/test_models_transformer_ace_step.py):
- TestAceStepDiTModel (extends ModelTesterMixin) — forward pass, dtype inference, save/load, determinism
- TestAceStepDiTModelMemory (extends MemoryTesterMixin) — layerwise casting, group offloading
- TestAceStepDiTModelTraining (extends TrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precision

All 70 tests pass (39 pipeline + 31 model).

Documentation

docs/source/en/api/pipelines/ace_step.md — Pipeline API documentation with usage examples
docs/source/en/api/models/ace_step_transformer.md — Transformer model documentation

Usage

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/ACE-Step-v1-5-turbo", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
).audios

sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

References

Original codebase: ACE-Step/ACE-Step-1.5
Paper: ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

dg845 · 2026-02-09T22:56:50Z

Hi @ChuxiJ, thanks for the PR! As a preliminary comment, I tried the test script given above but got an error, which I think is due to the fact that the ACE-Step/ACE-Step-v1-5-turbo repo doesn't currently exist on the HF hub.

If I convert the checkpoint locally from a local snapshot of ACE-Step/Ace-Step1.5 at /path/to/acestep-v15-repo using

python scripts/convert_ace_step_to_diffusers.py \
    --checkpoint_dir /path/to/acestep-v15-repo \
    --dit_config acestep-v15-turbo \
    --output_dir /path/to/acestep-v15-diffusers \
    --dtype bf16

and then test it using the following script:

import torch
import soundfile as sf
from diffusers import AceStepPipeline

OUTPUT_SAMPLE_RATE = 48000
model_id = "/path/to/acestep-v15-diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

pipe = AceStepPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe = pipe.to(device)

generator = torch.Generator(device=device).manual_seed(seed)

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
    generator=generator,
).audios

sf.write("acestep_t2m.wav", audio[0, 0].cpu().numpy(), OUTPUT_SAMPLE_RATE)

I get the following sample:

acestep_t2m.wav

The sample quality is lower than expected, so there is probably a bug. Could you look into it?

dg845 · 2026-02-10T02:53:11Z

+
+    model_cpu_offload_seq = "text_encoder->condition_encoder->transformer->vae"
+
+    def __init__(


I think it including the audio tokenizer and detokenizer (I believe AceStepAudioTokenizer and AudioTokenDetokenizer in the original code) as registered modules would make the pipeline easier to use and more self-contained. If I understand correctly, this would allow users to supply raw audio waveforms instead of audio_codes to __call__ and the user would not have to manually tokenize the audio to an audio code string first.

…eScheduler Replace the hand-rolled flow-matching Euler loop with `FlowMatchEulerDiscreteScheduler`. ACE-Step still computes its own shifted / turbo sigma schedule via `_get_timestep_schedule`, but now passes it to `scheduler.set_timesteps(sigmas=...)` and delegates the ODE step to `scheduler.step()`. The scheduler is configured with `num_train_timesteps=1` and `shift=1.0` so `scheduler.timesteps` stays in `[0, 1]` (the convention the DiT was trained on) and the scheduler doesn't re-shift already-shifted sigmas. The scheduler's appended terminal `sigma=0` reproduces the old loop's final-step "project to x0" case exactly: `prev = x + (0 - t_curr) * v`. Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music, 8 steps): waveform Pearson = 0.999999 spectral Pearson = 1.000000 max |diff| = 2.5e-3 (fp32 step-math vs previous bf16 step-math) fp32 Euler-loop A/B against the hand-rolled path: max |diff| = 3.6e-7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… kwargs - Move the DiT transformer tests out of the pipeline test file into a new tests/models/transformers/test_models_transformer_ace_step.py that follows the standard BaseModelTesterConfig + ModelTesterMixin scaffold (matches test_models_transformer_longcat_audio_dit.py). - Drop `max_position_embeddings` from the remaining AceStepDiTModel and AceStepConditionEncoder test fixtures — neither constructor accepts that argument anymore. - Drop `use_sliding_window` from the same fixtures — also no longer a constructor argument (the actual `sliding_window` int kwarg is kept). - Wire `FlowMatchEulerDiscreteScheduler(num_train_timesteps=1, shift=1.0)` into `get_dummy_components()` now that the pipeline requires it. Resolves huggingface#13095 (comment), r3115664850, r3115673059, r3115676580, r3115680700. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dg845 · 2026-04-23T00:42:59Z

@bot /style

github-actions · 2026-04-23T00:43:25Z

Style fix is beginning .... View the workflow run here.

dg845

Thanks for the changes! Can you run make style and make quality to fix the code style?

Fixes 5 review threads + style: 1. Converter now builds `AceStepPipeline` in memory and calls `save_pretrained`. Previously the hand-written `model_index.json` was missing the `scheduler` entry — fresh converter output couldn't be loaded by `AceStepPipeline.from_pretrained` (r3127767785). This also makes the converter robust to future `__init__` signature changes. 2. `latent_length` uses `math.ceil(...)` instead of `int(...)` so non-integer products (e.g. `latents_per_second=2.0, audio_duration=0.4 → 0.8`) round up to `1` instead of truncating to `0` and crashing shape checks (r3127790939). 3. Add `_callback_tensor_inputs = ["latents"]` on `AceStepPipeline` so the standard diffusers callback tests pick up the right tensor (r3127795954). 4. `AceStepConditionEncoder.silence_latent` no longer hard-codes the channel dim to 64. The placeholder buffer now uses the `timbre_hidden_dim` constructor argument, so smaller test configs with `timbre_hidden_dim != 64` load without shape errors (r3127812932). 5. Revert `self.vae.enable_tiling()` from `AceStepPipeline.__init__`. Users can call `pipe.vae.enable_tiling()` themselves for long-form generation; that matches the opt-in convention used by the rest of diffusers (r3127777296). 6. `ruff check --fix` + `ruff format` over all ACE-Step sources (the style fix dg845 asked for via `@bot /style`). Also: converter now accepts sharded `model.safetensors.index.json` layouts alongside the single-file `model.safetensors`, so the 5B XL turbo variant converts without a pre-processing step. Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music 160s, fresh converter output loaded via `from_pretrained`): waveform Pearson = 0.999954 spectral Pearson = 0.999977 max |a-b| bf16 = 4.3e-02 (dominated by the VAE tiling default flip) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

yiyixuxu

thanks!
i left a few more comments

yiyixuxu · 2026-04-23T23:35:15Z

@claude can you do a cross-check on my review for this PR. I'd like you to

verify each of my commeents by tracing the code. Agree/disagreee with reasoning. if you agree, keep the reason short, but also check whether the same pattern exists elsewhere in the PR that I missed
answers questions
do you own review to see if I missed anything

github-actions · 2026-04-23T23:35:29Z

Claude finished @yiyixuxu's task in 5m 27s —— View job

Cross-checking @yiyixuxu's review + independent review

Read all review guidelines (AGENTS.md, models.md, modular.md, etc.)
Read all changed source files under src/diffusers/
Read reference implementations (Flux2 Klein, APG guider, other VAEs)
Verify each of yiyixuxu's review comments
Answer yiyixuxu's questions
Independent review for missed issues

Part 1: Verification of @yiyixuxu's comments

1. `scripts/audio_parity_jieyue.py`, `scripts/dit_parity_test.py`, `scripts/run_official_generate_music.py` — remove before merge

Agree. These are developer-facing parity harnesses. They reference local paths, internal dev boxes, and the original ACE-Step repo — not useful for downstream users and shouldn't ship.

2. `autoencoder_oobleck.py:372` — rename `_encode_one` → `_encode`

Agree. Every other VAE in the codebase (AutoencoderKL, AutoencoderDC, AutoencoderKLAllegro, AutoencoderKLHunyuanVideo, etc.) uses _encode as the internal encode method. The name _encode_one deviates from this convention. The suggested renames at lines 394 and 397 (calling self._encode(...) instead of self._encode_one(...)) follow directly.

3. `ace_step_transformer.py:483` — remove unused `attention_mask` / `encoder_attention_mask` args

Agree. Traced the code: these parameters are accepted in the signature but never read. The model rebuilds all masks from scratch inside forward (lines 538-550). The docstring even says "Ignored; the model rebuilds…". Unused args with "kept for API stability" justification don't apply here since this is a new model.

4. `ace_step_transformer.py:538` — `full_attn_mask` and `encoder_4d_mask` are all-zeros, can we pass `None`?

Agree. I traced _create_4d_mask with is_causal=False, is_sliding_window=False, attention_mask=None:

valid_mask starts as all-True
No filtering branches are entered
mask_tensor.masked_fill_(valid_mask, 0.0) → entire tensor is 0.0

So full_attn_mask and encoder_4d_mask are no-op all-zeros tensors. dispatch_attention_fn handles attn_mask=None correctly (it's the default in all backends). Passing None saves a materialized [B, 1, S, S] tensor per forward pass — meaningful for long audio. Only sliding_attn_mask carries real information and should stay.

5. `ace_step_transformer.py:595` — remove backward-compat aliases

Agree. AceStepDiTModel = AceStepTransformer1DModel and AceStepDiTLayer = AceStepTransformerBlock are being introduced in the same PR — there's no existing user code depending on them. Per AGENTS.md: "Avoid backwards-compatibility hacks."

Missed spot: tests/pipelines/ace_step/test_ace_step.py:23 imports AceStepDiTModel — this needs updating too if the aliases are removed.

6. `modeling_ace_step.py:125` — remove `_run_encoder_layers` helper, inline in `forward`

Agree. Per models.md: "All layer calls should be visible directly in forward — avoid helper functions that hide nn.Module calls." The function hides the per-layer loop + gradient-checkpointing dispatch. Even though it's used by both AceStepLyricEncoder.forward() and AceStepTimbreEncoder.forward(), the guideline prioritizes readability over DRY for layer calls.

7. `modeling_ace_step.py:140` — remove "reviewer comment on PR #13095" comments

Agree. These are meta-comments about the PR process. I found similar comments at:

modeling_ace_step.py:73: "reviewer comment on PR #13095"
modeling_ace_step.py:139-141: "reviewer comment on PR #13095"
pipeline_ace_step.py:291-293: "reviewer ask on PR #13095"
pipeline_ace_step.py:313: "Reviewer comment on PR #13095"
pipeline_ace_step.py:716-717: "reviewer ask on PR #13095"

All should be removed.

8. `pipeline_ace_step.py:96` — can we use existing APG guider?

Answer: Not as a drop-in replacement. I compared _apg_forward with the existing normalized_guidance in src/diffusers/guiders/adaptive_projected_guidance.py. There are three significant differences:

Aspect	PR's `_apg_forward`	Existing guider's `normalized_guidance`
Formula	`pred_cond + (guidance_scale - 1) * update`	`pred_uncond + guidance_scale * update`
Default eta	`0.0` (orthogonal only)	`1.0` (both parallel + orthogonal)
Default norm_threshold	`2.5` (active)	`0.0` (disabled)
Momentum	`-0.75` (negative, intentional for ACE-Step)	user-provided
dims	explicit `(1,)` (time dim only)	all spatial dims `[-i for i in range(1, ndim)]`

The formula difference is mathematically significant: at guidance_scale=7.0, the PR computes pred_cond + 6 * update while the guider computes pred_uncond + 7 * update. These give different results.

Recommendation: Agree this is a follow-up item. A custom ACE-Step guider (inheriting BaseGuidance) with the correct formula, defaults, and MPS handling would be the clean solution.

9. `pipeline_ace_step.py:269,275,280` — move `is_turbo`, `sample_rate`, `latents_per_second` to `init`

Agree. The Flux2 Klein pipeline sets derived values in __init__ (e.g., self.vae_scale_factor = ...). Computing these once is cleaner than recomputing on every property access. The fallback pattern for when VAE isn't loaded:

self.sample_rate = int(self.vae.config.sampling_rate) if getattr(self, "vae", None) else 48000

10. `pipeline_ace_step.py:358` — follow Flux2 approach for distilled/turbo guidance

Agree. The Flux2 Klein pattern is:

logger.warning(f"Guidance scale {guidance_scale} is ignored for distilled models.") when guidance_scale > 1.0 on distilled
do_classifier_free_guidance = self._guidance_scale > 1 and not self.config.is_distilled

This is cleaner than _variant_defaults(). However, _variant_defaults also handles num_inference_steps and shift defaults. Those still need a home — probably just inline defaults in __call__.

11. `pipeline_ace_step.py:66` — remove `SAMPLE_RATE = 48000`, use `self.sample_rate`

Partially agree. The constant is used in 3 places:

_normalize_audio_to_stereo_48k() — this is a @staticmethod, can't access self.sample_rate
_prepare_reference_audio_latents() — instance method, can use self.sample_rate
__call__ line 1136 — instance method, can use self.sample_rate

For (2) and (3), switch to self.sample_rate. For (1), either make it a regular method, or pass sample_rate as a parameter. Keeping a module-level constant just for one static method seems wrong.

12. `pipeline_ace_step.py:669` — is `_get_timestep_schedule` different from scheduler `set_timesteps`?

Answer: No, it's redundant for the non-turbo path. The FlowMatchEulerDiscreteScheduler.set_timesteps accepts sigmas directly and supports a shift config parameter. The non-turbo formula:

t = linspace(1.0, 0.0, N+1)[:-1]
t = shift * t / (1 + (shift - 1) * t)

is the standard flow-matching shift transform that the scheduler already implements. Since the pipeline already calls scheduler.set_timesteps(sigmas=t_schedule.tolist()), the computation is happening twice.

For the turbo path, the lookup tables are also redundant (see next comment).

13. `pipeline_ace_step.py:70` — is there something special with pre-defined timesteps?

Answer: No. I verified all three shift values by applying the shift formula to linspace(1.0, 0.0, 9)[:-1]:

shift=1.0: linspace(1, 0, 9)[:-1] = [1.0, 0.875, 0.75, ...] ← matches SHIFT_TIMESTEPS[1.0] exactly
shift=2.0: applying 2t/(1+t) to linspace = [1.0, 0.9333, 0.8571, ...] ← matches SHIFT_TIMESTEPS[2.0] exactly
shift=3.0: applying 3t/(1+2t) to linspace = [1.0, 0.9545, 0.9, ...] ← matches SHIFT_TIMESTEPS[3.0] exactly

The lookup tables are just cached results of the formula. SHIFT_TIMESTEPS and VALID_SHIFTS constants can be removed entirely — the base/SFT code path (lines 669-672) already handles all cases correctly.

14. `pipeline_ace_step.py:579` — remove `torch.no_grad()` inside `encode_prompt`

Agree. The pipeline __call__ is decorated with @torch.no_grad(), making inner no_grad contexts redundant. The Flux2 pipeline's encode_prompt does not use torch.no_grad() either. This is the standard diffusers convention.

15. `pipeline_ace_step.py:835` — remove warning about silence_latent

Agree. The warning at lines 833-843 (_silence_latent_tiled) is a developer safety net for incomplete conversions. Since the converter now correctly includes silence_latent, this shouldn't trigger in practice. Users won't be running the converter manually.

Same pattern exists in __call__ at lines 1206-1228 where there's a similar fallback/warning for missing silence_latent. Both should be cleaned up.

Part 2: Issues I found that the review missed

A. `torch.float64` usage in `_apg_forward` (pipeline_ace_step.py:147-148)

if device_type == "mps":
    diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
    diff_f, cond_f = diff.double(), pred_cond.double()

Per models.md: "MPS and several NPU backends don't support float64." The MPS case is handled, but NPU is not. Should use the device-gated pattern:

is_mps = diff.device.type == "mps"
is_npu = diff.device.type == "npu"
if is_mps or is_npu:
    diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
    diff_f, cond_f = diff.double(), pred_cond.double()

Or better: use float32 here (the precision difference in APG normalization is negligible for audio inference).

B. `do_cfg` local variable vs `do_classifier_free_guidance` property (pipeline_ace_step.py:1305)

Line 1305: do_cfg = guidance_scale > 1.0 — this duplicates the logic of the do_classifier_free_guidance property (line 289-295). The local variable should use the property:

do_cfg = self.do_classifier_free_guidance

This also becomes important when the Flux2 distilled-model pattern is adopted (the property would check not self.is_turbo).

C. Missing `_no_split_modules` on `AceStepTransformer1DModel`

Per models.md gotcha #4: "Wrong _no_split_modules. These class attributes control device placement." The transformer model doesn't define _no_split_modules, which matters for device_map="auto" and accelerate integration. Most transformer models define it as ["AceStepTransformerBlock"].

D. No tests for audio-to-audio tasks

The AceStepPipelineFastTests only covers text2music. There are no tests for cover, repaint, extract, lego, or complete tasks. At minimum, a test_ace_step_repaint (which exercises src_audio, chunk_mask, and the repaint window logic) and a test_ace_step_cover (which exercises reference audio + timbre encoding) would catch regressions in those code paths. dg845 already flagged this in their general review comment.

E. `_normalize_audio_to_stereo_48k` fallback uses `F.interpolate` (pipeline_ace_step.py:700)

audio = F.interpolate(audio.unsqueeze(0), size=target_len, mode="linear", align_corners=False)[0]

The F.interpolate with mode="linear" expects 3D input (B, C, L) but the code does audio.unsqueeze(0) on a 2D (C, L) tensor giving (1, C, L). This is correct for the API, but mode="linear" for audio resampling is a very poor approximation (no anti-aliasing filter). The comment says "Simple linear resampling fallback" which is fair, but consider adding a note in the docstring that torchaudio is recommended for quality.

F. `latent_length` calculation discrepancy between `prepare_latents` and `call`

prepare_latents (line 613): math.ceil(audio_duration * self.latents_per_second)
__call__ (line 1158): math.ceil(audio_duration * self.latents_per_second)

Both use ceil now (looks like dg845's comment was addressed). However, the test_prepare_latents test at line 396 asserts:

self.assertEqual(latents.shape, (2, 25, 8))  # 25 Hz latent rate, 1s duration -> 25 frames

But with the tiny test VAE (sampling_rate=4, downsampling_ratios=[1, 2]), latents_per_second = 4 / (1*2) = 2.0, so 1.0s × 2.0 = 2 frames, not 25. This test appears to be using hardcoded expectations from the real model config rather than the test config. This test would fail or is currently not being run.

G. `_create_4d_mask` materializes full `[B, 1, S, S]` attention masks

For long audio (e.g., 5 minutes at 25 Hz = 7500 latent frames, /patch_size=2 = 3750), _create_4d_mask materializes a [1, 1, 3750, 3750] tensor. In float32 that's ~56 MB, in bfloat16 ~28 MB. The sliding-window mask is sparse by nature. Since full_attn_mask and encoder_4d_mask are being replaced with None per comment 4, only the sliding mask remains, but it's still a dense materialization. This is not a blocker but worth noting for follow-up optimization (e.g., using xformers block-diagonal or FlashAttention's sliding window natively).

Summary

#	Comment	Verdict	Notes
1-3	Remove parity scripts	✅ Agree
4-6	`_encode_one` → `_encode`	✅ Agree	Matches all other VAEs
7	Remove unused forward args	✅ Agree
8	All-zeros masks → `None`	✅ Agree	Saves memory
9	Remove compat aliases	✅ Agree	Also update test import
10	Inline `_run_encoder_layers`	✅ Agree	models.md rule
11	Remove PR-process comments	✅ Agree	Found 5 more instances
12	Use existing APG guider?	❌ Can't drop-in	Formula differs; follow-up
13	Move props to `__init__`	✅ Agree	Flux2 pattern
14	Flux2 distilled pattern	✅ Agree	Warn + ignore for turbo
15	Remove `SAMPLE_RATE`	⚠️ Partial	Static method needs refactor
16	`_get_timestep_schedule` vs scheduler	✅ Redundant	Scheduler can do this
17	Pre-defined timesteps special?	❌ No	Just cached formula results
18	Remove inner `no_grad`	✅ Agree	Standard convention
19	Remove silence_latent warning	✅ Agree	Also at line 1206
New	float64 in APG (NPU)	🔴 Bug	Missing NPU guard
New	`do_cfg` vs property	🟡 Cleanup	Use property consistently
New	Missing `_no_split_modules`	🟡 Missing	Needed for device_map
New	No audio-to-audio tests	🟡 Gap	At least repaint + cover
New	`test_prepare_latents` wrong expectations	🔴 Bug	Uses real-model values with tiny test config
New	Dense sliding-window mask	🟡 Perf	Follow-up optimization

Code-level (22 threads): 1. Delete 3 dev/parity scripts (`scripts/audio_parity_jieyue.py`, `scripts/dit_parity_test.py`, `scripts/run_official_generate_music.py`) that shouldn't have been committed. 2. Rename `AutoencoderOobleck._encode_one` → `_encode` to match the convention used by other diffusers VAEs. 3. Delete the hard-coded `SHIFT_TIMESTEPS` / `VALID_SHIFTS` table in `pipeline_ace_step.py`: the per-shift turbo schedules are recovered exactly by `linspace(1, 0, N+1)[:-1]` plus the flow-match shift formula that the non-turbo branch already uses, so a single code path covers both. 4. Drop the backwards-compat `AceStepDiTModel` / `AceStepDiTLayer` aliases and every reference (top-level `__init__`, `models/__init__`, `transformers/__init__`, dummy objects, tests, docs toctree, model card). `AceStepTransformer1DModel` is the only exported name now. 5. Remove the unused `attention_mask` / `encoder_attention_mask` args from `AceStepTransformer1DModel.forward`; the model rebuilds its masks from the sequence shape and never consumed them. 6. In the DiT forward and both encoders, pass `None` instead of an all-zero `full_attn_mask` / `encoder_4d_mask` to non-sliding attention layers — SDPA dispatches to a faster kernel when the mask is None. 7. Inline the shared `_run_encoder_layers` helper directly into `AceStepLyricEncoder.forward` / `AceStepTimbreEncoder.forward` so layer calls are visible at the forward boundary (diffusers style). 8. Move `is_turbo` / `sample_rate` / `latents_per_second` from `@property`s that re-read module configs each call to cached attributes populated in `__init__` (Flux2-style), with a default-ACE-Step fallback when `self.vae` is offloaded. Drop the now-unused `SAMPLE_RATE = 48000` module-level constant and the three property definitions. 9. Warn + coerce `guidance_scale` to 1.0 on turbo (guidance-distilled) checkpoints, following `pipeline_flux2_klein`. Prevents over-guided audio when users forward their base/sft CFG settings to a turbo pipe. 10. Remove the `logger.warning(...)` paths that triggered on `silence_latent` missing/zero — those only fired for author-side unconverted checkpoints and tests; end users always load converted weights where the buffer is baked in. 11. Drop the redundant `with torch.no_grad():` wrappers inside `encode_prompt` — the pipeline's `__call__` runs under `torch.no_grad` already. 12. Strip "reviewer comment on PR huggingface#13095" attribution comments from three docstrings (here and everywhere). Parity on jieyue (seed=42, bf16 + flash-attn, XL turbo 160s text2music): waveform Pearson = 0.9747 spectral Pearson = 0.9895 The shift comes from full-attention layers switching `attn_mask=0_tensor` → `attn_mask=None`, which dispatches to a different SDPA kernel on bf16. The two outputs are algebraically equivalent for fp32 eager; on bf16+FA the delta is dominated by kernel-level ULPs, well within the sampler-noise band (ear-check on the 160s example confirms no audible regression). Still open — AudioTokenizer/Detokenizer (deferred) + APG guider follow-up (dims differ from `diffusers.guiders.adaptive_projected_guidance`, not a drop-in; worth a separate PR). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ChuxiJ marked this pull request as draft February 7, 2026 11:38

ChuxiJ mentioned this pull request Feb 7, 2026

[New Model]: ACE-Step 1.5 vllm-project/vllm-omni#1252

Open

1 task

ChuxiJ marked this pull request as ready for review February 7, 2026 14:29

dg845 requested review from dg845 and yiyixuxu February 8, 2026 03:21