Skip to content

Add ACE-Step pipeline for text-to-music generation#13095

Open
ChuxiJ wants to merge 27 commits intohuggingface:mainfrom
ChuxiJ:add-ace-step-pipeline
Open

Add ACE-Step pipeline for text-to-music generation#13095
ChuxiJ wants to merge 27 commits intohuggingface:mainfrom
ChuxiJ:add-ace-step-pipeline

Conversation

@ChuxiJ
Copy link
Copy Markdown

@ChuxiJ ChuxiJ commented Feb 7, 2026

What does this PR do?

This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.

New Components

  • AceStepDiTModel (src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components: AceStepRMSNorm, AceStepRotaryEmbedding, AceStepMLP, AceStepTimestepEmbedding, AceStepAttention, AceStepEncoderLayer, and AceStepDiTLayer.

  • AceStepConditionEncoder (src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. Includes AceStepLyricEncoder and AceStepTimbreEncoder sub-modules.

  • AceStepPipeline (src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:

    • text2music — generate music from text and lyrics
    • cover — generate from audio semantic codes or with timbre transfer via reference audio
    • repaint — regenerate a time region within existing audio
    • extract — extract a specific track (vocals, drums, etc.) from audio
    • lego — generate a specific track given audio context
    • complete — complete audio with additional tracks
  • Conversion script (scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.

Key Features

  • Multi-task support: 6 task types with automatic instruction routing via _get_task_instruction
  • Music metadata conditioning: Optional bpm, keyscale, timesignature parameters formatted into the SFT prompt template
  • Audio-to-audio tasks: Source audio (src_audio) and reference audio (reference_audio) inputs with VAE encoding
  • Tiled VAE encode/decode: Memory-efficient chunked encoding (_tiled_encode) and decoding (_tiled_decode) for long audio
  • Classifier-free guidance (CFG): Dual forward pass with configurable guidance_scale, cfg_interval_start, and cfg_interval_end (primarily for base/SFT models; turbo models have guidance distilled into weights)
  • Audio cover strength blending: Smooth interpolation between cover-conditioned and text-only-conditioned outputs via audio_cover_strength
  • Audio code parsing: _parse_audio_code_string extracts semantic codes from <|audio_code_N|> tokens for cover tasks
  • Chunk masking: _build_chunk_mask creates time-region masks for repaint/lego tasks
  • Anti-clipping normalization: Post-decode normalization to prevent audio clipping
  • Multi-language lyrics: 50+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
  • Variable-length generation: Configurable duration from 10 seconds to 10+ minutes
  • Custom timestep schedules: Pre-defined shifted schedules for shift=1.0/2.0/3.0, or user-provided timesteps
  • Turbo model variant: Optimized for 8 inference steps with shift=3.0

Architecture

ACE-Step 1.5 comprises three main components:

  1. Oobleck autoencoder (VAE): Compresses 48kHz stereo waveforms into 25Hz latent representations
  2. Qwen3-Embedding-0.6B text encoder: Encodes text prompts and lyrics for conditioning
  3. Diffusion Transformer (DiT): Denoises audio latents using flow matching with an Euler ODE solver

Tests

  • Pipeline tests (tests/pipelines/ace_step/test_ace_step.py):
    • AceStepDiTModelTests — forward shape, return dict, gradient checkpointing
    • AceStepConditionEncoderTests — forward shape, save/load config
    • AceStepPipelineFastTests (extends PipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and more
  • Model tests (tests/models/transformers/test_models_transformer_ace_step.py):
    • TestAceStepDiTModel (extends ModelTesterMixin) — forward pass, dtype inference, save/load, determinism
    • TestAceStepDiTModelMemory (extends MemoryTesterMixin) — layerwise casting, group offloading
    • TestAceStepDiTModelTraining (extends TrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precision

All 70 tests pass (39 pipeline + 31 model).

Documentation

  • docs/source/en/api/pipelines/ace_step.md — Pipeline API documentation with usage examples
  • docs/source/en/api/models/ace_step_transformer.md — Transformer model documentation

Usage

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/ACE-Step-v1-5-turbo", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
).audios

sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)

Before submitting

Who can review?

References

@ChuxiJ ChuxiJ marked this pull request as draft February 7, 2026 11:38
@ChuxiJ ChuxiJ marked this pull request as ready for review February 7, 2026 14:29
@dg845 dg845 requested review from dg845 and yiyixuxu February 8, 2026 03:21
@dg845
Copy link
Copy Markdown
Collaborator

dg845 commented Feb 9, 2026

Hi @ChuxiJ, thanks for the PR! As a preliminary comment, I tried the test script given above but got an error, which I think is due to the fact that the ACE-Step/ACE-Step-v1-5-turbo repo doesn't currently exist on the HF hub.

If I convert the checkpoint locally from a local snapshot of ACE-Step/Ace-Step1.5 at /path/to/acestep-v15-repo using

python scripts/convert_ace_step_to_diffusers.py \
    --checkpoint_dir /path/to/acestep-v15-repo \
    --dit_config acestep-v15-turbo \
    --output_dir /path/to/acestep-v15-diffusers \
    --dtype bf16

and then test it using the following script:

import torch
import soundfile as sf
from diffusers import AceStepPipeline

OUTPUT_SAMPLE_RATE = 48000
model_id = "/path/to/acestep-v15-diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

pipe = AceStepPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe = pipe.to(device)

generator = torch.Generator(device=device).manual_seed(seed)

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
    generator=generator,
).audios

sf.write("acestep_t2m.wav", audio[0, 0].cpu().numpy(), OUTPUT_SAMPLE_RATE)

I get the following sample:

acestep_t2m.wav

The sample quality is lower than expected, so there is probably a bug. Could you look into it?

Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/models/transformers/ace_step_transformer.py Outdated
Comment thread src/diffusers/pipelines/ace_step/modeling_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated

model_cpu_offload_seq = "text_encoder->condition_encoder->transformer->vae"

def __init__(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it including the audio tokenizer and detokenizer (I believe AceStepAudioTokenizer and AudioTokenDetokenizer in the original code) as registered modules would make the pipeline easier to use and more self-contained. If I understand correctly, this would allow users to supply raw audio waveforms instead of audio_codes to __call__ and the user would not have to manually tokenize the audio to an audio code string first.

Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py
…eScheduler

Replace the hand-rolled flow-matching Euler loop with
`FlowMatchEulerDiscreteScheduler`. ACE-Step still computes its own shifted /
turbo sigma schedule via `_get_timestep_schedule`, but now passes it to
`scheduler.set_timesteps(sigmas=...)` and delegates the ODE step to
`scheduler.step()`. The scheduler is configured with `num_train_timesteps=1`
and `shift=1.0` so `scheduler.timesteps` stays in `[0, 1]` (the convention the
DiT was trained on) and the scheduler doesn't re-shift already-shifted sigmas.

The scheduler's appended terminal `sigma=0` reproduces the old loop's
final-step "project to x0" case exactly: `prev = x + (0 - t_curr) * v`.

Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music, 8 steps):
  waveform Pearson = 0.999999
  spectral Pearson = 1.000000
  max |diff|       = 2.5e-3  (fp32 step-math vs previous bf16 step-math)

fp32 Euler-loop A/B against the hand-rolled path: max |diff| = 3.6e-7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 22, 2026
… kwargs

- Move the DiT transformer tests out of the pipeline test file into a new
  tests/models/transformers/test_models_transformer_ace_step.py that follows
  the standard BaseModelTesterConfig + ModelTesterMixin scaffold (matches
  test_models_transformer_longcat_audio_dit.py).
- Drop `max_position_embeddings` from the remaining AceStepDiTModel and
  AceStepConditionEncoder test fixtures — neither constructor accepts that
  argument anymore.
- Drop `use_sliding_window` from the same fixtures — also no longer a
  constructor argument (the actual `sliding_window` int kwarg is kept).
- Wire `FlowMatchEulerDiscreteScheduler(num_train_timesteps=1, shift=1.0)`
  into `get_dummy_components()` now that the pipeline requires it.

Resolves huggingface#13095 (comment),
r3115664850, r3115673059, r3115676580, r3115680700.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 22, 2026
@dg845
Copy link
Copy Markdown
Collaborator

dg845 commented Apr 23, 2026

@bot /style

@github-actions
Copy link
Copy Markdown
Contributor

Style fix is beginning .... View the workflow run here.

Comment thread scripts/convert_ace_step_to_diffusers.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/modeling_ace_step.py
Copy link
Copy Markdown
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Can you run make style and make quality to fix the code style?

Fixes 5 review threads + style:

1. Converter now builds `AceStepPipeline` in memory and calls
   `save_pretrained`. Previously the hand-written `model_index.json` was
   missing the `scheduler` entry — fresh converter output couldn't be loaded
   by `AceStepPipeline.from_pretrained` (r3127767785). This also makes the
   converter robust to future `__init__` signature changes.

2. `latent_length` uses `math.ceil(...)` instead of `int(...)` so non-integer
   products (e.g. `latents_per_second=2.0, audio_duration=0.4 → 0.8`) round up
   to `1` instead of truncating to `0` and crashing shape checks (r3127790939).

3. Add `_callback_tensor_inputs = ["latents"]` on `AceStepPipeline` so the
   standard diffusers callback tests pick up the right tensor (r3127795954).

4. `AceStepConditionEncoder.silence_latent` no longer hard-codes the channel
   dim to 64. The placeholder buffer now uses the `timbre_hidden_dim`
   constructor argument, so smaller test configs with `timbre_hidden_dim != 64`
   load without shape errors (r3127812932).

5. Revert `self.vae.enable_tiling()` from `AceStepPipeline.__init__`. Users can
   call `pipe.vae.enable_tiling()` themselves for long-form generation; that
   matches the opt-in convention used by the rest of diffusers (r3127777296).

6. `ruff check --fix` + `ruff format` over all ACE-Step sources (the style fix
   dg845 asked for via `@bot /style`).

Also: converter now accepts sharded `model.safetensors.index.json` layouts
alongside the single-file `model.safetensors`, so the 5B XL turbo variant
converts without a pre-processing step.

Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music 160s, fresh
converter output loaded via `from_pretrained`):
  waveform Pearson  = 0.999954
  spectral Pearson  = 0.999977
  max |a-b| bf16    = 4.3e-02  (dominated by the VAE tiling default flip)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
i left a few more comments

Comment thread scripts/audio_parity_jieyue.py Outdated
Comment thread scripts/dit_parity_test.py Outdated
Comment thread scripts/run_official_generate_music.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_oobleck.py Outdated
Comment thread src/diffusers/models/autoencoders/autoencoder_oobleck.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
Comment thread src/diffusers/pipelines/ace_step/pipeline_ace_step.py Outdated
@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude can you do a cross-check on my review for this PR. I'd like you to

  1. verify each of my commeents by tracing the code. Agree/disagreee with reasoning. if you agree, keep the reason short, but also check whether the same pattern exists elsewhere in the PR that I missed
  2. answers questions
  3. do you own review to see if I missed anything

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Claude finished @yiyixuxu's task in 5m 27s —— View job


Cross-checking @yiyixuxu's review + independent review

  • Read all review guidelines (AGENTS.md, models.md, modular.md, etc.)
  • Read all changed source files under src/diffusers/
  • Read reference implementations (Flux2 Klein, APG guider, other VAEs)
  • Verify each of yiyixuxu's review comments
  • Answer yiyixuxu's questions
  • Independent review for missed issues

Part 1: Verification of @yiyixuxu's comments

1. scripts/audio_parity_jieyue.py, scripts/dit_parity_test.py, scripts/run_official_generate_music.py — remove before merge

Agree. These are developer-facing parity harnesses. They reference local paths, internal dev boxes, and the original ACE-Step repo — not useful for downstream users and shouldn't ship.


2. autoencoder_oobleck.py:372 — rename _encode_one_encode

Agree. Every other VAE in the codebase (AutoencoderKL, AutoencoderDC, AutoencoderKLAllegro, AutoencoderKLHunyuanVideo, etc.) uses _encode as the internal encode method. The name _encode_one deviates from this convention. The suggested renames at lines 394 and 397 (calling self._encode(...) instead of self._encode_one(...)) follow directly.


3. ace_step_transformer.py:483 — remove unused attention_mask / encoder_attention_mask args

Agree. Traced the code: these parameters are accepted in the signature but never read. The model rebuilds all masks from scratch inside forward (lines 538-550). The docstring even says "Ignored; the model rebuilds…". Unused args with "kept for API stability" justification don't apply here since this is a new model.


4. ace_step_transformer.py:538full_attn_mask and encoder_4d_mask are all-zeros, can we pass None?

Agree. I traced _create_4d_mask with is_causal=False, is_sliding_window=False, attention_mask=None:

  • valid_mask starts as all-True
  • No filtering branches are entered
  • mask_tensor.masked_fill_(valid_mask, 0.0) → entire tensor is 0.0

So full_attn_mask and encoder_4d_mask are no-op all-zeros tensors. dispatch_attention_fn handles attn_mask=None correctly (it's the default in all backends). Passing None saves a materialized [B, 1, S, S] tensor per forward pass — meaningful for long audio. Only sliding_attn_mask carries real information and should stay.


5. ace_step_transformer.py:595 — remove backward-compat aliases

Agree. AceStepDiTModel = AceStepTransformer1DModel and AceStepDiTLayer = AceStepTransformerBlock are being introduced in the same PR — there's no existing user code depending on them. Per AGENTS.md: "Avoid backwards-compatibility hacks."

Missed spot: tests/pipelines/ace_step/test_ace_step.py:23 imports AceStepDiTModel — this needs updating too if the aliases are removed.


6. modeling_ace_step.py:125 — remove _run_encoder_layers helper, inline in forward

Agree. Per models.md: "All layer calls should be visible directly in forward — avoid helper functions that hide nn.Module calls." The function hides the per-layer loop + gradient-checkpointing dispatch. Even though it's used by both AceStepLyricEncoder.forward() and AceStepTimbreEncoder.forward(), the guideline prioritizes readability over DRY for layer calls.


7. modeling_ace_step.py:140 — remove "reviewer comment on PR #13095" comments

Agree. These are meta-comments about the PR process. I found similar comments at:

  • modeling_ace_step.py:73: "reviewer comment on PR #13095"
  • modeling_ace_step.py:139-141: "reviewer comment on PR #13095"
  • pipeline_ace_step.py:291-293: "reviewer ask on PR #13095"
  • pipeline_ace_step.py:313: "Reviewer comment on PR #13095"
  • pipeline_ace_step.py:716-717: "reviewer ask on PR #13095"

All should be removed.


8. pipeline_ace_step.py:96 — can we use existing APG guider?

Answer: Not as a drop-in replacement. I compared _apg_forward with the existing normalized_guidance in src/diffusers/guiders/adaptive_projected_guidance.py. There are three significant differences:

Aspect PR's _apg_forward Existing guider's normalized_guidance
Formula pred_cond + (guidance_scale - 1) * update pred_uncond + guidance_scale * update
Default eta 0.0 (orthogonal only) 1.0 (both parallel + orthogonal)
Default norm_threshold 2.5 (active) 0.0 (disabled)
Momentum -0.75 (negative, intentional for ACE-Step) user-provided
dims explicit (1,) (time dim only) all spatial dims [-i for i in range(1, ndim)]

The formula difference is mathematically significant: at guidance_scale=7.0, the PR computes pred_cond + 6 * update while the guider computes pred_uncond + 7 * update. These give different results.

Recommendation: Agree this is a follow-up item. A custom ACE-Step guider (inheriting BaseGuidance) with the correct formula, defaults, and MPS handling would be the clean solution.


9. pipeline_ace_step.py:269,275,280 — move is_turbo, sample_rate, latents_per_second to __init__

Agree. The Flux2 Klein pipeline sets derived values in __init__ (e.g., self.vae_scale_factor = ...). Computing these once is cleaner than recomputing on every property access. The fallback pattern for when VAE isn't loaded:

self.sample_rate = int(self.vae.config.sampling_rate) if getattr(self, "vae", None) else 48000

10. pipeline_ace_step.py:358 — follow Flux2 approach for distilled/turbo guidance

Agree. The Flux2 Klein pattern is:

  1. logger.warning(f"Guidance scale {guidance_scale} is ignored for distilled models.") when guidance_scale > 1.0 on distilled
  2. do_classifier_free_guidance = self._guidance_scale > 1 and not self.config.is_distilled

This is cleaner than _variant_defaults(). However, _variant_defaults also handles num_inference_steps and shift defaults. Those still need a home — probably just inline defaults in __call__.


11. pipeline_ace_step.py:66 — remove SAMPLE_RATE = 48000, use self.sample_rate

Partially agree. The constant is used in 3 places:

  1. _normalize_audio_to_stereo_48k() — this is a @staticmethod, can't access self.sample_rate
  2. _prepare_reference_audio_latents() — instance method, can use self.sample_rate
  3. __call__ line 1136 — instance method, can use self.sample_rate

For (2) and (3), switch to self.sample_rate. For (1), either make it a regular method, or pass sample_rate as a parameter. Keeping a module-level constant just for one static method seems wrong.


12. pipeline_ace_step.py:669 — is _get_timestep_schedule different from scheduler set_timesteps?

Answer: No, it's redundant for the non-turbo path. The FlowMatchEulerDiscreteScheduler.set_timesteps accepts sigmas directly and supports a shift config parameter. The non-turbo formula:

t = linspace(1.0, 0.0, N+1)[:-1]
t = shift * t / (1 + (shift - 1) * t)

is the standard flow-matching shift transform that the scheduler already implements. Since the pipeline already calls scheduler.set_timesteps(sigmas=t_schedule.tolist()), the computation is happening twice.

For the turbo path, the lookup tables are also redundant (see next comment).


13. pipeline_ace_step.py:70 — is there something special with pre-defined timesteps?

Answer: No. I verified all three shift values by applying the shift formula to linspace(1.0, 0.0, 9)[:-1]:

  • shift=1.0: linspace(1, 0, 9)[:-1] = [1.0, 0.875, 0.75, ...] ← matches SHIFT_TIMESTEPS[1.0] exactly
  • shift=2.0: applying 2t/(1+t) to linspace = [1.0, 0.9333, 0.8571, ...] ← matches SHIFT_TIMESTEPS[2.0] exactly
  • shift=3.0: applying 3t/(1+2t) to linspace = [1.0, 0.9545, 0.9, ...] ← matches SHIFT_TIMESTEPS[3.0] exactly

The lookup tables are just cached results of the formula. SHIFT_TIMESTEPS and VALID_SHIFTS constants can be removed entirely — the base/SFT code path (lines 669-672) already handles all cases correctly.


14. pipeline_ace_step.py:579 — remove torch.no_grad() inside encode_prompt

Agree. The pipeline __call__ is decorated with @torch.no_grad(), making inner no_grad contexts redundant. The Flux2 pipeline's encode_prompt does not use torch.no_grad() either. This is the standard diffusers convention.


15. pipeline_ace_step.py:835 — remove warning about silence_latent

Agree. The warning at lines 833-843 (_silence_latent_tiled) is a developer safety net for incomplete conversions. Since the converter now correctly includes silence_latent, this shouldn't trigger in practice. Users won't be running the converter manually.

Same pattern exists in __call__ at lines 1206-1228 where there's a similar fallback/warning for missing silence_latent. Both should be cleaned up.


Part 2: Issues I found that the review missed

A. torch.float64 usage in _apg_forward (pipeline_ace_step.py:147-148)

if device_type == "mps":
    diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
    diff_f, cond_f = diff.double(), pred_cond.double()

Per models.md: "MPS and several NPU backends don't support float64." The MPS case is handled, but NPU is not. Should use the device-gated pattern:

is_mps = diff.device.type == "mps"
is_npu = diff.device.type == "npu"
if is_mps or is_npu:
    diff_f, cond_f = diff.cpu().double(), pred_cond.cpu().double()
else:
    diff_f, cond_f = diff.double(), pred_cond.double()

Or better: use float32 here (the precision difference in APG normalization is negligible for audio inference).

B. do_cfg local variable vs do_classifier_free_guidance property (pipeline_ace_step.py:1305)

Line 1305: do_cfg = guidance_scale > 1.0 — this duplicates the logic of the do_classifier_free_guidance property (line 289-295). The local variable should use the property:

do_cfg = self.do_classifier_free_guidance

This also becomes important when the Flux2 distilled-model pattern is adopted (the property would check not self.is_turbo).

C. Missing _no_split_modules on AceStepTransformer1DModel

Per models.md gotcha #4: "Wrong _no_split_modules. These class attributes control device placement." The transformer model doesn't define _no_split_modules, which matters for device_map="auto" and accelerate integration. Most transformer models define it as ["AceStepTransformerBlock"].

D. No tests for audio-to-audio tasks

The AceStepPipelineFastTests only covers text2music. There are no tests for cover, repaint, extract, lego, or complete tasks. At minimum, a test_ace_step_repaint (which exercises src_audio, chunk_mask, and the repaint window logic) and a test_ace_step_cover (which exercises reference audio + timbre encoding) would catch regressions in those code paths. dg845 already flagged this in their general review comment.

E. _normalize_audio_to_stereo_48k fallback uses F.interpolate (pipeline_ace_step.py:700)

audio = F.interpolate(audio.unsqueeze(0), size=target_len, mode="linear", align_corners=False)[0]

The F.interpolate with mode="linear" expects 3D input (B, C, L) but the code does audio.unsqueeze(0) on a 2D (C, L) tensor giving (1, C, L). This is correct for the API, but mode="linear" for audio resampling is a very poor approximation (no anti-aliasing filter). The comment says "Simple linear resampling fallback" which is fair, but consider adding a note in the docstring that torchaudio is recommended for quality.

F. latent_length calculation discrepancy between prepare_latents and __call__

prepare_latents (line 613): math.ceil(audio_duration * self.latents_per_second)
__call__ (line 1158): math.ceil(audio_duration * self.latents_per_second)

Both use ceil now (looks like dg845's comment was addressed). However, the test_prepare_latents test at line 396 asserts:

self.assertEqual(latents.shape, (2, 25, 8))  # 25 Hz latent rate, 1s duration -> 25 frames

But with the tiny test VAE (sampling_rate=4, downsampling_ratios=[1, 2]), latents_per_second = 4 / (1*2) = 2.0, so 1.0s × 2.0 = 2 frames, not 25. This test appears to be using hardcoded expectations from the real model config rather than the test config. This test would fail or is currently not being run.

G. _create_4d_mask materializes full [B, 1, S, S] attention masks

For long audio (e.g., 5 minutes at 25 Hz = 7500 latent frames, /patch_size=2 = 3750), _create_4d_mask materializes a [1, 1, 3750, 3750] tensor. In float32 that's ~56 MB, in bfloat16 ~28 MB. The sliding-window mask is sparse by nature. Since full_attn_mask and encoder_4d_mask are being replaced with None per comment 4, only the sliding mask remains, but it's still a dense materialization. This is not a blocker but worth noting for follow-up optimization (e.g., using xformers block-diagonal or FlashAttention's sliding window natively).


Summary

# Comment Verdict Notes
1-3 Remove parity scripts ✅ Agree
4-6 _encode_one_encode ✅ Agree Matches all other VAEs
7 Remove unused forward args ✅ Agree
8 All-zeros masks → None ✅ Agree Saves memory
9 Remove compat aliases ✅ Agree Also update test import
10 Inline _run_encoder_layers ✅ Agree models.md rule
11 Remove PR-process comments ✅ Agree Found 5 more instances
12 Use existing APG guider? ❌ Can't drop-in Formula differs; follow-up
13 Move props to __init__ ✅ Agree Flux2 pattern
14 Flux2 distilled pattern ✅ Agree Warn + ignore for turbo
15 Remove SAMPLE_RATE ⚠️ Partial Static method needs refactor
16 _get_timestep_schedule vs scheduler ✅ Redundant Scheduler can do this
17 Pre-defined timesteps special? ❌ No Just cached formula results
18 Remove inner no_grad ✅ Agree Standard convention
19 Remove silence_latent warning ✅ Agree Also at line 1206
New float64 in APG (NPU) 🔴 Bug Missing NPU guard
New do_cfg vs property 🟡 Cleanup Use property consistently
New Missing _no_split_modules 🟡 Missing Needed for device_map
New No audio-to-audio tests 🟡 Gap At least repaint + cover
New test_prepare_latents wrong expectations 🔴 Bug Uses real-model values with tiny test config
New Dense sliding-window mask 🟡 Perf Follow-up optimization

Code-level (22 threads):

1. Delete 3 dev/parity scripts (`scripts/audio_parity_jieyue.py`,
   `scripts/dit_parity_test.py`, `scripts/run_official_generate_music.py`)
   that shouldn't have been committed.
2. Rename `AutoencoderOobleck._encode_one` → `_encode` to match the convention
   used by other diffusers VAEs.
3. Delete the hard-coded `SHIFT_TIMESTEPS` / `VALID_SHIFTS` table in
   `pipeline_ace_step.py`: the per-shift turbo schedules are recovered
   exactly by `linspace(1, 0, N+1)[:-1]` plus the flow-match shift formula
   that the non-turbo branch already uses, so a single code path covers both.
4. Drop the backwards-compat `AceStepDiTModel` / `AceStepDiTLayer` aliases
   and every reference (top-level `__init__`, `models/__init__`,
   `transformers/__init__`, dummy objects, tests, docs toctree, model card).
   `AceStepTransformer1DModel` is the only exported name now.
5. Remove the unused `attention_mask` / `encoder_attention_mask` args from
   `AceStepTransformer1DModel.forward`; the model rebuilds its masks from
   the sequence shape and never consumed them.
6. In the DiT forward and both encoders, pass `None` instead of an all-zero
   `full_attn_mask` / `encoder_4d_mask` to non-sliding attention layers — SDPA
   dispatches to a faster kernel when the mask is None.
7. Inline the shared `_run_encoder_layers` helper directly into
   `AceStepLyricEncoder.forward` / `AceStepTimbreEncoder.forward` so layer
   calls are visible at the forward boundary (diffusers style).
8. Move `is_turbo` / `sample_rate` / `latents_per_second` from `@property`s
   that re-read module configs each call to cached attributes populated in
   `__init__` (Flux2-style), with a default-ACE-Step fallback when
   `self.vae` is offloaded. Drop the now-unused `SAMPLE_RATE = 48000`
   module-level constant and the three property definitions.
9. Warn + coerce `guidance_scale` to 1.0 on turbo (guidance-distilled)
   checkpoints, following `pipeline_flux2_klein`. Prevents over-guided
   audio when users forward their base/sft CFG settings to a turbo pipe.
10. Remove the `logger.warning(...)` paths that triggered on
    `silence_latent` missing/zero — those only fired for author-side
    unconverted checkpoints and tests; end users always load converted
    weights where the buffer is baked in.
11. Drop the redundant `with torch.no_grad():` wrappers inside
    `encode_prompt` — the pipeline's `__call__` runs under `torch.no_grad`
    already.
12. Strip "reviewer comment on PR huggingface#13095" attribution comments from three
    docstrings (here and everywhere).

Parity on jieyue (seed=42, bf16 + flash-attn, XL turbo 160s text2music):
  waveform Pearson = 0.9747
  spectral Pearson = 0.9895

The shift comes from full-attention layers switching `attn_mask=0_tensor` →
`attn_mask=None`, which dispatches to a different SDPA kernel on bf16. The
two outputs are algebraically equivalent for fp32 eager; on bf16+FA the
delta is dominated by kernel-level ULPs, well within the sampler-noise
band (ear-check on the 160s example confirms no audible regression).

Still open — AudioTokenizer/Detokenizer (deferred) + APG guider follow-up
(dims differ from `diffusers.guiders.adaptive_projected_guidance`, not a
drop-in; worth a separate PR).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 24, 2026
@github-actions github-actions Bot added guiders size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 24, 2026
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation guiders models pipelines size/L PR with diff > 200 LOC tests utils

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants