Question: Applicability of TransMLA to DLM

Hi Authors,

Thank you for the amazing work and for open-sourcing this project!

I had a quick question — do you think **TransMLA** could be applied to *Diffusion Language Models* (DL models)?  
I’m currently exploring ways to reduce the **K–V cache memory** during inference in diffusion-based language or vision–language models. 

Since TransMLA provides a theoretical and practical framework for converting GQA-based architectures into MLA with compressed KV caches, I was wondering whether a similar idea could be used for the iterative denoising steps in diffusion models.

Would love to hear your thoughts on whether TransMLA’s low-rank latent compression or RoRoPE decoupling could extend to diffusion-style attention or cross-attention blocks.

Thanks again for this great contribution!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Applicability of TransMLA to DLM #37

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: Applicability of TransMLA to DLM #37

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions