Skip to content

fix(kimi): clamp completion budget dynamically#2332

Open
wbxl2000 wants to merge 6 commits into
mainfrom
qer/kimi-dynamic-completion-budget
Open

fix(kimi): clamp completion budget dynamically#2332
wbxl2000 wants to merge 6 commits into
mainfrom
qer/kimi-dynamic-completion-budget

Conversation

@wbxl2000
Copy link
Copy Markdown
Collaborator

@wbxl2000 wbxl2000 commented May 20, 2026

Summary

This PR changes the Kimi provider path to stop hardcoding max_tokens = 32000 and instead compute a per-request max_completion_tokens that fits the current context window.

Changes:

  • remove the provider-level default max_tokens = 32000
  • prefer Kimi's max_completion_tokens request parameter
  • keep max_tokens as a compatibility alias and normalize it to max_completion_tokens
  • read KIMI_MODEL_MAX_COMPLETION_TOKENS, with KIMI_MODEL_MAX_TOKENS as a fallback alias
  • compute a dynamic completion cap before each Kimi model step and compaction request
  • clamp explicit and default caps against the remaining model context window
  • document the new env var and compatibility behavior
  • add focused tests for provider request shape, env handling, and KimiSoul dynamic budgeting

Why

The old fixed max_tokens = 32000 is unsafe for two reasons:

  1. It can exceed the remaining context window when the input is already large.
  2. For reasoning models, the completion budget is shared by reasoning_content and final content, so a too-small remaining budget can produce a 200 response with only thinking text and no usable assistant content.

The new behavior treats reserved_context_size as the default desired response budget, but the final API parameter is always clamped by the current input size and model context length:

available = model_context_length - current_input_tokens - safety_margin
max_completion_tokens = min(configured_or_default_budget, available)

Real Kimi API probe

I tested the real API rather than relying only on docs.

Target:

  • Base URL: https://api.moonshot.cn/v1
  • Model: kimi-k2.6
  • Transport: raw HTTP requests to /v1/models and /v1/chat/completions
  • Auth: real API key, redacted

Model metadata

Request:

GET /v1/models
Authorization: Bearer <redacted>

Observed kimi-k2.6 entry:

{
  "id": "kimi-k2.6",
  "object": "model",
  "owned_by": "moonshot",
  "supports_image_in": true,
  "supports_video_in": true,
  "supports_reasoning": true,
  "context_length": 262144
}

Small max_completion_tokens can return thinking only

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 16
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27,
    "cached_tokens": 11,
    "prompt_tokens_details": { "cached_tokens": 11 }
  },
  "finish_reason": "length",
  "message_keys": ["content", "reasoning_content", "role"],
  "content_chars": 0,
  "reasoning_chars": 64,
  "content_prefix": "",
  "reasoning_prefix": "The user wants me to reply exactly with \"ok\". No punctuation, no"
}

This is the important failure mode: the API returns HTTP 200, but the agent gets no usable content.

A slightly larger cap leaves room for content

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 32
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 29,
    "total_tokens": 40
  },
  "finish_reason": "stop",
  "content_chars": 2,
  "reasoning_chars": 114,
  "content_prefix": "ok"
}

Legacy max_tokens is still accepted

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_tokens": 16
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 64
}

If both caps are sent, max_completion_tokens wins

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 16,
  "max_tokens": 32
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 60
}

Huge caps are accepted on short input

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 1000000
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 27,
    "total_tokens": 38
  },
  "finish_reason": "stop",
  "content_chars": 2,
  "reasoning_chars": 107
}

So the server accepts obviously unsafe cap values. Client-side clamping is still required.

Omitting the cap is not a controlled strategy

Request:

{
  "model": "kimi-k2.6",
  "messages": [
    {
      "role": "user",
      "content": "请连续输出编号列表,从 1 开始,每行一个编号和短句,尽量一直输出直到系统停止。不要解释。"
    }
  ],
  "stream": true,
  "stream_options": { "include_usage": true }
}

Observed result:

{
  "status": 200,
  "elapsed_ms": 811785,
  "stream_chunks": 21952,
  "finish_reason": "stop",
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 21986,
    "total_tokens": 22023,
    "cached_tokens": 37,
    "prompt_tokens_details": { "cached_tokens": 37 }
  },
  "content_chars": 15598,
  "reasoning_chars": 21520
}

This means omitting the cap can run for a long time and spend a large completion budget. It is not a safe replacement for an explicit dynamic cap.

kimi-k2.6 context-window behavior

Input over context limit:

{
  "status": 400,
  "error": {
    "message": "Invalid request: Your request exceeded model token limit: 262144 (requested: 270018)",
    "type": "invalid_request_error"
  }
}

Near context limit with a tiny cap:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 260016,
    "completion_tokens": 16,
    "total_tokens": 260032,
    "cached_tokens": 249856,
    "prompt_tokens_details": { "cached_tokens": 249856 }
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 60
}

Near context limit with a large cap can also fail operationally:

{
  "status": 429,
  "error": {
    "message": "The engine is currently overloaded, please try again later",
    "type": "engine_overloaded_error"
  }
}

Test plan

uv run pytest packages/kosong/tests/api_snapshot_tests/test_kimi.py tests/core/test_create_llm.py tests/core/test_kimisoul_completion_budget.py tests/core/test_simple_compaction.py
uv run ruff check packages/kosong/src/kosong/chat_provider/kimi.py src/kimi_cli/llm.py src/kimi_cli/soul/kimisoul.py src/kimi_cli/soul/compaction.py packages/kosong/tests/api_snapshot_tests/test_kimi.py tests/core/test_create_llm.py tests/core/test_kimisoul_completion_budget.py
uv run pyright src/kimi_cli/llm.py src/kimi_cli/soul/kimisoul.py src/kimi_cli/soul/compaction.py

Results:

  • 55 passed, 2 warnings
  • ruff: All checks passed!
  • pyright: 0 errors, 0 warnings, 0 informations

Open in Devin Review

Copilot AI review requested due to automatic review settings May 20, 2026 09:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34b527ccc1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1006 to 1008
effective_history = normalize_history(self._context.history)
chat_provider = self._with_dynamic_completion_budget(chat_provider)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Persist recovered provider instead of per-step clone

Applying _with_dynamic_completion_budget here creates a temporary Kimi copy for the request, but retry recovery (on_retryable_error) runs against that copy and may replace/close its client; the runtime’s canonical provider (self._runtime.llm.chat_provider) is never updated. After any transient connection error (or 401 refresh path), the next turn can continue with a stale/closed client and fail repeatedly even though recovery succeeded once. This regression is introduced by cloning before _run_with_connection_recovery and affects normal step execution (and similarly the compaction path).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Kimi integration to stop hardcoding a fixed completion cap (max_tokens=32000) and instead prefer max_completion_tokens, with per-request clamping based on the remaining context window. It also adds env var compatibility/precedence rules and new tests/docs around the new behavior.

Changes:

  • Add compute_max_completion_tokens() and use it from KimiSoul to clamp completion budgets per step/compaction.
  • Update the kosong Kimi provider to normalize legacy max_tokens to max_completion_tokens and remove the provider-level default cap.
  • Update CLI env var handling + docs, and add focused tests for env precedence and dynamic budgeting.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/kimi_cli/llm.py Adds completion-budget clamping helper and env var parsing for Kimi max completion tokens.
src/kimi_cli/soul/kimisoul.py Injects dynamic completion budget into each Kimi step and compaction call.
src/kimi_cli/soul/compaction.py Updates compaction comment to reflect KimiSoul-managed completion caps.
packages/kosong/src/kosong/chat_provider/kimi.py Removes hardcoded default max_tokens, adds max_completion_tokens, and normalizes legacy max_tokens.
packages/kosong/tests/api_snapshot_tests/test_kimi.py Updates snapshots/tests to assert max_completion_tokens and default omission of caps.
tests/core/test_create_llm.py Adds env precedence test + unit tests for the clamping helper; updates expected Kimi provider request shape.
tests/core/test_kimisoul_completion_budget.py Adds tests covering KimiSoul dynamic completion budgeting behavior.
docs/en/configuration/env-vars.md Documents KIMI_MODEL_MAX_COMPLETION_TOKENS and KIMI_MODEL_MAX_TOKENS alias behavior.
docs/zh/configuration/env-vars.md Same as English docs, in Chinese.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/kimi_cli/llm.py
Comment on lines +67 to +68
safe_remaining = remaining - max(0, safety_margin)
available = safe_remaining if safe_remaining > 0 else remaining
Comment on lines +165 to 166
generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread src/kimi_cli/soul/kimisoul.py Outdated

# Normalize: merge adjacent user messages for clean API input
effective_history = normalize_history(self._context.history)
chat_provider = self._with_dynamic_completion_budget(chat_provider)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Connection recovery on shallow-copied provider closes the original provider's shared HTTP client

_with_dynamic_completion_budget creates a shallow copy of the Kimi provider via with_generation_kwargscopy.copy(self) (kimi.py:220). The copy's .client attribute is the same AsyncOpenAI instance as the original (self._runtime.llm.chat_provider). When a connection/timeout error triggers _run_with_connection_recovery (kimisoul.py:1395), on_retryable_error is called on the copy, which calls close_replaced_openai_client(old_client, ...) (kimi.py:191) — closing the shared client. This permanently corrupts the original provider's client.

After the first connection recovery, every subsequent step creates a new budget copy from the corrupted original (with the closed client), fails on its first attempt, recovers by creating yet another new client, and succeeds on retry — burning one recovery attempt per step for the rest of the session. The same issue applies to the compaction path at kimisoul.py:1196.

Prompt for agents
The root cause is that _with_dynamic_completion_budget (kimisoul.py:1128-1147) returns a shallow copy of the Kimi provider via with_generation_kwargs, which shares the same AsyncOpenAI client instance with the original provider stored in self._runtime.llm.chat_provider. When _run_with_connection_recovery (kimisoul.py:1382-1423) triggers on_retryable_error on this copy during a connection error, it closes the shared client via close_replaced_openai_client (kimi.py:191), corrupting the original provider.

The fix needs to ensure that connection recovery on a temporary budget copy does not close the original provider's client. Several approaches:

1. After calling _with_dynamic_completion_budget, propagate any client changes back to the original provider when on_retryable_error fires. This is complex.

2. Instead of creating a temporary copy per step, apply the dynamic budget to the original provider in-place (update self._runtime.llm.chat_provider). This preserves the old behavior where on_retryable_error modifies the original.

3. In _step and compact_context, after _run_with_connection_recovery succeeds following a recovery, check if the copy's client differs from the original's and update the original's client to match. This prevents the original from keeping a stale/closed client.

4. Make with_generation_kwargs (or a variant) do a proper deep copy of the client reference chain so that the copy is fully independent. However, deep-copying an AsyncOpenAI client is likely impractical.

Approach 2 or 3 seem most practical. The affected paths are _step (line 1007) and compact_context (lines 1192-1198).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@wbxl2000
Copy link
Copy Markdown
Collaborator Author

@codex

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9e76f05cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"max_tokens": 32000,
}
generation_kwargs.update(self._generation_kwargs)
generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reinstate a completion cap fallback for uncapped Kimi calls

Removing the provider’s default cap here means any Kimi call path that does not add its own budget now sends no max_completion_tokens/max_tokens. In this repo, /btw bypasses KimiSoul._step and calls kosong.step directly with soul._runtime.llm.chat_provider (src/kimi_cli/soul/btw.py, execute_side_question), so those requests are now unbounded unless users set an env var. That can cause very long generations and unexpectedly high latency/cost on side-question flows.

Useful? React with 👍 / 👎.

The previous approach in this PR derived a budget-clamped Kimi instance
via with_generation_kwargs() in KimiSoul._with_dynamic_completion_budget,
but that shallow copy shared self.client with runtime.llm.chat_provider.
A transient APIConnectionError would invoke on_retryable_error on the
copy, rebinding only the copy's client while the original (now closed)
client stayed attached to runtime.llm.chat_provider. Every subsequent
step then had to walk through a fresh connection-recovery cycle before
making any real progress.

Switch to a per-call generation_overrides mapping plumbed through
kosong.step / kosong.generate / ChatProvider.generate, so providers can
merge per-request kwargs without producing a parallel instance. The
single Kimi instance owned by Runtime keeps its live client and OAuth
token state, and on_retryable_error mutations are visible on the next
step. The Compaction protocol gains the same parameter so callers no
longer need to mutate the chat provider before invoking compact().

- Extend ChatProvider.generate Protocol with generation_overrides
- Forward through kosong.generate and kosong.step
- Kimi.generate normalizes max_tokens -> max_completion_tokens in overrides
- Other providers (Anthropic, OpenAILegacy, OpenAIResponses, GoogleGenAI,
  Chaos, Mock, Echo, ScriptedEcho) accept and merge the parameter
- KimiSoul replaces _with_dynamic_completion_budget with
  _compute_completion_overrides returning a dict
- compact_context no longer constructs a replacement LLM
- SimpleCompaction.compact threads generation_overrides to kosong.step
- Add regression test ensuring runtime.llm.chat_provider is not replaced
- Add kosong-level tests for per-call override merge/normalization
@wbxl2000
Copy link
Copy Markdown
Collaborator Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0e4851c89

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +167 to +168
generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)
if generation_overrides:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reinstate a default cap when no per-call override exists

This change removes the provider-level max_tokens default and now sends no completion limit unless callers explicitly pass generation_overrides, but several call paths still invoke kosong.step without overrides (for example src/kimi_cli/soul/btw.py and packages/kosong/src/kosong/__main__.py). In those contexts, Kimi requests become uncapped and can run much longer and consume far more tokens than before, which is a regression in latency/cost behavior introduced by this commit; either keep a safe provider default or add clamping in every Kimi call path that currently omits overrides.

Useful? React with 👍 / 👎.

After removing the provider-level ``max_tokens=32000`` default, two call
paths still invoked ``kosong.step`` without ``generation_overrides`` and
therefore sent Kimi requests with no completion cap at all:

- ``KimiSoul`` /btw side-question loop (``execute_side_question`` calls
  ``kosong.step`` directly with ``soul._runtime.llm.chat_provider``),
- the kosong package's developer demo (``kosong.__main__``).

Both are flagged by codex as a P1 regression in latency/cost behavior.

Reuse ``KimiSoul._compute_completion_overrides`` in the /btw path so it
shares the same context-aware budget as the main step loop. Give the
kosong demo a conservative ``max_completion_tokens=8192`` default so
``python -m kosong kimi`` does not run unbounded generations. Add a
regression test asserting /btw forwards the overrides to ``kosong.step``.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a0f9ec2ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/kimi_cli/soul/btw.py
# bare ``kosong.step`` call below would send a request with no completion cap on
# Kimi providers (the provider-level default was removed), so a side-question
# could run for hundreds of seconds and burn a large completion budget.
generation_overrides = soul._compute_completion_overrides(chat_provider) # pyright: ignore[reportPrivateUsage]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recompute /btw completion cap from actual turn history

execute_side_question computes generation_overrides once before building the /btw request history, and then reuses that same cap for both turns. Because _compute_completion_overrides uses context.token_count_with_pending, this budget ignores the injected side-question wrapper message and any assistant/tool-result messages appended before turn 2, so near the model context limit it can overestimate available output tokens and send an oversized max_completion_tokens that triggers Kimi 400 token-limit errors specifically on /btw flows. Recalculate the override after constructing each turn’s history (especially before turn 2) so the cap matches the actual request payload.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledging this is a real edge case, but holding off on it in this PR for three reasons:

  1. Triggering requires all three: the main context already near the model limit and /btw fired at that moment and the LLM picking a denied tool on turn 1 forcing a turn 2. KimiSoul._step calls should_auto_compact before every step (kimisoul.py:892-898), so the main loop does not sit at that limit.
  2. A per-turn estimate would itself rely on estimate_text_tokens (char/4 heuristic, especially loose for CJK), so the precision gain is on the order of the existing 1024-token safety margin. The extra estimation duplicates the budget logic at the caller side and widens _compute_completion_overrides's signature.
  3. If we ever see real 400s from this in the wild, a focused follow-up plumbing an input_tokens override into _compute_completion_overrides from the /btw callsite is cheap to add. Preferring that to expanding this PR.

Leaving the current cap (computed once from soul.context.token_count_with_pending) as a conservative approximation that is accurate for typical /btw usage where the main context is not near full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants