fix(kimi): clamp completion budget dynamically by wbxl2000 · Pull Request #2332 · MoonshotAI/kimi-cli

wbxl2000 · 2026-05-20T09:23:01Z

Summary

This PR changes the Kimi provider path to stop hardcoding max_tokens = 32000 and instead compute a per-request max_completion_tokens that fits the current context window.

Changes:

remove the provider-level default max_tokens = 32000
prefer Kimi's max_completion_tokens request parameter
keep max_tokens as a compatibility alias and normalize it to max_completion_tokens
read KIMI_MODEL_MAX_COMPLETION_TOKENS, with KIMI_MODEL_MAX_TOKENS as a fallback alias
compute a dynamic completion cap before each Kimi model step and compaction request
clamp explicit and default caps against the remaining model context window
document the new env var and compatibility behavior
add focused tests for provider request shape, env handling, and KimiSoul dynamic budgeting

Why

The old fixed max_tokens = 32000 is unsafe for two reasons:

It can exceed the remaining context window when the input is already large.
For reasoning models, the completion budget is shared by reasoning_content and final content, so a too-small remaining budget can produce a 200 response with only thinking text and no usable assistant content.

The new behavior treats reserved_context_size as the default desired response budget, but the final API parameter is always clamped by the current input size and model context length:

available = model_context_length - current_input_tokens - safety_margin
max_completion_tokens = min(configured_or_default_budget, available)

Real Kimi API probe

I tested the real API rather than relying only on docs.

Target:

Base URL: https://api.moonshot.cn/v1
Model: kimi-k2.6
Transport: raw HTTP requests to /v1/models and /v1/chat/completions
Auth: real API key, redacted

Model metadata

Request:

GET /v1/models
Authorization: Bearer <redacted>

Observed kimi-k2.6 entry:

{
  "id": "kimi-k2.6",
  "object": "model",
  "owned_by": "moonshot",
  "supports_image_in": true,
  "supports_video_in": true,
  "supports_reasoning": true,
  "context_length": 262144
}

Small `max_completion_tokens` can return thinking only

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 16
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27,
    "cached_tokens": 11,
    "prompt_tokens_details": { "cached_tokens": 11 }
  },
  "finish_reason": "length",
  "message_keys": ["content", "reasoning_content", "role"],
  "content_chars": 0,
  "reasoning_chars": 64,
  "content_prefix": "",
  "reasoning_prefix": "The user wants me to reply exactly with \"ok\". No punctuation, no"
}

This is the important failure mode: the API returns HTTP 200, but the agent gets no usable content.

A slightly larger cap leaves room for content

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 32
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 29,
    "total_tokens": 40
  },
  "finish_reason": "stop",
  "content_chars": 2,
  "reasoning_chars": 114,
  "content_prefix": "ok"
}

Legacy `max_tokens` is still accepted

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_tokens": 16
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 64
}

If both caps are sent, `max_completion_tokens` wins

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 16,
  "max_tokens": 32
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 16,
    "total_tokens": 27
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 60
}

Huge caps are accepted on short input

Request:

{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "Reply exactly: ok" }],
  "stream": false,
  "max_completion_tokens": 1000000
}

Observed response:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 27,
    "total_tokens": 38
  },
  "finish_reason": "stop",
  "content_chars": 2,
  "reasoning_chars": 107
}

So the server accepts obviously unsafe cap values. Client-side clamping is still required.

Omitting the cap is not a controlled strategy

Request:

{
  "model": "kimi-k2.6",
  "messages": [
    {
      "role": "user",
      "content": "请连续输出编号列表，从 1 开始，每行一个编号和短句，尽量一直输出直到系统停止。不要解释。"
    }
  ],
  "stream": true,
  "stream_options": { "include_usage": true }
}

Observed result:

{
  "status": 200,
  "elapsed_ms": 811785,
  "stream_chunks": 21952,
  "finish_reason": "stop",
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 21986,
    "total_tokens": 22023,
    "cached_tokens": 37,
    "prompt_tokens_details": { "cached_tokens": 37 }
  },
  "content_chars": 15598,
  "reasoning_chars": 21520
}

This means omitting the cap can run for a long time and spend a large completion budget. It is not a safe replacement for an explicit dynamic cap.

`kimi-k2.6` context-window behavior

Input over context limit:

{
  "status": 400,
  "error": {
    "message": "Invalid request: Your request exceeded model token limit: 262144 (requested: 270018)",
    "type": "invalid_request_error"
  }
}

Near context limit with a tiny cap:

{
  "status": 200,
  "usage": {
    "prompt_tokens": 260016,
    "completion_tokens": 16,
    "total_tokens": 260032,
    "cached_tokens": 249856,
    "prompt_tokens_details": { "cached_tokens": 249856 }
  },
  "finish_reason": "length",
  "content_chars": 0,
  "reasoning_chars": 60
}

Near context limit with a large cap can also fail operationally:

{
  "status": 429,
  "error": {
    "message": "The engine is currently overloaded, please try again later",
    "type": "engine_overloaded_error"
  }
}

Test plan

uv run pytest packages/kosong/tests/api_snapshot_tests/test_kimi.py tests/core/test_create_llm.py tests/core/test_kimisoul_completion_budget.py tests/core/test_simple_compaction.py
uv run ruff check packages/kosong/src/kosong/chat_provider/kimi.py src/kimi_cli/llm.py src/kimi_cli/soul/kimisoul.py src/kimi_cli/soul/compaction.py packages/kosong/tests/api_snapshot_tests/test_kimi.py tests/core/test_create_llm.py tests/core/test_kimisoul_completion_budget.py
uv run pyright src/kimi_cli/llm.py src/kimi_cli/soul/kimisoul.py src/kimi_cli/soul/compaction.py

Results:

55 passed, 2 warnings
ruff: All checks passed!
pyright: 0 errors, 0 warnings, 0 informations

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34b527ccc1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T09:26:08Z

        effective_history = normalize_history(self._context.history)
+        chat_provider = self._with_dynamic_completion_budget(chat_provider)



Persist recovered provider instead of per-step clone

Applying _with_dynamic_completion_budget here creates a temporary Kimi copy for the request, but retry recovery (on_retryable_error) runs against that copy and may replace/close its client; the runtime’s canonical provider (self._runtime.llm.chat_provider) is never updated. After any transient connection error (or 401 refresh path), the next turn can continue with a stale/closed client and fail repeatedly even though recovery succeeded once. This regression is introduced by cloning before _run_with_connection_recovery and affects normal step execution (and similarly the compaction path).

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR updates the Kimi integration to stop hardcoding a fixed completion cap (max_tokens=32000) and instead prefer max_completion_tokens, with per-request clamping based on the remaining context window. It also adds env var compatibility/precedence rules and new tests/docs around the new behavior.

Changes:

Add compute_max_completion_tokens() and use it from KimiSoul to clamp completion budgets per step/compaction.
Update the kosong Kimi provider to normalize legacy max_tokens to max_completion_tokens and remove the provider-level default cap.
Update CLI env var handling + docs, and add focused tests for env precedence and dynamic budgeting.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/kimi_cli/llm.py`	Adds completion-budget clamping helper and env var parsing for Kimi max completion tokens.
`src/kimi_cli/soul/kimisoul.py`	Injects dynamic completion budget into each Kimi step and compaction call.
`src/kimi_cli/soul/compaction.py`	Updates compaction comment to reflect KimiSoul-managed completion caps.
`packages/kosong/src/kosong/chat_provider/kimi.py`	Removes hardcoded default `max_tokens`, adds `max_completion_tokens`, and normalizes legacy `max_tokens`.
`packages/kosong/tests/api_snapshot_tests/test_kimi.py`	Updates snapshots/tests to assert `max_completion_tokens` and default omission of caps.
`tests/core/test_create_llm.py`	Adds env precedence test + unit tests for the clamping helper; updates expected Kimi provider request shape.
`tests/core/test_kimisoul_completion_budget.py`	Adds tests covering KimiSoul dynamic completion budgeting behavior.
`docs/en/configuration/env-vars.md`	Documents `KIMI_MODEL_MAX_COMPLETION_TOKENS` and `KIMI_MODEL_MAX_TOKENS` alias behavior.
`docs/zh/configuration/env-vars.md`	Same as English docs, in Chinese.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    safe_remaining = remaining - max(0, safety_margin)
+    available = safe_remaining if safe_remaining > 0 else remaining


+        generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)



devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-05-20T09:32:43Z


        # Normalize: merge adjacent user messages for clean API input
        effective_history = normalize_history(self._context.history)
+        chat_provider = self._with_dynamic_completion_budget(chat_provider)


🔴 Connection recovery on shallow-copied provider closes the original provider's shared HTTP client

_with_dynamic_completion_budget creates a shallow copy of the Kimi provider via with_generation_kwargs → copy.copy(self) (kimi.py:220). The copy's .client attribute is the same AsyncOpenAI instance as the original (self._runtime.llm.chat_provider). When a connection/timeout error triggers _run_with_connection_recovery (kimisoul.py:1395), on_retryable_error is called on the copy, which calls close_replaced_openai_client(old_client, ...) (kimi.py:191) — closing the shared client. This permanently corrupts the original provider's client.

After the first connection recovery, every subsequent step creates a new budget copy from the corrupted original (with the closed client), fails on its first attempt, recovers by creating yet another new client, and succeeds on retry — burning one recovery attempt per step for the rest of the session. The same issue applies to the compaction path at kimisoul.py:1196.

Prompt for agents

The root cause is that _with_dynamic_completion_budget (kimisoul.py:1128-1147) returns a shallow copy of the Kimi provider via with_generation_kwargs, which shares the same AsyncOpenAI client instance with the original provider stored in self._runtime.llm.chat_provider. When _run_with_connection_recovery (kimisoul.py:1382-1423) triggers on_retryable_error on this copy during a connection error, it closes the shared client via close_replaced_openai_client (kimi.py:191), corrupting the original provider. The fix needs to ensure that connection recovery on a temporary budget copy does not close the original provider's client. Several approaches: 1. After calling _with_dynamic_completion_budget, propagate any client changes back to the original provider when on_retryable_error fires. This is complex. 2. Instead of creating a temporary copy per step, apply the dynamic budget to the original provider in-place (update self._runtime.llm.chat_provider). This preserves the old behavior where on_retryable_error modifies the original. 3. In _step and compact_context, after _run_with_connection_recovery succeeds following a recovery, check if the copy's client differs from the original's and update the original's client to match. This prevents the original from keeping a stale/closed client. 4. Make with_generation_kwargs (or a variant) do a proper deep copy of the client reference chain so that the copy is fully independent. However, deep-copying an AsyncOpenAI client is likely impractical. Approach 2 or 3 seem most practical. The affected paths are _step (line 1007) and compact_context (lines 1192-1198).

Was this helpful? React with 👍 or 👎 to provide feedback.

wbxl2000 · 2026-05-20T10:37:00Z

@codex

chatgpt-codex-connector · 2026-05-20T10:41:34Z

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9e76f05cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T11:15:37Z

-            "max_tokens": 32000,
-        }
-        generation_kwargs.update(self._generation_kwargs)
+        generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)


Reinstate a completion cap fallback for uncapped Kimi calls

Removing the provider’s default cap here means any Kimi call path that does not add its own budget now sends no max_completion_tokens/max_tokens. In this repo, /btw bypasses KimiSoul._step and calls kosong.step directly with soul._runtime.llm.chat_provider (src/kimi_cli/soul/btw.py, execute_side_question), so those requests are now unbounded unless users set an env var. That can cause very long generations and unexpectedly high latency/cost on side-question flows.

Useful? React with 👍 / 👎.

The previous approach in this PR derived a budget-clamped Kimi instance via with_generation_kwargs() in KimiSoul._with_dynamic_completion_budget, but that shallow copy shared self.client with runtime.llm.chat_provider. A transient APIConnectionError would invoke on_retryable_error on the copy, rebinding only the copy's client while the original (now closed) client stayed attached to runtime.llm.chat_provider. Every subsequent step then had to walk through a fresh connection-recovery cycle before making any real progress. Switch to a per-call generation_overrides mapping plumbed through kosong.step / kosong.generate / ChatProvider.generate, so providers can merge per-request kwargs without producing a parallel instance. The single Kimi instance owned by Runtime keeps its live client and OAuth token state, and on_retryable_error mutations are visible on the next step. The Compaction protocol gains the same parameter so callers no longer need to mutate the chat provider before invoking compact(). - Extend ChatProvider.generate Protocol with generation_overrides - Forward through kosong.generate and kosong.step - Kimi.generate normalizes max_tokens -> max_completion_tokens in overrides - Other providers (Anthropic, OpenAILegacy, OpenAIResponses, GoogleGenAI, Chaos, Mock, Echo, ScriptedEcho) accept and merge the parameter - KimiSoul replaces _with_dynamic_completion_budget with _compute_completion_overrides returning a dict - compact_context no longer constructs a replacement LLM - SimpleCompaction.compact threads generation_overrides to kosong.step - Add regression test ensuring runtime.llm.chat_provider is not replaced - Add kosong-level tests for per-call override merge/normalization

wbxl2000 · 2026-05-20T13:04:16Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0e4851c89

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:08:12Z

+        generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)
+        if generation_overrides:


Reinstate a default cap when no per-call override exists

This change removes the provider-level max_tokens default and now sends no completion limit unless callers explicitly pass generation_overrides, but several call paths still invoke kosong.step without overrides (for example src/kimi_cli/soul/btw.py and packages/kosong/src/kosong/__main__.py). In those contexts, Kimi requests become uncapped and can run much longer and consume far more tokens than before, which is a regression in latency/cost behavior introduced by this commit; either keep a safe provider default or add clamping in every Kimi call path that currently omits overrides.

Useful? React with 👍 / 👎.

After removing the provider-level ``max_tokens=32000`` default, two call paths still invoked ``kosong.step`` without ``generation_overrides`` and therefore sent Kimi requests with no completion cap at all: - ``KimiSoul`` /btw side-question loop (``execute_side_question`` calls ``kosong.step`` directly with ``soul._runtime.llm.chat_provider``), - the kosong package's developer demo (``kosong.__main__``). Both are flagged by codex as a P1 regression in latency/cost behavior. Reuse ``KimiSoul._compute_completion_overrides`` in the /btw path so it shares the same context-aware budget as the main step loop. Give the kosong demo a conservative ``max_completion_tokens=8192`` default so ``python -m kosong kimi`` does not run unbounded generations. Add a regression test asserting /btw forwards the overrides to ``kosong.step``.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a0f9ec2ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:27:44Z

+        # bare ``kosong.step`` call below would send a request with no completion cap on
+        # Kimi providers (the provider-level default was removed), so a side-question
+        # could run for hundreds of seconds and burn a large completion budget.
+        generation_overrides = soul._compute_completion_overrides(chat_provider)  # pyright: ignore[reportPrivateUsage]


Recompute /btw completion cap from actual turn history

execute_side_question computes generation_overrides once before building the /btw request history, and then reuses that same cap for both turns. Because _compute_completion_overrides uses context.token_count_with_pending, this budget ignores the injected side-question wrapper message and any assistant/tool-result messages appended before turn 2, so near the model context limit it can overestimate available output tokens and send an oversized max_completion_tokens that triggers Kimi 400 token-limit errors specifically on /btw flows. Recalculate the override after constructing each turn’s history (especially before turn 2) so the cap matches the actual request payload.

Useful? React with 👍 / 👎.

Acknowledging this is a real edge case, but holding off on it in this PR for three reasons:

Triggering requires all three: the main context already near the model limit and /btw fired at that moment and the LLM picking a denied tool on turn 1 forcing a turn 2. KimiSoul._step calls should_auto_compact before every step (kimisoul.py:892-898), so the main loop does not sit at that limit.

A per-turn estimate would itself rely on estimate_text_tokens (char/4 heuristic, especially loose for CJK), so the precision gain is on the order of the existing 1024-token safety margin. The extra estimation duplicates the budget logic at the caller side and widens _compute_completion_overrides's signature.

If we ever see real 400s from this in the wild, a focused follow-up plumbing an input_tokens override into _compute_completion_overrides from the /btw callsite is cheap to add. Preferring that to expanding this PR.

Leaving the current cap (computed once from soul.context.token_count_with_pending) as a conservative approximation that is accurate for typical /btw usage where the main context is not near full.

fix(kimi): clamp completion budget dynamically

34b527c

Copilot AI review requested due to automatic review settings May 20, 2026 09:23

Copilot started reviewing on behalf of wbxl2000 May 20, 2026 09:23 View session

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

test(kimi): narrow budgeted provider type

f26b1f7

Copilot AI reviewed May 20, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 20, 2026

View reviewed changes

fix(kimi): avoid cloning unchanged compaction llm

a3ed772

Merge branch 'main' into qer/kimi-dynamic-completion-budget

d9e76f0

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

RealKai42 approved these changes May 20, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

		effective_history = normalize_history(self._context.history)
		chat_provider = self._with_dynamic_completion_budget(chat_provider)

		safe_remaining = remaining - max(0, safety_margin)
		available = safe_remaining if safe_remaining > 0 else remaining

		generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)

		generation_kwargs: dict[str, Any] = dict(self._generation_kwargs)
		if generation_overrides:

Conversation

wbxl2000 commented May 20, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Real Kimi API probe

Model metadata

Small max_completion_tokens can return thinking only

A slightly larger cap leaves room for content

Legacy max_tokens is still accepted

If both caps are sent, max_completion_tokens wins

Huge caps are accepted on short input

Omitting the cap is not a controlled strategy

kimi-k2.6 context-window behavior

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

wbxl2000 commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

wbxl2000 commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

7Sageer May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wbxl2000 commented May 20, 2026 •

edited by devin-ai-integration Bot

Loading

Small `max_completion_tokens` can return thinking only

Legacy `max_tokens` is still accepted

If both caps are sent, `max_completion_tokens` wins

`kimi-k2.6` context-window behavior