fix(kimi): clamp completion budget dynamically#2332
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34b527ccc1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| effective_history = normalize_history(self._context.history) | ||
| chat_provider = self._with_dynamic_completion_budget(chat_provider) | ||
|
|
There was a problem hiding this comment.
Persist recovered provider instead of per-step clone
Applying _with_dynamic_completion_budget here creates a temporary Kimi copy for the request, but retry recovery (on_retryable_error) runs against that copy and may replace/close its client; the runtime’s canonical provider (self._runtime.llm.chat_provider) is never updated. After any transient connection error (or 401 refresh path), the next turn can continue with a stale/closed client and fail repeatedly even though recovery succeeded once. This regression is introduced by cloning before _run_with_connection_recovery and affects normal step execution (and similarly the compaction path).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR updates the Kimi integration to stop hardcoding a fixed completion cap (max_tokens=32000) and instead prefer max_completion_tokens, with per-request clamping based on the remaining context window. It also adds env var compatibility/precedence rules and new tests/docs around the new behavior.
Changes:
- Add
compute_max_completion_tokens()and use it fromKimiSoulto clamp completion budgets per step/compaction. - Update the
kosongKimi provider to normalize legacymax_tokenstomax_completion_tokensand remove the provider-level default cap. - Update CLI env var handling + docs, and add focused tests for env precedence and dynamic budgeting.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/kimi_cli/llm.py |
Adds completion-budget clamping helper and env var parsing for Kimi max completion tokens. |
src/kimi_cli/soul/kimisoul.py |
Injects dynamic completion budget into each Kimi step and compaction call. |
src/kimi_cli/soul/compaction.py |
Updates compaction comment to reflect KimiSoul-managed completion caps. |
packages/kosong/src/kosong/chat_provider/kimi.py |
Removes hardcoded default max_tokens, adds max_completion_tokens, and normalizes legacy max_tokens. |
packages/kosong/tests/api_snapshot_tests/test_kimi.py |
Updates snapshots/tests to assert max_completion_tokens and default omission of caps. |
tests/core/test_create_llm.py |
Adds env precedence test + unit tests for the clamping helper; updates expected Kimi provider request shape. |
tests/core/test_kimisoul_completion_budget.py |
Adds tests covering KimiSoul dynamic completion budgeting behavior. |
docs/en/configuration/env-vars.md |
Documents KIMI_MODEL_MAX_COMPLETION_TOKENS and KIMI_MODEL_MAX_TOKENS alias behavior. |
docs/zh/configuration/env-vars.md |
Same as English docs, in Chinese. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| safe_remaining = remaining - max(0, safety_margin) | ||
| available = safe_remaining if safe_remaining > 0 else remaining |
| generation_kwargs: dict[str, Any] = dict(self._generation_kwargs) | ||
|
|
|
|
||
| # Normalize: merge adjacent user messages for clean API input | ||
| effective_history = normalize_history(self._context.history) | ||
| chat_provider = self._with_dynamic_completion_budget(chat_provider) |
There was a problem hiding this comment.
🔴 Connection recovery on shallow-copied provider closes the original provider's shared HTTP client
_with_dynamic_completion_budget creates a shallow copy of the Kimi provider via with_generation_kwargs → copy.copy(self) (kimi.py:220). The copy's .client attribute is the same AsyncOpenAI instance as the original (self._runtime.llm.chat_provider). When a connection/timeout error triggers _run_with_connection_recovery (kimisoul.py:1395), on_retryable_error is called on the copy, which calls close_replaced_openai_client(old_client, ...) (kimi.py:191) — closing the shared client. This permanently corrupts the original provider's client.
After the first connection recovery, every subsequent step creates a new budget copy from the corrupted original (with the closed client), fails on its first attempt, recovers by creating yet another new client, and succeeds on retry — burning one recovery attempt per step for the rest of the session. The same issue applies to the compaction path at kimisoul.py:1196.
Prompt for agents
The root cause is that _with_dynamic_completion_budget (kimisoul.py:1128-1147) returns a shallow copy of the Kimi provider via with_generation_kwargs, which shares the same AsyncOpenAI client instance with the original provider stored in self._runtime.llm.chat_provider. When _run_with_connection_recovery (kimisoul.py:1382-1423) triggers on_retryable_error on this copy during a connection error, it closes the shared client via close_replaced_openai_client (kimi.py:191), corrupting the original provider.
The fix needs to ensure that connection recovery on a temporary budget copy does not close the original provider's client. Several approaches:
1. After calling _with_dynamic_completion_budget, propagate any client changes back to the original provider when on_retryable_error fires. This is complex.
2. Instead of creating a temporary copy per step, apply the dynamic budget to the original provider in-place (update self._runtime.llm.chat_provider). This preserves the old behavior where on_retryable_error modifies the original.
3. In _step and compact_context, after _run_with_connection_recovery succeeds following a recovery, check if the copy's client differs from the original's and update the original's client to match. This prevents the original from keeping a stale/closed client.
4. Make with_generation_kwargs (or a variant) do a proper deep copy of the client reference chain so that the copy is fully independent. However, deep-copying an AsyncOpenAI client is likely impractical.
Approach 2 or 3 seem most practical. The affected paths are _step (line 1007) and compact_context (lines 1192-1198).
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9e76f05cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "max_tokens": 32000, | ||
| } | ||
| generation_kwargs.update(self._generation_kwargs) | ||
| generation_kwargs: dict[str, Any] = dict(self._generation_kwargs) |
There was a problem hiding this comment.
Reinstate a completion cap fallback for uncapped Kimi calls
Removing the provider’s default cap here means any Kimi call path that does not add its own budget now sends no max_completion_tokens/max_tokens. In this repo, /btw bypasses KimiSoul._step and calls kosong.step directly with soul._runtime.llm.chat_provider (src/kimi_cli/soul/btw.py, execute_side_question), so those requests are now unbounded unless users set an env var. That can cause very long generations and unexpectedly high latency/cost on side-question flows.
Useful? React with 👍 / 👎.
The previous approach in this PR derived a budget-clamped Kimi instance via with_generation_kwargs() in KimiSoul._with_dynamic_completion_budget, but that shallow copy shared self.client with runtime.llm.chat_provider. A transient APIConnectionError would invoke on_retryable_error on the copy, rebinding only the copy's client while the original (now closed) client stayed attached to runtime.llm.chat_provider. Every subsequent step then had to walk through a fresh connection-recovery cycle before making any real progress. Switch to a per-call generation_overrides mapping plumbed through kosong.step / kosong.generate / ChatProvider.generate, so providers can merge per-request kwargs without producing a parallel instance. The single Kimi instance owned by Runtime keeps its live client and OAuth token state, and on_retryable_error mutations are visible on the next step. The Compaction protocol gains the same parameter so callers no longer need to mutate the chat provider before invoking compact(). - Extend ChatProvider.generate Protocol with generation_overrides - Forward through kosong.generate and kosong.step - Kimi.generate normalizes max_tokens -> max_completion_tokens in overrides - Other providers (Anthropic, OpenAILegacy, OpenAIResponses, GoogleGenAI, Chaos, Mock, Echo, ScriptedEcho) accept and merge the parameter - KimiSoul replaces _with_dynamic_completion_budget with _compute_completion_overrides returning a dict - compact_context no longer constructs a replacement LLM - SimpleCompaction.compact threads generation_overrides to kosong.step - Add regression test ensuring runtime.llm.chat_provider is not replaced - Add kosong-level tests for per-call override merge/normalization
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0e4851c89
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| generation_kwargs: dict[str, Any] = dict(self._generation_kwargs) | ||
| if generation_overrides: |
There was a problem hiding this comment.
Reinstate a default cap when no per-call override exists
This change removes the provider-level max_tokens default and now sends no completion limit unless callers explicitly pass generation_overrides, but several call paths still invoke kosong.step without overrides (for example src/kimi_cli/soul/btw.py and packages/kosong/src/kosong/__main__.py). In those contexts, Kimi requests become uncapped and can run much longer and consume far more tokens than before, which is a regression in latency/cost behavior introduced by this commit; either keep a safe provider default or add clamping in every Kimi call path that currently omits overrides.
Useful? React with 👍 / 👎.
After removing the provider-level ``max_tokens=32000`` default, two call paths still invoked ``kosong.step`` without ``generation_overrides`` and therefore sent Kimi requests with no completion cap at all: - ``KimiSoul`` /btw side-question loop (``execute_side_question`` calls ``kosong.step`` directly with ``soul._runtime.llm.chat_provider``), - the kosong package's developer demo (``kosong.__main__``). Both are flagged by codex as a P1 regression in latency/cost behavior. Reuse ``KimiSoul._compute_completion_overrides`` in the /btw path so it shares the same context-aware budget as the main step loop. Give the kosong demo a conservative ``max_completion_tokens=8192`` default so ``python -m kosong kimi`` does not run unbounded generations. Add a regression test asserting /btw forwards the overrides to ``kosong.step``.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a0f9ec2ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # bare ``kosong.step`` call below would send a request with no completion cap on | ||
| # Kimi providers (the provider-level default was removed), so a side-question | ||
| # could run for hundreds of seconds and burn a large completion budget. | ||
| generation_overrides = soul._compute_completion_overrides(chat_provider) # pyright: ignore[reportPrivateUsage] |
There was a problem hiding this comment.
Recompute /btw completion cap from actual turn history
execute_side_question computes generation_overrides once before building the /btw request history, and then reuses that same cap for both turns. Because _compute_completion_overrides uses context.token_count_with_pending, this budget ignores the injected side-question wrapper message and any assistant/tool-result messages appended before turn 2, so near the model context limit it can overestimate available output tokens and send an oversized max_completion_tokens that triggers Kimi 400 token-limit errors specifically on /btw flows. Recalculate the override after constructing each turn’s history (especially before turn 2) so the cap matches the actual request payload.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Acknowledging this is a real edge case, but holding off on it in this PR for three reasons:
- Triggering requires all three: the main context already near the model limit and
/btwfired at that moment and the LLM picking a denied tool on turn 1 forcing a turn 2.KimiSoul._stepcallsshould_auto_compactbefore every step (kimisoul.py:892-898), so the main loop does not sit at that limit. - A per-turn estimate would itself rely on
estimate_text_tokens(char/4 heuristic, especially loose for CJK), so the precision gain is on the order of the existing 1024-token safety margin. The extra estimation duplicates the budget logic at the caller side and widens_compute_completion_overrides's signature. - If we ever see real 400s from this in the wild, a focused follow-up plumbing an
input_tokensoverride into_compute_completion_overridesfrom the/btwcallsite is cheap to add. Preferring that to expanding this PR.
Leaving the current cap (computed once from soul.context.token_count_with_pending) as a conservative approximation that is accurate for typical /btw usage where the main context is not near full.
Summary
This PR changes the Kimi provider path to stop hardcoding
max_tokens = 32000and instead compute a per-requestmax_completion_tokensthat fits the current context window.Changes:
max_tokens = 32000max_completion_tokensrequest parametermax_tokensas a compatibility alias and normalize it tomax_completion_tokensKIMI_MODEL_MAX_COMPLETION_TOKENS, withKIMI_MODEL_MAX_TOKENSas a fallback aliasWhy
The old fixed
max_tokens = 32000is unsafe for two reasons:reasoning_contentand finalcontent, so a too-small remaining budget can produce a 200 response with only thinking text and no usable assistant content.The new behavior treats
reserved_context_sizeas the default desired response budget, but the final API parameter is always clamped by the current input size and model context length:Real Kimi API probe
I tested the real API rather than relying only on docs.
Target:
https://api.moonshot.cn/v1kimi-k2.6/v1/modelsand/v1/chat/completionsModel metadata
Request:
Observed
kimi-k2.6entry:{ "id": "kimi-k2.6", "object": "model", "owned_by": "moonshot", "supports_image_in": true, "supports_video_in": true, "supports_reasoning": true, "context_length": 262144 }Small
max_completion_tokenscan return thinking onlyRequest:
{ "model": "kimi-k2.6", "messages": [{ "role": "user", "content": "Reply exactly: ok" }], "stream": false, "max_completion_tokens": 16 }Observed response:
{ "status": 200, "usage": { "prompt_tokens": 11, "completion_tokens": 16, "total_tokens": 27, "cached_tokens": 11, "prompt_tokens_details": { "cached_tokens": 11 } }, "finish_reason": "length", "message_keys": ["content", "reasoning_content", "role"], "content_chars": 0, "reasoning_chars": 64, "content_prefix": "", "reasoning_prefix": "The user wants me to reply exactly with \"ok\". No punctuation, no" }This is the important failure mode: the API returns HTTP 200, but the agent gets no usable
content.A slightly larger cap leaves room for content
Request:
{ "model": "kimi-k2.6", "messages": [{ "role": "user", "content": "Reply exactly: ok" }], "stream": false, "max_completion_tokens": 32 }Observed response:
{ "status": 200, "usage": { "prompt_tokens": 11, "completion_tokens": 29, "total_tokens": 40 }, "finish_reason": "stop", "content_chars": 2, "reasoning_chars": 114, "content_prefix": "ok" }Legacy
max_tokensis still acceptedRequest:
{ "model": "kimi-k2.6", "messages": [{ "role": "user", "content": "Reply exactly: ok" }], "stream": false, "max_tokens": 16 }Observed response:
{ "status": 200, "usage": { "prompt_tokens": 11, "completion_tokens": 16, "total_tokens": 27 }, "finish_reason": "length", "content_chars": 0, "reasoning_chars": 64 }If both caps are sent,
max_completion_tokenswinsRequest:
{ "model": "kimi-k2.6", "messages": [{ "role": "user", "content": "Reply exactly: ok" }], "stream": false, "max_completion_tokens": 16, "max_tokens": 32 }Observed response:
{ "status": 200, "usage": { "prompt_tokens": 11, "completion_tokens": 16, "total_tokens": 27 }, "finish_reason": "length", "content_chars": 0, "reasoning_chars": 60 }Huge caps are accepted on short input
Request:
{ "model": "kimi-k2.6", "messages": [{ "role": "user", "content": "Reply exactly: ok" }], "stream": false, "max_completion_tokens": 1000000 }Observed response:
{ "status": 200, "usage": { "prompt_tokens": 11, "completion_tokens": 27, "total_tokens": 38 }, "finish_reason": "stop", "content_chars": 2, "reasoning_chars": 107 }So the server accepts obviously unsafe cap values. Client-side clamping is still required.
Omitting the cap is not a controlled strategy
Request:
{ "model": "kimi-k2.6", "messages": [ { "role": "user", "content": "请连续输出编号列表,从 1 开始,每行一个编号和短句,尽量一直输出直到系统停止。不要解释。" } ], "stream": true, "stream_options": { "include_usage": true } }Observed result:
{ "status": 200, "elapsed_ms": 811785, "stream_chunks": 21952, "finish_reason": "stop", "usage": { "prompt_tokens": 37, "completion_tokens": 21986, "total_tokens": 22023, "cached_tokens": 37, "prompt_tokens_details": { "cached_tokens": 37 } }, "content_chars": 15598, "reasoning_chars": 21520 }This means omitting the cap can run for a long time and spend a large completion budget. It is not a safe replacement for an explicit dynamic cap.
kimi-k2.6context-window behaviorInput over context limit:
{ "status": 400, "error": { "message": "Invalid request: Your request exceeded model token limit: 262144 (requested: 270018)", "type": "invalid_request_error" } }Near context limit with a tiny cap:
{ "status": 200, "usage": { "prompt_tokens": 260016, "completion_tokens": 16, "total_tokens": 260032, "cached_tokens": 249856, "prompt_tokens_details": { "cached_tokens": 249856 } }, "finish_reason": "length", "content_chars": 0, "reasoning_chars": 60 }Near context limit with a large cap can also fail operationally:
{ "status": 429, "error": { "message": "The engine is currently overloaded, please try again later", "type": "engine_overloaded_error" } }Test plan
Results:
55 passed, 2 warningsruff:All checks passed!pyright:0 errors, 0 warnings, 0 informations