Skip to content

Fix: Vertex streaming + native tool calling leading to blank responses#168

Open
ctroche99 wants to merge 17 commits into
owndev:mainfrom
ctroche99:claude/openwebui-tool-calling-streaming-7h27r
Open

Fix: Vertex streaming + native tool calling leading to blank responses#168
ctroche99 wants to merge 17 commits into
owndev:mainfrom
ctroche99:claude/openwebui-tool-calling-streaming-7h27r

Conversation

@ctroche99
Copy link
Copy Markdown

@ctroche99 ctroche99 commented May 24, 2026

Root Cause & Fix: Native Tool Calling Producing Blank Output

After adding INFO-level logging at every decision point in a live OpenWebUI instance, the blank-output bug with native tool calling + streaming traces to three distinct issues. The first is the dominant one and the reason every prior attempt at a fix looked correct on paper but failed at runtime.

1. The Google Genai SDK's Automatic Function Calling (AFC) was silently competing with the pipe's tool loop

The google-genai SDK enables Automatic Function Calling (AFC) by default whenever you pass Python callables (rather than types.FunctionDeclaration objects) as tools in GenerateContentConfig. AFC:

  1. Intercepts the model's function_call parts before they reach the caller's iterator
  2. Executes the Python callable itself
  3. Sends the result back to Gemini as a FunctionResponse
  4. Only yields the final text response to the pipe's stream loop

This means the pipe's custom tool-execution loop never fired. The diagnostic logs proved it: function_calls=0 on every stream round, yet server logs showed the underlying tool APIs being hit. AFC was doing the work invisibly. You can spot it in the SDK's own log line:

google.genai.models:async_generator - AFC is enabled with max remote calls: 10.

AFC and the pipe's tool loop are doing exactly the same job — they're two competing implementations of the same round-trip. AFC wins the race because it runs inside the SDK. The fix is to explicitly disable AFC so the pipe's loop owns the execution:

if tools:
    gen_config_params["tools"] = tools
    gen_config_params["automatic_function_calling"] = (
        types.AutomaticFunctionCallingConfig(disable=True)
    )

This isn't a loss of functionality. AFC is a convenience feature for simple scripts that don't need custom serialization, error handling, or UI rendering. The pipe's loop replaces it with a more capable implementation that handles OWUI-specific requirements.

2. OpenWebUI tools can return (Response, str) tuples that AFC cannot serialize for Gemini

OWUI's tool convention allows callables to return a tuple where element [0] is a Response object for the UI renderer and element [1] is the plain-text string for the model's context. When AFC executed these tools, it tried to pass the raw Python tuple — including the response object — into the FunctionResponse it sends back to Gemini. Gemini cannot deserialize that, so the model produced empty output.

This is why the symptom was "the tool runs (visible in server logs) but the chat shows nothing." The tool executed successfully; the serialization of its result to Gemini was what broke.

Fix in the manual tool loop — extract the str component when a tool returns a tuple:

_raw = tool_callable(**tool_args)
tool_result = (await _raw) if inspect.isawaitable(_raw) else _raw
if isinstance(tool_result, tuple):
    tool_result = next(
        (v for v in tool_result if isinstance(v, str)), None
    )
if tool_result is None:
    tool_result = ""
tool_result = str(tool_result)

3. Gemini sometimes namespaces tool names as default_api:foo

Once AFC was disabled and the raw function_call parts were visible, a secondary issue appeared. Gemini occasionally emits function_call.name as "default_api:tool_name" or "default_api.tool_name" rather than the plain registered name. The if tool_name in __tools__ lookup failed silently and we sent back an error FunctionResponse, after which the model gave up producing output.

Fix: strip the prefix for the lookup, but preserve the original name in the FunctionResponse so Gemini can pair it with the original FunctionCall:

raw_tool_name = fc_part.function_call.name
tool_name = raw_tool_name
for prefix in ("default_api:", "default_api."):
    if tool_name.startswith(prefix):
        tool_name = tool_name[len(prefix):]
        break

# ... execute via __tools__[tool_name] ...

types.FunctionResponse(
    name=raw_tool_name,  # ← original name, not normalized
    response={"result": tool_result},
)

Why earlier fixes didn't work

Several prior PRs correctly built a multi-round tool-execution loop over generate_content_stream with function_call detection, tool execution, follow-up streams, and <details type="tool_calls"> rendering. The code was structurally sound. But because AFC consumed the function_call parts before they reached the loop, the loop body was unreachable at runtime. And because AFC passed unserializable return values to Gemini, the final output was empty rather than an error. The bug was invisible without per-step logging.

TL;DR

The pipe's tool loop was correct in structure. AFC was just running first and producing unserializable payloads. Disabling AFC, extracting the str from OWUI tuple returns, and normalizing the default_api: prefix produces the expected end-to-end behavior in both streaming and non-streaming paths.

claude added 8 commits May 23, 2026 00:01
When streaming is enabled with native tool calling, the Gemini API returns
function_call parts that were previously silently ignored, causing blank
responses. This fix adds a tool call execution loop inside
_handle_streaming_response that:

- Detects function_call parts in each streaming round
- Executes the corresponding OpenWebUI tool callables (sync and async)
- Appends the function responses back into the conversation
- Makes a follow-up streaming call so the model can answer with results
- Loops up to 10 times to support multi-step tool use
- Emits status events so the UI shows which tools are running

The call site in pipe() now passes client, model_id, contents, and
generation_config so the handler has everything it needs for follow-up
requests.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
The non-streaming path had the same bug as streaming: function_call parts
returned by Gemini were silently ignored, causing blank or tool-unaware
responses (issue owndev#155).

The fix wraps the generate_content call in a tool call loop that:
- Detects function_call parts in each response round
- Executes the corresponding OpenWebUI tool callables (sync and async)
- Appends function responses back into the conversation contents
- Makes follow-up generate_content calls so the model can use results
- Loops up to 10 times to support multi-step tool use
- Emits status events so the UI shows which tools are running

Text, thoughts, and image parts accumulate across all rounds so the
final response includes output from every step. Image generation models
are excluded from the tool call loop since they don't use function calling.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
Two improvements based on OpenWebUI plugin documentation:

1. Replace asyncio.iscoroutinefunction() with the call-first / inspect.isawaitable()
   pattern. OpenWebUI wraps tool callables with functools.partial and similar
   constructs that fool iscoroutinefunction, returning False even when the
   underlying function is async. Calling first and then awaiting the result if
   isawaitable() handles all cases correctly.

2. Emit a <details type="tool_calls"> block into the response after each tool
   call round. This matches the exact shape OpenWebUI itself renders for native
   tool calls, giving users a collapsible history of which tools ran and what
   they returned. Applied to both streaming and non-streaming paths.

   import inspect added at module level.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
1. Remove redundant 'Running' status event — 'Calling' already fires
   immediately when the function_call part hits the stream; emitting
   'Running' milliseconds later in the execution loop is noise.

2. Separate tool call blocks from answer_chunks so grounding citation
   processing only receives real model text. Previously the <details>
   HTML block was in answer_chunks and got passed to
   _process_grounding_metadata, which would insert citation links into
   the HTML markup and produce malformed output when grounding and
   native tool calling were both enabled. tool_call_blocks is now a
   separate accumulator; the two are joined after grounding runs.

3. Final content assembly now correctly orders: thought block →
   tool call details → grounded answer text.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
Header injection (sanitize_header_value):
  The regex already removed \x00-\x1F which includes \r and \n, but
  adding them explicitly as \r\n makes the intent unambiguous and guards
  against edge cases in regex engine interpretation.

Grounding citation byte bounds (grounding_metadata_list processing):
  Added clamping of end_index to [last_byte_index, text_len] and used
  errors='replace' on decode. Malformed segment offsets from the API
  (e.g. due to encoding mismatch on unicode text) would previously cause
  silent truncation or IndexError.

Tool loop MAX_TOOL_ITERATIONS warning (both paths):
  The loop was silently breaking at the limit. Now logs a warning so
  runaway agent loops are visible in server logs.

Empty function_response_parts guard (both paths):
  If all tool calls fail to resolve (not in __tools__), we still built
  error-result response parts — so in practice this list is never empty.
  Added an explicit guard and break anyway to prevent pushing an empty
  Content object to the Gemini API, which would cause an API error.

None tool result normalisation (both paths):
  Tool callables returning None were producing the literal string "None"
  via str(). Now normalised to "" before wrapping in FunctionResponse so
  the model receives clean empty text rather than the word "None".

Tool call blocks separated from answer text (non-streaming path):
  Same fix previously applied to streaming: <details type="tool_calls">
  HTML is now accumulated in tool_call_blocks_ns and combined with the
  grounded answer text after _process_grounding_metadata runs, preventing
  citation links from being injected into HTML markup.

Status label consistency (non-streaming path):
  "Running: {tool_name}" renamed to "Calling: {tool_name}" to match the
  streaming path label.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
1. Missing newline between thought <details> block and answer text in
   both streaming (f-string join) and non-streaming (concatenation) paths.
   Content was running directly adjacent, breaking visual formatting and
   potentially confusing HTML/markdown parsers.

2. Narrow the exception catch in EncryptedStr.decrypt() to re-raise
   fatal exceptions (KeyboardInterrupt, SystemExit, MemoryError) instead
   of swallowing them. The broad `except Exception` was silently catching
   OOM and interrupt signals, masking serious runtime failures.

https://claude.ai/code/session_01EvKKnagp23fb6fi92HaHBo
Two real fixes for native tool calling and one observability change:

1. Tool name normalization (the actual bug)
   Gemini sometimes emits function_call.name as "default_api:foo" or
   "default_api.foo" instead of the plain "foo" registered via the Python
   callable. Our __tools__ lookup keyed on the plain name failed, we sent
   back an error FunctionResponse, and the model gave up — producing the
   "model thinks about tool but no output" symptom. Strip the prefix when
   matching against __tools__, but keep the raw name in the FunctionResponse
   so Gemini can pair it with the original FunctionCall.

2. function_calling detection across metadata paths
   params.get("function_calling") read from __metadata__["params"], but
   OpenWebUI stores model-level params at __metadata__["model"]["params"]
   in current versions. Check both, and treat __tools__ being non-empty as
   implicit native mode unless function_calling is explicitly "default".

3. INFO-level logging at every decision point
   __tools__ registration, function_call detection, name normalization,
   tool execution start/result/error, follow-up stream kickoff, and final
   answer assembly. Both streaming and non-streaming paths instrumented so
   the server terminal shows exactly where the loop breaks if it does.
Root cause of blank output confirmed from server logs:

1. Disable Automatic Function Calling (AFC)
   The google-genai SDK has AFC enabled by default when Python callables are
   passed as tools. AFC intercepts function_call parts and executes them
   internally before our streaming loop ever sees them. This breaks our tool
   loop (function_calls=0 always) and — critically — passes raw Python return
   values including OpenWebUI's (HTMLResponse, str) tuples directly to Gemini,
   which cannot deserialise them, producing blank output. Fix: set
   automatic_function_calling=AutomaticFunctionCallingConfig(disable=True)
   in the GenerateContentConfig whenever tools are present.

2. Handle (HTMLResponse, str) tuple return type (both streaming and non-streaming)
   OpenWebUI tools that return rich UI content yield a tuple where element[0]
   is an HTMLResponse for the UI renderer and element[1] is the plain-text
   string for the model. Extract the str component when the tool returns a
   tuple so Gemini receives clean, serialisable text.
@ctroche99 ctroche99 closed this May 24, 2026
@ctroche99
Copy link
Copy Markdown
Author

Hello,

Let me first start off by saying thank you for building this project! I am restricted to a VertexAI account in my environment and this PIPE function allowed me to set it up in OpenWebUI. I quickly ran into the issue reported in #155 and then the issue also seen in #135 . Considering OpenWebUI now considers default tool calling deprecated and "native" is required, I felt this needed to be addressed. I setup server side logging for your tool calling functions and then, based on those, learned about AFC. From there it was easy to have Claude re-write the functions with AFC disabled and handle the process via the Pipe. In my testing, it appears to have worked. Of course, I welcome your own testing and verification. I wanted to leave this comment to let you know I am not just pushing bling Claude PRs to your project. If you prefer non-AI coded PRs, that is fine, go ahead and close this out, but I did want to pass the fix along in case you found it sufficient.

@ctroche99 ctroche99 reopened this May 24, 2026
@ctroche99 ctroche99 changed the title Claude/openwebui tool calling streaming 7h27r Fix: Vertex streaming + native tool calling leading to blank responses May 24, 2026
claude added 6 commits May 24, 2026 03:18
When our pipe calls a tool callable, the tool fires its own events via its
already-bound __event_emitter__ (OWUI injects it via functools.partial before
passing callables to the pipe). This renders rich HTML widgets in the chat UI.

Emitting `replace` at the end of the streaming loop overwrites the entire
message content — including those HTML widgets — with our plain-text
final_content, making the widgets disappear.

Fix: skip the `replace` event when tool_call_blocks is non-empty (i.e., tools
ran this turn). The streamed chat:message:delta events and the final yield
already deliver the text response correctly. The HTML widgets from the tools
persist alongside the text.

Exception: still emit `replace` when grounding is present, since citation
marker injection requires replacing the answer text in-place.
Previous fix only skipped the `replace` event when tool_call_blocks was
non-empty, but with AFC handling tool execution invisibly to our loop,
tool_call_blocks stays at 0 even when a tool fired its HTML widget via
its bound __event_emitter__. As a result, `replace`, `chat:message`,
`chat:finish`, and the final yield all still override the message body
with text-only content — wiping the HTML widget the tool rendered.

Switch the gating signal from `tool_call_blocks` (our loop's view) to
`__tools__` (what was available to the request, regardless of who
executed them). When tools were available this turn, emit completion
signals without content overrides and yield empty string so the
tool-emitted HTML widget remains the canonical message body.

Grounding is still the override exception: it must inject citation
markers into the live text.
…vailable

Earlier attempts emitted chat:message and chat:finish with done=True (no
content) when tools were available, on the theory that omitting `content`
would be enough to preserve tool-emitted HTML widgets. It wasn't: the
done=True flag itself appears to trigger OWUI to re-render the assistant
message in "completed" state using the persisted text-only body, dropping
the live-rendered HTML widget.

Remove the chat:message, chat:finish, and final yield entirely when tools
were available. The async generator completion is enough signal for OWUI's
middleware. Grounding remains an exception — citation injection requires
the replace event regardless.
@olivier-lacroix
Copy link
Copy Markdown
Contributor

@ctroche99 thanks for this! Have been having this blank response issue, which is quite annoying.

Had a quick look: how do you handle the thought signature that Gemini 3 models need to? Without sending it back, things will crash.

Does your solution handle parallel tool calls?

claude added 2 commits May 24, 2026 04:41
The pipe drives its own native function-calling loop, which bypasses
OWUI's middleware — so the 'embeds' and 'files' events the middleware
would normally fire for HTMLResponse cards and base64 image returns
were being dropped, leaving rich tool UI live-only and unpersisted.

Mirror OWUI middleware's process_tool_result locally:

- (HTMLResponse, ctx) tuples -> emit 'embeds' event, send ctx to LLM
- bare HTMLResponse -> emit 'embeds' event, send generic ack to LLM
- 'data:image/...' strings -> emit 'files' event
- dict/list -> JSON-serialise for LLM
- generic tuple / str / None -> sensible coercion

Both the streaming and non-streaming tool execution sites now route
through the shared helper.
…ni 3

Gemini 3 and Gemini 2.5 thinking models attach a thought_signature to
each thought Part. When building the multi-turn history for a tool-call
round-trip, the entire model turn — thoughts AND function_calls — must
be echoed back together. Previously only the function_call parts were
included, causing the API to reject follow-up requests on thinking models.

Collect thought Parts into thought_parts_this_round alongside the
existing function_call_parts_this_round, then prepend them when
constructing the model Content. Applies to both streaming and
non-streaming tool call paths.
@ctroche99
Copy link
Copy Markdown
Author

@ctroche99 thanks for this! Have been having this blank response issue, which is quite annoying.

Had a quick look: how do you handle the thought signature that Gemini 3 models need to? Without sending it back, things will crash.

Does your solution handle parallel tool calls?

Very welcome!

Thought signatures: good catch, I forgot about that requirement; now fixed in 16e3641. I was only including function_call parts in the model turn when building tool-call history, dropping the thought Parts entirely. To fix it I collect thought Parts into thought_parts_this_round alongside the existing function_call_parts_this_round and prepend them when constructing the model Content, so the entire turn — thoughts (with signatures) + function calls — gets echoed back. Applied to both streaming and non-streaming paths.

Parallel tool calls — yes, fully handled. function_call_parts_this_round accumulates every function_call Part the model emits across a single stream round; we iterate them in one loop, execute each, build one FunctionResponse per call, and batch all responses into a single user Content before the follow-up request.

Heads up — two meaningful changes since the PR was opened that you may want to pull:

51f9fcd — emit OWUI embeds/files events for rich tool returns.

Because the pipe drives its own tool-calling loop, it bypasses OWUI's middleware — so tools that return (HTMLResponse, str) tuples or data:image/... base64 strings never get their persisted events fired. Cards would render live then vanish on reload. Added a _process_tool_result_for_owui helper that mirrors what OWUI's middleware process_tool_result does locally: detects HTMLResponse (tuple or bare), image data URLs, dicts/lists, etc., emits {"type": "embeds", "data": {"embeds": [...]}} or {"type": "files", ...}, and returns the LLM-visible string. Rich-UI tools now persist correctly across reloads. This was a bit of a rabbit hole but ultimately worked out very well in the end. I had a tool with a HTML injection that kept wiping the HTML but leaving the context injection. Turns out OWUI uses this 'embed' render tag for non text displays. Updated to catch common ones. Although, I did skip the MCP and external-tool-tuple branches because I had already gone overboard lol.

16e3641 — Gemini 3 thought-signature round-tripping (described above).

The other 6 commits in between were widget-positioning experiments that got reverted — net delta since the PR is just those two fixes.

…ing features

Exposes four individually-toggleable OWUI feature flags for Gemini's
built-in server-side tools, alongside two new Valves for Maps widget
rendering.

New OWUI feature flags (set on model in Settings → Models → Features):
  google_search    — Google Search grounding only (was bundled in google_search_tool)
  url_context      — URL Context grounding only  (was bundled in google_search_tool)
  code_execution   — model writes + runs Python server-side
  google_maps      — Google Maps grounding with optional lat/lng location context

Legacy google_search_tool flag is preserved unchanged (still enables both
Search and URL Context together for backward compatibility).

New Valves:
  GOOGLE_MAPS_ENABLE_WIDGET  — request google_maps_widget_context_token
  GOOGLE_MAPS_API_KEY        — Maps Platform key for rendering the Places widget

Response handling:
  - executable_code parts rendered as fenced code blocks in stream and
    non-stream paths
  - code_execution_result parts rendered as output blocks
  - chunk.maps grounding chunks added to _format_grounding_chunks_as_sources
  - google_maps_widget_context_token emitted as persisted "embeds" event
    when GOOGLE_MAPS_API_KEY is set

Maps lat/lng location context passed via body/params latitude+longitude
fields into tool_config.retrieval_config.lat_lng.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants