Skip to content

Ollama serve harness + v2 GGUF format + Qwen3 Q2_0 importer#29

Merged
sharpninja merged 7 commits intomainfrom
claude/modest-mendeleev-40b3e2
Apr 23, 2026
Merged

Ollama serve harness + v2 GGUF format + Qwen3 Q2_0 importer#29
sharpninja merged 7 commits intomainfrom
claude/modest-mendeleev-40b3e2

Conversation

@sharpninja
Copy link
Copy Markdown
Owner

Summary

Three features landing together (Byrd TDD throughout):

  • Ollama-compat HTTP server (new serve subcommand): Ollama-native /api/{version,tags,show,chat,generate,embeddings} + OpenAI-compat /v1/{models,chat/completions,completions}. NDJSON framing for Ollama, SSE for OpenAI. Works with ollama/open-webui/ollama-python clients unchanged. Backed by existing BitNetHostedAgentModel.
  • v2 GGUF format (ggml_type = 1001): packs BitNet ternary weights as trits + per-tensor FP32 Gamma. 19.95x shrink on real 8B Bonsai (25.88 GB v1 -> 1.30 GB v2); denser than source Q2_0 because 5-trit-per-byte base-3 beats 2 bits + FP16 block scale.
  • Qwen3 Bonsai Q2_0 importer (new import-gguf subcommand): Prism Q2_0 decoder (34-byte block, quaternary -> ternary collapse), analytic Gamma preserving E[|w|], GroupedQueryAttention (32Q/8KV), BitNetConfig.Qwen3Like8B preset, BitLinear.ImportTernary direct integer path (no FP32 detour). Discards tokenizer-mismatched token_embd/output, re-seeds via bootstrap.

Why

  • Serve: enables drop-in Ollama replacement so existing tooling (Open WebUI, ollama-python, TruckMate's MAF host) talks to BitNetSharp with zero client changes.
  • v2 format: v1 wrote dequantized FP32 (~26 GB for 8B) which was unusable for distribution. v2 matches the paper's storage claim and is smaller than the upstream Q2_0 source.
  • Qwen3 importer: unblocks reuse of Ternary-Bonsai-8B body weights without a distillation pipeline. Tokenizer mismatch handled by re-seeding embeddings + LM head and handing off to existing training.

Test plan

  • dotnet build BitNet-b1.58-Sharp.slnx clean
  • 500 Core + 52 Converter + 4 Runtime = 556/556 tests green
  • Curl smoke on serve: /api/version, /api/tags, /api/show, /api/chat (NDJSON stream + non-stream, done:true terminator), /api/generate, /v1/models, /v1/chat/completions (SSE + non-stream, [DONE] sentinel), /v1/completions, /api/embeddings
  • Tag resolution: bare name, :latest suffix, custom :v0.2 tag all resolve
  • Real 2.03 GB Bonsai Q2_0 round-trip through import-gguf + v2 GGUF Save/Load
  • (Deferred) --train-embeddings=<dataset> flag for importer (stub present, wiring TBD)

Reviewer notes

  • Single commit (e2ad707) covers all three features because the subsystems share Program.cs, BitNetSharp.App.csproj, and BitNetConfig edits; splitting risked broken intermediate builds.
  • New test assembly BitNetSharp.Converter.Tests keeps the Qwen3-specific synthetic GGUF fixtures out of the main Core test graph.
  • GroupedQueryAttention falls back to existing MultiHeadAttention bit-exactly when kvHeads == heads, so no regression for non-GQA configs.
  • OllamaStreamWriter uses async-only WriteAsync; an earlier sync WriteByte path broke TestServer and is removed.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

🤖 Generated with Claude Code

sharpninja and others added 7 commits April 18, 2026 17:56
Phase 1 (SignalR push pipeline):
- TrainingEventsHub at /hubs/training with AdminPolicy auth
- ITrainingEventsBroadcaster + SignalR + NoOp + capturing-fake impls
- SnapshotBroadcaster BackgroundService ticks 2s with fleet tok/s
- Broadcast hook in SubmitGradientCommandHandler after RecordAccepted;
  fire-and-forget so transport failures do not fail the gradient
- SqliteWorkQueueStore.GetShardId(taskId) cheap lookup for broadcast

Phase 2 (per-prefix + per-worker rollup + page):
- CoordinatorOptions.ActiveShardPrefixes (default asr-v1, truckmate v2/v1)
- SqliteWorkQueueStore.CountByShardPrefixAndState (excludes legacy)
- SqliteTelemetryStore.GetRecentGradientEvents joins tasks.shard_id
- SqliteTelemetryStore.AggregateByWorkerAndShardPrefix (one query/prefix)
- SqliteTelemetryStore.GetThroughputBuckets (time-series sparkline source)
- GetTrainingStatusQuery + handler composes snapshot (rollup, cells,
  fleet + top-N worker sparklines, recent events, fleet ETA)
- TrainingStatusPageViewModel + Blazor page at /admin/training-status
  with inline SVG sparklines and 2s timer refresh
- MainLayout nav: Training + Prefixes links

Tests (+23 covering both phases):
- SqliteTelemetryStoreTests for recent events, prefix aggregate, buckets
- SqliteWorkQueueStoreTests for CountByShardPrefixAndState + GetShardId
- CqrsHandlerTests for training-status rollup, cells, sparklines caps,
  eta-null-when-tps-zero, weight-version, active workers
- TrainingEventsBroadcasterTests for fire-and-forget semantics

Full suite: 456 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- AppSettingsWriter: atomic Coordinator:ActiveShardPrefixes rewrite via
  tmp-file + File.Replace, preserves every other key. Fails loud when
  the target file is missing so ops spots wrong content root.
- ShardPrefixesPageViewModel: load from IOptionsMonitor, add / remove /
  reorder rows, preview pending-task counts, SaveAsync validates +
  persists + waits up to 1 s for reloadOnChange to observe the change.
- Blazor page at /admin/config/shard-prefixes (InteractiveServer).
- DI: AppSettingsWriter bound to IHostEnvironment.ContentRootPath so
  the writer targets the same file CreateBuilder reads.

Tests (+7): AppSettingsWriterTests covering atomic replace, key
preservation, unicode labels, loud-fail on missing file, graft onto
settings without Coordinator section, null-list + blank-prefix guards.

Full suite: 463 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three-part spec for the Blazor-coordinator-adjacent driver-assistant
workstream, locked with Q1-Q4 decisions and the hard safety rules
captured across this session.

- Part A: Quality-enforcement framework (prerequisite gate) - canary
  suite, regression gates, replay-mix, drift detection, source-quality
  gating, quality dashboard.
- Part B: DPO online-learning pipeline, gated behind A.
- Part C: Driver-assistant north star - tool-use protocol, capability
  matrix, TruckMate adapter boundary.

Safety rules (exactly two):
1. H/W/W legal accommodation on every route segment AND every
   destination approach. Applies to POI lookups too - e.g. "nearest
   bathroom" filters out candidates whose approach violates H/W/W.
2. No U-turns, ever - but backtracking via legal-turn loops or
   appropriate-parking-lot turnarounds is fine. Rule forbids only the
   single-swing >=150 degree U-turn maneuver.

Rejection audit trail: every validator rejection persists to a new
`navigation_rejections` table with pre-formatted driver_explanation
("Highway 1 through Town has a bridge with clearance of only 12 feet.").
Model MUST cite the string verbatim - paraphrase is forbidden because
the numbers are load-bearing. Interactive check_route_via tool lets the
model answer proactive "why don't we use X" questions.

No code lands against this spec until Part A is built and proven.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactored-Rossum Plan Phase 1a+1b.

- Add `IHostedAgentModel.StreamResponseAsync(history, maxOutputTokens, ct)`
  with default impl that flattens via `PromptTemplate` and yields the
  `GetResponseAsync` result as a single chunk. Native-streaming impls
  (e.g. upcoming `InProcessBitNetModel`) override.
- New `PromptTemplate.FlattenHistory` centralizes `[SYSTEM]`/`[USER]`/
  `[ASSISTANT]`/`[TOOL]` role-tagged flattening so the streaming path
  and the non-streaming path produce identical prompts.
- `HostedModelChatClient` no longer drops to the last user message;
  full history is flattened and fed to the model. `GetStreamingResponseAsync`
  now delegates to `StreamResponseAsync` instead of whitespace-splitting
  a pre-materialized response.
- New `HostedModelChatClientTests` covers: multi-turn preservation,
  model-driven streaming (no whitespace split), max-output-tokens
  propagation, cancellation mid-stream. 4/4 green.
- Loosen `HostedAgentBenchmarksExecutionTests.ResponseBenchmarkExecutesThePaperAlignedQueryPath`
  to non-empty: benchmarks measure path, not deterministic template
  keywords (old assertion was coupled to the last-message-only lossy
  prompt; new flattened prompt shifts which keyword the
  BitNetPaperModel template matcher picks).

Full suite 464/464 (excluding long-runners).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…BitNetModel

Refactored-Rossum Plan Phase 1c. Adds a new `BitNetSharp.Runtime`
project intended to host BitNet inference inside end-user apps
(TruckMate, etc.) without the Microsoft.Agents.AI / DI-host
dependencies that `BitNetSharp.App` carries.

- `BitNetSharp.Runtime.csproj`: `IsAotCompatible=true`,
  `IsTrimmable=true`, `EnableTrimAnalyzer`, `EnableAotAnalyzer`.
  References Core + Distributed.Contracts only.
- `InProcessBitNetModel`: loads a `WeightBlobCodec` v1 blob from
  disk or in-memory bytes, validates flat-parameter length against
  the supplied `BitNetConfig`, hydrates a fresh `BitNetPaperModel`
  via `FlatParameterPack.Unpack`. Exposes `GenerateResponse` +
  `GetTernaryWeightStats`. Phase 5 will extend this to v2
  ternary-packed blobs.
- `WeightBlobCodec.DecodeAsync(path, ct)` convenience wrapper
  over the existing synchronous `Decode`.
- New `tests/BitNetSharp.Runtime.Tests` project with 4 tests:
  v1 blob round-trip, disk load, magic mismatch rejection,
  weight-count mismatch rejection. All green.
- Solution file registers the new src + test projects.

Zero AOT/trim analyzer warnings at Debug build. AOT publish
smoke + Runtime adapter for IHostedAgentModel follow in Phase 1d / Phase 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pression

Adds BitNetSharp.Runtime.AotSmoke exe that exercises the full
Runtime load+inference path after native-AOT publish, proving the
Runtime slice has no AOT/trim breakage on the path TruckMate will use.

- src/BitNetSharp.Runtime.AotSmoke: PublishAot=true exe that packs a
  seed BitNetPaperModel via FlatParameterPack, encodes via
  WeightBlobCodec v1, loads through InProcessBitNetModel.LoadFromBytes,
  and generates one response. Exits 0 on success.
- Core + Distributed.Contracts csprojs: NoWarn IL2026/IL3050. Core
  serves both coordinator (full CLR reflection JSON via
  BitNetPaperCheckpoint + BitNetPaperGguf) and the AOT Runtime slice;
  reflection JSON sites are not called from the AOT inference path.
  Follow-up: migrate coordinator-side JSON to System.Text.Json source
  generators to drop the suppression.
- slnx: register AotSmoke project.

Verified: dotnet publish -c Release -r win-x64 --self-contained
-p:PublishAot=true succeeds with zero warnings; published exe prints
"OK version=1 textLen=3" and returns 0. Full test suite green
(Runtime.Tests 4/4, BitNetSharp.Tests 467/467, LongRunning excluded).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three features landing together:

1. Ollama-compat HTTP server ("serve" subcommand):
   - Ollama-native /api/{version,tags,show,chat,generate,embeddings}
   - OpenAI-compat /v1/{models,chat/completions,completions}
   - NDJSON streaming for Ollama, SSE for OpenAI, correct done/[DONE] terminators
   - Snake_case JSON, tag-suffix tolerance (:latest + custom tags)
   - Backed by existing BitNetHostedAgentModel; no model changes needed
   - 45 route-shape + framer + registry tests green

2. v2 GGUF format (ggml_type 1001):
   - Packs BitNet ternary weights as trits + per-tensor FP32 Gamma
   - 19.95x shrink vs v1 on real 8B Bonsai (25.88 GB -> 1.30 GB)
   - Denser than source Q2_0 (5 trits/byte base-3 beats 2 bits + FP16)
   - GgufStreamingReader/Writer for large-file memory discipline

3. Qwen3 Bonsai Q2_0 importer ("import-gguf" subcommand):
   - PrismQ2_0 decoder (34-byte block, quaternary -> ternary collapse)
   - Analytic Gamma preserves E[|w|] across the lossy collapse
   - GroupedQueryAttention (32Q/8KV) + Qwen3Like8B config preset
   - BitLinear.ImportTernary direct integer path (no FP32 detour)
   - Discards tokenizer-mismatched token_embd + output, re-seeds
   - Separate BitNetSharp.Converter.Tests assembly (52 tests)

Test counts: 500 Core + 52 Converter + 4 Runtime = 556 green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2ad707038

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +64
else
{
kestrel.ListenAnyIP(options.Port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bind non-IP hosts to requested interface, not AnyIP

When --host is a hostname (for example localhost), IPAddress.TryParse fails and this branch calls ListenAnyIP, which exposes the server on all interfaces instead of honoring the requested host. That can unintentionally publish a local-only serve endpoint to the network and violates the CLI host contract; use hostname-aware binding (ListenLocalhost/resolved IPs) instead of falling back to AnyIP.

Useful? React with 👍 / 👎.

{
try
{
using var probe = new TcpListener(IPAddress.Loopback, options.Port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Probe the configured host instead of loopback only

The preflight port check always binds IPAddress.Loopback, so it reports “port already bound” even when the requested bind address is a different interface that could legally coexist (e.g., 127.0.0.1:11434 occupied while serving on 192.168.x.x:11434). This creates false startup failures for valid --host values; probe the actual configured host (or remove the precheck and rely on Kestrel’s bind error).

Useful? React with 👍 / 👎.

{
public static async Task<T?> ReadJsonAsync<T>(this HttpRequest request, CancellationToken cancellationToken = default)
{
return await JsonSerializer.DeserializeAsync<T>(request.Body, ServeJson.Options, cancellationToken).ConfigureAwait(false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return 400 for malformed JSON request bodies

This helper directly propagates JsonSerializer.DeserializeAsync exceptions; malformed JSON therefore bubbles up as unhandled server errors (HTTP 500) across /api/* and /v1/* routes instead of a client-side 400 invalid request response. Wrapping deserialization failures and mapping them to a structured 400 keeps API behavior compatible and avoids noisy error paths for bad input.

Useful? React with 👍 / 👎.

@sharpninja sharpninja merged commit a1e1ca4 into main Apr 23, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant