Ollama serve harness + v2 GGUF format + Qwen3 Q2_0 importer#29
Ollama serve harness + v2 GGUF format + Qwen3 Q2_0 importer#29sharpninja merged 7 commits intomainfrom
Conversation
Phase 1 (SignalR push pipeline): - TrainingEventsHub at /hubs/training with AdminPolicy auth - ITrainingEventsBroadcaster + SignalR + NoOp + capturing-fake impls - SnapshotBroadcaster BackgroundService ticks 2s with fleet tok/s - Broadcast hook in SubmitGradientCommandHandler after RecordAccepted; fire-and-forget so transport failures do not fail the gradient - SqliteWorkQueueStore.GetShardId(taskId) cheap lookup for broadcast Phase 2 (per-prefix + per-worker rollup + page): - CoordinatorOptions.ActiveShardPrefixes (default asr-v1, truckmate v2/v1) - SqliteWorkQueueStore.CountByShardPrefixAndState (excludes legacy) - SqliteTelemetryStore.GetRecentGradientEvents joins tasks.shard_id - SqliteTelemetryStore.AggregateByWorkerAndShardPrefix (one query/prefix) - SqliteTelemetryStore.GetThroughputBuckets (time-series sparkline source) - GetTrainingStatusQuery + handler composes snapshot (rollup, cells, fleet + top-N worker sparklines, recent events, fleet ETA) - TrainingStatusPageViewModel + Blazor page at /admin/training-status with inline SVG sparklines and 2s timer refresh - MainLayout nav: Training + Prefixes links Tests (+23 covering both phases): - SqliteTelemetryStoreTests for recent events, prefix aggregate, buckets - SqliteWorkQueueStoreTests for CountByShardPrefixAndState + GetShardId - CqrsHandlerTests for training-status rollup, cells, sparklines caps, eta-null-when-tps-zero, weight-version, active workers - TrainingEventsBroadcasterTests for fire-and-forget semantics Full suite: 456 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- AppSettingsWriter: atomic Coordinator:ActiveShardPrefixes rewrite via tmp-file + File.Replace, preserves every other key. Fails loud when the target file is missing so ops spots wrong content root. - ShardPrefixesPageViewModel: load from IOptionsMonitor, add / remove / reorder rows, preview pending-task counts, SaveAsync validates + persists + waits up to 1 s for reloadOnChange to observe the change. - Blazor page at /admin/config/shard-prefixes (InteractiveServer). - DI: AppSettingsWriter bound to IHostEnvironment.ContentRootPath so the writer targets the same file CreateBuilder reads. Tests (+7): AppSettingsWriterTests covering atomic replace, key preservation, unicode labels, loud-fail on missing file, graft onto settings without Coordinator section, null-list + blank-prefix guards. Full suite: 463 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three-part spec for the Blazor-coordinator-adjacent driver-assistant
workstream, locked with Q1-Q4 decisions and the hard safety rules
captured across this session.
- Part A: Quality-enforcement framework (prerequisite gate) - canary
suite, regression gates, replay-mix, drift detection, source-quality
gating, quality dashboard.
- Part B: DPO online-learning pipeline, gated behind A.
- Part C: Driver-assistant north star - tool-use protocol, capability
matrix, TruckMate adapter boundary.
Safety rules (exactly two):
1. H/W/W legal accommodation on every route segment AND every
destination approach. Applies to POI lookups too - e.g. "nearest
bathroom" filters out candidates whose approach violates H/W/W.
2. No U-turns, ever - but backtracking via legal-turn loops or
appropriate-parking-lot turnarounds is fine. Rule forbids only the
single-swing >=150 degree U-turn maneuver.
Rejection audit trail: every validator rejection persists to a new
`navigation_rejections` table with pre-formatted driver_explanation
("Highway 1 through Town has a bridge with clearance of only 12 feet.").
Model MUST cite the string verbatim - paraphrase is forbidden because
the numbers are load-bearing. Interactive check_route_via tool lets the
model answer proactive "why don't we use X" questions.
No code lands against this spec until Part A is built and proven.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactored-Rossum Plan Phase 1a+1b. - Add `IHostedAgentModel.StreamResponseAsync(history, maxOutputTokens, ct)` with default impl that flattens via `PromptTemplate` and yields the `GetResponseAsync` result as a single chunk. Native-streaming impls (e.g. upcoming `InProcessBitNetModel`) override. - New `PromptTemplate.FlattenHistory` centralizes `[SYSTEM]`/`[USER]`/ `[ASSISTANT]`/`[TOOL]` role-tagged flattening so the streaming path and the non-streaming path produce identical prompts. - `HostedModelChatClient` no longer drops to the last user message; full history is flattened and fed to the model. `GetStreamingResponseAsync` now delegates to `StreamResponseAsync` instead of whitespace-splitting a pre-materialized response. - New `HostedModelChatClientTests` covers: multi-turn preservation, model-driven streaming (no whitespace split), max-output-tokens propagation, cancellation mid-stream. 4/4 green. - Loosen `HostedAgentBenchmarksExecutionTests.ResponseBenchmarkExecutesThePaperAlignedQueryPath` to non-empty: benchmarks measure path, not deterministic template keywords (old assertion was coupled to the last-message-only lossy prompt; new flattened prompt shifts which keyword the BitNetPaperModel template matcher picks). Full suite 464/464 (excluding long-runners). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…BitNetModel Refactored-Rossum Plan Phase 1c. Adds a new `BitNetSharp.Runtime` project intended to host BitNet inference inside end-user apps (TruckMate, etc.) without the Microsoft.Agents.AI / DI-host dependencies that `BitNetSharp.App` carries. - `BitNetSharp.Runtime.csproj`: `IsAotCompatible=true`, `IsTrimmable=true`, `EnableTrimAnalyzer`, `EnableAotAnalyzer`. References Core + Distributed.Contracts only. - `InProcessBitNetModel`: loads a `WeightBlobCodec` v1 blob from disk or in-memory bytes, validates flat-parameter length against the supplied `BitNetConfig`, hydrates a fresh `BitNetPaperModel` via `FlatParameterPack.Unpack`. Exposes `GenerateResponse` + `GetTernaryWeightStats`. Phase 5 will extend this to v2 ternary-packed blobs. - `WeightBlobCodec.DecodeAsync(path, ct)` convenience wrapper over the existing synchronous `Decode`. - New `tests/BitNetSharp.Runtime.Tests` project with 4 tests: v1 blob round-trip, disk load, magic mismatch rejection, weight-count mismatch rejection. All green. - Solution file registers the new src + test projects. Zero AOT/trim analyzer warnings at Debug build. AOT publish smoke + Runtime adapter for IHostedAgentModel follow in Phase 1d / Phase 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pression Adds BitNetSharp.Runtime.AotSmoke exe that exercises the full Runtime load+inference path after native-AOT publish, proving the Runtime slice has no AOT/trim breakage on the path TruckMate will use. - src/BitNetSharp.Runtime.AotSmoke: PublishAot=true exe that packs a seed BitNetPaperModel via FlatParameterPack, encodes via WeightBlobCodec v1, loads through InProcessBitNetModel.LoadFromBytes, and generates one response. Exits 0 on success. - Core + Distributed.Contracts csprojs: NoWarn IL2026/IL3050. Core serves both coordinator (full CLR reflection JSON via BitNetPaperCheckpoint + BitNetPaperGguf) and the AOT Runtime slice; reflection JSON sites are not called from the AOT inference path. Follow-up: migrate coordinator-side JSON to System.Text.Json source generators to drop the suppression. - slnx: register AotSmoke project. Verified: dotnet publish -c Release -r win-x64 --self-contained -p:PublishAot=true succeeds with zero warnings; published exe prints "OK version=1 textLen=3" and returns 0. Full test suite green (Runtime.Tests 4/4, BitNetSharp.Tests 467/467, LongRunning excluded). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three features landing together:
1. Ollama-compat HTTP server ("serve" subcommand):
- Ollama-native /api/{version,tags,show,chat,generate,embeddings}
- OpenAI-compat /v1/{models,chat/completions,completions}
- NDJSON streaming for Ollama, SSE for OpenAI, correct done/[DONE] terminators
- Snake_case JSON, tag-suffix tolerance (:latest + custom tags)
- Backed by existing BitNetHostedAgentModel; no model changes needed
- 45 route-shape + framer + registry tests green
2. v2 GGUF format (ggml_type 1001):
- Packs BitNet ternary weights as trits + per-tensor FP32 Gamma
- 19.95x shrink vs v1 on real 8B Bonsai (25.88 GB -> 1.30 GB)
- Denser than source Q2_0 (5 trits/byte base-3 beats 2 bits + FP16)
- GgufStreamingReader/Writer for large-file memory discipline
3. Qwen3 Bonsai Q2_0 importer ("import-gguf" subcommand):
- PrismQ2_0 decoder (34-byte block, quaternary -> ternary collapse)
- Analytic Gamma preserves E[|w|] across the lossy collapse
- GroupedQueryAttention (32Q/8KV) + Qwen3Like8B config preset
- BitLinear.ImportTernary direct integer path (no FP32 detour)
- Discards tokenizer-mismatched token_embd + output, re-seeds
- Separate BitNetSharp.Converter.Tests assembly (52 tests)
Test counts: 500 Core + 52 Converter + 4 Runtime = 556 green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2ad707038
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else | ||
| { | ||
| kestrel.ListenAnyIP(options.Port); |
There was a problem hiding this comment.
Bind non-IP hosts to requested interface, not AnyIP
When --host is a hostname (for example localhost), IPAddress.TryParse fails and this branch calls ListenAnyIP, which exposes the server on all interfaces instead of honoring the requested host. That can unintentionally publish a local-only serve endpoint to the network and violates the CLI host contract; use hostname-aware binding (ListenLocalhost/resolved IPs) instead of falling back to AnyIP.
Useful? React with 👍 / 👎.
| { | ||
| try | ||
| { | ||
| using var probe = new TcpListener(IPAddress.Loopback, options.Port); |
There was a problem hiding this comment.
Probe the configured host instead of loopback only
The preflight port check always binds IPAddress.Loopback, so it reports “port already bound” even when the requested bind address is a different interface that could legally coexist (e.g., 127.0.0.1:11434 occupied while serving on 192.168.x.x:11434). This creates false startup failures for valid --host values; probe the actual configured host (or remove the precheck and rely on Kestrel’s bind error).
Useful? React with 👍 / 👎.
| { | ||
| public static async Task<T?> ReadJsonAsync<T>(this HttpRequest request, CancellationToken cancellationToken = default) | ||
| { | ||
| return await JsonSerializer.DeserializeAsync<T>(request.Body, ServeJson.Options, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Return 400 for malformed JSON request bodies
This helper directly propagates JsonSerializer.DeserializeAsync exceptions; malformed JSON therefore bubbles up as unhandled server errors (HTTP 500) across /api/* and /v1/* routes instead of a client-side 400 invalid request response. Wrapping deserialization failures and mapping them to a structured 400 keeps API behavior compatible and avoids noisy error paths for bad input.
Useful? React with 👍 / 👎.
Summary
Three features landing together (Byrd TDD throughout):
servesubcommand): Ollama-native/api/{version,tags,show,chat,generate,embeddings}+ OpenAI-compat/v1/{models,chat/completions,completions}. NDJSON framing for Ollama, SSE for OpenAI. Works withollama/open-webui/ollama-pythonclients unchanged. Backed by existingBitNetHostedAgentModel.ggml_type = 1001): packs BitNet ternary weights as trits + per-tensor FP32 Gamma. 19.95x shrink on real 8B Bonsai (25.88 GB v1 -> 1.30 GB v2); denser than source Q2_0 because 5-trit-per-byte base-3 beats 2 bits + FP16 block scale.import-ggufsubcommand): Prism Q2_0 decoder (34-byte block, quaternary -> ternary collapse), analytic Gamma preserving E[|w|],GroupedQueryAttention(32Q/8KV),BitNetConfig.Qwen3Like8Bpreset,BitLinear.ImportTernarydirect integer path (no FP32 detour). Discards tokenizer-mismatchedtoken_embd/output, re-seeds via bootstrap.Why
Test plan
dotnet build BitNet-b1.58-Sharp.slnxcleanserve:/api/version,/api/tags,/api/show,/api/chat(NDJSON stream + non-stream,done:trueterminator),/api/generate,/v1/models,/v1/chat/completions(SSE + non-stream,[DONE]sentinel),/v1/completions,/api/embeddings:latestsuffix, custom:v0.2tag all resolveimport-gguf+ v2 GGUF Save/Load--train-embeddings=<dataset>flag for importer (stub present, wiring TBD)Reviewer notes
e2ad707) covers all three features because the subsystems shareProgram.cs,BitNetSharp.App.csproj, andBitNetConfigedits; splitting risked broken intermediate builds.BitNetSharp.Converter.Testskeeps the Qwen3-specific synthetic GGUF fixtures out of the main Core test graph.GroupedQueryAttentionfalls back to existingMultiHeadAttentionbit-exactly whenkvHeads == heads, so no regression for non-GQA configs.OllamaStreamWriteruses async-onlyWriteAsync; an earlier syncWriteBytepath broke TestServer and is removed.Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com
🤖 Generated with Claude Code