Skip to content

Add OpenTelemetry tracing and metrics#2019

Open
edknv wants to merge 7 commits into
NVIDIA:mainfrom
edknv:edwardk/otel
Open

Add OpenTelemetry tracing and metrics#2019
edknv wants to merge 7 commits into
NVIDIA:mainfrom
edknv:edwardk/otel

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented May 12, 2026

Description

image

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from a team as code owners May 12, 2026 02:56
@edknv edknv requested a review from nkmcalli May 12, 2026 02:56
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR adds OpenTelemetry tracing and metrics to the NeMo Retriever pipeline — both library mode (via OTELConfig / configure()) and service mode — wiring spans across the inprocess executor, Ray batch actors, and service-mode worker subprocesses via W3C trace-context propagation.

  • New observability/ package with a public OTELConfig dataclass, configure() helper, attribute constants, reusable operator_span/pipeline_span context managers, and Ray actor span wrapper.
  • Service-mode integration: OTELServiceConfig added to ServiceConfig, FastAPI auto-instrumented, per-page trace_context carrier injected at submit time for cross-process parent linkage, worker OTEL shutdown registered via atexit.
  • Metric counters for jobs submitted/completed/failed, pages completed/failed, and per-operator duration histograms, all guarded by safe_add so telemetry never breaks ingestion.

Confidence Score: 4/5

Safe to merge after fixing the unfinalised step-1 probe span; all other changes are additive and well-guarded.

The step-1 health-check span in probe.py exits without _finalize_probe_span whenever the health endpoint returns non-2xx — the exact path the code comment calls out as the common cloud-NIM case. Every probe of a remote endpoint will emit a span missing the probe_status attribute and with UNSET status, silently breaking any dashboard query that filters on that field. The rest of the observability infrastructure is solid and previously raised concerns were addressed in follow-up commits.

nemo_retriever/src/nemo_retriever/nim/probe.py — the step-1 span finalization gap on non-2xx responses.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/nim/probe.py Step 1 span exits without calling _finalize_probe_span when the health check returns a non-2xx status — the most common path triggering step 2. Spans lack probe_status and have UNSET status in this case.
nemo_retriever/src/nemo_retriever/observability/configure.py New file: one-shot OTEL SDK setup with lazy imports, provider wiring, reset_instrument_cache() call post-configure, and atexit-registered shutdown; looks correct.
nemo_retriever/src/nemo_retriever/observability/spans.py New file: operator_span and pipeline_span context managers with clean exception capture, metric recording in finally, and safe_add guards; correct.
nemo_retriever/src/nemo_retriever/observability/ray_integration.py New file: RayOperatorSpanWrapper correctly injects operator into actor, extracts parent context once at construction, and wraps each batch in an operator span.
nemo_retriever/src/nemo_retriever/service/processing/pool.py Adds _JobProgress dataclass, _resolve_batch_span_parents, worker OTEL init with atexit registration, trace_context per-page carrier, and lifecycle metric counters; addressed concerns from prior review rounds.
nemo_retriever/src/nemo_retriever/service/app.py OTEL configured before FastAPI app creation (required for middleware ordering), otel_shutdown stored on app.state and called in lifespan teardown; correct sequence.
nemo_retriever/src/nemo_retriever/service/config.py New OTELServiceConfig Pydantic model with sensible defaults; resource_attributes uses empty-dict sentinel that callers convert to None before passing to OTELConfig.
nemo_retriever/src/nemo_retriever/graph/executor.py Wraps each inprocess operator in operator_span, injects W3C parent carrier for Ray batch operators, and propagates OTEL env vars into runtime_env.
nemo_retriever/src/nemo_retriever/graph_ingestor.py Adds otel param to GraphIngestor.__init__, registers shutdown with atexit, and wraps pipeline execution in pipeline_span; correct handling of library-mode OTEL lifecycle.
nemo_retriever/tests/test_observability.py New test file covering operator span happy/failure paths, service-mode batch context inheritance, and Ray actor parent linkage; session-scoped providers work around OTEL single-install limitation.

Sequence Diagram

sequenceDiagram
    participant Client
    participant FastAPI as FastAPI (HTTP span)
    participant IngestRouter as ingest.py
    participant Pool as ProcessingPool
    participant Worker as Worker Subprocess
    participant Ray as Ray Actor

    Client->>FastAPI: POST /ingest (W3C traceparent)
    FastAPI->>IngestRouter: ingest_document()
    IngestRouter->>IngestRouter: tag_job / tag_upload
    IngestRouter->>Pool: "try_submit(trace_context=inject_current_context())"
    Pool-->>Worker: PageDescriptor (trace_context carrier)

    Note over Worker: _run_pipeline_batch()
    Worker->>Worker: _resolve_batch_span_parents(carrier)
    Worker->>Worker: pool.run_pipeline_batch span (parented to HTTP span)
    loop each operator
        Worker->>Worker: operator_span(stage_name)
    end
    Worker-->>Pool: BatchWorkerResult
    Pool->>Pool: _handle_job_completion + SSE

    Note over Ray: RayDataExecutor (batch mode)
    Pool->>Ray: ds.map_batches(RayOperatorSpanWrapper)
    Ray->>Ray: extract_context(carrier) at __init__
    loop each batch
        Ray->>Ray: "operator_span(node_name, parent_context=driver_ctx)"
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/nim/probe.py:134-138
**Step 1 span exits unfinalized on non-2xx health response**

When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose `/v1/health/ready`) the `with _probe_span` block exits normally without any `_finalize_probe_span` call. The comment at line 184 explicitly calls this the *common path* that triggers step 2, so every step-2 probe emits a span without `probe_status` and with UNSET span status — breaking any dashboard filter on that attribute.

```suggestion
            span.set_attribute("http.status_code", int(resp.status_code))
            if resp.ok:
                _finalize_probe_span(span, "ok")
                _probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
                return
            _finalize_probe_span(span, "non_2xx", f"http {resp.status_code}")
```

Reviews (4): Last reviewed commit: "add uv.lock" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/service/processing/pool.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/observability/configure.py
Comment thread nemo_retriever/src/nemo_retriever/service/processing/pool.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/graph_ingestor.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/service/processing/pool.py
Comment on lines +134 to +138
span.set_attribute("http.status_code", int(resp.status_code))
if resp.ok:
_finalize_probe_span(span, "ok")
_probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Step 1 span exits unfinalized on non-2xx health response

When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose /v1/health/ready) the with _probe_span block exits normally without any _finalize_probe_span call. The comment at line 184 explicitly calls this the common path that triggers step 2, so every step-2 probe emits a span without probe_status and with UNSET span status — breaking any dashboard filter on that attribute.

Suggested change
span.set_attribute("http.status_code", int(resp.status_code))
if resp.ok:
_finalize_probe_span(span, "ok")
_probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
return
span.set_attribute("http.status_code", int(resp.status_code))
if resp.ok:
_finalize_probe_span(span, "ok")
_probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
return
_finalize_probe_span(span, "non_2xx", f"http {resp.status_code}")
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/nim/probe.py
Line: 134-138

Comment:
**Step 1 span exits unfinalized on non-2xx health response**

When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose `/v1/health/ready`) the `with _probe_span` block exits normally without any `_finalize_probe_span` call. The comment at line 184 explicitly calls this the *common path* that triggers step 2, so every step-2 probe emits a span without `probe_status` and with UNSET span status — breaking any dashboard filter on that attribute.

```suggestion
            span.set_attribute("http.status_code", int(resp.status_code))
            if resp.ok:
                _finalize_probe_span(span, "ok")
                _probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
                return
            _finalize_probe_span(span, "non_2xx", f"http {resp.status_code}")
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant