New: [AEA-6581] - llm evaluation poc by bencegadanyi1-nhs · Pull Request #565 · NHSDigital/eps-assist-me

bencegadanyi1-nhs · 2026-04-29T11:13:34Z

Summary

✨ New Feature

Details

Copilot

Pull request overview

Adds a proof-of-concept LLM/RAG evaluation suite (DeepEval-based) for the EPS chatbot and wires a smoke run into the PR deployment workflow to provide early quality signals.

Changes:

Introduces packages/ragasEvaluation with DeepEval tests, fixtures, and a Bedrock-backed judge model plus a Lambda/KB client.
Adds a new Poetry dependency group (ragasEvaluation) and updates the lockfile accordingly.
Adds make eval-smoke / make eval-full targets and runs smoke evaluation in the PR release workflow.

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`pyproject.toml`	Adds `ragasEvaluation` dependency group; bumps `urllib3` patch.
`poetry.lock`	Regenerated lockfile to include DeepEval + transitive deps and new group membership.
`packages/ragasEvaluation/tests/test_chatbot_eval.py`	Smoke/full DeepEval test suite driving live chatbot calls and metric evaluation.
`packages/ragasEvaluation/tests/conftest.py`	Session bootstrap + Bedrock judge fixture with xdist-safe caching.
`packages/ragasEvaluation/test_cases.json`	Defines evaluation prompts and ground-truth expectations (incl. smoke subset).
`packages/ragasEvaluation/pytest.ini`	Pytest defaults for the evaluation suite (incl. xdist settings).
`packages/ragasEvaluation/evaluation/chatbot.py`	Direct Lambda invocation + KB retrieve contexts for metric inputs.
`packages/ragasEvaluation/evaluation/bedrock_judge.py`	Custom DeepEval judge model using Bedrock `converse`.
`packages/ragasEvaluation/.deepeval/.deepeval_telemetry.txt`	Adds DeepEval telemetry artefact to repo.
`Makefile`	Adds `eval-smoke` and `eval-full` targets.
`.grype.yaml`	Adds vulnerability ignore list.
`.github/workflows/release_all_stacks.yml`	Adds PR-only “Chatbot RAG Evaluation” job running smoke evaluation.

+python_files = test_*.py
+python_functions = test_*
+pythonpath = .
+addopts = -v --tb=short -n 4


+def pytest_sessionstart(session: pytest.Session) -> None:
+    """Resolve Lambda name and KB ID once before any tests run.
+
+    Uses a file lock so only the first xdist worker calls CloudFormation;
+    the rest read cached values from a shared JSON file.
+    """
+    # Use a stable path so all workers (separate processes) share it.
+    cache_file = Path(tempfile.gettempdir()) / "eval_bootstrap_cache.json"
+    lock_file = Path(tempfile.gettempdir()) / "eval_bootstrap_cache.json.lock"
+
+    with FileLock(str(lock_file)):
+        if cache_file.is_file():
+            data = json.loads(cache_file.read_text())
+            os.environ["_EVAL_LAMBDA_NAME"] = data["lambda_name"]
+            os.environ["_EVAL_KB_ID"] = data["kb_id"]
+        else:
+            bootstrap()
+            cache_file.write_text(


+ - vulnerability: GHSA-38jv-5279-wg99
+ - vulnerability: GHSA-vfmq-68hx-4jfw
+ - vulnerability: GHSA-p423-j2cm-9vmq
+ - vulnerability: GHSA-58qw-9mgm-455v
+ - vulnerability: GHSA-r6ph-v2qm-q3c2
+ - vulnerability: GHSA-6w46-j5rx-g56g
+ - vulnerability: GHSA-gc5v-m9x4-r6x2


+  chatbot_evaluation:
+    name: Chatbot RAG Evaluation
+    runs-on: ubuntu-22.04
+    container:
+      image: ${{ inputs.pinned_image }}
+      options: --user 1001:1001 --group-add 128
+    defaults:
+      run:
+        shell: bash
+    if: ${{ always() && !failure() && !cancelled() && inputs.IS_PULL_REQUEST == true
+      }}
+    needs: [ release_all_code ]
+    permissions:
+      id-token: write
+      contents: read
+    steps:
+      - name: copy .tool-versions
+        run: |
+          cp /home/vscode/.tool-versions "$HOME/.tool-versions"
+
+      - name: Checkout code
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
+        with:
+          persist-credentials: false
+
+      - name: Configure AWS Credentials
+        uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
+        with:
+          aws-region: eu-west-2
+          role-to-assume: ${{ secrets.DEV_CLOUD_FORMATION_EXECUTE_LAMBDA_ROLE }}
+          role-session-name: eps-assist-me-evaluation
+
+      - name: Install dependencies
+        run: |
+          make install-python
+
+      - name: Run smoke evaluation
+        env:
+          CHATBOT_STACK_NAME: ${{ inputs.STACK_NAME }}
+          AWS_REGION: eu-west-2
+        run: |
+          make eval-smoke


+DEEPEVAL_ID=d0395a29-36cf-4018-afbd-c2834073dfcc
+DEEPEVAL_STATUS=old
+DEEPEVAL_LAST_FEATURE=evaluation
+DEEPEVAL_EVALUATION_STATUS=old


sonarqubecloud · 2026-04-29T15:22:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

feat: llm evaluation + ci

602a0e9

Copilot AI review requested due to automatic review settings April 29, 2026 11:13

Copilot started reviewing on behalf of bencegadanyi1-nhs April 29, 2026 11:14 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

chore: zizmor

5adde19

bencegadanyi1-nhs temporarily deployed to dev-pr April 29, 2026 12:48 — with GitHub Actions Inactive

chore: adds missing policy action

cddc37b

bencegadanyi1-nhs temporarily deployed to dev-pr April 29, 2026 13:32 — with GitHub Actions Inactive

Merge branch 'main' into AEA-6581-deepeval-poc

c01c47e

bencegadanyi1-nhs temporarily deployed to dev-pr April 29, 2026 15:26 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New: [AEA-6581] - llm evaluation poc#565

New: [AEA-6581] - llm evaluation poc#565
bencegadanyi1-nhs wants to merge 4 commits intomainfrom
AEA-6581-deepeval-poc

bencegadanyi1-nhs commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bencegadanyi1-nhs commented Apr 29, 2026

Summary

Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants