Skip to content

chore: Harden kernel + searcher container against AF_ALG/algif_* bug class#142

Draft
MoeMahhouk wants to merge 6 commits intomainfrom
moe/copy-fail-defense-in-depth
Draft

chore: Harden kernel + searcher container against AF_ALG/algif_* bug class#142
MoeMahhouk wants to merge 6 commits intomainfrom
moe/copy-fail-defense-in-depth

Conversation

@MoeMahhouk
Copy link
Copy Markdown
Member

@MoeMahhouk MoeMahhouk commented May 4, 2026

Summary

Defense-in-depth follow-up to #138 (the kernel bump that fixes CVE-2026-31431 / copy.fail). The kernel-level vulnerability is already patched; this PR removes the AF_ALG userspace surface that the copy.fail exploit chain runs through, so any future bug in the same family doesn't have a ready entry point on this image.

Three commits, ordered by increasing scope of removal, easy to drop the last one (or two) if pre-merge testing surfaces an unexpected AF_ALG consumer.

Changes

Block AF_ALG in searcher container seccomp profile

Container-only. The existing socket() rule already blocks AF_VSOCK (family 40); extend the same rule to also deny AF_ALG (family 38). One rule, two AND-ed constraints per the OCI seccomp spec. No build/image impact, only runtime behaviour inside the searcher container.

Drop CONFIG_CRYPTO_USER_API_*

Removes the AF_ALG family from the kernel entirely (host-wide, not just inside the searcher container). Verified by source-reading that no userspace process on the image consumes AF_ALG:

  • tdx-init: Go stdlib crypto/hmac + crypto/sha256 (pure userspace), then shells out to cryptsetup.
  • cryptsetup: Debian build uses libgcrypt + libargon2 in userspace; dm-crypt uses the in-kernel skcipher API directly, not via the AF_ALG userspace surface.
  • Lighthouse and the rest of the image userspace: ring / aes-gcm.

The previous # For tdx-init annotation in 10-bob was inaccurate — these flags weren't actually being used. Replaced with a comment explaining the rationale.

Pin CONFIG_CRYPTO_AUTHENCESN=n

authencesn is the AEAD template at the heart of the copy.fail bug. Its only intended in-tree consumer is the kernel's IPsec stack when a tunnel is configured with the Extended Sequence Number option, and IPsec is fully disabled on this image (CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP all "not set" in 01-sane-defaults). Pinning it off explicitly removes the algorithm even if Debian's cloud config inherits =y.

Pre-merge verification

  • Boot the rebuilt image; journalctl -b | grep -i 'AF_ALG\|algif' is empty (no warnings about a missing crypto API).
  • tdx-init set-passphrase end-to-end works (LUKS format → token import → header restore → MAC write → open → ext4 mount).
  • lighthouse restarts cleanly via systemctl restart lighthouse.
  • Searcher container init path completes (init-container.sh runs, container reaches running state, sshd listens on the published port).
  • Optional: strace -e trace=socket podman exec searcher-container <typical-cmd> 2>&1 | grep AF_ALG returns nothing.

If anything regresses, individual commits can be git revert'd cleanly without disturbing the others.

Not in scope

This is not the fix for CVE-2026-31431 that landed in #138 via the kernel bump. This PR removes the userspace entry point so the same bug class can't be reached again on this image without both a kernel CVE and a config regression.

References

MoeMahhouk added 3 commits May 4, 2026 10:12
Defense in depth against the AF_ALG/algif_aead syscall surface that copy.fail
(CVE-2026-31431) abuses. The existing socket() rule already blocks AF_VSOCK
(family 40); extend the same rule to also block AF_ALG (family 38).

Multiple args in a single seccomp rule are AND-ed per the OCI spec, so the
rule now allows socket() only when arg[0] is neither 40 nor 38.
The AF_ALG userspace crypto API (algif_hash / algif_skcipher / algif_rng /
algif_aead) was enabled with a '# For tdx-init' annotation, but tdx-init
itself uses Go's stdlib crypto/hmac + crypto/sha256 (pure userspace) and
shells out to cryptsetup, which on Debian uses libgcrypt + libargon2 for
PBKDF and dm-crypt for actual block encryption -- dm-crypt talks to the
in-kernel skcipher API directly, not via the AF_ALG userspace surface.
Lighthouse and the rest of the image userspace use ring / aes-gcm.

Removing the surface eliminates the entry point for CVE-2026-31431
(copy.fail) at the kernel level and shrinks the surface for any future
algif_* CVE.

Pre-merge: boot the rebuilt image and confirm 'journalctl -b' has no
AF_ALG/algif_* warnings, and that 'tdx-init set-passphrase' / lighthouse
restart / searcher container init paths all work end-to-end.
authencesn is an AEAD template whose only intended in-tree consumer is the
IPsec/XFRM stack when an SA has the Extended Sequence Number flag set.
IPsec is disabled on this image (CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP all
'not set' in 01-sane-defaults), so authencesn has no in-tree user here.

Pinning it off explicitly removes the algorithm even if Debian's cloud
config inherits it as =y, and removes the specific code path that the
copy.fail bug rearranges -- belt-and-suspenders alongside the AF_ALG
removal in the previous commit.
@MoeMahhouk MoeMahhouk requested a review from niccoloraspa May 4, 2026 10:28
MoeMahhouk added 3 commits May 8, 2026 09:10
…er seccomp profile

Defense in depth against the RxRPC and PF_KEY/XFRM kernel codepaths.
The existing socket() rule already blocks AF_VSOCK (40) and AF_ALG (38);
extend the same rule to also block AF_RXRPC (45) and AF_KEY (15).

Numeric values verified against include/linux/socket.h (PF_RXRPC = 45,
PF_KEY = 15) -- same lesson learned from copy.fail, where the rule
intended to block AF_ALG was blocking AF_VSOCK because the constant
was off by two.

Multiple args in a single seccomp rule are AND-ed per the OCI spec, so
the rule now allows socket() only when arg[0] is none of {15, 38, 40, 45}.

The host kernel does not currently compile any of these families in
(MODULES=n + CONFIG_AF_RXRPC=m / CONFIG_NET_KEY=m in the Debian base
both resolve to 'not set' after olddefconfig), so socket() with these
families already returns EAFNOSUPPORT. This change makes the rejection
explicit at the seccomp layer, which keeps the path closed even if a
future kernel-config edit re-enables one of these families.

No legitimate searcher workload uses AF_RXRPC (kernel AFS client) or
AF_KEY (legacy IPsec keying interface). The container's egress firewall
in init-container.sh already blocks the relevant network paths.
Three more kernel codepaths with no in-tree user on any flashbots image,
joining the existing # CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP/NET_KEY
disables in this file:

- AF_RXRPC + RXKAD: kernel RxRPC session sockets and Kerberos security,
  used only by the in-kernel AFS filesystem client. No image runs an
  AFS client, no userspace opens AF_RXRPC sockets.

- XFRM_USER: netlink control interface for XFRM transforms (`ip xfrm`,
  strongSwan, libreswan). The image firewall is iptables; no IPsec
  daemon runs anywhere. With INET_AH/ESP/INET6_AH/INET6_ESP/NET_KEY
  already off, XFRM has no transforms to configure -- the netlink
  control interface is dead surface.

Debian's cloud-amd64 base config has CONFIG_AF_RXRPC=m, CONFIG_RXKAD=y,
CONFIG_XFRM_USER=m. CONFIG_MODULES is unset on this image
(00-no-modules), so olddefconfig already resolves AF_RXRPC and
XFRM_USER to 'not set', and RXKAD follows because it sits inside
`if AF_RXRPC` in net/rxrpc/Kconfig. RXKAD is the one to watch -- a
straight `=y` in Debian, not auto-disabled by MODULES=n alone, so an
explicit pin is the only thing that keeps it off if the surrounding
config drifts.

Pinning the three explicitly removes the inference step and keeps the
kernel attack surface small if a future Debian config or kconfig
snippet edit changes a default. Mirrors the same belt-and-suspenders
pattern used for AUTHENCESN and the AF_ALG family elsewhere in this
branch.
Followup to the previous commit that pinned AF_RXRPC/RXKAD/XFRM_USER off.
XFRM_USER is the netlink config interface; this commit pins the rest of
the XFRM machinery so no XFRM code is compiled into the kernel at all.

In net/xfrm/Kconfig:

- CONFIG_XFRM (bool, no default) is selected only by transforms
  (INET_ESP/AH/IPCOMP, INET6_ESP/AH/IPCOMP, NET_KEY, XFRM_USER,
  XFRM_INTERFACE). All are 'not set' on this image (NET_KEY,
  INET[6]_AH/ESP/IPCOMP earlier in this file; XFRM_USER in the
  previous commit; XFRM_INTERFACE depends on IPV6 which is off).
- CONFIG_XFRM_ALGO (tristate, no default) is selected by the same
  transform protocols, all off.
- CONFIG_XFRM_ESPINTCP (bool) is the ESP-in-TCP encap glue, only
  meaningful with ESP, which is off.

So all three resolve to 'not set' via olddefconfig already; the explicit
pin removes the inference step and stays correct if a future kconfig
snippet edit selects something that pulls XFRM back in.

Functional impact: none. Verified that NET_IP_TUNNEL/NET_UDP_TUNNEL,
TLS, KVM, HYPERV, VIRTIO, container runtime, dropbear, and the
flashbox firewall do not depend on XFRM. NETFILTER_XT_MATCH_POLICY
depends on XFRM and is the only iptables match that does -- flashbox
firewall scripts do not use \`-m policy\` (grepped 0 hits in
init-firewall.sh, toggle, and the per-image firewall-config files),
so its absence is invisible.

Removes the kernel-side primitive used by the ESP-in-UDP MSG_SPLICE_PAGES
no-COW page-cache writes (Copy_Fail2 / Dirty Frag's ESP path) at the
strongest layer: the ESP code is not even compiled in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant