Skip to content

Test misc-next (regular, SELF)#1629

Open
kdave wants to merge 10000 commits intobtrfs:ci-kvmfrom
kdave:misc-next
Open

Test misc-next (regular, SELF)#1629
kdave wants to merge 10000 commits intobtrfs:ci-kvmfrom
kdave:misc-next

Conversation

@kdave
Copy link
Copy Markdown
Member

@kdave kdave commented Apr 17, 2026

No description provided.

@kdave kdave force-pushed the misc-next branch 5 times, most recently from df45462 to 12e9571 Compare April 21, 2026 04:26
broonie and others added 23 commits April 22, 2026 21:15
Baojun Xu <baojun.xu@ti.com> says:

Link: https://patch.msgid.link/20260414015441.2439-1-baojun.xu@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Pull tomoyo update from Tetsuo Handa:
 "Handle 64-bit inode numbers"

* tag 'tomoyo-pr-20260422' of git://git.code.sf.net/p/tomoyo/tomoyo:
  tomoyo: use u64 for holding inode->i_ino value
…/git/danielt/linux

Pull kgdb update from Daniel Thompson:
 "Only a very small update for kgdb this cycle: a single patch from
  Kexin Sun that fixes some outdated comments"

* tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: update outdated references to kgdb_wait()
…linux/kernel/git/trace/linux-trace

Pull ring-buffer fix from Steven Rostedt:

 - Make undefsyms_base.c into a real file

   The file undefsyms_base.c is used to catch any symbols used by a
   remote ring buffer that is made for use of a pKVM hypervisor. As it
   doesn't share the same text as the rest of the kernel, referencing
   any symbols within the kernel will make it fail to be built for the
   standalone hypervisor.

   A file was created by the Makefile that checked for any symbols that
   could cause issues. There's no reason to have this file created by
   the Makefile, just create it as a normal file instead.

* tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Make undefsyms_base.c a first-class citizen
…/git/rostedt/linux-ktest

Pull ktest updates from Steven Rostedt:

 - Fix month in date timestamp used to create failure directories

   On failure, a directory is created to store the logs and config file
   to analyze the failure. The Perl function localtime is used to create
   the data timestamp of the directory. The month passed back from that
   function starts at 0 and not 1, but the timestamp used does not
   account for that. Thus for April 20, 2026, the timestamp of 20260320
   is used, instead of 20260420.

 - Save the logfile to the failure directory

   Just the test log was saved to the directory on failure, but there's
   useful information in the full logfile that can be helpful to
   analyzing the failure. Save the logfile as well.

* tag 'ktest-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
  ktest: Add logfile to failure directory
  ktest: Fix the month in the name of the failure directory
…el/git/trace/linux-trace

Pull tracefs fixes from Steven Rostedt:

 - Use list_add_tail_rcu() for walking eventfs children

   The linked list of children is protected by SRCU and list walkers can
   walk the list with only using SRCU. Using just list_add_tail() on
   weakly ordered architectures can cause issues. Instead use
   list_add_tail_rcu().

 - Hold eventfs_mutex and SRCU for remount walk events

   The trace_apply_options() walks the tracefs_inodes where some are
   eventfs inodes and eventfs_remount() is called which in turn calls
   eventfs_set_attr(). This walk only holds normal RCU read locks, but
   the eventfs_mutex and SRCU should be held.

   Add a eventfs_remount_(un)lock() helpers to take the necessary locks
   before iterating the list.

* tag 'tracefs-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Hold eventfs_mutex and SRCU when remount walks events
  eventfs: Use list_add_tail_rcu() for SRCU-protected children list
This also removes the smbdirect_ prefix from the files.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-cifs/CAHk-=whmue3PVi88K0UZLZO0at22QhQZ-yu+qO2TOKyZpGqecw@mail.gmail.com/
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
…kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

 - a new charlieplex GPIO keypad driver

 - an update to aw86927 driver to support 86938 chip

 - an update for Chrome OS EC keyboard driver to support Fn-<key> keymap
   extension

 - an UAF fix in debugfs teardown in EDT touchscreen driver

 - a number of conversions for input drivers to use guard() and __free()
   cleanup primitives

 - several drivers for bus mice (inport, logibm) and other very old
   devices have been removed

 - OLPC HGPK PS/2 protocol has been removed as it's been broken and
   inactive for 10 something years

 - dedicated kpsmoused has been removed from psmouse driver

 - other assorted cleanups and fixups

* tag 'input-for-v7.1-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (101 commits)
  Input: charlieplex_keypad - add GPIO charlieplex keypad
  dt-bindings: input: add GPIO charlieplex keypad
  dt-bindings: input: add settling-time-us common property
  dt-bindings: input: add debounce-delay-ms common property
  Input: imx_keypad - fix spelling mistake "Colums" -> "Columns"
  Input: edt-ft5x06 - fix use-after-free in debugfs teardown
  Input: ims-pcu - fix heap-buffer-overflow in ims_pcu_process_data()
  Input: ct82c710 - remove driver
  Input: mk712 - remove driver
  Input: logibm - remove driver
  Input: inport - remove driver
  Input: qt1070 - inline i2c_check_functionality check
  Input: qt1050 - inline i2c_check_functionality check
  Input: aiptek - validate raw macro indices before updating state
  Input: gf2k - skip invalid hat lookup values
  Input: xpad - add RedOctane Games vendor id
  Input: xpad - remove stale TODO and changelog header
  Input: usbtouchscreen - refactor endpoint lookup
  Input: aw86927 - add support for Awinic AW86938
  dt-bindings: input: awinic,aw86927: Add Awinic AW86938
  ...
Pull more VFIO updates from Alex Williamson:

 - Fix ordering of dma-buf cleanup versus device disabling in vfio-pci
   (Matt Evans)

 - Resolve an inconsistent and incorrect use of spinlock-irq in the
   virtio vfio-pci variant by conversion to mutex and proceed to
   modernize and simplify driver with use of guards (Alex Williamson)

 - Resurrect the removal of the remaining class_create() call in vfio,
   replacing with const struct class and class_register() (Jori
   Koolstra, Alex Williamson)

 - Fix NULL pointer dereference, properly serialize interrupt setup, and
   cleanup interrupt state tracking in the cdx vfio bus driver (Prasanna
   Kumar T S M, Alex Williamson)

* tag 'vfio-v7.1-rc1-pt2' of https://github.com/awilliam/linux-vfio:
  vfio/cdx: Consolidate MSI configured state onto cdx_irqs
  vfio/cdx: Serialize VFIO_DEVICE_SET_IRQS with a per-device mutex
  vfio/cdx: Fix NULL pointer dereference in interrupt trigger path
  vfio: replace vfio->device_class with a const struct class
  vfio/virtio: Use guard() for bar_mutex in legacy I/O
  vfio/virtio: Use guard() for migf->lock where applicable
  vfio/virtio: Use guard() for list_lock where applicable
  vfio/virtio: Convert list_lock from spinlock to mutex
  vfio/pci: Clean up DMABUFs before disabling function
AppArmor dfas need a minimum of two states to be valid. State 0 is the
default trap state, and State 1 the default start state. When verifying
the dfa ensure that this is the case.

Fixes: c27c6bd ("apparmor: ensure that dfa state tables have entries")
Signed-off-by: John Johansen <john.johansen@canonical.com>
error is initialized to -EPROTO but set by some of the internal
functions, unfortunately the last two checks assume error is set to
-EPROTO already for the failure case. Ensure it is by setting it
before these checks.

Fixes: 3d28e23 ("apparmor: add support loading per permission tagging")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
In apparmor_path_rename(), when handling RENAME_EXCHANGE, the
cond_exchange structure is supposed to carry the attributes of the
*new* dentry (since it is used to authorize moving new_dentry to the
old location). However, line 412 reads:

    vfsuid = i_uid_into_vfsuid(idmap, d_backing_inode(old_dentry));

This fetches the uid of old_dentry instead of new_dentry. As a result,
the RENAME_EXCHANGE permission check uses the wrong file owner, which
can allow a rename that should be denied (if old_dentry's owner has
more privileges) or deny one that should be allowed.

Note that cond_exchange.mode on the line above correctly uses
new_dentry. Only the uid lookup is wrong.

Fix by changing old_dentry to new_dentry in the i_uid_into_vfsuid call.

Fixes: 5e26a01 ("apparmor: use type safe idmapping helpers")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
When booting Ubuntu 26.04 with Linux 7.0-rc4 on an ARM64 Qualcomm
Snapdragon X1 we see a string buffer overrun:

BUG: KASAN: slab-out-of-bounds in aa_dfa_match (security/apparmor/match.c:535)
Read of size 1 at addr ffff0008901cc000 by task snap-update-ns/2120

CPU: 5 UID: 60578 PID: 2120 Comm: snap-update-ns Not tainted 7.0.0-rc4+ #22 PREEMPTLAZY
Hardware name: LENOVO 83ED/LNVNB161216, BIOS NHCN60WW 09/11/2025
Call trace:
show_stack (arch/arm64/kernel/stacktrace.c:501) (C)
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:379 mm/kasan/report.c:482)
kasan_report (mm/kasan/report.c:597)
__asan_report_load1_noabort (mm/kasan/report_generic.c:378)
aa_dfa_match (security/apparmor/match.c:535)
match_mnt_path_str (security/apparmor/mount.c:244 security/apparmor/mount.c:336)
match_mnt (security/apparmor/mount.c:371)
aa_bind_mount (security/apparmor/mount.c:447 (discriminator 4))
apparmor_sb_mount (security/apparmor/lsm.c:719 (discriminator 1))
security_sb_mount (security/security.c:1062 (discriminator 31))
path_mount (fs/namespace.c:4101)
__arm64_sys_mount (fs/namespace.c:4172 fs/namespace.c:4361 fs/namespace.c:4338 fs/namespace.c:4338)
invoke_syscall.constprop.0 (arch/arm64/kernel/syscall.c:35 arch/arm64/kernel/syscall.c:49)
el0_svc_common.constprop.0 (./include/linux/thread_info.h:142 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2))
do_el0_svc (arch/arm64/kernel/syscall.c:152)
el0_svc (arch/arm64/kernel/entry-common.c:80 arch/arm64/kernel/entry-common.c:725)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:744)
el0t_64_sync (arch/arm64/kernel/entry.S:596)

Allocated by task 2120:
kasan_save_stack (mm/kasan/common.c:58)
kasan_save_track (./arch/arm64/include/asm/current.h:19 mm/kasan/common.c:70 mm/kasan/common.c:79)
kasan_save_alloc_info (mm/kasan/generic.c:571)
__kasan_kmalloc (mm/kasan/common.c:419)
__kmalloc_noprof (./include/linux/kasan.h:263 mm/slub.c:5260 mm/slub.c:5272)
aa_get_buffer (security/apparmor/lsm.c:2201)
aa_bind_mount (security/apparmor/mount.c:442)
apparmor_sb_mount (security/apparmor/lsm.c:719 (discriminator 1))
security_sb_mount (security/security.c:1062 (discriminator 31))
path_mount (fs/namespace.c:4101)
__arm64_sys_mount (fs/namespace.c:4172 fs/namespace.c:4361 fs/namespace.c:4338 fs/namespace.c:4338)
invoke_syscall.constprop.0 (arch/arm64/kernel/syscall.c:35 arch/arm64/kernel/syscall.c:49)
el0_svc_common.constprop.0 (./include/linux/thread_info.h:142 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2))
do_el0_svc (arch/arm64/kernel/syscall.c:152)
el0_svc (arch/arm64/kernel/entry-common.c:80 arch/arm64/kernel/entry-common.c:725)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:744)
el0t_64_sync (arch/arm64/kernel/entry.S:596)

The buggy address belongs to the object at ffff0008901ca000
which belongs to the cache kmalloc-rnd-06-8k of size 8192
The buggy address is located 0 bytes to the right of
allocated 8192-byte region [ffff0008901ca000, ffff0008901cc000)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9101c8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:-1 pincount:0
flags: 0x8000000000000040(head|zone=2)
page_type: f5(slab)
raw: 8000000000000040 ffff000800016c40 fffffdffe2d14e10 ffff000800015c70
raw: 0000000000000000 0000000800010001 00000000f5000000 0000000000000000
head: 8000000000000040 ffff000800016c40 fffffdffe2d14e10 ffff000800015c70
head: 0000000000000000 0000000800010001 00000000f5000000 0000000000000000
head: 8000000000000003 fffffdffe2407201 fffffdffffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff0008901cbf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff0008901cbf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff0008901cc000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff0008901cc080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff0008901cc100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

This was introduced by previous incorrect conversion from strcpy(). Fix it
by adding the missing terminator.

Cc: stable@vger.kernel.org
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Fixes: 93d4dbd ("apparmor: Replace deprecated strcpy in d_namespace_path")
Signed-off-by: John Johansen <john.johansen@canonical.com>
aa_dfa_unpack returns ERR_PTR not NULL when it fails, but aa_put_dfa
only checks NULL for its input, which would cause invalid memory access
in aa_put_dfa. Set nulldfa to NULL explicitly to fix that.

Fixes: 98b824f ("apparmor: refcount the pdb")
Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Since commit 2bd8248 ("xps: fix xps for stacked devices"),
skb->napi_id shares storage with sender_cpu. RX tracepoints using
net_dev_rx_verbose_template read skb->napi_id directly and can therefore
report sender_cpu values as if they were NAPI IDs.

For example, on the loopback path this can report 1 as napi_id, where 1
comes from raw_smp_processor_id() + 1 in the XPS path:

  # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }'
  # taskset -c 0 ping -c 1 ::1

Report only valid NAPI IDs in these tracepoints and use 0 otherwise.

Fixes: 2bd8248 ("xps: fix xps for stacked devices")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
directly into the mmap'd TX ring buffer shared with userspace. The
kernel validates the header via __packet_snd_vnet_parse() but then
re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
userspace thread can modify the vnet_hdr fields between validation
and use, bypassing all safety checks.

The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
to a stack-local variable. All other vnet_hdr consumers in the kernel
(tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
path is the only caller of virtio_net_hdr_to_skb() that reads directly
from user-controlled shared memory.

Fix this by copying vnet_hdr from the mmap'd ring buffer to a
stack-local variable before validation and use, consistent with the
approach used in packet_snd() and all other callers.

Fixes: 1d036d2 ("packet: tpacket_snd gso and checksum offload")
Signed-off-by: Bingquan Chen <patzilla007@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix spelling mistake "targgeting" -> "targeting" in
maintainer-netdev.rst

No functional change.

Signed-off-by: Ariful Islam Shoikot <islamarifulshoikat@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Firmware may not advertize correct resources if backing store is not
enabled before resource information is queried.
Fix the initial sequence of HWRMs so that driver gets capabilities
and resource information correctly.

Fixes: 3fa9e97 ("bng_en: Initialize default configuration")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rahul Gupta <rahul-rg.gupta@broadcom.com>
Link: https://patch.msgid.link/20260418023438.1597876-2-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The backing store type, BNGE_CTX_MRAV, is not applicable in Thor Ultra
devices. Remove it from the backing store configuration, as the firmware
will not populate entities in this backing store type, due to which the
driver load fails.

Fixes: 29c5b35 ("bng_en: Add backing store support")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com>
Link: https://patch.msgid.link/20260418023438.1597876-3-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vikas Gupta says:

====================
bnge fixes

Patch-1:
    Due to wrong HWRM sequence, driver do not get the correct
    information regarding resources and capabilities.
    The patch fixes the initial HWRM sequence.
Patch-2:
    Remove the unsupported backing store type initialization, which is
    not supported in Thor Ultra devices.
====================

Link: https://patch.msgid.link/20260418023438.1597876-1-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_clone() increments sockets_allocated and sets the socket refcount to 2.
SCTP performs additional accounting in sctp_clone_sock(), so the clone-time
increment must be undone to avoid double counting.

Note we cannot simply remove the SCTP-side increment, because the SCTP
destroy path in sctp_destroy_sock() only decrements sockets_allocated when
sp->ep is set, which may not be true for all failure paths in
sctp_clone_sock().

Fixes: 16942cf ("sctp: Use sk_clone() in sctp_accept().")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/af8d66f928dec3e9fcbee8d4a85b7d5a6b86f515.1776460180.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the
condition in seg6_build_state() that excludes L2 encap modes from
setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for
the new mode.
As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output()
on the output path, where the packet is silently dropped because
skb_mac_header_was_set() fails on L3 packets.

Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP.

Fixes: 13f0296 ("seg6: add support for SRv6 H.L2Encaps.Red behavior")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
caller-allocated on-stack u64 buffer to a per-connection visitor and
then copy the full item_len bytes back to user space via
rds_info_copy() regardless of how much of the buffer the visitor
actually wrote.

rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
write a subset of their output struct when the underlying
rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
and the two GIDs via explicit memsets). Several u32 fields
(max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
cache_allocs) and the 2-byte alignment hole between sl and
cache_allocs remain as whatever stack contents preceded the visitor
call and are then memcpy_to_user()'d out to user space.

struct rds_info_rdma_connection and struct rds6_info_rdma_connection
are the only rds_info_* structs in include/uapi/linux/rds.h that are
not marked __attribute__((packed)), so they have a real alignment
hole. The other info visitors (rds_conn_info_visitor,
rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
their packed output struct today and are not known to be vulnerable,
but a future visitor that adds a conditional write-path would have
the same bug.

Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
any netdev is sufficient), sendto()'s any peer on the same subnet
(fails cleanly but installs an rds_connection in the global hash in
RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
bytes of stack garbage including kernel text/data pointers:

    0..7   0a 63 00 01 0a 63 00 02     src=10.99.0.1 dst=10.99.0.2
    8..39  00 ...                      gids (memset-zeroed)
    40..47 e0 92 a3 81 ff ff ff ff     kernel pointer (max_send_wr)
    48..55 7f 37 b5 81 ff ff ff ff     kernel pointer (rdma_mr_max)
    56..59 01 00 08 00                 rdma_mr_size (garbage)
    60..61 00 00                       tos, sl
    62..63 00 00                       alignment padding
    64..67 18 00 00 00                 cache_allocs (garbage)

Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
and rds_walk_conn_path_info() before invoking the visitor. This
covers the IPv4/IPv6 IB visitors and hardens all current and future
visitors against the same class of bug.

No functional change for visitors that fully populate their output.

Changes in v2:
- retarget at the net tree (subject prefix "[PATCH net v2]",
  net/rds: prefix in the title)
- pick up Reviewed-by tags from Sharath Srinivasan and
  Allison Henderson

Fixes: ec16227 ("RDS/IB: Infiniband transport")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Assisted-by: Claude:claude-opus-4-7
Link: https://patch.msgid.link/20260418141047.3398203-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fdmanana and others added 20 commits April 28, 2026 06:49
It's not necessary since we can get the block group from the given
free space control structure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can get the free space control structure from the given block group,
so there is no need to pass it as an argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to pass the free space control structure as an argument
because we can grab it from the given block group.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[WARNING]
When running test cases with injected errors or shutdown, e.g.
generic/388 or generic/475, there is a chance that the following kernel
warning is triggered:

  BTRFS info (device dm-2): first mount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff
  BTRFS info (device dm-2): using crc32c checksum algorithm
  BTRFS info (device dm-2): checking UUID tree
  BTRFS info (device dm-2): turning on async discard
  BTRFS info (device dm-2): enabling free space tree
  BTRFS critical (device dm-2 state E): emergency shutdown
  ------------[ cut here ]------------
  WARNING: extent_io.c:1742 at extent_writepage_io+0x437/0x520 [btrfs], CPU#2: kworker/u43:2/651591
  CPU: 2 UID: 0 PID: 651591 Comm: kworker/u43:2 Tainted: G        W  OE       7.0.0-rc6-custom+ #365 PREEMPT(full)  5804053f02137e627472d94b5128cc9fcb110e88
  RIP: 0010:extent_writepage_io+0x437/0x520 [btrfs]
  Call Trace:
   <TASK>
   extent_write_cache_pages+0x2a5/0x820 [btrfs 70299925d0856939e93b17d480651713b3cbba58]
   btrfs_writepages+0x74/0x130 [btrfs 70299925d0856939e93b17d480651713b3cbba58]
   do_writepages+0xd0/0x160
   __writeback_single_inode+0x42/0x340
   writeback_sb_inodes+0x22d/0x580
   wb_writeback+0xc6/0x360
   wb_workfn+0xbd/0x470
   process_one_work+0x198/0x3b0
   worker_thread+0x1c8/0x330
   kthread+0xee/0x120
   ret_from_fork+0x2a6/0x330
   ret_from_fork_asm+0x11/0x20
   </TASK>
  ---[ end trace 0000000000000000 ]---
  BTRFS error (device dm-2 state E): root 5 ino 259 folio 1323008 is marked dirty without notifying the fs
  BTRFS error (device dm-2 state E): failed to submit blocks, root=5 inode=259 folio=1323008 submit_bitmap=0: -117
  BTRFS info (device dm-2 state E): last unmount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff

[CAUSE]
Inside btrfs we have the following pattern in several locations, for
example inside btrfs_dirty_folio():

	btrfs_clear_extent_bit(&inode->io_tree, start_pos, end_of_last_block,
			       EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
			       cached);

	ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
					extra_bits, cached);
	if (ret)
		return ret;

However btrfs_set_extent_delalloc() can return IO errors other than -ENOMEM
through the following callchain:

 btrfs_set_extent_delalloc()
 \- btrfs_find_new_delalloc_bytes()
    \- btrfs_get_extent()
       \- btrfs_lookup_file_extent()
          \- btrfs_search_slot()

When such IO error happened, the previous btrfs_clear_extent_bit() has
cleared the EXTENT_DELALLOC for the range, and we're expecting
btrfs_set_extent_delalloc() to re-set EXTENT_DELALLOC.

But since btrfs_set_extent_delalloc() failed before
btrfs_set_extent_bit(), EXTENT_DELALLOC flag is no longer present.

And if the folio range is dirty before entering
btrfs_set_extent_delalloc(), we got a dirty folio but no EXTENT_DELALLOC
flag now.

Then we hit the folio writeback:

  extent_writepage()
  |- writepage_delalloc()
  |  No ordered extent is created, as there is no EXTENT_DELALLOC set
  |  for the folio range.
  |  This also means the folio has no ordered flag set.
  |
  |- extent_writepage_io()
     \- if (unlikely(!folio_test_ordered(folio))
        Now we hit the warning.

[FIX]
Introduce a new helper, btrfs_reset_extent_delalloc() to replace the
currently open-coded btrfs_clear_extent_bit() +
btrfs_set_extent_delalloc() combination.

Instead of calling btrfs_clear_extent_bit() first, update
EXTENT_DELALLOC_NEW first, as that part can fail due to metadata IO,
meanwhile btrfs_clear_extent_bit() and btrfs_set_extent_bit() won't
return any error but retry memory allocation until succeeded.

This allows us to fail early without clearing EXTENT_DELALLOC bit, so
even if that new btrfs_reset_extent_delalloc() failed before touching
EXTENT_DELALLOC, the existing dirty range will still have their old
EXTENT_DELALLOC flag present, thus avoid the warning.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
query the on-disk csums for a file range.

The ioctl is deliberately per-file rather than exposing raw csum tree
lookups, to avoid leaking information to users about files they may not
have access to.

This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
the kernel, which details the offset and length we're interested in, and
a buffer for the kernel to write its results into. The kernel writes a
struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
csums if available. The maximum size of the user buffer is capped to
16MiB.

If the extent is an uncompressed, non-NODATASUM extent, the kernel sets
the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
csums. If it is sparse, preallocated, or beyond the EOF, it sets the
type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
the precomputed hash of the zero sector. Otherwise, it sets the type to
BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.

For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
produce the following output buffer:

  | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |

We do store the csums of compressed extents, but we deliberately don't
return them here: they're calculated over the compressed data, not the
uncompressed data that's returned to userspace. Similarly for encrypted
data, once encryption is supported, in which the csums will be on the
ciphertext.

The main use case for this is for speeding up mkfs.btrfs --rootdir. For
the case when the source FS is btrfs and using the same csum algorithm,
we can avoid having to recalculate the csums - in my synthetic
benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
speed-up (218s to 196s).

When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
reading the data entirely, resulting a ~2200% speed-up on the same test
(128s to 6s).

    # mkdir rootdir
    # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304

    (without ioctl)
    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir testimg
    ...
    real    3m37.965s
    user    0m5.496s
    sys     0m6.125s

    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir --reflink testimg
    ...
    real    2m8.342s
    user    0m5.472s
    sys     0m1.667s

    (with ioctl)
    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir testimg
    ...
    real    3m15.865s
    user    0m4.258s
    sys     0m6.261s

    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir --reflink testimg
    ...
    real    0m5.847s
    user    0m2.899s
    sys     0m0.097s

Another notable use case is for deduplication, where reading the
checksums may serve as a hint instead of reading the whole file data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Running btrfs balance with a usage filter (-dusage=N) can trigger a
null-ptr-deref when metadata corruption causes a chunk to have no
corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
  RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
  RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
  RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
  RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604
  ...
  Call Trace:
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    vfs_ioctl fs/ioctl.c:51 [inline]
    ...

The bug is reproducible on current development branch.

[CAUSE]
Two separate data structures are involved:

1. The on-disk chunk tree, which records every chunk (logical address
   space region) and is iterated by __btrfs_balance().

2. The in-memory block group cache (fs_info->block_group_cache_tree),
   which is built at mount time by btrfs_read_block_groups() and holds
   a struct btrfs_block_group for each chunk. This cache is what the
   usage filter queries.

On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.

When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_filter(), which queries the block
group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_used = cache->used;   /* cache may be NULL */

btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_filter() does not check the return value,
so the immediately following dereference of cache->used triggers the crash.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in chunk_usage_filter().
When the lookup fails, emit a btrfs_err() message identifying the
affected bytenr and return -EUCLEAN to indicate filesystem corruption.

Since chunk_usage_filter() now has an error path, change its return type
from bool to error pointer and 0 if the chunk passes the usage filter,
and 1 if it should be skipped.

Update should_balance_chunk() accordingly to propagate negative errors
from the usage filter.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…filter()

[BUG]
Running btrfs balance with a usage range filter (-dusage=min..max) can
trigger a null-ptr-deref when metadata corruption causes a chunk to have
no corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
  RIP: 0010:chunk_usage_range_filter fs/btrfs/volumes.c:3845 [inline]
  RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4031 [inline]
  RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4182 [inline]
  RIP: 0010:btrfs_balance+0x249e/0x4320 fs/btrfs/volumes.c:4618
  ...
  Call Trace:
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    vfs_ioctl fs/ioctl.c:51 [inline]
    ...

The bug is reproducible on recent development branch.

[CAUSE]
Two separate data structures are involved:

1. The on-disk chunk tree, which records every chunk (logical address
   space region) and is iterated by __btrfs_balance().

2. The in-memory block group cache (fs_info->block_group_cache_tree),
   which is built at mount time by btrfs_read_block_groups() and holds
   a struct btrfs_block_group for each chunk. This cache is what the
   usage range filter queries.

On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.

When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_range_filter(), which queries
the block group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_used = cache->used;   /* cache may be NULL */

btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_range_filter() does not check the return
value, so the immediately following dereference of cache->used triggers
the crash.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in
chunk_usage_range_filter(). When the lookup fails, emit a btrfs_err()
message identifying the affected bytenr and return -EUCLEAN to indicate
filesystem corruption.

Since chunk_usage_range_filter() now has an error path, change its
return type from bool to error pointer, return 0 if the chunk matches
the usage range, and 1 if it should be filtered out.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…ta_chunk()

[BUG]
Running btrfs balance can trigger a null-ptr-deref before relocating a
data chunk when metadata corruption leaves a chunk in the chunk tree
without a corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
  RIP: 0010:btrfs_may_alloc_data_chunk+0x40/0x1c0 fs/btrfs/volumes.c:3601
  Call Trace:
    __btrfs_balance fs/btrfs/volumes.c:4217 [inline]
    btrfs_balance+0x2516/0x42b0 fs/btrfs/volumes.c:4604
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    ...

[CAUSE]
__btrfs_balance() iterates the on-disk chunk tree and passes the chunk
logical bytenr to btrfs_may_alloc_data_chunk() before relocating a data
chunk. That helper then queries the in-memory block group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_type = cache->flags;   /* cache may be NULL */

A corrupt image can contain a chunk item whose matching block group
item is missing, so no block group is ever inserted into the cache. In
that case btrfs_lookup_block_group() returns NULL.

The code only guards this with ASSERT(cache), which becomes a no-op when
CONFIG_BTRFS_ASSERT is disabled. The subsequent dereference of
cache->flags therefore crashes the kernel.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in
btrfs_may_alloc_data_chunk() and print and error message for clarity.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
A corrupted image with a chunk present in the chunk tree but whose
corresponding block group item is missing from the extent tree can be
mounted successfully, even though check_chunk_block_group_mappings()
is supposed to catch exactly this corruption at mount time.  Once
mounted, running btrfs balance with a usage filter (-dusage=N or
-dusage=min..max) triggers a null-ptr-deref:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
    RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
    RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
    RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
    RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604

[CAUSE]
The crash occurs because __btrfs_balance() iterates the on-disk chunk
tree, finds the orphaned chunk, calls chunk_usage_filter() (or
chunk_usage_range_filter()), which queries the in-memory block group
cache via btrfs_lookup_block_group().  Since no block group was ever
inserted for this chunk, the lookup returns NULL, and the subsequent
dereference of cache->used crashes.

check_chunk_block_group_mappings() uses btrfs_find_chunk_map() to
iterate the in-memory chunk map (fs_info->mapping_tree):

  map = btrfs_find_chunk_map(fs_info, start, 1);

With @start = 0 and @Length = 1, btrfs_find_chunk_map() looks for a
chunk map that *contains* the logical address 0. If no chunk contains
logical address 0, btrfs_find_chunk_map(fs_info, 0, 1) returns NULL
immediately and the loop breaks after the very first iteration,
having checked zero chunks. The entire verification function is therefore
a no-op, and the corrupted image passes the mount-time check undetected.

[FIX]
Replace the btrfs_find_chunk_map() based loop with a direct in-order
walk of fs_info->mapping_tree using rb_first_cached() + rb_next().
This guarantees that every chunk map in the tree is visited regardless
of the logical addresses involved.

No lock is taken around the traversal. This function is called during
mount from btrfs_read_block_groups(), which is invoked from open_ctree()
before any background threads (cleaner, transaction kthread, etc.) are
started. There are therefore no concurrent writers that could modify
mapping_tree at this point. An analogous lockless direct traversal of
mapping_tree already exists in fill_dummy_bgs() in the same file.

Since we walk the rb-tree directly via rb_entry() without going through
btrfs_find_chunk_map(), no reference is taken on each map entry, so the
btrfs_free_chunk_map() calls are also removed.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…_extents()

There's no need to calculate again the size for the temporary block
reserve in btrfs_replace_file_extents() - we have already calculated it
and stored it in the 'min_size' variable.

So use the variable to make it more clear and also make the variable const
since it's not supposed to change during the whole function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…fields

The 'disk_cache_state' and 'cached' fields are defined with an int type
but all the values we assigned to them come from the enums
btrfs_disk_cache_state and btrfs_caching_type. So change the type in the
btrfs_block_group structure from int to these enums - in practice an enum
is an int, so this is more for readability and clarity.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop the length argument and use the simpler QSTR().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If statement branches that lead to a DEBUG_WARN() are unexpected to happen
and in most places we surround their expressions with the unlikely tag,
however a few places are missing. Add the unlikely tag to those missing
places to make it explicit to a reader that it's not expected and to hint
the compiler to generate better code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function always returns true or false but the its return type is
defined as int, which makes no sense. Change it to bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Originally 2K block size support was introduced to test subpage (block
size < page size) on x86_64 where the page size is exactly the original
minimal block size.

However that 2K block size support has some problems:

- No 2K nodesize support
  This is critical, as there is still no way to exercise the subpage
  metadata routine.

- Very easy to test subpage data path now
  With the currently experimental large folio support, it's very easy to
  test the subpage data folio path already, as when a folio larger than
  4K is encountered on x86_64, we will need all the subpage folio states
  and bitmaps.

  So there is no need to use 2K block size just to verify subpage data
  path even on x86_64.

And with the incoming huge folio (2M on x86_64) support, the 2K block
size will easily double the bitmap size, considering the burden to
maintain and the limited extra coverage, I believe it's time to remove
it for the incoming huge folio support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_writepages() just accumulates as large bio as possible
(within writeback_control constraints) and then submits it. This can
however lead to significant latency in writeback IO submission (I have
observed tens of milliseconds) because the submitted bio easily has over
hundred of megabytes. Consequently this leads to IO pipeline stalls and
reduced throughput.

At the same time beyond certain size submitting so large bio provides
diminishing returns because the bio is split by the block layer
immediately anyway. So compute (estimate of) bio size beyond which we
are unlikely to improve performance and just submit the bio for
writeback once we accumulate that much to keep the IO pipeline busy.
This improves writeback throughput for sequential writes by about 15% on
the test machine I was using.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
[ Fix the handling of missing device to avoid NULL pointer dereference. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0b2600f ("treewide: change inode->i_ino from unsigned long
to u64") sets the inode number type to u64 unconditionally, so we can
use it directly as there's no difference on 32bit and 64bit platform. We
used to have a copy of the number in our btrfs_inode.

The size of btrfs_inode on 32bit platform is about 688 bytes (after the
change).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
When bs < ps support was initially introduced, the compressed data
readahead was disabled as at that time the target page size was 64K.
This means a compressed data extent can span at most 3 64K pages (the
head and tail parts are not aligned to 64K), meaning the benefit is
pretty minimal.

[UNEXPECTED WORKING SITUATION]
But with the already merged large folio support, we're already enabling
readahead with subpage routine unintentionally, e.g.:

   0      4K      8K      12K      16K
   |   Folio 0    |    Folio 8K    |
   |<----- Compressed data ------->|

We have 2 8K sized folios, all backed by a single compressed data.

In that case add_ra_bio_pages() will continue to add folio 8K into the
read bio, as the condition to skip is only (bs < ps), not taking the
newer large folio support into consideration at all.

So for folio 8K, it is added to the read bio, but without subpage lock
bitmap populated.

Then at end_bbio_data_read(), folio 0 has proper locked bitmap set, but
folio 8K does not.
This inconsistency is handled by the extra safety net at
btrfs_subpage_end_and_test_lock() where if a folio has no @nr_locked, it
will just be unlocked without touching the locked bitmap.

[ENHANCEMENT]
Make add_ra_bio_pages() support bs < ps and large folio cases, by
removing the check and calling btrfs_folio_set_lock() unconditionally.

This won't make any difference on 4K page sized systems with large
folios, as the readahead is already working, although unexpectedly.

But this will enable true compressed data readahead for bs < ps cases
properly.

Please note that such readahead will only work if the compressed extent is
crossing folio boundaries, which is also the existing limitation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function add_ra_bio_folios() has been utilizing folio interfaces
since c808c1d ("btrfs: convert add_ra_bio_pages() to use only
folios"), but we are still referring to "pages" inside the function name
and all comments.

Furthermore, such folio/page mixing can even be confusing, e.g. the
variable @page_end is very confusing as we're not really referring to
the end of the page, but the end of the folio, especially when we
already have large folio support.

Enhance that function by:

- Rename "page" to "folio" to avoid confusion

- Skip to the folio end if there is already a folio in the page cache
  The existing skip is:

  	cur += folio_size(folio);

  This is incorrect if @Cur is not folio size aligned, and can be
  common with large folio support.

  Thankfully this is not going to cause any real bugs, but at most will
  skip some blocks that can be added to readahead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This feature was introduced in v6.17 under experimental, and we had
several small bugs related to or exposed by that:

  e9e3b22 ("btrfs: fix beyond-EOF write handling")
  18de34d ("btrfs: truncate ordered extent when skipping writeback past i_size")

Otherwise, the feature has been frequently tested by btrfs developers.

The latest fix only arrived in v6.19. After three releases, I think it's
time to move this feature out of experimental.

And since we're here, also remove the comment about the bitmap size
limit, which is no longer relevant in the context. It will soon be
outdated for the incoming huge folio support.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
maharmstone and others added 8 commits April 28, 2026 17:01
Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.

Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.

On my machine, this one-line change results in a ~59% improvement in DIO
throughput:

Before patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
    write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
     bw (  KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
     iops        : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
    cpu          : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec

After patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
    write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
     bw (  MiB/s): min=  619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
     iops        : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
    cpu          : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec

The script to reproduce that:

  #!/bin/bash
  mkfs.btrfs -f /dev/nvme0n1
  mount /dev/nvme0n1 /mnt/test
  mkdir /mnt/test/nocow
  chattr +C /mnt/test/nocow
  fio /root/test.fio

  # cat /root/test.fio
  [global]
  rw=randwrite
  ioengine=io_uring
  iodepth=64
  size=1g
  direct=1
  startdelay=20
  force_async=4
  ramp_time=5
  runtime=60
  group_reporting=1
  numjobs=32
  time_based
  disk_util=0
  clat_percentiles=0
  disable_lat=1
  disable_clat=1
  disable_slat=1
  filename=/mnt/test/nocow/fiofile
  [test]
  name=test
  bs=4k
  stonewall

This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.

Fixes: ad21f15 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A build workload newly prints order-0 allocation failures on 7.1-rc1:

    sh: page allocation failure: order:0
    mode:0x14084a(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_IO|__GFP_KSWAPD_RECLAIM|
                  __GFP_COMP|__GFP_HARDWALL)
    CPU: 27 UID: 1000 PID: 855540 Comm: sh Not tainted 7.1.0-rc1-llvm-00058-gdca922e019dd #1 PREEMPTLAZY
    Call Trace:
     <TASK>
     dump_stack_lvl+0x50/0x70
     warn_alloc+0xeb/0x100
     __alloc_pages_slowpath+0x567/0x5a0
     ? filemap_get_entry+0x11a/0x140
     __alloc_frozen_pages_noprof+0x249/0x2d0
     alloc_pages_mpol+0xe4/0x180
     folio_alloc_noprof+0x80/0xa0
     add_ra_bio_pages+0x13c/0x4b0
     btrfs_submit_compressed_read+0x229/0x300
     submit_one_bio+0x9e/0xe0
     btrfs_readahead+0x185/0x1a0
     [...]

    (lldb) source list -a add_ra_bio_pages+0x13c
    .../vmlinux.unstripped add_ra_bio_pages + 316 at .../fs/btrfs/compression.c:454:8
       451
       452                  folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, constraint_gfp),
       453                                              0, NULL);
    -> 454                  if (!folio)
       455                          break;

I can reproduce this consistently by running a memory hog concurrently
with a buffered writer on a machine with a very large amount of swap.

Commit 7ae37b2 ("btrfs: prevent direct reclaim during compressed
readahead") clearly intended to suppress these warnings. But because the
mask set in the address_space with mapping_set_gfp_mask() doesn't include
__GFP_NOWARN, mapping_gfp_constraint() removes it from constraint_gfp
before it is passed to filemap_alloc_folio().

Fix by refactoring the code to add __GFP_NOWARN after the call to
mapping_gfp_constraint().

Fixes: 7ae37b2 ("btrfs: prevent direct reclaim during compressed readahead")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Any commits after this one are for testing and evaluation only.

Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the recent commit "btrfs: warn about extent buffer that can not be
released", we can trigger the following warning running test cases like
generic/388 at unmount:

 BTRFS critical (device dm-2 state E): emergency shutdown
 BTRFS error (device dm-2 state E): cow_file_range failed, root=5 inode=265 start=135168 len=118784 cur_offset=135168 cur_alloc_size=0: -5
 BTRFS error (device dm-2 state E): error while writing out transaction: -30
 BTRFS warning (device dm-2 state E): Skipping commit of aborted transaction.
 BTRFS error (device dm-2 state EA): Transaction 9 aborted (error -30)
 BTRFS: error (device dm-2 state EA) in cleanup_transaction:2068: errno=-30 Readonly filesystem
 BTRFS info (device dm-2 state EA): forced readonly
 BTRFS error (device dm-2 state EA): failed to run delalloc range, root=5 ino=265 folio=135168 submit_bitmap=0 start=135168 len=118784: -5
 BTRFS info (device dm-2 state EA): last unmount of filesystem 8b3d8748-4710-4b5a-84d9-b072cb03be2d
 ------------[ cut here ]------------
 WARNING: disk-io.c:3306 at invalidate_btree_folios+0xfd/0x1ca [btrfs], CPU#4: umount/60183
 CPU: 4 UID: 0 PID: 60183 Comm: umount Tainted: G        W  OE       7.0.0-rc6-custom+ #365 PREEMPT(full)  5804053f02137e627472d94b5128cc9fcb110e88
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
 RIP: 0010:invalidate_btree_folios+0xfd/0x1ca [btrfs]
 Call Trace:
  <TASK>
  close_ctree+0x534/0x57a [btrfs eeeee2af86b856a32e0b81b75d427a17a62ffe29]
  generic_shutdown_super+0x89/0x1a0
  kill_anon_super+0x16/0x40
  btrfs_kill_super+0x16/0x20 [btrfs eeeee2af86b856a32e0b81b75d427a17a62ffe29]
  deactivate_locked_super+0x2d/0xb0
  cleanup_mnt+0xdc/0x140
  task_work_run+0x5a/0xa0
  exit_to_user_mode_loop+0x123/0x4b0
  do_syscall_64+0x288/0x7d0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  </TASK>
 ---[ end trace 0000000000000000 ]---
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30507008 owner 1 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30588928 owner 9 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30605312 owner 257 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30621696 owner 7 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30638080 owner 258 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30654464 owner 2 gen 9 refs 2 flags 0x7
 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30670848 owner 10 gen 9 refs 2 flags 0x7

I'm using a striped down version, which seems to be trigger the warning
more reliably:

  _fsstress_pid=""
  workload()
  {
  	dmesg -C
  	mkfs.btrfs -f -K $dev > /dev/null
  	echo 1 > /sys/kernel/debug/clear_warn_once
  	mount $dev $mnt
  	$fsstress -w -n 1024 -p 4 -d $mnt &
  	_fsstress_pid=$!
  	sleep 0
  	$godown $mnt
  	pkill --echo -PIPE fsstress > /dev/null
  	wait $_fsstress_pid
  	unset _fsstress_pid
  	umount $mnt

  	if dmesg | grep -q "WARNING"; then
  		fail
  	fi
  }

  for (( i = 0; i < $runtime; i++ )); do
  	echo "=== $i/$runtime ==="
  	workload
  done

[CAUSE]
Inside btrfs_write_and_wait_transaction(), we first try to write all
dirty ebs, then wait for them to finish.

After that we call btrfs_extent_io_tree_release() to free all
extent states from dirty_pages io tree.

However if we hit an error from btrfs_write_marked_extent(), then we
still call btrfs_extent_io_tree_release() to clear that dirty_pages io
tree, which may contain dirty records that we haven't yet submitted.

Furthermore, later transaction cleanup path will utilize that
dirty_pages io tree to properly cleanup those dirty ebs, but since it's
already empty, no dirty ebs are properly cleaned up, thus will later
trigger the warnings inside invalidate_btree_folios().

[FIX]
Normally such dirty ebs won't cause problems, as when the iput() is
called on the btree inode, the dirty ebs will be forced written back.
But at that stage all workers have been destroyed, and if the metadata
writeback needs any worker (e.g. RAID56), it will easily trigger an NULL
pointer dereference.

Furthermore such writeback at iput() time can be too late for zoned
btrfs, as we also need to properly update the write pointer for zoned
devices.

Instead of unconditionally calling btrfs_extent_io_tree_release(), only
call it if btrfs_write_and_wait_transaction() finished successfully, so
that @dirty_pages extent io tree is kept untouched for transaction
cleanup.

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comments at the beginning of subpage.c are out-of-date, a lot of the
limits are already resolved.

Update them to reflect the latest status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…maps

[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.

That limit allows us to easily grab an unsigned long without the need to
properly allocate memory for a larger bitmap.

Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.

[ENHANCEMENT]
To allow direct bitmap operations without allocating new memory,
introduce two different ways to access the subpage bitmaps:

- Return an unsigned long value
  This only happens if blocks_per_folio <= BITS_PER_LONG.

  We read out the sub-bitmap into an unsigned long, and return the
  value.
  This is the old existing method.

  This involves get_bitmap_value_##name() helper functions.
  And this time the helper functions are defined as inline functions
  instead of macros to provide better type checks.

- Return a pointer where the sub-bitmap starts
  This only happens if blocks_per_folio >= BITS_PER_LONG.

  This is the new method for sub-bitmaps larger than BITS_PER_LONG.
  Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we
  can directly access the start byte of the sub-bitmap.

  This involves get_bitmap_pointer_##name() helper functions.

Then change the existing sub-bitmaps users to use the new helpers:

- Bitmap dumping
  Switch between get_bitmap_value_##name() and
  get_bitmap_pointer_##name() depending on the sub-bitmap size.

- btrfs_get_subpage_dirty_bitmap()
  Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new
  value/pointer naming.
  Since we do not support huge folios yet, there is no pointer version
  for it yet.

  Furthermore add the support for bs == ps cases for
  btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer
  needs to check if the folio needs subpage handling.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.

One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap,
which makes it very simple and straightforward to just grab an unsigned
long value and assign it to submit_bitmap.

Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.

[ENHANCEMENT]
Instead of using a fixed unsigned long value, change
btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer.

And for cases where an unsigned long can hold the whole bitmap,
introduce @submit_bitmap_value, and just point that pointer to that
unsigned long.

Then update all direct users of bio_ctrl->submit_bitmap to use the
pointer version.

There are several call sites that get extra changes:

- @range_bitmap inside extent_writepage_io()
  Which is only utilized to truncate the bitmap.
  Since we do not want to allocate new memory just for such temporary
  usage, change the original bitmap_set() and bitmap_and() into
  bitmap_clear() for the ranges out of the folio.

- Getting dirty subpage bitmap inside writepage_delalloc()
  Since we're passing an unsigned long pointer now, we need to go with
  different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG,
  blocks_per_folio > BITS_PER_LONG).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With all the previous preparations, it's finally time to enable the
huge folio support.

- The max folio size
  Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed to 2MiB.

  This will ensure we have a large enough but not too large folio for
  btrfs.
  This limit applies to all systems regardless of page size.

  Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on
  CONFIG_BTRFS_EXPERIMENTAL.

  If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512,
  otherwise it's BITS_PER_LONG.

  The filemap max order will be calculated using both
  BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO.

  E.g. for 64K page size with 64K fs block size, the limit will be
  BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5.
  This will be lower than the old order (6), but folios larger than 2M
  are rarely any better for IO performance. Meanwhile excessively large
  folios can cause other problems like stalling the IO pipeline for too
  long.

  For 4K page size and 4K fs block size, the limit will be increased to
  2M from the old 256K.
  This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and
  BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to reach huge
  folio support, and the filemap max order will be 9.

- btrfs_bio_ctrl::submit_bitmap
  This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and
  this will be on-stack memory.
  This will increase on-stack memory usage by 56 bytes compared to the
  baseline (before any patch in the series).

- Local @delalloc_bitmap inside writepage_delalloc()
  Unfortunately we cannot afford to handle an allocation error here, thus
  again we use on-stack memory.
  Thus this will increase on-stack memory usage by 56 bytes again.

So unfortunately this means during the delalloc window, the writeback path
will have +112 bytes for on-stack memory usage, and for other cases the
writeback path will have +56 bytes on-stack memory usage.

The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed
after we have reworked the compression submission, so the current
on-stack submit_bitmap is mostly a workaround until then.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet