Test misc-next (regular, SELF)#1629
Open
kdave wants to merge 10000 commits intobtrfs:ci-kvmfrom
Open
Conversation
df45462 to
12e9571
Compare
Baojun Xu <baojun.xu@ti.com> says: Link: https://patch.msgid.link/20260414015441.2439-1-baojun.xu@ti.com Signed-off-by: Mark Brown <broonie@kernel.org>
Pull tomoyo update from Tetsuo Handa: "Handle 64-bit inode numbers" * tag 'tomoyo-pr-20260422' of git://git.code.sf.net/p/tomoyo/tomoyo: tomoyo: use u64 for holding inode->i_ino value
…/git/danielt/linux Pull kgdb update from Daniel Thompson: "Only a very small update for kgdb this cycle: a single patch from Kexin Sun that fixes some outdated comments" * tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kgdb: update outdated references to kgdb_wait()
…linux/kernel/git/trace/linux-trace Pull ring-buffer fix from Steven Rostedt: - Make undefsyms_base.c into a real file The file undefsyms_base.c is used to catch any symbols used by a remote ring buffer that is made for use of a pKVM hypervisor. As it doesn't share the same text as the rest of the kernel, referencing any symbols within the kernel will make it fail to be built for the standalone hypervisor. A file was created by the Makefile that checked for any symbols that could cause issues. There's no reason to have this file created by the Makefile, just create it as a normal file instead. * tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Make undefsyms_base.c a first-class citizen
…/git/rostedt/linux-ktest Pull ktest updates from Steven Rostedt: - Fix month in date timestamp used to create failure directories On failure, a directory is created to store the logs and config file to analyze the failure. The Perl function localtime is used to create the data timestamp of the directory. The month passed back from that function starts at 0 and not 1, but the timestamp used does not account for that. Thus for April 20, 2026, the timestamp of 20260320 is used, instead of 20260420. - Save the logfile to the failure directory Just the test log was saved to the directory on failure, but there's useful information in the full logfile that can be helpful to analyzing the failure. Save the logfile as well. * tag 'ktest-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest: ktest: Add logfile to failure directory ktest: Fix the month in the name of the failure directory
…el/git/trace/linux-trace Pull tracefs fixes from Steven Rostedt: - Use list_add_tail_rcu() for walking eventfs children The linked list of children is protected by SRCU and list walkers can walk the list with only using SRCU. Using just list_add_tail() on weakly ordered architectures can cause issues. Instead use list_add_tail_rcu(). - Hold eventfs_mutex and SRCU for remount walk events The trace_apply_options() walks the tracefs_inodes where some are eventfs inodes and eventfs_remount() is called which in turn calls eventfs_set_attr(). This walk only holds normal RCU read locks, but the eventfs_mutex and SRCU should be held. Add a eventfs_remount_(un)lock() helpers to take the necessary locks before iterating the list. * tag 'tracefs-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Hold eventfs_mutex and SRCU when remount walks events eventfs: Use list_add_tail_rcu() for SRCU-protected children list
This also removes the smbdirect_ prefix from the files. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-cifs/CAHk-=whmue3PVi88K0UZLZO0at22QhQZ-yu+qO2TOKyZpGqecw@mail.gmail.com/ Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
…kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new charlieplex GPIO keypad driver - an update to aw86927 driver to support 86938 chip - an update for Chrome OS EC keyboard driver to support Fn-<key> keymap extension - an UAF fix in debugfs teardown in EDT touchscreen driver - a number of conversions for input drivers to use guard() and __free() cleanup primitives - several drivers for bus mice (inport, logibm) and other very old devices have been removed - OLPC HGPK PS/2 protocol has been removed as it's been broken and inactive for 10 something years - dedicated kpsmoused has been removed from psmouse driver - other assorted cleanups and fixups * tag 'input-for-v7.1-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (101 commits) Input: charlieplex_keypad - add GPIO charlieplex keypad dt-bindings: input: add GPIO charlieplex keypad dt-bindings: input: add settling-time-us common property dt-bindings: input: add debounce-delay-ms common property Input: imx_keypad - fix spelling mistake "Colums" -> "Columns" Input: edt-ft5x06 - fix use-after-free in debugfs teardown Input: ims-pcu - fix heap-buffer-overflow in ims_pcu_process_data() Input: ct82c710 - remove driver Input: mk712 - remove driver Input: logibm - remove driver Input: inport - remove driver Input: qt1070 - inline i2c_check_functionality check Input: qt1050 - inline i2c_check_functionality check Input: aiptek - validate raw macro indices before updating state Input: gf2k - skip invalid hat lookup values Input: xpad - add RedOctane Games vendor id Input: xpad - remove stale TODO and changelog header Input: usbtouchscreen - refactor endpoint lookup Input: aw86927 - add support for Awinic AW86938 dt-bindings: input: awinic,aw86927: Add Awinic AW86938 ...
Pull more VFIO updates from Alex Williamson: - Fix ordering of dma-buf cleanup versus device disabling in vfio-pci (Matt Evans) - Resolve an inconsistent and incorrect use of spinlock-irq in the virtio vfio-pci variant by conversion to mutex and proceed to modernize and simplify driver with use of guards (Alex Williamson) - Resurrect the removal of the remaining class_create() call in vfio, replacing with const struct class and class_register() (Jori Koolstra, Alex Williamson) - Fix NULL pointer dereference, properly serialize interrupt setup, and cleanup interrupt state tracking in the cdx vfio bus driver (Prasanna Kumar T S M, Alex Williamson) * tag 'vfio-v7.1-rc1-pt2' of https://github.com/awilliam/linux-vfio: vfio/cdx: Consolidate MSI configured state onto cdx_irqs vfio/cdx: Serialize VFIO_DEVICE_SET_IRQS with a per-device mutex vfio/cdx: Fix NULL pointer dereference in interrupt trigger path vfio: replace vfio->device_class with a const struct class vfio/virtio: Use guard() for bar_mutex in legacy I/O vfio/virtio: Use guard() for migf->lock where applicable vfio/virtio: Use guard() for list_lock where applicable vfio/virtio: Convert list_lock from spinlock to mutex vfio/pci: Clean up DMABUFs before disabling function
AppArmor dfas need a minimum of two states to be valid. State 0 is the default trap state, and State 1 the default start state. When verifying the dfa ensure that this is the case. Fixes: c27c6bd ("apparmor: ensure that dfa state tables have entries") Signed-off-by: John Johansen <john.johansen@canonical.com>
error is initialized to -EPROTO but set by some of the internal functions, unfortunately the last two checks assume error is set to -EPROTO already for the failure case. Ensure it is by setting it before these checks. Fixes: 3d28e23 ("apparmor: add support loading per permission tagging") Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: John Johansen <john.johansen@canonical.com>
In apparmor_path_rename(), when handling RENAME_EXCHANGE, the
cond_exchange structure is supposed to carry the attributes of the
*new* dentry (since it is used to authorize moving new_dentry to the
old location). However, line 412 reads:
vfsuid = i_uid_into_vfsuid(idmap, d_backing_inode(old_dentry));
This fetches the uid of old_dentry instead of new_dentry. As a result,
the RENAME_EXCHANGE permission check uses the wrong file owner, which
can allow a rename that should be denied (if old_dentry's owner has
more privileges) or deny one that should be allowed.
Note that cond_exchange.mode on the line above correctly uses
new_dentry. Only the uid lookup is wrong.
Fix by changing old_dentry to new_dentry in the i_uid_into_vfsuid call.
Fixes: 5e26a01 ("apparmor: use type safe idmapping helpers")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
When booting Ubuntu 26.04 with Linux 7.0-rc4 on an ARM64 Qualcomm Snapdragon X1 we see a string buffer overrun: BUG: KASAN: slab-out-of-bounds in aa_dfa_match (security/apparmor/match.c:535) Read of size 1 at addr ffff0008901cc000 by task snap-update-ns/2120 CPU: 5 UID: 60578 PID: 2120 Comm: snap-update-ns Not tainted 7.0.0-rc4+ #22 PREEMPTLAZY Hardware name: LENOVO 83ED/LNVNB161216, BIOS NHCN60WW 09/11/2025 Call trace: show_stack (arch/arm64/kernel/stacktrace.c:501) (C) dump_stack_lvl (lib/dump_stack.c:122) print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) kasan_report (mm/kasan/report.c:597) __asan_report_load1_noabort (mm/kasan/report_generic.c:378) aa_dfa_match (security/apparmor/match.c:535) match_mnt_path_str (security/apparmor/mount.c:244 security/apparmor/mount.c:336) match_mnt (security/apparmor/mount.c:371) aa_bind_mount (security/apparmor/mount.c:447 (discriminator 4)) apparmor_sb_mount (security/apparmor/lsm.c:719 (discriminator 1)) security_sb_mount (security/security.c:1062 (discriminator 31)) path_mount (fs/namespace.c:4101) __arm64_sys_mount (fs/namespace.c:4172 fs/namespace.c:4361 fs/namespace.c:4338 fs/namespace.c:4338) invoke_syscall.constprop.0 (arch/arm64/kernel/syscall.c:35 arch/arm64/kernel/syscall.c:49) el0_svc_common.constprop.0 (./include/linux/thread_info.h:142 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)) do_el0_svc (arch/arm64/kernel/syscall.c:152) el0_svc (arch/arm64/kernel/entry-common.c:80 arch/arm64/kernel/entry-common.c:725) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:744) el0t_64_sync (arch/arm64/kernel/entry.S:596) Allocated by task 2120: kasan_save_stack (mm/kasan/common.c:58) kasan_save_track (./arch/arm64/include/asm/current.h:19 mm/kasan/common.c:70 mm/kasan/common.c:79) kasan_save_alloc_info (mm/kasan/generic.c:571) __kasan_kmalloc (mm/kasan/common.c:419) __kmalloc_noprof (./include/linux/kasan.h:263 mm/slub.c:5260 mm/slub.c:5272) aa_get_buffer (security/apparmor/lsm.c:2201) aa_bind_mount (security/apparmor/mount.c:442) apparmor_sb_mount (security/apparmor/lsm.c:719 (discriminator 1)) security_sb_mount (security/security.c:1062 (discriminator 31)) path_mount (fs/namespace.c:4101) __arm64_sys_mount (fs/namespace.c:4172 fs/namespace.c:4361 fs/namespace.c:4338 fs/namespace.c:4338) invoke_syscall.constprop.0 (arch/arm64/kernel/syscall.c:35 arch/arm64/kernel/syscall.c:49) el0_svc_common.constprop.0 (./include/linux/thread_info.h:142 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)) do_el0_svc (arch/arm64/kernel/syscall.c:152) el0_svc (arch/arm64/kernel/entry-common.c:80 arch/arm64/kernel/entry-common.c:725) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:744) el0t_64_sync (arch/arm64/kernel/entry.S:596) The buggy address belongs to the object at ffff0008901ca000 which belongs to the cache kmalloc-rnd-06-8k of size 8192 The buggy address is located 0 bytes to the right of allocated 8192-byte region [ffff0008901ca000, ffff0008901cc000) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9101c8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:-1 pincount:0 flags: 0x8000000000000040(head|zone=2) page_type: f5(slab) raw: 8000000000000040 ffff000800016c40 fffffdffe2d14e10 ffff000800015c70 raw: 0000000000000000 0000000800010001 00000000f5000000 0000000000000000 head: 8000000000000040 ffff000800016c40 fffffdffe2d14e10 ffff000800015c70 head: 0000000000000000 0000000800010001 00000000f5000000 0000000000000000 head: 8000000000000003 fffffdffe2407201 fffffdffffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff0008901cbf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff0008901cbf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff0008901cc000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff0008901cc080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff0008901cc100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc This was introduced by previous incorrect conversion from strcpy(). Fix it by adding the missing terminator. Cc: stable@vger.kernel.org Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com> Signed-off-by: Daniel J Blueman <daniel@quora.org> Fixes: 93d4dbd ("apparmor: Replace deprecated strcpy in d_namespace_path") Signed-off-by: John Johansen <john.johansen@canonical.com>
aa_dfa_unpack returns ERR_PTR not NULL when it fails, but aa_put_dfa only checks NULL for its input, which would cause invalid memory access in aa_put_dfa. Set nulldfa to NULL explicitly to fix that. Fixes: 98b824f ("apparmor: refcount the pdb") Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> Signed-off-by: John Johansen <john.johansen@canonical.com>
Since commit 2bd8248 ("xps: fix xps for stacked devices"), skb->napi_id shares storage with sender_cpu. RX tracepoints using net_dev_rx_verbose_template read skb->napi_id directly and can therefore report sender_cpu values as if they were NAPI IDs. For example, on the loopback path this can report 1 as napi_id, where 1 comes from raw_smp_processor_id() + 1 in the XPS path: # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }' # taskset -c 0 ping -c 1 ::1 Report only valid NAPI IDs in these tracepoints and use 0 otherwise. Fixes: 2bd8248 ("xps: fix xps for stacked devices") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points directly into the mmap'd TX ring buffer shared with userspace. The kernel validates the header via __packet_snd_vnet_parse() but then re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent userspace thread can modify the vnet_hdr fields between validation and use, bypassing all safety checks. The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr to a stack-local variable. All other vnet_hdr consumers in the kernel (tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX path is the only caller of virtio_net_hdr_to_skb() that reads directly from user-controlled shared memory. Fix this by copying vnet_hdr from the mmap'd ring buffer to a stack-local variable before validation and use, consistent with the approach used in packet_snd() and all other callers. Fixes: 1d036d2 ("packet: tpacket_snd gso and checksum offload") Signed-off-by: Bingquan Chen <patzilla007@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix spelling mistake "targgeting" -> "targeting" in maintainer-netdev.rst No functional change. Signed-off-by: Ariful Islam Shoikot <islamarifulshoikat@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Firmware may not advertize correct resources if backing store is not enabled before resource information is queried. Fix the initial sequence of HWRMs so that driver gets capabilities and resource information correctly. Fixes: 3fa9e97 ("bng_en: Initialize default configuration") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Rahul Gupta <rahul-rg.gupta@broadcom.com> Link: https://patch.msgid.link/20260418023438.1597876-2-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The backing store type, BNGE_CTX_MRAV, is not applicable in Thor Ultra devices. Remove it from the backing store configuration, as the firmware will not populate entities in this backing store type, due to which the driver load fails. Fixes: 29c5b35 ("bng_en: Add backing store support") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com> Link: https://patch.msgid.link/20260418023438.1597876-3-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vikas Gupta says:
====================
bnge fixes
Patch-1:
Due to wrong HWRM sequence, driver do not get the correct
information regarding resources and capabilities.
The patch fixes the initial HWRM sequence.
Patch-2:
Remove the unsupported backing store type initialization, which is
not supported in Thor Ultra devices.
====================
Link: https://patch.msgid.link/20260418023438.1597876-1-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_clone() increments sockets_allocated and sets the socket refcount to 2. SCTP performs additional accounting in sctp_clone_sock(), so the clone-time increment must be undone to avoid double counting. Note we cannot simply remove the SCTP-side increment, because the SCTP destroy path in sctp_destroy_sock() only decrements sockets_allocated when sp->ep is set, which may not be true for all failure paths in sctp_clone_sock(). Fixes: 16942cf ("sctp: Use sk_clone() in sctp_accept().") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/af8d66f928dec3e9fcbee8d4a85b7d5a6b86f515.1776460180.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the condition in seg6_build_state() that excludes L2 encap modes from setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for the new mode. As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output() on the output path, where the packet is silently dropped because skb_mac_header_was_set() fails on L3 packets. Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP. Fixes: 13f0296 ("seg6: add support for SRv6 H.L2Encaps.Red behavior") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
caller-allocated on-stack u64 buffer to a per-connection visitor and
then copy the full item_len bytes back to user space via
rds_info_copy() regardless of how much of the buffer the visitor
actually wrote.
rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
write a subset of their output struct when the underlying
rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
and the two GIDs via explicit memsets). Several u32 fields
(max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
cache_allocs) and the 2-byte alignment hole between sl and
cache_allocs remain as whatever stack contents preceded the visitor
call and are then memcpy_to_user()'d out to user space.
struct rds_info_rdma_connection and struct rds6_info_rdma_connection
are the only rds_info_* structs in include/uapi/linux/rds.h that are
not marked __attribute__((packed)), so they have a real alignment
hole. The other info visitors (rds_conn_info_visitor,
rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
their packed output struct today and are not known to be vulnerable,
but a future visitor that adds a conditional write-path would have
the same bug.
Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
any netdev is sufficient), sendto()'s any peer on the same subnet
(fails cleanly but installs an rds_connection in the global hash in
RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
bytes of stack garbage including kernel text/data pointers:
0..7 0a 63 00 01 0a 63 00 02 src=10.99.0.1 dst=10.99.0.2
8..39 00 ... gids (memset-zeroed)
40..47 e0 92 a3 81 ff ff ff ff kernel pointer (max_send_wr)
48..55 7f 37 b5 81 ff ff ff ff kernel pointer (rdma_mr_max)
56..59 01 00 08 00 rdma_mr_size (garbage)
60..61 00 00 tos, sl
62..63 00 00 alignment padding
64..67 18 00 00 00 cache_allocs (garbage)
Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
and rds_walk_conn_path_info() before invoking the visitor. This
covers the IPv4/IPv6 IB visitors and hardens all current and future
visitors against the same class of bug.
No functional change for visitors that fully populate their output.
Changes in v2:
- retarget at the net tree (subject prefix "[PATCH net v2]",
net/rds: prefix in the title)
- pick up Reviewed-by tags from Sharath Srinivasan and
Allison Henderson
Fixes: ec16227 ("RDS/IB: Infiniband transport")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Assisted-by: Claude:claude-opus-4-7
Link: https://patch.msgid.link/20260418141047.3398203-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's not necessary since we can get the block group from the given free space control structure. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
We can get the free space control structure from the given block group, so there is no need to pass it as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to pass the free space control structure as an argument because we can grab it from the given block group. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[WARNING] When running test cases with injected errors or shutdown, e.g. generic/388 or generic/475, there is a chance that the following kernel warning is triggered: BTRFS info (device dm-2): first mount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff BTRFS info (device dm-2): using crc32c checksum algorithm BTRFS info (device dm-2): checking UUID tree BTRFS info (device dm-2): turning on async discard BTRFS info (device dm-2): enabling free space tree BTRFS critical (device dm-2 state E): emergency shutdown ------------[ cut here ]------------ WARNING: extent_io.c:1742 at extent_writepage_io+0x437/0x520 [btrfs], CPU#2: kworker/u43:2/651591 CPU: 2 UID: 0 PID: 651591 Comm: kworker/u43:2 Tainted: G W OE 7.0.0-rc6-custom+ #365 PREEMPT(full) 5804053f02137e627472d94b5128cc9fcb110e88 RIP: 0010:extent_writepage_io+0x437/0x520 [btrfs] Call Trace: <TASK> extent_write_cache_pages+0x2a5/0x820 [btrfs 70299925d0856939e93b17d480651713b3cbba58] btrfs_writepages+0x74/0x130 [btrfs 70299925d0856939e93b17d480651713b3cbba58] do_writepages+0xd0/0x160 __writeback_single_inode+0x42/0x340 writeback_sb_inodes+0x22d/0x580 wb_writeback+0xc6/0x360 wb_workfn+0xbd/0x470 process_one_work+0x198/0x3b0 worker_thread+0x1c8/0x330 kthread+0xee/0x120 ret_from_fork+0x2a6/0x330 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS error (device dm-2 state E): root 5 ino 259 folio 1323008 is marked dirty without notifying the fs BTRFS error (device dm-2 state E): failed to submit blocks, root=5 inode=259 folio=1323008 submit_bitmap=0: -117 BTRFS info (device dm-2 state E): last unmount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff [CAUSE] Inside btrfs we have the following pattern in several locations, for example inside btrfs_dirty_folio(): btrfs_clear_extent_bit(&inode->io_tree, start_pos, end_of_last_block, EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, cached); ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, extra_bits, cached); if (ret) return ret; However btrfs_set_extent_delalloc() can return IO errors other than -ENOMEM through the following callchain: btrfs_set_extent_delalloc() \- btrfs_find_new_delalloc_bytes() \- btrfs_get_extent() \- btrfs_lookup_file_extent() \- btrfs_search_slot() When such IO error happened, the previous btrfs_clear_extent_bit() has cleared the EXTENT_DELALLOC for the range, and we're expecting btrfs_set_extent_delalloc() to re-set EXTENT_DELALLOC. But since btrfs_set_extent_delalloc() failed before btrfs_set_extent_bit(), EXTENT_DELALLOC flag is no longer present. And if the folio range is dirty before entering btrfs_set_extent_delalloc(), we got a dirty folio but no EXTENT_DELALLOC flag now. Then we hit the folio writeback: extent_writepage() |- writepage_delalloc() | No ordered extent is created, as there is no EXTENT_DELALLOC set | for the folio range. | This also means the folio has no ordered flag set. | |- extent_writepage_io() \- if (unlikely(!folio_test_ordered(folio)) Now we hit the warning. [FIX] Introduce a new helper, btrfs_reset_extent_delalloc() to replace the currently open-coded btrfs_clear_extent_bit() + btrfs_set_extent_delalloc() combination. Instead of calling btrfs_clear_extent_bit() first, update EXTENT_DELALLOC_NEW first, as that part can fail due to metadata IO, meanwhile btrfs_clear_extent_bit() and btrfs_set_extent_bit() won't return any error but retry memory allocation until succeeded. This allows us to fail early without clearing EXTENT_DELALLOC bit, so even if that new btrfs_reset_extent_delalloc() failed before touching EXTENT_DELALLOC, the existing dirty range will still have their old EXTENT_DELALLOC flag present, thus avoid the warning. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
query the on-disk csums for a file range.
The ioctl is deliberately per-file rather than exposing raw csum tree
lookups, to avoid leaking information to users about files they may not
have access to.
This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
the kernel, which details the offset and length we're interested in, and
a buffer for the kernel to write its results into. The kernel writes a
struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
csums if available. The maximum size of the user buffer is capped to
16MiB.
If the extent is an uncompressed, non-NODATASUM extent, the kernel sets
the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
csums. If it is sparse, preallocated, or beyond the EOF, it sets the
type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
the precomputed hash of the zero sector. Otherwise, it sets the type to
BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
produce the following output buffer:
| [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
We do store the csums of compressed extents, but we deliberately don't
return them here: they're calculated over the compressed data, not the
uncompressed data that's returned to userspace. Similarly for encrypted
data, once encryption is supported, in which the csums will be on the
ciphertext.
The main use case for this is for speeding up mkfs.btrfs --rootdir. For
the case when the source FS is btrfs and using the same csum algorithm,
we can avoid having to recalculate the csums - in my synthetic
benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
speed-up (218s to 196s).
When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
reading the data entirely, resulting a ~2200% speed-up on the same test
(128s to 6s).
# mkdir rootdir
# dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
(without ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m37.965s
user 0m5.496s
sys 0m6.125s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 2m8.342s
user 0m5.472s
sys 0m1.667s
(with ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m15.865s
user 0m4.258s
sys 0m6.261s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 0m5.847s
user 0m2.899s
sys 0m0.097s
Another notable use case is for deduplication, where reading the
checksums may serve as a hint instead of reading the whole file data.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Running btrfs balance with a usage filter (-dusage=N) can trigger a
null-ptr-deref when metadata corruption causes a chunk to have no
corresponding block group in the in-memory cache:
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604
...
Call Trace:
btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
vfs_ioctl fs/ioctl.c:51 [inline]
...
The bug is reproducible on current development branch.
[CAUSE]
Two separate data structures are involved:
1. The on-disk chunk tree, which records every chunk (logical address
space region) and is iterated by __btrfs_balance().
2. The in-memory block group cache (fs_info->block_group_cache_tree),
which is built at mount time by btrfs_read_block_groups() and holds
a struct btrfs_block_group for each chunk. This cache is what the
usage filter queries.
On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.
When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_filter(), which queries the block
group cache:
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
chunk_used = cache->used; /* cache may be NULL */
btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_filter() does not check the return value,
so the immediately following dereference of cache->used triggers the crash.
[FIX]
Add a NULL check after btrfs_lookup_block_group() in chunk_usage_filter().
When the lookup fails, emit a btrfs_err() message identifying the
affected bytenr and return -EUCLEAN to indicate filesystem corruption.
Since chunk_usage_filter() now has an error path, change its return type
from bool to error pointer and 0 if the chunk passes the usage filter,
and 1 if it should be skipped.
Update should_balance_chunk() accordingly to propagate negative errors
from the usage filter.
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…filter()
[BUG]
Running btrfs balance with a usage range filter (-dusage=min..max) can
trigger a null-ptr-deref when metadata corruption causes a chunk to have
no corresponding block group in the in-memory cache:
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
RIP: 0010:chunk_usage_range_filter fs/btrfs/volumes.c:3845 [inline]
RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4031 [inline]
RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4182 [inline]
RIP: 0010:btrfs_balance+0x249e/0x4320 fs/btrfs/volumes.c:4618
...
Call Trace:
btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
vfs_ioctl fs/ioctl.c:51 [inline]
...
The bug is reproducible on recent development branch.
[CAUSE]
Two separate data structures are involved:
1. The on-disk chunk tree, which records every chunk (logical address
space region) and is iterated by __btrfs_balance().
2. The in-memory block group cache (fs_info->block_group_cache_tree),
which is built at mount time by btrfs_read_block_groups() and holds
a struct btrfs_block_group for each chunk. This cache is what the
usage range filter queries.
On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.
When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_range_filter(), which queries
the block group cache:
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
chunk_used = cache->used; /* cache may be NULL */
btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_range_filter() does not check the return
value, so the immediately following dereference of cache->used triggers
the crash.
[FIX]
Add a NULL check after btrfs_lookup_block_group() in
chunk_usage_range_filter(). When the lookup fails, emit a btrfs_err()
message identifying the affected bytenr and return -EUCLEAN to indicate
filesystem corruption.
Since chunk_usage_range_filter() now has an error path, change its
return type from bool to error pointer, return 0 if the chunk matches
the usage range, and 1 if it should be filtered out.
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…ta_chunk()
[BUG]
Running btrfs balance can trigger a null-ptr-deref before relocating a
data chunk when metadata corruption leaves a chunk in the chunk tree
without a corresponding block group in the in-memory cache:
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
RIP: 0010:btrfs_may_alloc_data_chunk+0x40/0x1c0 fs/btrfs/volumes.c:3601
Call Trace:
__btrfs_balance fs/btrfs/volumes.c:4217 [inline]
btrfs_balance+0x2516/0x42b0 fs/btrfs/volumes.c:4604
btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
...
[CAUSE]
__btrfs_balance() iterates the on-disk chunk tree and passes the chunk
logical bytenr to btrfs_may_alloc_data_chunk() before relocating a data
chunk. That helper then queries the in-memory block group cache:
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
chunk_type = cache->flags; /* cache may be NULL */
A corrupt image can contain a chunk item whose matching block group
item is missing, so no block group is ever inserted into the cache. In
that case btrfs_lookup_block_group() returns NULL.
The code only guards this with ASSERT(cache), which becomes a no-op when
CONFIG_BTRFS_ASSERT is disabled. The subsequent dereference of
cache->flags therefore crashes the kernel.
[FIX]
Add a NULL check after btrfs_lookup_block_group() in
btrfs_may_alloc_data_chunk() and print and error message for clarity.
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
A corrupted image with a chunk present in the chunk tree but whose
corresponding block group item is missing from the extent tree can be
mounted successfully, even though check_chunk_block_group_mappings()
is supposed to catch exactly this corruption at mount time. Once
mounted, running btrfs balance with a usage filter (-dusage=N or
-dusage=min..max) triggers a null-ptr-deref:
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604
[CAUSE]
The crash occurs because __btrfs_balance() iterates the on-disk chunk
tree, finds the orphaned chunk, calls chunk_usage_filter() (or
chunk_usage_range_filter()), which queries the in-memory block group
cache via btrfs_lookup_block_group(). Since no block group was ever
inserted for this chunk, the lookup returns NULL, and the subsequent
dereference of cache->used crashes.
check_chunk_block_group_mappings() uses btrfs_find_chunk_map() to
iterate the in-memory chunk map (fs_info->mapping_tree):
map = btrfs_find_chunk_map(fs_info, start, 1);
With @start = 0 and @Length = 1, btrfs_find_chunk_map() looks for a
chunk map that *contains* the logical address 0. If no chunk contains
logical address 0, btrfs_find_chunk_map(fs_info, 0, 1) returns NULL
immediately and the loop breaks after the very first iteration,
having checked zero chunks. The entire verification function is therefore
a no-op, and the corrupted image passes the mount-time check undetected.
[FIX]
Replace the btrfs_find_chunk_map() based loop with a direct in-order
walk of fs_info->mapping_tree using rb_first_cached() + rb_next().
This guarantees that every chunk map in the tree is visited regardless
of the logical addresses involved.
No lock is taken around the traversal. This function is called during
mount from btrfs_read_block_groups(), which is invoked from open_ctree()
before any background threads (cleaner, transaction kthread, etc.) are
started. There are therefore no concurrent writers that could modify
mapping_tree at this point. An analogous lockless direct traversal of
mapping_tree already exists in fill_dummy_bgs() in the same file.
Since we walk the rb-tree directly via rb_entry() without going through
btrfs_find_chunk_map(), no reference is taken on each map entry, so the
btrfs_free_chunk_map() calls are also removed.
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…_extents() There's no need to calculate again the size for the temporary block reserve in btrfs_replace_file_extents() - we have already calculated it and stored it in the 'min_size' variable. So use the variable to make it more clear and also make the variable const since it's not supposed to change during the whole function. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…fields The 'disk_cache_state' and 'cached' fields are defined with an int type but all the values we assigned to them come from the enums btrfs_disk_cache_state and btrfs_caching_type. So change the type in the btrfs_block_group structure from int to these enums - in practice an enum is an int, so this is more for readability and clarity. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Drop the length argument and use the simpler QSTR(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
If statement branches that lead to a DEBUG_WARN() are unexpected to happen and in most places we surround their expressions with the unlikely tag, however a few places are missing. Add the unlikely tag to those missing places to make it explicit to a reader that it's not expected and to hint the compiler to generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The function always returns true or false but the its return type is defined as int, which makes no sense. Change it to bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Originally 2K block size support was introduced to test subpage (block size < page size) on x86_64 where the page size is exactly the original minimal block size. However that 2K block size support has some problems: - No 2K nodesize support This is critical, as there is still no way to exercise the subpage metadata routine. - Very easy to test subpage data path now With the currently experimental large folio support, it's very easy to test the subpage data folio path already, as when a folio larger than 4K is encountered on x86_64, we will need all the subpage folio states and bitmaps. So there is no need to use 2K block size just to verify subpage data path even on x86_64. And with the incoming huge folio (2M on x86_64) support, the 2K block size will easily double the bitmap size, considering the burden to maintain and the limited extra coverage, I believe it's time to remove it for the incoming huge folio support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_writepages() just accumulates as large bio as possible (within writeback_control constraints) and then submits it. This can however lead to significant latency in writeback IO submission (I have observed tens of milliseconds) because the submitted bio easily has over hundred of megabytes. Consequently this leads to IO pipeline stalls and reduced throughput. At the same time beyond certain size submitting so large bio provides diminishing returns because the bio is split by the block layer immediately anyway. So compute (estimate of) bio size beyond which we are unlikely to improve performance and just submit the bio for writeback once we accumulate that much to keep the IO pipeline busy. This improves writeback throughput for sequential writes by about 15% on the test machine I was using. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> [ Fix the handling of missing device to avoid NULL pointer dereference. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0b2600f ("treewide: change inode->i_ino from unsigned long to u64") sets the inode number type to u64 unconditionally, so we can use it directly as there's no difference on 32bit and 64bit platform. We used to have a copy of the number in our btrfs_inode. The size of btrfs_inode on 32bit platform is about 688 bytes (after the change). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND] When bs < ps support was initially introduced, the compressed data readahead was disabled as at that time the target page size was 64K. This means a compressed data extent can span at most 3 64K pages (the head and tail parts are not aligned to 64K), meaning the benefit is pretty minimal. [UNEXPECTED WORKING SITUATION] But with the already merged large folio support, we're already enabling readahead with subpage routine unintentionally, e.g.: 0 4K 8K 12K 16K | Folio 0 | Folio 8K | |<----- Compressed data ------->| We have 2 8K sized folios, all backed by a single compressed data. In that case add_ra_bio_pages() will continue to add folio 8K into the read bio, as the condition to skip is only (bs < ps), not taking the newer large folio support into consideration at all. So for folio 8K, it is added to the read bio, but without subpage lock bitmap populated. Then at end_bbio_data_read(), folio 0 has proper locked bitmap set, but folio 8K does not. This inconsistency is handled by the extra safety net at btrfs_subpage_end_and_test_lock() where if a folio has no @nr_locked, it will just be unlocked without touching the locked bitmap. [ENHANCEMENT] Make add_ra_bio_pages() support bs < ps and large folio cases, by removing the check and calling btrfs_folio_set_lock() unconditionally. This won't make any difference on 4K page sized systems with large folios, as the readahead is already working, although unexpectedly. But this will enable true compressed data readahead for bs < ps cases properly. Please note that such readahead will only work if the compressed extent is crossing folio boundaries, which is also the existing limitation. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The function add_ra_bio_folios() has been utilizing folio interfaces since c808c1d ("btrfs: convert add_ra_bio_pages() to use only folios"), but we are still referring to "pages" inside the function name and all comments. Furthermore, such folio/page mixing can even be confusing, e.g. the variable @page_end is very confusing as we're not really referring to the end of the page, but the end of the folio, especially when we already have large folio support. Enhance that function by: - Rename "page" to "folio" to avoid confusion - Skip to the folio end if there is already a folio in the page cache The existing skip is: cur += folio_size(folio); This is incorrect if @Cur is not folio size aligned, and can be common with large folio support. Thankfully this is not going to cause any real bugs, but at most will skip some blocks that can be added to readahead. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This feature was introduced in v6.17 under experimental, and we had several small bugs related to or exposed by that: e9e3b22 ("btrfs: fix beyond-EOF write handling") 18de34d ("btrfs: truncate ordered extent when skipping writeback past i_size") Otherwise, the feature has been frequently tested by btrfs developers. The latest fix only arrived in v6.19. After three releases, I think it's time to move this feature out of experimental. And since we're here, also remove the comment about the bitmap size limit, which is no longer relevant in the context. It will soon be outdated for the incoming huge folio support. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.
Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.
On my machine, this one-line change results in a ~59% improvement in DIO
throughput:
Before patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec
After patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec
The script to reproduce that:
#!/bin/bash
mkfs.btrfs -f /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
mkdir /mnt/test/nocow
chattr +C /mnt/test/nocow
fio /root/test.fio
# cat /root/test.fio
[global]
rw=randwrite
ioengine=io_uring
iodepth=64
size=1g
direct=1
startdelay=20
force_async=4
ramp_time=5
runtime=60
group_reporting=1
numjobs=32
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=/mnt/test/nocow/fiofile
[test]
name=test
bs=4k
stonewall
This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.
Fixes: ad21f15 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A build workload newly prints order-0 allocation failures on 7.1-rc1:
sh: page allocation failure: order:0
mode:0x14084a(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_IO|__GFP_KSWAPD_RECLAIM|
__GFP_COMP|__GFP_HARDWALL)
CPU: 27 UID: 1000 PID: 855540 Comm: sh Not tainted 7.1.0-rc1-llvm-00058-gdca922e019dd #1 PREEMPTLAZY
Call Trace:
<TASK>
dump_stack_lvl+0x50/0x70
warn_alloc+0xeb/0x100
__alloc_pages_slowpath+0x567/0x5a0
? filemap_get_entry+0x11a/0x140
__alloc_frozen_pages_noprof+0x249/0x2d0
alloc_pages_mpol+0xe4/0x180
folio_alloc_noprof+0x80/0xa0
add_ra_bio_pages+0x13c/0x4b0
btrfs_submit_compressed_read+0x229/0x300
submit_one_bio+0x9e/0xe0
btrfs_readahead+0x185/0x1a0
[...]
(lldb) source list -a add_ra_bio_pages+0x13c
.../vmlinux.unstripped add_ra_bio_pages + 316 at .../fs/btrfs/compression.c:454:8
451
452 folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, constraint_gfp),
453 0, NULL);
-> 454 if (!folio)
455 break;
I can reproduce this consistently by running a memory hog concurrently
with a buffered writer on a machine with a very large amount of swap.
Commit 7ae37b2 ("btrfs: prevent direct reclaim during compressed
readahead") clearly intended to suppress these warnings. But because the
mask set in the address_space with mapping_set_gfp_mask() doesn't include
__GFP_NOWARN, mapping_gfp_constraint() removes it from constraint_gfp
before it is passed to filemap_alloc_folio().
Fix by refactoring the code to add __GFP_NOWARN after the call to
mapping_gfp_constraint().
Fixes: 7ae37b2 ("btrfs: prevent direct reclaim during compressed readahead")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Any commits after this one are for testing and evaluation only. Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the recent commit "btrfs: warn about extent buffer that can not be
released", we can trigger the following warning running test cases like
generic/388 at unmount:
BTRFS critical (device dm-2 state E): emergency shutdown
BTRFS error (device dm-2 state E): cow_file_range failed, root=5 inode=265 start=135168 len=118784 cur_offset=135168 cur_alloc_size=0: -5
BTRFS error (device dm-2 state E): error while writing out transaction: -30
BTRFS warning (device dm-2 state E): Skipping commit of aborted transaction.
BTRFS error (device dm-2 state EA): Transaction 9 aborted (error -30)
BTRFS: error (device dm-2 state EA) in cleanup_transaction:2068: errno=-30 Readonly filesystem
BTRFS info (device dm-2 state EA): forced readonly
BTRFS error (device dm-2 state EA): failed to run delalloc range, root=5 ino=265 folio=135168 submit_bitmap=0 start=135168 len=118784: -5
BTRFS info (device dm-2 state EA): last unmount of filesystem 8b3d8748-4710-4b5a-84d9-b072cb03be2d
------------[ cut here ]------------
WARNING: disk-io.c:3306 at invalidate_btree_folios+0xfd/0x1ca [btrfs], CPU#4: umount/60183
CPU: 4 UID: 0 PID: 60183 Comm: umount Tainted: G W OE 7.0.0-rc6-custom+ #365 PREEMPT(full) 5804053f02137e627472d94b5128cc9fcb110e88
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:invalidate_btree_folios+0xfd/0x1ca [btrfs]
Call Trace:
<TASK>
close_ctree+0x534/0x57a [btrfs eeeee2af86b856a32e0b81b75d427a17a62ffe29]
generic_shutdown_super+0x89/0x1a0
kill_anon_super+0x16/0x40
btrfs_kill_super+0x16/0x20 [btrfs eeeee2af86b856a32e0b81b75d427a17a62ffe29]
deactivate_locked_super+0x2d/0xb0
cleanup_mnt+0xdc/0x140
task_work_run+0x5a/0xa0
exit_to_user_mode_loop+0x123/0x4b0
do_syscall_64+0x288/0x7d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30507008 owner 1 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30588928 owner 9 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30605312 owner 257 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30621696 owner 7 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30638080 owner 258 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30654464 owner 2 gen 9 refs 2 flags 0x7
BTRFS warning (device dm-2 state EA): unable to release extent buffer 30670848 owner 10 gen 9 refs 2 flags 0x7
I'm using a striped down version, which seems to be trigger the warning
more reliably:
_fsstress_pid=""
workload()
{
dmesg -C
mkfs.btrfs -f -K $dev > /dev/null
echo 1 > /sys/kernel/debug/clear_warn_once
mount $dev $mnt
$fsstress -w -n 1024 -p 4 -d $mnt &
_fsstress_pid=$!
sleep 0
$godown $mnt
pkill --echo -PIPE fsstress > /dev/null
wait $_fsstress_pid
unset _fsstress_pid
umount $mnt
if dmesg | grep -q "WARNING"; then
fail
fi
}
for (( i = 0; i < $runtime; i++ )); do
echo "=== $i/$runtime ==="
workload
done
[CAUSE]
Inside btrfs_write_and_wait_transaction(), we first try to write all
dirty ebs, then wait for them to finish.
After that we call btrfs_extent_io_tree_release() to free all
extent states from dirty_pages io tree.
However if we hit an error from btrfs_write_marked_extent(), then we
still call btrfs_extent_io_tree_release() to clear that dirty_pages io
tree, which may contain dirty records that we haven't yet submitted.
Furthermore, later transaction cleanup path will utilize that
dirty_pages io tree to properly cleanup those dirty ebs, but since it's
already empty, no dirty ebs are properly cleaned up, thus will later
trigger the warnings inside invalidate_btree_folios().
[FIX]
Normally such dirty ebs won't cause problems, as when the iput() is
called on the btree inode, the dirty ebs will be forced written back.
But at that stage all workers have been destroyed, and if the metadata
writeback needs any worker (e.g. RAID56), it will easily trigger an NULL
pointer dereference.
Furthermore such writeback at iput() time can be too late for zoned
btrfs, as we also need to properly update the write pointer for zoned
devices.
Instead of unconditionally calling btrfs_extent_io_tree_release(), only
call it if btrfs_write_and_wait_transaction() finished successfully, so
that @dirty_pages extent io tree is kept untouched for transaction
cleanup.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comments at the beginning of subpage.c are out-of-date, a lot of the limits are already resolved. Update them to reflect the latest status. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…maps [CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. That limit allows us to easily grab an unsigned long without the need to properly allocate memory for a larger bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] To allow direct bitmap operations without allocating new memory, introduce two different ways to access the subpage bitmaps: - Return an unsigned long value This only happens if blocks_per_folio <= BITS_PER_LONG. We read out the sub-bitmap into an unsigned long, and return the value. This is the old existing method. This involves get_bitmap_value_##name() helper functions. And this time the helper functions are defined as inline functions instead of macros to provide better type checks. - Return a pointer where the sub-bitmap starts This only happens if blocks_per_folio >= BITS_PER_LONG. This is the new method for sub-bitmaps larger than BITS_PER_LONG. Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we can directly access the start byte of the sub-bitmap. This involves get_bitmap_pointer_##name() helper functions. Then change the existing sub-bitmaps users to use the new helpers: - Bitmap dumping Switch between get_bitmap_value_##name() and get_bitmap_pointer_##name() depending on the sub-bitmap size. - btrfs_get_subpage_dirty_bitmap() Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new value/pointer naming. Since we do not support huge folios yet, there is no pointer version for it yet. Furthermore add the support for bs == ps cases for btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer needs to check if the folio needs subpage handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap, which makes it very simple and straightforward to just grab an unsigned long value and assign it to submit_bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] Instead of using a fixed unsigned long value, change btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer. And for cases where an unsigned long can hold the whole bitmap, introduce @submit_bitmap_value, and just point that pointer to that unsigned long. Then update all direct users of bio_ctrl->submit_bitmap to use the pointer version. There are several call sites that get extra changes: - @range_bitmap inside extent_writepage_io() Which is only utilized to truncate the bitmap. Since we do not want to allocate new memory just for such temporary usage, change the original bitmap_set() and bitmap_and() into bitmap_clear() for the ranges out of the folio. - Getting dirty subpage bitmap inside writepage_delalloc() Since we're passing an unsigned long pointer now, we need to go with different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG, blocks_per_folio > BITS_PER_LONG). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
With all the previous preparations, it's finally time to enable the huge folio support. - The max folio size Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed to 2MiB. This will ensure we have a large enough but not too large folio for btrfs. This limit applies to all systems regardless of page size. Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on CONFIG_BTRFS_EXPERIMENTAL. If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512, otherwise it's BITS_PER_LONG. The filemap max order will be calculated using both BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO. E.g. for 64K page size with 64K fs block size, the limit will be BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5. This will be lower than the old order (6), but folios larger than 2M are rarely any better for IO performance. Meanwhile excessively large folios can cause other problems like stalling the IO pipeline for too long. For 4K page size and 4K fs block size, the limit will be increased to 2M from the old 256K. This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to reach huge folio support, and the filemap max order will be 9. - btrfs_bio_ctrl::submit_bitmap This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and this will be on-stack memory. This will increase on-stack memory usage by 56 bytes compared to the baseline (before any patch in the series). - Local @delalloc_bitmap inside writepage_delalloc() Unfortunately we cannot afford to handle an allocation error here, thus again we use on-stack memory. Thus this will increase on-stack memory usage by 56 bytes again. So unfortunately this means during the delalloc window, the writeback path will have +112 bytes for on-stack memory usage, and for other cases the writeback path will have +56 bytes on-stack memory usage. The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed after we have reworked the compression submission, so the current on-stack submit_bitmap is mostly a workaround until then. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.