Skip to content

Commit

Permalink
Merge tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/x…
Browse files Browse the repository at this point in the history
…fs-linux

Pull xfs fixes from Darrick Wong:
 "This fixes multiple problems in the reserve pool sizing functions: an
  incorrect free space calculation, a pointless infinite loop, and even
  more braindamage that could result in the pool being overfilled. The
  pile of patches from Dave fix myriad races and UAF bugs in the log
  recovery code that much to our mutual surprise nobody's tripped over.
  Dave also fixed a performance optimization that had turned into a
  regression.

  Dave Chinner is taking over as XFS maintainer starting Sunday and
  lasting until 5.19-rc1 is tagged so that I can focus on starting a
  massive design review for the (feature complete after five years)
  online repair feature. From then on, he and I will be moving XFS to a
  co-maintainership model by trading duties every other release.

  NOTE: I hope very strongly that the other pieces of the (X)FS
  ecosystem (fstests and xfsprogs) will make similar changes to spread
  their maintenance load.

  Summary:

   - Fix an incorrect free space calculation in xfs_reserve_blocks that
     could lead to a request for free blocks that will never succeed.

   - Fix a hang in xfs_reserve_blocks caused by an infinite loop and the
     incorrect free space calculation.

   - Fix yet a third problem in xfs_reserve_blocks where multiple racing
     threads can overfill the reserve pool.

   - Fix an accounting error that lead to us reporting reserved space as
     "available".

   - Fix a race condition during abnormal fs shutdown that could cause
     UAF problems when memory reclaim and log shutdown try to clean up
     inodes.

   - Fix a bug where log shutdown can race with unmount to tear down the
     log, thereby causing UAF errors.

   - Disentangle log and filesystem shutdown to reduce confusion.

   - Fix some confusion in xfs_trans_commit such that a race between
     transaction commit and filesystem shutdown can cause unlogged dirty
     inode metadata to be committed, thereby corrupting the filesystem.

   - Remove a performance optimization in the log as it was discovered
     that certain storage hardware handle async log flushes so poorly as
     to cause serious performance regressions. Recent restructuring of
     other parts of the logging code mean that no performance benefit is
     seen on hardware that handle it well"

* tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: drop async cache flushes from CIL commits.
  xfs: shutdown during log recovery needs to mark the log shutdown
  xfs: xfs_trans_commit() path must check for log shutdown
  xfs: xfs_do_force_shutdown needs to block racing shutdowns
  xfs: log shutdown triggers should only shut down the log
  xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
  xfs: shutdown in intent recovery has non-intent items in the AIL
  xfs: aborting inodes on shutdown may need buffer lock
  xfs: don't report reserved bnobt space as available
  xfs: fix overfilling of reserve pool
  xfs: always succeed at setting the reserve pool size
  xfs: remove infinite loop when reserving free block pool
  xfs: don't include bnobt blocks when reserving free block pool
  xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  • Loading branch information
torvalds committed Apr 2, 2022
2 parents 1fdff40 + 919edba commit b32e381
Show file tree
Hide file tree
Showing 18 changed files with 347 additions and 246 deletions.
28 changes: 23 additions & 5 deletions fs/xfs/libxfs/xfs_alloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,24 @@ xfs_prealloc_blocks(
}

/*
* The number of blocks per AG that we withhold from xfs_mod_fdblocks to
* guarantee that we can refill the AGFL prior to allocating space in a nearly
* full AG. Although the the space described by the free space btrees, the
* blocks used by the freesp btrees themselves, and the blocks owned by the
* AGFL are counted in the ondisk fdblocks, it's a mistake to let the ondisk
* free space in the AG drop so low that the free space btrees cannot refill an
* empty AGFL up to the minimum level. Rather than grind through empty AGs
* until the fs goes down, we subtract this many AG blocks from the incore
* fdblocks to ensure user allocation does not overcommit the space the
* filesystem needs for the AGFLs. The rmap btree uses a per-AG reservation to
* withhold space from xfs_mod_fdblocks, so we do not account for that here.
*/
#define XFS_ALLOCBT_AGFL_RESERVE 4

/*
* Compute the number of blocks that we set aside to guarantee the ability to
* refill the AGFL and handle a full bmap btree split.
*
* In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
* AGF buffer (PV 947395), we place constraints on the relationship among
* actual allocations for data blocks, freelist blocks, and potential file data
Expand All @@ -93,14 +111,14 @@ xfs_prealloc_blocks(
* extents need to be actually allocated. To get around this, we explicitly set
* aside a few blocks which will not be reserved in delayed allocation.
*
* We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
* potential split of the file's bmap btree.
* For each AG, we need to reserve enough blocks to replenish a totally empty
* AGFL and 4 more to handle a potential split of the file's bmap btree.
*/
unsigned int
xfs_alloc_set_aside(
struct xfs_mount *mp)
{
return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
}

/*
Expand All @@ -124,12 +142,12 @@ xfs_alloc_ag_max_usable(
unsigned int blocks;

blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
blocks += XFS_ALLOC_AGFL_RESERVE;
blocks += XFS_ALLOCBT_AGFL_RESERVE;
blocks += 3; /* AGF, AGI btree root blocks */
if (xfs_has_finobt(mp))
blocks++; /* finobt root block */
if (xfs_has_rmapbt(mp))
blocks++; /* rmap root block */
blocks++; /* rmap root block */
if (xfs_has_reflink(mp))
blocks++; /* refcount root block */

Expand Down
1 change: 0 additions & 1 deletion fs/xfs/libxfs/xfs_alloc.h
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ typedef struct xfs_alloc_arg {
#define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */

/* freespace limit calculations */
#define XFS_ALLOC_AGFL_RESERVE 4
unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);

Expand Down
33 changes: 0 additions & 33 deletions fs/xfs/xfs_bio_io.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,39 +9,6 @@ static inline unsigned int bio_max_vecs(unsigned int count)
return bio_max_segs(howmany(count, PAGE_SIZE));
}

static void
xfs_flush_bdev_async_endio(
struct bio *bio)
{
complete(bio->bi_private);
}

/*
* Submit a request for an async cache flush to run. If the request queue does
* not require flush operations, just skip it altogether. If the caller needs
* to wait for the flush completion at a later point in time, they must supply a
* valid completion. This will be signalled when the flush completes. The
* caller never sees the bio that is issued here.
*/
void
xfs_flush_bdev_async(
struct bio *bio,
struct block_device *bdev,
struct completion *done)
{
struct request_queue *q = bdev->bd_disk->queue;

if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
complete(done);
return;
}

bio_init(bio, bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC);
bio->bi_private = done;
bio->bi_end_io = xfs_flush_bdev_async_endio;

submit_bio(bio);
}
int
xfs_rw_bdev(
struct block_device *bdev,
Expand Down
60 changes: 27 additions & 33 deletions fs/xfs/xfs_fsops.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include "xfs_fsops.h"
#include "xfs_trans_space.h"
#include "xfs_log.h"
#include "xfs_log_priv.h"
#include "xfs_ag.h"
#include "xfs_ag_resv.h"
#include "xfs_trace.h"
Expand Down Expand Up @@ -347,7 +348,7 @@ xfs_fs_counts(
cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
mp->m_alloc_set_aside;
xfs_fdblocks_unavailable(mp);

spin_lock(&mp->m_sb_lock);
cnt->freertx = mp->m_sb.sb_frextents;
Expand Down Expand Up @@ -430,46 +431,36 @@ xfs_reserve_blocks(
* If the request is larger than the current reservation, reserve the
* blocks before we update the reserve counters. Sample m_fdblocks and
* perform a partial reservation if the request exceeds free space.
*
* The code below estimates how many blocks it can request from
* fdblocks to stash in the reserve pool. This is a classic TOCTOU
* race since fdblocks updates are not always coordinated via
* m_sb_lock. Set the reserve size even if there's not enough free
* space to fill it because mod_fdblocks will refill an undersized
* reserve when it can.
*/
error = -ENOSPC;
do {
free = percpu_counter_sum(&mp->m_fdblocks) -
mp->m_alloc_set_aside;
if (free <= 0)
break;

delta = request - mp->m_resblks;
lcounter = free - delta;
if (lcounter < 0)
/* We can't satisfy the request, just get what we can */
fdblks_delta = free;
else
fdblks_delta = delta;

free = percpu_counter_sum(&mp->m_fdblocks) -
xfs_fdblocks_unavailable(mp);
delta = request - mp->m_resblks;
mp->m_resblks = request;
if (delta > 0 && free > 0) {
/*
* We'll either succeed in getting space from the free block
* count or we'll get an ENOSPC. If we get a ENOSPC, it means
* things changed while we were calculating fdblks_delta and so
* we should try again to see if there is anything left to
* reserve.
* count or we'll get an ENOSPC. Don't set the reserved flag
* here - we don't want to reserve the extra reserve blocks
* from the reserve.
*
* Don't set the reserved flag here - we don't want to reserve
* the extra reserve blocks from the reserve.....
* The desired reserve size can change after we drop the lock.
* Use mod_fdblocks to put the space into the reserve or into
* fdblocks as appropriate.
*/
fdblks_delta = min(free, delta);
spin_unlock(&mp->m_sb_lock);
error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
if (!error)
xfs_mod_fdblocks(mp, fdblks_delta, 0);
spin_lock(&mp->m_sb_lock);
} while (error == -ENOSPC);

/*
* Update the reserve counters if blocks have been successfully
* allocated.
*/
if (!error && fdblks_delta) {
mp->m_resblks += fdblks_delta;
mp->m_resblks_avail += fdblks_delta;
}

out:
if (outval) {
outval->resblks = mp->m_resblks;
Expand Down Expand Up @@ -528,8 +519,11 @@ xfs_do_force_shutdown(
int tag;
const char *why;

if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate))

if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate)) {
xlog_shutdown_wait(mp->m_log);
return;
}
if (mp->m_sb_bp)
mp->m_sb_bp->b_flags |= XBF_DONE;

Expand Down
2 changes: 1 addition & 1 deletion fs/xfs/xfs_icache.c
Original file line number Diff line number Diff line change
Expand Up @@ -883,7 +883,7 @@ xfs_reclaim_inode(
*/
if (xlog_is_shutdown(ip->i_mount->m_log)) {
xfs_iunpin_wait(ip);
xfs_iflush_abort(ip);
xfs_iflush_shutdown_abort(ip);
goto reclaim;
}
if (xfs_ipincount(ip))
Expand Down
2 changes: 1 addition & 1 deletion fs/xfs/xfs_inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -3631,7 +3631,7 @@ xfs_iflush_cluster(

/*
* We must use the safe variant here as on shutdown xfs_iflush_abort()
* can remove itself from the list.
* will remove itself from the list.
*/
list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
iip = (struct xfs_inode_log_item *)lip;
Expand Down
Loading

0 comments on commit b32e381

Please sign in to comment.