Skip to content

Commit

Permalink
Merge tag 'for-4.13/dm-changes' of git://git.kernel.org/pub/scm/linux…
Browse files Browse the repository at this point in the history
…/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add the ability to use select or poll /dev/mapper/control to wait for
   events from multiple DM devices.

 - Convert DM's printk macros over to using pr_<level> macros.

 - Add a big-endian variant of plain64 IV to dm-crypt.

 - Add support for zoned (aka SMR) devices to DM core. DM kcopyd was
   also improved to provide a sequential write feature needed by zoned
   devices.

 - Introduce DM zoned target that provides support for host-managed
   zoned devices, the result dm-zoned device acts as a drive-managed
   interface to the underlying host-managed device.

 - A DM raid fix to avoid using BUG() for error handling.

* tag 'for-4.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm zoned: fix overflow when converting zone ID to sectors
  dm raid: stop using BUG() in __rdev_sectors()
  dm zoned: drive-managed zoned block device target
  dm kcopyd: add sequential write feature
  dm linear: add support for zoned block devices
  dm flakey: add support for zoned block devices
  dm: introduce dm_remap_zone_report()
  dm: fix REQ_OP_ZONE_REPORT bio handling
  dm: fix REQ_OP_ZONE_RESET bio handling
  dm table: add zoned block devices validation
  dm: convert DM printk macros to pr_<level> macros
  dm crypt: add big-endian variant of plain64 IV
  dm bio prison: use rb_entry() rather than container_of()
  dm ioctl: report event number in DM_LIST_DEVICES
  dm ioctl: add a new DM_DEV_ARM_POLL ioctl
  dm: add basic support for using the select or poll function
  • Loading branch information
torvalds committed Jul 6, 2017
2 parents 9871ab2 + 3908c98 commit 3a564bb
Show file tree
Hide file tree
Showing 21 changed files with 4,958 additions and 78 deletions.
144 changes: 144 additions & 0 deletions Documentation/device-mapper/dm-zoned.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
dm-zoned
========

The dm-zoned device mapper target exposes a zoned block device (ZBC and
ZAC compliant devices) as a regular block device without any write
pattern constraints. In effect, it implements a drive-managed zoned
block device which hides from the user (a file system or an application
doing raw block device accesses) the sequential write constraints of
host-managed zoned block devices and can mitigate the potential
device-side performance degradation due to excessive random writes on
host-aware zoned block devices.

For a more detailed description of the zoned block device models and
their constraints see (for SCSI devices):

http://www.t10.org/drafts.htm#ZBC_Family

and (for ATA devices):

http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf

The dm-zoned implementation is simple and minimizes system overhead (CPU
and memory usage as well as storage capacity loss). For a 10TB
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
instance is at most 4.5 MB and as little as 5 zones will be used
internally for storing metadata and performaing reclaim operations.

dm-zoned target devices are formatted and checked using the dmzadm
utility available at:

https://github.com/hgst/dm-zoned-tools

Algorithm
=========

dm-zoned implements an on-disk buffering scheme to handle non-sequential
write accesses to the sequential zones of a zoned block device.
Conventional zones are used for caching as well as for storing internal
metadata.

The zones of the device are separated into 2 types:

1) Metadata zones: these are conventional zones used to store metadata.
Metadata zones are not reported as useable capacity to the user.

2) Data zones: all remaining zones, the vast majority of which will be
sequential zones used exclusively to store user data. The conventional
zones of the device may be used also for buffering user random writes.
Data in these zones may be directly mapped to the conventional zone, but
later moved to a sequential zone so that the conventional zone can be
reused for buffering incoming random writes.

dm-zoned exposes a logical device with a sector size of 4096 bytes,
irrespective of the physical sector size of the backend zoned block
device being used. This allows reducing the amount of metadata needed to
manage valid blocks (blocks written).

The on-disk metadata format is as follows:

1) The first block of the first conventional zone found contains the
super block which describes the on disk amount and position of metadata
blocks.

2) Following the super block, a set of blocks is used to describe the
mapping of the logical device blocks. The mapping is done per chunk of
blocks, with the chunk size equal to the zoned block device size. The
mapping table is indexed by chunk number and each mapping entry
indicates the zone number of the device storing the chunk of data. Each
mapping entry may also indicate if the zone number of a conventional
zone used to buffer random modification to the data zone.

3) A set of blocks used to store bitmaps indicating the validity of
blocks in the data zones follows the mapping table. A valid block is
defined as a block that was written and not discarded. For a buffered
data chunk, a block is always valid only in the data zone mapping the
chunk or in the buffer zone of the chunk.

For a logical chunk mapped to a conventional zone, all write operations
are processed by directly writing to the zone. If the mapping zone is a
sequential zone, the write operation is processed directly only if the
write offset within the logical chunk is equal to the write pointer
offset within of the sequential data zone (i.e. the write operation is
aligned on the zone write pointer). Otherwise, write operations are
processed indirectly using a buffer zone. In that case, an unused
conventional zone is allocated and assigned to the chunk being
accessed. Writing a block to the buffer zone of a chunk will
automatically invalidate the same block in the sequential zone mapping
the chunk. If all blocks of the sequential zone become invalid, the zone
is freed and the chunk buffer zone becomes the primary zone mapping the
chunk, resulting in native random write performance similar to a regular
block device.

Read operations are processed according to the block validity
information provided by the bitmaps. Valid blocks are read either from
the sequential zone mapping a chunk, or if the chunk is buffered, from
the buffer zone assigned. If the accessed chunk has no mapping, or the
accessed blocks are invalid, the read buffer is zeroed and the read
operation terminated.

After some time, the limited number of convnetional zones available may
be exhausted (all used to map chunks or buffer sequential zones) and
unaligned writes to unbuffered chunks become impossible. To avoid this
situation, a reclaim process regularly scans used conventional zones and
tries to reclaim the least recently used zones by copying the valid
blocks of the buffer zone to a free sequential zone. Once the copy
completes, the chunk mapping is updated to point to the sequential zone
and the buffer zone freed for reuse.

Metadata Protection
===================

To protect metadata against corruption in case of sudden power loss or
system crash, 2 sets of metadata zones are used. One set, the primary
set, is used as the main metadata region, while the secondary set is
used as a staging area. Modified metadata is first written to the
secondary set and validated by updating the super block in the secondary
set, a generation counter is used to indicate that this set contains the
newest metadata. Once this operation completes, in place of metadata
block updates can be done in the primary metadata set. This ensures that
one of the set is always consistent (all modifications committed or none
at all). Flush operations are used as a commit point. Upon reception of
a flush request, metadata modification activity is temporarily blocked
(for both incoming BIO processing and reclaim process) and all dirty
metadata blocks are staged and updated. Normal operation is then
resumed. Flushing metadata thus only temporarily delays write and
discard requests. Read requests can be processed concurrently while
metadata flush is being executed.

Usage
=====

A zoned block device must first be formatted using the dmzadm tool. This
will analyze the device zone configuration, determine where to place the
metadata sets on the device and initialize the metadata sets.

Ex:

dmzadm --format /dev/sdxx

For a formatted device, the target can be created normally with the
dmsetup utility. The only parameter that dm-zoned requires is the
underlying zoned block device name. Ex:

echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
17 changes: 17 additions & 0 deletions drivers/md/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,23 @@ config DM_INTEGRITY
To compile this code as a module, choose M here: the module will
be called dm-integrity.

config DM_ZONED
tristate "Drive-managed zoned block device target support"
depends on BLK_DEV_DM
depends on BLK_DEV_ZONED
---help---
This device-mapper target takes a host-managed or host-aware zoned
block device and exposes most of its capacity as a regular block
device (drive-managed zoned block device) without any write
constraints. This is mainly intended for use with file systems that
do not natively support zoned block devices but still want to
benefit from the increased capacity offered by SMR disks. Other uses
by applications using raw block devices (for example object stores)
are also possible.

To compile this code as a module, choose M here: the module will
be called dm-zoned.

If unsure, say N.

endif # MD
2 changes: 2 additions & 0 deletions drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dm-era-y += dm-era-target.o
dm-verity-y += dm-verity-target.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o

# Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise
Expand Down Expand Up @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_ERA) += dm-era.o
obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o
obj-$(CONFIG_DM_ZONED) += dm-zoned.o

ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
Expand Down
2 changes: 1 addition & 1 deletion drivers/md/dm-bio-prison-v1.c
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ static int __bio_detain(struct dm_bio_prison *prison,

while (*new) {
struct dm_bio_prison_cell *cell =
container_of(*new, struct dm_bio_prison_cell, node);
rb_entry(*new, struct dm_bio_prison_cell, node);

r = cmp_keys(key, &cell->key);

Expand Down
2 changes: 1 addition & 1 deletion drivers/md/dm-bio-prison-v2.c
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ static bool __find_or_insert(struct dm_bio_prison_v2 *prison,

while (*new) {
struct dm_bio_prison_cell_v2 *cell =
container_of(*new, struct dm_bio_prison_cell_v2, node);
rb_entry(*new, struct dm_bio_prison_cell_v2, node);

r = cmp_keys(key, &cell->key);

Expand Down
3 changes: 3 additions & 0 deletions drivers/md/dm-core.h
Original file line number Diff line number Diff line change
Expand Up @@ -147,4 +147,7 @@ static inline bool dm_message_test_buffer_overflow(char *result, unsigned maxlen
return !maxlen || strlen(result) + 1 >= maxlen;
}

extern atomic_t dm_global_event_nr;
extern wait_queue_head_t dm_global_eventq;

#endif
21 changes: 20 additions & 1 deletion drivers/md/dm-crypt.c
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,9 @@ static struct crypto_aead *any_tfm_aead(struct crypt_config *cc)
* plain64: the initial vector is the 64-bit little-endian version of the sector
* number, padded with zeros if necessary.
*
* plain64be: the initial vector is the 64-bit big-endian version of the sector
* number, padded with zeros if necessary.
*
* essiv: "encrypted sector|salt initial vector", the sector number is
* encrypted with the bulk cipher using a salt as key. The salt
* should be derived from the bulk cipher's key via hashing.
Expand Down Expand Up @@ -302,6 +305,16 @@ static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
return 0;
}

static int crypt_iv_plain64be_gen(struct crypt_config *cc, u8 *iv,
struct dm_crypt_request *dmreq)
{
memset(iv, 0, cc->iv_size);
/* iv_size is at least of size u64; usually it is 16 bytes */
*(__be64 *)&iv[cc->iv_size - sizeof(u64)] = cpu_to_be64(dmreq->iv_sector);

return 0;
}

/* Initialise ESSIV - compute salt but no local memory allocations */
static int crypt_iv_essiv_init(struct crypt_config *cc)
{
Expand Down Expand Up @@ -835,6 +848,10 @@ static const struct crypt_iv_operations crypt_iv_plain64_ops = {
.generator = crypt_iv_plain64_gen
};

static const struct crypt_iv_operations crypt_iv_plain64be_ops = {
.generator = crypt_iv_plain64be_gen
};

static const struct crypt_iv_operations crypt_iv_essiv_ops = {
.ctr = crypt_iv_essiv_ctr,
.dtr = crypt_iv_essiv_dtr,
Expand Down Expand Up @@ -2208,6 +2225,8 @@ static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
cc->iv_gen_ops = &crypt_iv_plain_ops;
else if (strcmp(ivmode, "plain64") == 0)
cc->iv_gen_ops = &crypt_iv_plain64_ops;
else if (strcmp(ivmode, "plain64be") == 0)
cc->iv_gen_ops = &crypt_iv_plain64be_ops;
else if (strcmp(ivmode, "essiv") == 0)
cc->iv_gen_ops = &crypt_iv_essiv_ops;
else if (strcmp(ivmode, "benbi") == 0)
Expand Down Expand Up @@ -2987,7 +3006,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)

static struct target_type crypt_target = {
.name = "crypt",
.version = {1, 17, 0},
.version = {1, 18, 0},
.module = THIS_MODULE,
.ctr = crypt_ctr,
.dtr = crypt_dtr,
Expand Down
23 changes: 20 additions & 3 deletions drivers/md/dm-flakey.c
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ static void flakey_map_bio(struct dm_target *ti, struct bio *bio)
struct flakey_c *fc = ti->private;

bio->bi_bdev = fc->dev->bdev;
if (bio_sectors(bio))
if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
bio->bi_iter.bi_sector =
flakey_map_sector(ti, bio->bi_iter.bi_sector);
}
Expand Down Expand Up @@ -306,6 +306,14 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
pb->bio_submitted = false;

/* Do not fail reset zone */
if (bio_op(bio) == REQ_OP_ZONE_RESET)
goto map_bio;

/* We need to remap reported zones, so remember the BIO iter */
if (bio_op(bio) == REQ_OP_ZONE_REPORT)
goto map_bio;

/* Are we alive ? */
elapsed = (jiffies - fc->start_time) / HZ;
if (elapsed % (fc->up_interval + fc->down_interval) >= fc->up_interval) {
Expand Down Expand Up @@ -359,11 +367,19 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
}

static int flakey_end_io(struct dm_target *ti, struct bio *bio,
blk_status_t *error)
blk_status_t *error)
{
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));

if (bio_op(bio) == REQ_OP_ZONE_RESET)
return DM_ENDIO_DONE;

if (bio_op(bio) == REQ_OP_ZONE_REPORT) {
dm_remap_zone_report(ti, bio, fc->start);
return DM_ENDIO_DONE;
}

if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == READ) &&
all_corrupt_bio_flags_match(bio, fc)) {
Expand Down Expand Up @@ -446,7 +462,8 @@ static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_

static struct target_type flakey_target = {
.name = "flakey",
.version = {1, 4, 0},
.version = {1, 5, 0},
.features = DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr = flakey_ctr,
.dtr = flakey_dtr,
Expand Down
Loading

0 comments on commit 3a564bb

Please sign in to comment.