forked from Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
194 lines
8.3 KiB
194 lines
8.3 KiB
======== |
|
dm-zoned |
|
======== |
|
|
|
The dm-zoned device mapper target exposes a zoned block device (ZBC and |
|
ZAC compliant devices) as a regular block device without any write |
|
pattern constraints. In effect, it implements a drive-managed zoned |
|
block device which hides from the user (a file system or an application |
|
doing raw block device accesses) the sequential write constraints of |
|
host-managed zoned block devices and can mitigate the potential |
|
device-side performance degradation due to excessive random writes on |
|
host-aware zoned block devices. |
|
|
|
For a more detailed description of the zoned block device models and |
|
their constraints see (for SCSI devices): |
|
|
|
https://www.t10.org/drafts.htm#ZBC_Family |
|
|
|
and (for ATA devices): |
|
|
|
http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf |
|
|
|
The dm-zoned implementation is simple and minimizes system overhead (CPU |
|
and memory usage as well as storage capacity loss). For a 10TB |
|
host-managed disk with 256 MB zones, dm-zoned memory usage per disk |
|
instance is at most 4.5 MB and as little as 5 zones will be used |
|
internally for storing metadata and performing reclaim operations. |
|
|
|
dm-zoned target devices are formatted and checked using the dmzadm |
|
utility available at: |
|
|
|
https://github.com/hgst/dm-zoned-tools |
|
|
|
Algorithm |
|
========= |
|
|
|
dm-zoned implements an on-disk buffering scheme to handle non-sequential |
|
write accesses to the sequential zones of a zoned block device. |
|
Conventional zones are used for caching as well as for storing internal |
|
metadata. It can also use a regular block device together with the zoned |
|
block device; in that case the regular block device will be split logically |
|
in zones with the same size as the zoned block device. These zones will be |
|
placed in front of the zones from the zoned block device and will be handled |
|
just like conventional zones. |
|
|
|
The zones of the device(s) are separated into 2 types: |
|
|
|
1) Metadata zones: these are conventional zones used to store metadata. |
|
Metadata zones are not reported as useable capacity to the user. |
|
|
|
2) Data zones: all remaining zones, the vast majority of which will be |
|
sequential zones used exclusively to store user data. The conventional |
|
zones of the device may be used also for buffering user random writes. |
|
Data in these zones may be directly mapped to the conventional zone, but |
|
later moved to a sequential zone so that the conventional zone can be |
|
reused for buffering incoming random writes. |
|
|
|
dm-zoned exposes a logical device with a sector size of 4096 bytes, |
|
irrespective of the physical sector size of the backend zoned block |
|
device being used. This allows reducing the amount of metadata needed to |
|
manage valid blocks (blocks written). |
|
|
|
The on-disk metadata format is as follows: |
|
|
|
1) The first block of the first conventional zone found contains the |
|
super block which describes the on disk amount and position of metadata |
|
blocks. |
|
|
|
2) Following the super block, a set of blocks is used to describe the |
|
mapping of the logical device blocks. The mapping is done per chunk of |
|
blocks, with the chunk size equal to the zoned block device size. The |
|
mapping table is indexed by chunk number and each mapping entry |
|
indicates the zone number of the device storing the chunk of data. Each |
|
mapping entry may also indicate if the zone number of a conventional |
|
zone used to buffer random modification to the data zone. |
|
|
|
3) A set of blocks used to store bitmaps indicating the validity of |
|
blocks in the data zones follows the mapping table. A valid block is |
|
defined as a block that was written and not discarded. For a buffered |
|
data chunk, a block is always valid only in the data zone mapping the |
|
chunk or in the buffer zone of the chunk. |
|
|
|
For a logical chunk mapped to a conventional zone, all write operations |
|
are processed by directly writing to the zone. If the mapping zone is a |
|
sequential zone, the write operation is processed directly only if the |
|
write offset within the logical chunk is equal to the write pointer |
|
offset within of the sequential data zone (i.e. the write operation is |
|
aligned on the zone write pointer). Otherwise, write operations are |
|
processed indirectly using a buffer zone. In that case, an unused |
|
conventional zone is allocated and assigned to the chunk being |
|
accessed. Writing a block to the buffer zone of a chunk will |
|
automatically invalidate the same block in the sequential zone mapping |
|
the chunk. If all blocks of the sequential zone become invalid, the zone |
|
is freed and the chunk buffer zone becomes the primary zone mapping the |
|
chunk, resulting in native random write performance similar to a regular |
|
block device. |
|
|
|
Read operations are processed according to the block validity |
|
information provided by the bitmaps. Valid blocks are read either from |
|
the sequential zone mapping a chunk, or if the chunk is buffered, from |
|
the buffer zone assigned. If the accessed chunk has no mapping, or the |
|
accessed blocks are invalid, the read buffer is zeroed and the read |
|
operation terminated. |
|
|
|
After some time, the limited number of conventional zones available may |
|
be exhausted (all used to map chunks or buffer sequential zones) and |
|
unaligned writes to unbuffered chunks become impossible. To avoid this |
|
situation, a reclaim process regularly scans used conventional zones and |
|
tries to reclaim the least recently used zones by copying the valid |
|
blocks of the buffer zone to a free sequential zone. Once the copy |
|
completes, the chunk mapping is updated to point to the sequential zone |
|
and the buffer zone freed for reuse. |
|
|
|
Metadata Protection |
|
=================== |
|
|
|
To protect metadata against corruption in case of sudden power loss or |
|
system crash, 2 sets of metadata zones are used. One set, the primary |
|
set, is used as the main metadata region, while the secondary set is |
|
used as a staging area. Modified metadata is first written to the |
|
secondary set and validated by updating the super block in the secondary |
|
set, a generation counter is used to indicate that this set contains the |
|
newest metadata. Once this operation completes, in place of metadata |
|
block updates can be done in the primary metadata set. This ensures that |
|
one of the set is always consistent (all modifications committed or none |
|
at all). Flush operations are used as a commit point. Upon reception of |
|
a flush request, metadata modification activity is temporarily blocked |
|
(for both incoming BIO processing and reclaim process) and all dirty |
|
metadata blocks are staged and updated. Normal operation is then |
|
resumed. Flushing metadata thus only temporarily delays write and |
|
discard requests. Read requests can be processed concurrently while |
|
metadata flush is being executed. |
|
|
|
If a regular device is used in conjunction with the zoned block device, |
|
a third set of metadata (without the zone bitmaps) is written to the |
|
start of the zoned block device. This metadata has a generation counter of |
|
'0' and will never be updated during normal operation; it just serves for |
|
identification purposes. The first and second copy of the metadata |
|
are located at the start of the regular block device. |
|
|
|
Usage |
|
===== |
|
|
|
A zoned block device must first be formatted using the dmzadm tool. This |
|
will analyze the device zone configuration, determine where to place the |
|
metadata sets on the device and initialize the metadata sets. |
|
|
|
Ex:: |
|
|
|
dmzadm --format /dev/sdxx |
|
|
|
|
|
If two drives are to be used, both devices must be specified, with the |
|
regular block device as the first device. |
|
|
|
Ex:: |
|
|
|
dmzadm --format /dev/sdxx /dev/sdyy |
|
|
|
|
|
Formatted device(s) can be started with the dmzadm utility, too.: |
|
|
|
Ex:: |
|
|
|
dmzadm --start /dev/sdxx /dev/sdyy |
|
|
|
|
|
Information about the internal layout and current usage of the zones can |
|
be obtained with the 'status' callback from dmsetup: |
|
|
|
Ex:: |
|
|
|
dmsetup status /dev/dm-X |
|
|
|
will return a line |
|
|
|
0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential |
|
|
|
where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number |
|
of unmapped (ie free) random zones, <nr_rnd> the total number of zones, |
|
<nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the |
|
total number of sequential zones. |
|
|
|
Normally the reclaim process will be started once there are less than 50 |
|
percent free random zones. In order to start the reclaim process manually |
|
even before reaching this threshold the 'dmsetup message' function can be |
|
used: |
|
|
|
Ex:: |
|
|
|
dmsetup message /dev/dm-X 0 reclaim |
|
|
|
will start the reclaim process and random zones will be moved to sequential |
|
zones.
|
|
|