mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
427 lines
15 KiB
427 lines
15 KiB
================= |
|
Thin provisioning |
|
================= |
|
|
|
Introduction |
|
============ |
|
|
|
This document describes a collection of device-mapper targets that |
|
between them implement thin-provisioning and snapshots. |
|
|
|
The main highlight of this implementation, compared to the previous |
|
implementation of snapshots, is that it allows many virtual devices to |
|
be stored on the same data volume. This simplifies administration and |
|
allows the sharing of data between volumes, thus reducing disk usage. |
|
|
|
Another significant feature is support for an arbitrary depth of |
|
recursive snapshots (snapshots of snapshots of snapshots ...). The |
|
previous implementation of snapshots did this by chaining together |
|
lookup tables, and so performance was O(depth). This new |
|
implementation uses a single data structure to avoid this degradation |
|
with depth. Fragmentation may still be an issue, however, in some |
|
scenarios. |
|
|
|
Metadata is stored on a separate device from data, giving the |
|
administrator some freedom, for example to: |
|
|
|
- Improve metadata resilience by storing metadata on a mirrored volume |
|
but data on a non-mirrored one. |
|
|
|
- Improve performance by storing the metadata on SSD. |
|
|
|
Status |
|
====== |
|
|
|
These targets are considered safe for production use. But different use |
|
cases will have different performance characteristics, for example due |
|
to fragmentation of the data volume. |
|
|
|
If you find this software is not performing as expected please mail |
|
[email protected] with details and we'll try our best to improve |
|
things for you. |
|
|
|
Userspace tools for checking and repairing the metadata have been fully |
|
developed and are available as 'thin_check' and 'thin_repair'. The name |
|
of the package that provides these utilities varies by distribution (on |
|
a Red Hat distribution it is named 'device-mapper-persistent-data'). |
|
|
|
Cookbook |
|
======== |
|
|
|
This section describes some quick recipes for using thin provisioning. |
|
They use the dmsetup program to control the device-mapper driver |
|
directly. End users will be advised to use a higher-level volume |
|
manager such as LVM2 once support has been added. |
|
|
|
Pool device |
|
----------- |
|
|
|
The pool device ties together the metadata volume and the data volume. |
|
It maps I/O linearly to the data volume and updates the metadata via |
|
two mechanisms: |
|
|
|
- Function calls from the thin targets |
|
|
|
- Device-mapper 'messages' from userspace which control the creation of new |
|
virtual devices amongst other things. |
|
|
|
Setting up a fresh pool device |
|
------------------------------ |
|
|
|
Setting up a pool device requires a valid metadata device, and a |
|
data device. If you do not have an existing metadata device you can |
|
make one by zeroing the first 4k to indicate empty metadata. |
|
|
|
dd if=/dev/zero of=$metadata_dev bs=4096 count=1 |
|
|
|
The amount of metadata you need will vary according to how many blocks |
|
are shared between thin devices (i.e. through snapshots). If you have |
|
less sharing than average you'll need a larger-than-average metadata device. |
|
|
|
As a guide, we suggest you calculate the number of bytes to use in the |
|
metadata device as 48 * $data_dev_size / $data_block_size but round it up |
|
to 2MB if the answer is smaller. If you're creating large numbers of |
|
snapshots which are recording large amounts of change, you may find you |
|
need to increase this. |
|
|
|
The largest size supported is 16GB: If the device is larger, |
|
a warning will be issued and the excess space will not be used. |
|
|
|
Reloading a pool table |
|
---------------------- |
|
|
|
You may reload a pool's table, indeed this is how the pool is resized |
|
if it runs out of space. (N.B. While specifying a different metadata |
|
device when reloading is not forbidden at the moment, things will go |
|
wrong if it does not route I/O to exactly the same on-disk location as |
|
previously.) |
|
|
|
Using an existing pool device |
|
----------------------------- |
|
|
|
:: |
|
|
|
dmsetup create pool \ |
|
--table "0 20971520 thin-pool $metadata_dev $data_dev \ |
|
$data_block_size $low_water_mark" |
|
|
|
$data_block_size gives the smallest unit of disk space that can be |
|
allocated at a time expressed in units of 512-byte sectors. |
|
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a |
|
multiple of 128 (64KB). $data_block_size cannot be changed after the |
|
thin-pool is created. People primarily interested in thin provisioning |
|
may want to use a value such as 1024 (512KB). People doing lots of |
|
snapshotting may want a smaller value such as 128 (64KB). If you are |
|
not zeroing newly-allocated data, a larger $data_block_size in the |
|
region of 256000 (128MB) is suggested. |
|
|
|
$low_water_mark is expressed in blocks of size $data_block_size. If |
|
free space on the data device drops below this level then a dm event |
|
will be triggered which a userspace daemon should catch allowing it to |
|
extend the pool device. Only one such event will be sent. |
|
|
|
No special event is triggered if a just resumed device's free space is below |
|
the low water mark. However, resuming a device always triggers an |
|
event; a userspace daemon should verify that free space exceeds the low |
|
water mark when handling this event. |
|
|
|
A low water mark for the metadata device is maintained in the kernel and |
|
will trigger a dm event if free space on the metadata device drops below |
|
it. |
|
|
|
Updating on-disk metadata |
|
------------------------- |
|
|
|
On-disk metadata is committed every time a FLUSH or FUA bio is written. |
|
If no such requests are made then commits will occur every second. This |
|
means the thin-provisioning target behaves like a physical disk that has |
|
a volatile write cache. If power is lost you may lose some recent |
|
writes. The metadata should always be consistent in spite of any crash. |
|
|
|
If data space is exhausted the pool will either error or queue IO |
|
according to the configuration (see: error_if_no_space). If metadata |
|
space is exhausted or a metadata operation fails: the pool will error IO |
|
until the pool is taken offline and repair is performed to 1) fix any |
|
potential inconsistencies and 2) clear the flag that imposes repair. |
|
Once the pool's metadata device is repaired it may be resized, which |
|
will allow the pool to return to normal operation. Note that if a pool |
|
is flagged as needing repair, the pool's data and metadata devices |
|
cannot be resized until repair is performed. It should also be noted |
|
that when the pool's metadata space is exhausted the current metadata |
|
transaction is aborted. Given that the pool will cache IO whose |
|
completion may have already been acknowledged to upper IO layers |
|
(e.g. filesystem) it is strongly suggested that consistency checks |
|
(e.g. fsck) be performed on those layers when repair of the pool is |
|
required. |
|
|
|
Thin provisioning |
|
----------------- |
|
|
|
i) Creating a new thinly-provisioned volume. |
|
|
|
To create a new thinly- provisioned volume you must send a message to an |
|
active pool device, /dev/mapper/pool in this example:: |
|
|
|
dmsetup message /dev/mapper/pool 0 "create_thin 0" |
|
|
|
Here '0' is an identifier for the volume, a 24-bit number. It's up |
|
to the caller to allocate and manage these identifiers. If the |
|
identifier is already in use, the message will fail with -EEXIST. |
|
|
|
ii) Using a thinly-provisioned volume. |
|
|
|
Thinly-provisioned volumes are activated using the 'thin' target:: |
|
|
|
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" |
|
|
|
The last parameter is the identifier for the thinp device. |
|
|
|
Internal snapshots |
|
------------------ |
|
|
|
i) Creating an internal snapshot. |
|
|
|
Snapshots are created with another message to the pool. |
|
|
|
N.B. If the origin device that you wish to snapshot is active, you |
|
must suspend it before creating the snapshot to avoid corruption. |
|
This is NOT enforced at the moment, so please be careful! |
|
|
|
:: |
|
|
|
dmsetup suspend /dev/mapper/thin |
|
dmsetup message /dev/mapper/pool 0 "create_snap 1 0" |
|
dmsetup resume /dev/mapper/thin |
|
|
|
Here '1' is the identifier for the volume, a 24-bit number. '0' is the |
|
identifier for the origin device. |
|
|
|
ii) Using an internal snapshot. |
|
|
|
Once created, the user doesn't have to worry about any connection |
|
between the origin and the snapshot. Indeed the snapshot is no |
|
different from any other thinly-provisioned device and can be |
|
snapshotted itself via the same method. It's perfectly legal to |
|
have only one of them active, and there's no ordering requirement on |
|
activating or removing them both. (This differs from conventional |
|
device-mapper snapshots.) |
|
|
|
Activate it exactly the same way as any other thinly-provisioned volume:: |
|
|
|
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" |
|
|
|
External snapshots |
|
------------------ |
|
|
|
You can use an external **read only** device as an origin for a |
|
thinly-provisioned volume. Any read to an unprovisioned area of the |
|
thin device will be passed through to the origin. Writes trigger |
|
the allocation of new blocks as usual. |
|
|
|
One use case for this is VM hosts that want to run guests on |
|
thinly-provisioned volumes but have the base image on another device |
|
(possibly shared between many VMs). |
|
|
|
You must not write to the origin device if you use this technique! |
|
Of course, you may write to the thin device and take internal snapshots |
|
of the thin volume. |
|
|
|
i) Creating a snapshot of an external device |
|
|
|
This is the same as creating a thin device. |
|
You don't mention the origin at this stage. |
|
|
|
:: |
|
|
|
dmsetup message /dev/mapper/pool 0 "create_thin 0" |
|
|
|
ii) Using a snapshot of an external device. |
|
|
|
Append an extra parameter to the thin target specifying the origin:: |
|
|
|
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" |
|
|
|
N.B. All descendants (internal snapshots) of this snapshot require the |
|
same extra origin parameter. |
|
|
|
Deactivation |
|
------------ |
|
|
|
All devices using a pool must be deactivated before the pool itself |
|
can be. |
|
|
|
:: |
|
|
|
dmsetup remove thin |
|
dmsetup remove snap |
|
dmsetup remove pool |
|
|
|
Reference |
|
========= |
|
|
|
'thin-pool' target |
|
------------------ |
|
|
|
i) Constructor |
|
|
|
:: |
|
|
|
thin-pool <metadata dev> <data dev> <data block size (sectors)> \ |
|
<low water mark (blocks)> [<number of feature args> [<arg>]*] |
|
|
|
Optional feature arguments: |
|
|
|
skip_block_zeroing: |
|
Skip the zeroing of newly-provisioned blocks. |
|
|
|
ignore_discard: |
|
Disable discard support. |
|
|
|
no_discard_passdown: |
|
Don't pass discards down to the underlying |
|
data device, but just remove the mapping. |
|
|
|
read_only: |
|
Don't allow any changes to be made to the pool |
|
metadata. This mode is only available after the |
|
thin-pool has been created and first used in full |
|
read/write mode. It cannot be specified on initial |
|
thin-pool creation. |
|
|
|
error_if_no_space: |
|
Error IOs, instead of queueing, if no space. |
|
|
|
Data block size must be between 64KB (128 sectors) and 1GB |
|
(2097152 sectors) inclusive. |
|
|
|
|
|
ii) Status |
|
|
|
:: |
|
|
|
<transaction id> <used metadata blocks>/<total metadata blocks> |
|
<used data blocks>/<total data blocks> <held metadata root> |
|
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space |
|
needs_check|- metadata_low_watermark |
|
|
|
transaction id: |
|
A 64-bit number used by userspace to help synchronise with metadata |
|
from volume managers. |
|
|
|
used data blocks / total data blocks |
|
If the number of free blocks drops below the pool's low water mark a |
|
dm event will be sent to userspace. This event is edge-triggered and |
|
it will occur only once after each resume so volume manager writers |
|
should register for the event and then check the target's status. |
|
|
|
held metadata root: |
|
The location, in blocks, of the metadata root that has been |
|
'held' for userspace read access. '-' indicates there is no |
|
held root. |
|
|
|
discard_passdown|no_discard_passdown |
|
Whether or not discards are actually being passed down to the |
|
underlying device. When this is enabled when loading the table, |
|
it can get disabled if the underlying device doesn't support it. |
|
|
|
ro|rw|out_of_data_space |
|
If the pool encounters certain types of device failures it will |
|
drop into a read-only metadata mode in which no changes to |
|
the pool metadata (like allocating new blocks) are permitted. |
|
|
|
In serious cases where even a read-only mode is deemed unsafe |
|
no further I/O will be permitted and the status will just |
|
contain the string 'Fail'. The userspace recovery tools |
|
should then be used. |
|
|
|
error_if_no_space|queue_if_no_space |
|
If the pool runs out of data or metadata space, the pool will |
|
either queue or error the IO destined to the data device. The |
|
default is to queue the IO until more space is added or the |
|
'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool |
|
module parameter can be used to change this timeout -- it |
|
defaults to 60 seconds but may be disabled using a value of 0. |
|
|
|
needs_check |
|
A metadata operation has failed, resulting in the needs_check |
|
flag being set in the metadata's superblock. The metadata |
|
device must be deactivated and checked/repaired before the |
|
thin-pool can be made fully operational again. '-' indicates |
|
needs_check is not set. |
|
|
|
metadata_low_watermark: |
|
Value of metadata low watermark in blocks. The kernel sets this |
|
value internally but userspace needs to know this value to |
|
determine if an event was caused by crossing this threshold. |
|
|
|
iii) Messages |
|
|
|
create_thin <dev id> |
|
Create a new thinly-provisioned device. |
|
<dev id> is an arbitrary unique 24-bit identifier chosen by |
|
the caller. |
|
|
|
create_snap <dev id> <origin id> |
|
Create a new snapshot of another thinly-provisioned device. |
|
<dev id> is an arbitrary unique 24-bit identifier chosen by |
|
the caller. |
|
<origin id> is the identifier of the thinly-provisioned device |
|
of which the new device will be a snapshot. |
|
|
|
delete <dev id> |
|
Deletes a thin device. Irreversible. |
|
|
|
set_transaction_id <current id> <new id> |
|
Userland volume managers, such as LVM, need a way to |
|
synchronise their external metadata with the internal metadata of the |
|
pool target. The thin-pool target offers to store an |
|
arbitrary 64-bit transaction id and return it on the target's |
|
status line. To avoid races you must provide what you think |
|
the current transaction id is when you change it with this |
|
compare-and-swap message. |
|
|
|
reserve_metadata_snap |
|
Reserve a copy of the data mapping btree for use by userland. |
|
This allows userland to inspect the mappings as they were when |
|
this message was executed. Use the pool's status command to |
|
get the root block associated with the metadata snapshot. |
|
|
|
release_metadata_snap |
|
Release a previously reserved copy of the data mapping btree. |
|
|
|
'thin' target |
|
------------- |
|
|
|
i) Constructor |
|
|
|
:: |
|
|
|
thin <pool dev> <dev id> [<external origin dev>] |
|
|
|
pool dev: |
|
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 |
|
|
|
dev id: |
|
the internal device identifier of the device to be |
|
activated. |
|
|
|
external origin dev: |
|
an optional block device outside the pool to be treated as a |
|
read-only snapshot origin: reads to unprovisioned areas of the |
|
thin target will be mapped to this device. |
|
|
|
The pool doesn't store any size against the thin devices. If you |
|
load a thin target that is smaller than you've been using previously, |
|
then you'll have no access to blocks mapped beyond the end. If you |
|
load a target that is bigger than before, then extra blocks will be |
|
provisioned as and when needed. |
|
|
|
ii) Status |
|
|
|
<nr mapped sectors> <highest mapped sector> |
|
If the pool has encountered device errors and failed, the status |
|
will just contain the string 'Fail'. The userspace recovery |
|
tools should then be used. |
|
|
|
In the case where <nr mapped sectors> is 0, there is no highest |
|
mapped sector and the value of <highest mapped sector> is unspecified.
|
|
|