mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
337 lines
13 KiB
337 lines
13 KiB
===== |
|
Cache |
|
===== |
|
|
|
Introduction |
|
============ |
|
|
|
dm-cache is a device mapper target written by Joe Thornber, Heinz |
|
Mauelshagen, and Mike Snitzer. |
|
|
|
It aims to improve performance of a block device (eg, a spindle) by |
|
dynamically migrating some of its data to a faster, smaller device |
|
(eg, an SSD). |
|
|
|
This device-mapper solution allows us to insert this caching at |
|
different levels of the dm stack, for instance above the data device for |
|
a thin-provisioning pool. Caching solutions that are integrated more |
|
closely with the virtual memory system should give better performance. |
|
|
|
The target reuses the metadata library used in the thin-provisioning |
|
library. |
|
|
|
The decision as to what data to migrate and when is left to a plug-in |
|
policy module. Several of these have been written as we experiment, |
|
and we hope other people will contribute others for specific io |
|
scenarios (eg. a vm image server). |
|
|
|
Glossary |
|
======== |
|
|
|
Migration |
|
Movement of the primary copy of a logical block from one |
|
device to the other. |
|
Promotion |
|
Migration from slow device to fast device. |
|
Demotion |
|
Migration from fast device to slow device. |
|
|
|
The origin device always contains a copy of the logical block, which |
|
may be out of date or kept in sync with the copy on the cache device |
|
(depending on policy). |
|
|
|
Design |
|
====== |
|
|
|
Sub-devices |
|
----------- |
|
|
|
The target is constructed by passing three devices to it (along with |
|
other parameters detailed later): |
|
|
|
1. An origin device - the big, slow one. |
|
|
|
2. A cache device - the small, fast one. |
|
|
|
3. A small metadata device - records which blocks are in the cache, |
|
which are dirty, and extra hints for use by the policy object. |
|
This information could be put on the cache device, but having it |
|
separate allows the volume manager to configure it differently, |
|
e.g. as a mirror for extra robustness. This metadata device may only |
|
be used by a single cache device. |
|
|
|
Fixed block size |
|
---------------- |
|
|
|
The origin is divided up into blocks of a fixed size. This block size |
|
is configurable when you first create the cache. Typically we've been |
|
using block sizes of 256KB - 1024KB. The block size must be between 64 |
|
sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). |
|
|
|
Having a fixed block size simplifies the target a lot. But it is |
|
something of a compromise. For instance, a small part of a block may be |
|
getting hit a lot, yet the whole block will be promoted to the cache. |
|
So large block sizes are bad because they waste cache space. And small |
|
block sizes are bad because they increase the amount of metadata (both |
|
in core and on disk). |
|
|
|
Cache operating modes |
|
--------------------- |
|
|
|
The cache has three operating modes: writeback, writethrough and |
|
passthrough. |
|
|
|
If writeback, the default, is selected then a write to a block that is |
|
cached will go only to the cache and the block will be marked dirty in |
|
the metadata. |
|
|
|
If writethrough is selected then a write to a cached block will not |
|
complete until it has hit both the origin and cache devices. Clean |
|
blocks should remain clean. |
|
|
|
If passthrough is selected, useful when the cache contents are not known |
|
to be coherent with the origin device, then all reads are served from |
|
the origin device (all reads miss the cache) and all writes are |
|
forwarded to the origin device; additionally, write hits cause cache |
|
block invalidates. To enable passthrough mode the cache must be clean. |
|
Passthrough mode allows a cache device to be activated without having to |
|
worry about coherency. Coherency that exists is maintained, although |
|
the cache will gradually cool as writes take place. If the coherency of |
|
the cache can later be verified, or established through use of the |
|
"invalidate_cblocks" message, the cache device can be transitioned to |
|
writethrough or writeback mode while still warm. Otherwise, the cache |
|
contents can be discarded prior to transitioning to the desired |
|
operating mode. |
|
|
|
A simple cleaner policy is provided, which will clean (write back) all |
|
dirty blocks in a cache. Useful for decommissioning a cache or when |
|
shrinking a cache. Shrinking the cache's fast device requires all cache |
|
blocks, in the area of the cache being removed, to be clean. If the |
|
area being removed from the cache still contains dirty blocks the resize |
|
will fail. Care must be taken to never reduce the volume used for the |
|
cache's fast device until the cache is clean. This is of particular |
|
importance if writeback mode is used. Writethrough and passthrough |
|
modes already maintain a clean cache. Future support to partially clean |
|
the cache, above a specified threshold, will allow for keeping the cache |
|
warm and in writeback mode during resize. |
|
|
|
Migration throttling |
|
-------------------- |
|
|
|
Migrating data between the origin and cache device uses bandwidth. |
|
The user can set a throttle to prevent more than a certain amount of |
|
migration occurring at any one time. Currently we're not taking any |
|
account of normal io traffic going to the devices. More work needs |
|
doing here to avoid migrating during those peak io moments. |
|
|
|
For the time being, a message "migration_threshold <#sectors>" |
|
can be used to set the maximum number of sectors being migrated, |
|
the default being 2048 sectors (1MB). |
|
|
|
Updating on-disk metadata |
|
------------------------- |
|
|
|
On-disk metadata is committed every time a FLUSH or FUA bio is written. |
|
If no such requests are made then commits will occur every second. This |
|
means the cache behaves like a physical disk that has a volatile write |
|
cache. If power is lost you may lose some recent writes. The metadata |
|
should always be consistent in spite of any crash. |
|
|
|
The 'dirty' state for a cache block changes far too frequently for us |
|
to keep updating it on the fly. So we treat it as a hint. In normal |
|
operation it will be written when the dm device is suspended. If the |
|
system crashes all cache blocks will be assumed dirty when restarted. |
|
|
|
Per-block policy hints |
|
---------------------- |
|
|
|
Policy plug-ins can store a chunk of data per cache block. It's up to |
|
the policy how big this chunk is, but it should be kept small. Like the |
|
dirty flags this data is lost if there's a crash so a safe fallback |
|
value should always be possible. |
|
|
|
Policy hints affect performance, not correctness. |
|
|
|
Policy messaging |
|
---------------- |
|
|
|
Policies will have different tunables, specific to each one, so we |
|
need a generic way of getting and setting these. Device-mapper |
|
messages are used. Refer to cache-policies.txt. |
|
|
|
Discard bitset resolution |
|
------------------------- |
|
|
|
We can avoid copying data during migration if we know the block has |
|
been discarded. A prime example of this is when mkfs discards the |
|
whole block device. We store a bitset tracking the discard state of |
|
blocks. However, we allow this bitset to have a different block size |
|
from the cache blocks. This is because we need to track the discard |
|
state for all of the origin device (compare with the dirty bitset |
|
which is just for the smaller cache device). |
|
|
|
Target interface |
|
================ |
|
|
|
Constructor |
|
----------- |
|
|
|
:: |
|
|
|
cache <metadata dev> <cache dev> <origin dev> <block size> |
|
<#feature args> [<feature arg>]* |
|
<policy> <#policy args> [policy args]* |
|
|
|
================ ======================================================= |
|
metadata dev fast device holding the persistent metadata |
|
cache dev fast device holding cached data blocks |
|
origin dev slow device holding original data blocks |
|
block size cache unit size in sectors |
|
|
|
#feature args number of feature arguments passed |
|
feature args writethrough or passthrough (The default is writeback.) |
|
|
|
policy the replacement policy to use |
|
#policy args an even number of arguments corresponding to |
|
key/value pairs passed to the policy |
|
policy args key/value pairs passed to the policy |
|
E.g. 'sequential_threshold 1024' |
|
See cache-policies.txt for details. |
|
================ ======================================================= |
|
|
|
Optional feature arguments are: |
|
|
|
|
|
==================== ======================================================== |
|
writethrough write through caching that prohibits cache block |
|
content from being different from origin block content. |
|
Without this argument, the default behaviour is to write |
|
back cache block contents later for performance reasons, |
|
so they may differ from the corresponding origin blocks. |
|
|
|
passthrough a degraded mode useful for various cache coherency |
|
situations (e.g., rolling back snapshots of |
|
underlying storage). Reads and writes always go to |
|
the origin. If a write goes to a cached origin |
|
block, then the cache block is invalidated. |
|
To enable passthrough mode the cache must be clean. |
|
|
|
metadata2 use version 2 of the metadata. This stores the dirty |
|
bits in a separate btree, which improves speed of |
|
shutting down the cache. |
|
|
|
no_discard_passdown disable passing down discards from the cache |
|
to the origin's data device. |
|
==================== ======================================================== |
|
|
|
A policy called 'default' is always registered. This is an alias for |
|
the policy we currently think is giving best all round performance. |
|
|
|
As the default policy could vary between kernels, if you are relying on |
|
the characteristics of a specific policy, always request it by name. |
|
|
|
Status |
|
------ |
|
|
|
:: |
|
|
|
<metadata block size> <#used metadata blocks>/<#total metadata blocks> |
|
<cache block size> <#used cache blocks>/<#total cache blocks> |
|
<#read hits> <#read misses> <#write hits> <#write misses> |
|
<#demotions> <#promotions> <#dirty> <#features> <features>* |
|
<#core args> <core args>* <policy name> <#policy args> <policy args>* |
|
<cache metadata mode> |
|
|
|
|
|
========================= ===================================================== |
|
metadata block size Fixed block size for each metadata block in |
|
sectors |
|
#used metadata blocks Number of metadata blocks used |
|
#total metadata blocks Total number of metadata blocks |
|
cache block size Configurable block size for the cache device |
|
in sectors |
|
#used cache blocks Number of blocks resident in the cache |
|
#total cache blocks Total number of cache blocks |
|
#read hits Number of times a READ bio has been mapped |
|
to the cache |
|
#read misses Number of times a READ bio has been mapped |
|
to the origin |
|
#write hits Number of times a WRITE bio has been mapped |
|
to the cache |
|
#write misses Number of times a WRITE bio has been |
|
mapped to the origin |
|
#demotions Number of times a block has been removed |
|
from the cache |
|
#promotions Number of times a block has been moved to |
|
the cache |
|
#dirty Number of blocks in the cache that differ |
|
from the origin |
|
#feature args Number of feature args to follow |
|
feature args 'writethrough' (optional) |
|
#core args Number of core arguments (must be even) |
|
core args Key/value pairs for tuning the core |
|
e.g. migration_threshold |
|
policy name Name of the policy |
|
#policy args Number of policy arguments to follow (must be even) |
|
policy args Key/value pairs e.g. sequential_threshold |
|
cache metadata mode ro if read-only, rw if read-write |
|
|
|
In serious cases where even a read-only mode is |
|
deemed unsafe no further I/O will be permitted and |
|
the status will just contain the string 'Fail'. |
|
The userspace recovery tools should then be used. |
|
needs_check 'needs_check' if set, '-' if not set |
|
A metadata operation has failed, resulting in the |
|
needs_check flag being set in the metadata's |
|
superblock. The metadata device must be |
|
deactivated and checked/repaired before the |
|
cache can be made fully operational again. |
|
'-' indicates needs_check is not set. |
|
========================= ===================================================== |
|
|
|
Messages |
|
-------- |
|
|
|
Policies will have different tunables, specific to each one, so we |
|
need a generic way of getting and setting these. Device-mapper |
|
messages are used. (A sysfs interface would also be possible.) |
|
|
|
The message format is:: |
|
|
|
<key> <value> |
|
|
|
E.g.:: |
|
|
|
dmsetup message my_cache 0 sequential_threshold 1024 |
|
|
|
|
|
Invalidation is removing an entry from the cache without writing it |
|
back. Cache blocks can be invalidated via the invalidate_cblocks |
|
message, which takes an arbitrary number of cblock ranges. Each cblock |
|
range's end value is "one past the end", meaning 5-10 expresses a range |
|
of values from 5 to 9. Each cblock must be expressed as a decimal |
|
value, in the future a variant message that takes cblock ranges |
|
expressed in hexadecimal may be needed to better support efficient |
|
invalidation of larger caches. The cache must be in passthrough mode |
|
when invalidate_cblocks is used:: |
|
|
|
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* |
|
|
|
E.g.:: |
|
|
|
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 |
|
|
|
Examples |
|
======== |
|
|
|
The test suite can be found here: |
|
|
|
https://github.com/jthornber/device-mapper-test-suite |
|
|
|
:: |
|
|
|
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ |
|
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' |
|
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ |
|
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ |
|
mq 4 sequential_threshold 1024 random_threshold 8'
|
|
|