mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
756 lines
23 KiB
756 lines
23 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
Journal (jbd2) |
|
-------------- |
|
|
|
Introduced in ext3, the ext4 filesystem employs a journal to protect the |
|
filesystem against metadata inconsistencies in the case of a system crash. Up |
|
to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal |
|
size limits) can be reserved inside the filesystem as a place to land |
|
“important” data writes on-disk as quickly as possible. Once the important |
|
data transaction is fully written to the disk and flushed from the disk write |
|
cache, a record of the data being committed is also written to the journal. At |
|
some later point in time, the journal code writes the transactions to their |
|
final locations on disk (this could involve a lot of seeking or a lot of small |
|
read-write-erases) before erasing the commit record. Should the system |
|
crash during the second slow write, the journal can be replayed all the |
|
way to the latest commit record, guaranteeing the atomicity of whatever |
|
gets written through the journal to the disk. The effect of this is to |
|
guarantee that the filesystem does not become stuck midway through a |
|
metadata update. |
|
|
|
For performance reasons, ext4 by default only writes filesystem metadata |
|
through the journal. This means that file data blocks are /not/ |
|
guaranteed to be in any consistent state after a crash. If this default |
|
guarantee level (``data=ordered``) is not satisfactory, there is a mount |
|
option to control journal behavior. If ``data=journal``, all data and |
|
metadata are written to disk through the journal. This is slower but |
|
safest. If ``data=writeback``, dirty data blocks are not flushed to the |
|
disk before the metadata are written to disk through the journal. |
|
|
|
In case of ``data=ordered`` mode, Ext4 also supports fast commits which |
|
help reduce commit latency significantly. The default ``data=ordered`` |
|
mode works by logging metadata blocks to the journal. In fast commit |
|
mode, Ext4 only stores the minimal delta needed to recreate the |
|
affected metadata in fast commit space that is shared with JBD2. |
|
Once the fast commit area fills in or if fast commit is not possible |
|
or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. |
|
A full commit invalidates all the fast commits that happened before |
|
it and thus it makes the fast commit area empty for further fast |
|
commits. This feature needs to be enabled at mkfs time. |
|
|
|
The journal inode is typically inode 8. The first 68 bytes of the |
|
journal inode are replicated in the ext4 superblock. The journal itself |
|
is normal (but hidden) file within the filesystem. The file usually |
|
consumes an entire block group, though mke2fs tries to put it in the |
|
middle of the disk. |
|
|
|
All fields in jbd2 are written to disk in big-endian order. This is the |
|
opposite of ext4. |
|
|
|
NOTE: Both ext4 and ocfs2 use jbd2. |
|
|
|
The maximum size of a journal embedded in an ext4 filesystem is 2^32 |
|
blocks. jbd2 itself does not seem to care. |
|
|
|
Layout |
|
~~~~~~ |
|
|
|
Generally speaking, the journal has this format: |
|
|
|
.. list-table:: |
|
:widths: 16 48 16 |
|
:header-rows: 1 |
|
|
|
* - Superblock |
|
- descriptor\_block (data\_blocks or revocation\_block) [more data or |
|
revocations] commmit\_block |
|
- [more transactions...] |
|
* - |
|
- One transaction |
|
- |
|
|
|
Notice that a transaction begins with either a descriptor and some data, |
|
or a block revocation list. A finished transaction always ends with a |
|
commit. If there is no commit record (or the checksums don't match), the |
|
transaction will be discarded during replay. |
|
|
|
External Journal |
|
~~~~~~~~~~~~~~~~ |
|
|
|
Optionally, an ext4 filesystem can be created with an external journal |
|
device (as opposed to an internal journal, which uses a reserved inode). |
|
In this case, on the filesystem device, ``s_journal_inum`` should be |
|
zero and ``s_journal_uuid`` should be set. On the journal device there |
|
will be an ext4 super block in the usual place, with a matching UUID. |
|
The journal superblock will be in the next full block after the |
|
superblock. |
|
|
|
.. list-table:: |
|
:widths: 12 12 12 32 12 |
|
:header-rows: 1 |
|
|
|
* - 1024 bytes of padding |
|
- ext4 Superblock |
|
- Journal Superblock |
|
- descriptor\_block (data\_blocks or revocation\_block) [more data or |
|
revocations] commmit\_block |
|
- [more transactions...] |
|
* - |
|
- |
|
- |
|
- One transaction |
|
- |
|
|
|
Block Header |
|
~~~~~~~~~~~~ |
|
|
|
Every block in the journal starts with a common 12-byte header |
|
``struct journal_header_s``: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Description |
|
* - 0x0 |
|
- \_\_be32 |
|
- h\_magic |
|
- jbd2 magic number, 0xC03B3998. |
|
* - 0x4 |
|
- \_\_be32 |
|
- h\_blocktype |
|
- Description of what this block contains. See the jbd2_blocktype_ table |
|
below. |
|
* - 0x8 |
|
- \_\_be32 |
|
- h\_sequence |
|
- The transaction ID that goes with this block. |
|
|
|
.. _jbd2_blocktype: |
|
|
|
The journal block type can be any one of: |
|
|
|
.. list-table:: |
|
:widths: 16 64 |
|
:header-rows: 1 |
|
|
|
* - Value |
|
- Description |
|
* - 1 |
|
- Descriptor. This block precedes a series of data blocks that were |
|
written through the journal during a transaction. |
|
* - 2 |
|
- Block commit record. This block signifies the completion of a |
|
transaction. |
|
* - 3 |
|
- Journal superblock, v1. |
|
* - 4 |
|
- Journal superblock, v2. |
|
* - 5 |
|
- Block revocation records. This speeds up recovery by enabling the |
|
journal to skip writing blocks that were subsequently rewritten. |
|
|
|
Super Block |
|
~~~~~~~~~~~ |
|
|
|
The super block for the journal is much simpler as compared to ext4's. |
|
The key data kept within are size of the journal, and where to find the |
|
start of the log of transactions. |
|
|
|
The journal superblock is recorded as ``struct journal_superblock_s``, |
|
which is 1024 bytes long: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Description |
|
* - |
|
- |
|
- |
|
- Static information describing the journal. |
|
* - 0x0 |
|
- journal\_header\_t (12 bytes) |
|
- s\_header |
|
- Common header identifying this as a superblock. |
|
* - 0xC |
|
- \_\_be32 |
|
- s\_blocksize |
|
- Journal device block size. |
|
* - 0x10 |
|
- \_\_be32 |
|
- s\_maxlen |
|
- Total number of blocks in this journal. |
|
* - 0x14 |
|
- \_\_be32 |
|
- s\_first |
|
- First block of log information. |
|
* - |
|
- |
|
- |
|
- Dynamic information describing the current state of the log. |
|
* - 0x18 |
|
- \_\_be32 |
|
- s\_sequence |
|
- First commit ID expected in log. |
|
* - 0x1C |
|
- \_\_be32 |
|
- s\_start |
|
- Block number of the start of log. Contrary to the comments, this field |
|
being zero does not imply that the journal is clean! |
|
* - 0x20 |
|
- \_\_be32 |
|
- s\_errno |
|
- Error value, as set by jbd2\_journal\_abort(). |
|
* - |
|
- |
|
- |
|
- The remaining fields are only valid in a v2 superblock. |
|
* - 0x24 |
|
- \_\_be32 |
|
- s\_feature\_compat; |
|
- Compatible feature set. See the table jbd2_compat_ below. |
|
* - 0x28 |
|
- \_\_be32 |
|
- s\_feature\_incompat |
|
- Incompatible feature set. See the table jbd2_incompat_ below. |
|
* - 0x2C |
|
- \_\_be32 |
|
- s\_feature\_ro\_compat |
|
- Read-only compatible feature set. There aren't any of these currently. |
|
* - 0x30 |
|
- \_\_u8 |
|
- s\_uuid[16] |
|
- 128-bit uuid for journal. This is compared against the copy in the ext4 |
|
super block at mount time. |
|
* - 0x40 |
|
- \_\_be32 |
|
- s\_nr\_users |
|
- Number of file systems sharing this journal. |
|
* - 0x44 |
|
- \_\_be32 |
|
- s\_dynsuper |
|
- Location of dynamic super block copy. (Not used?) |
|
* - 0x48 |
|
- \_\_be32 |
|
- s\_max\_transaction |
|
- Limit of journal blocks per transaction. (Not used?) |
|
* - 0x4C |
|
- \_\_be32 |
|
- s\_max\_trans\_data |
|
- Limit of data blocks per transaction. (Not used?) |
|
* - 0x50 |
|
- \_\_u8 |
|
- s\_checksum\_type |
|
- Checksum algorithm used for the journal. See jbd2_checksum_type_ for |
|
more info. |
|
* - 0x51 |
|
- \_\_u8[3] |
|
- s\_padding2 |
|
- |
|
* - 0x54 |
|
- \_\_be32 |
|
- s\_num\_fc\_blocks |
|
- Number of fast commit blocks in the journal. |
|
* - 0x58 |
|
- \_\_u32 |
|
- s\_padding[42] |
|
- |
|
* - 0xFC |
|
- \_\_be32 |
|
- s\_checksum |
|
- Checksum of the entire superblock, with this field set to zero. |
|
* - 0x100 |
|
- \_\_u8 |
|
- s\_users[16\*48] |
|
- ids of all file systems sharing the log. e2fsprogs/Linux don't allow |
|
shared external journals, but I imagine Lustre (or ocfs2?), which use |
|
the jbd2 code, might. |
|
|
|
.. _jbd2_compat: |
|
|
|
The journal compat features are any combination of the following: |
|
|
|
.. list-table:: |
|
:widths: 16 64 |
|
:header-rows: 1 |
|
|
|
* - Value |
|
- Description |
|
* - 0x1 |
|
- Journal maintains checksums on the data blocks. |
|
(JBD2\_FEATURE\_COMPAT\_CHECKSUM) |
|
|
|
.. _jbd2_incompat: |
|
|
|
The journal incompat features are any combination of the following: |
|
|
|
.. list-table:: |
|
:widths: 16 64 |
|
:header-rows: 1 |
|
|
|
* - Value |
|
- Description |
|
* - 0x1 |
|
- Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) |
|
* - 0x2 |
|
- Journal can deal with 64-bit block numbers. |
|
(JBD2\_FEATURE\_INCOMPAT\_64BIT) |
|
* - 0x4 |
|
- Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) |
|
* - 0x8 |
|
- This journal uses v2 of the checksum on-disk format. Each journal |
|
metadata block gets its own checksum, and the block tags in the |
|
descriptor table contain checksums for each of the data blocks in the |
|
journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) |
|
* - 0x10 |
|
- This journal uses v3 of the checksum on-disk format. This is the same as |
|
v2, but the journal block tag size is fixed regardless of the size of |
|
block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) |
|
* - 0x20 |
|
- Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT) |
|
|
|
.. _jbd2_checksum_type: |
|
|
|
Journal checksum type codes are one of the following. crc32 or crc32c are the |
|
most likely choices. |
|
|
|
.. list-table:: |
|
:widths: 16 64 |
|
:header-rows: 1 |
|
|
|
* - Value |
|
- Description |
|
* - 1 |
|
- CRC32 |
|
* - 2 |
|
- MD5 |
|
* - 3 |
|
- SHA1 |
|
* - 4 |
|
- CRC32C |
|
|
|
Descriptor Block |
|
~~~~~~~~~~~~~~~~ |
|
|
|
The descriptor block contains an array of journal block tags that |
|
describe the final locations of the data blocks that follow in the |
|
journal. Descriptor blocks are open-coded instead of being completely |
|
described by a data structure, but here is the block structure anyway. |
|
Descriptor blocks consume at least 36 bytes, but use a full block: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Descriptor |
|
* - 0x0 |
|
- journal\_header\_t |
|
- (open coded) |
|
- Common block header. |
|
* - 0xC |
|
- struct journal\_block\_tag\_s |
|
- open coded array[] |
|
- Enough tags either to fill up the block or to describe all the data |
|
blocks that follow this descriptor block. |
|
|
|
Journal block tags have any of the following formats, depending on which |
|
journal feature and block tag flags are set. |
|
|
|
If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is |
|
defined as ``struct journal_block_tag3_s``, which looks like the |
|
following. The size is 16 or 32 bytes. |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Descriptor |
|
* - 0x0 |
|
- \_\_be32 |
|
- t\_blocknr |
|
- Lower 32-bits of the location of where the corresponding data block |
|
should end up on disk. |
|
* - 0x4 |
|
- \_\_be32 |
|
- t\_flags |
|
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for |
|
more info. |
|
* - 0x8 |
|
- \_\_be32 |
|
- t\_blocknr\_high |
|
- Upper 32-bits of the location of where the corresponding data block |
|
should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is |
|
not enabled. |
|
* - 0xC |
|
- \_\_be32 |
|
- t\_checksum |
|
- Checksum of the journal UUID, the sequence number, and the data block. |
|
* - |
|
- |
|
- |
|
- This field appears to be open coded. It always comes at the end of the |
|
tag, after t_checksum. This field is not present if the "same UUID" flag |
|
is set. |
|
* - 0x8 or 0xC |
|
- char |
|
- uuid[16] |
|
- A UUID to go with this tag. This field appears to be copied from the |
|
``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that |
|
field. |
|
|
|
.. _jbd2_tag_flags: |
|
|
|
The journal tag flags are any combination of the following: |
|
|
|
.. list-table:: |
|
:widths: 16 64 |
|
:header-rows: 1 |
|
|
|
* - Value |
|
- Description |
|
* - 0x1 |
|
- On-disk block is escaped. The first four bytes of the data block just |
|
happened to match the jbd2 magic number. |
|
* - 0x2 |
|
- This block has the same UUID as previous, therefore the UUID field is |
|
omitted. |
|
* - 0x4 |
|
- The data block was deleted by the transaction. (Not used?) |
|
* - 0x8 |
|
- This is the last tag in this descriptor block. |
|
|
|
If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag |
|
is defined as ``struct journal_block_tag_s``, which looks like the |
|
following. The size is 8, 12, 24, or 28 bytes: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Descriptor |
|
* - 0x0 |
|
- \_\_be32 |
|
- t\_blocknr |
|
- Lower 32-bits of the location of where the corresponding data block |
|
should end up on disk. |
|
* - 0x4 |
|
- \_\_be16 |
|
- t\_checksum |
|
- Checksum of the journal UUID, the sequence number, and the data block. |
|
Note that only the lower 16 bits are stored. |
|
* - 0x6 |
|
- \_\_be16 |
|
- t\_flags |
|
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for |
|
more info. |
|
* - |
|
- |
|
- |
|
- This next field is only present if the super block indicates support for |
|
64-bit block numbers. |
|
* - 0x8 |
|
- \_\_be32 |
|
- t\_blocknr\_high |
|
- Upper 32-bits of the location of where the corresponding data block |
|
should end up on disk. |
|
* - |
|
- |
|
- |
|
- This field appears to be open coded. It always comes at the end of the |
|
tag, after t_flags or t_blocknr_high. This field is not present if the |
|
"same UUID" flag is set. |
|
* - 0x8 or 0xC |
|
- char |
|
- uuid[16] |
|
- A UUID to go with this tag. This field appears to be copied from the |
|
``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that |
|
field. |
|
|
|
If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or |
|
JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a |
|
``struct jbd2_journal_block_tail``, which looks like this: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Descriptor |
|
* - 0x0 |
|
- \_\_be32 |
|
- t\_checksum |
|
- Checksum of the journal UUID + the descriptor block, with this field set |
|
to zero. |
|
|
|
Data Block |
|
~~~~~~~~~~ |
|
|
|
In general, the data blocks being written to disk through the journal |
|
are written verbatim into the journal file after the descriptor block. |
|
However, if the first four bytes of the block match the jbd2 magic |
|
number then those four bytes are replaced with zeroes and the “escaped” |
|
flag is set in the descriptor block tag. |
|
|
|
Revocation Block |
|
~~~~~~~~~~~~~~~~ |
|
|
|
A revocation block is used to prevent replay of a block in an earlier |
|
transaction. This is used to mark blocks that were journalled at one |
|
time but are no longer journalled. Typically this happens if a metadata |
|
block is freed and re-allocated as a file data block; in this case, a |
|
journal replay after the file block was written to disk will cause |
|
corruption. |
|
|
|
**NOTE**: This mechanism is NOT used to express “this journal block is |
|
superseded by this other journal block”, as the author (djwong) |
|
mistakenly thought. Any block being added to a transaction will cause |
|
the removal of all existing revocation records for that block. |
|
|
|
Revocation blocks are described in |
|
``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in |
|
length, but use a full block: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Description |
|
* - 0x0 |
|
- journal\_header\_t |
|
- r\_header |
|
- Common block header. |
|
* - 0xC |
|
- \_\_be32 |
|
- r\_count |
|
- Number of bytes used in this block. |
|
* - 0x10 |
|
- \_\_be32 or \_\_be64 |
|
- blocks[0] |
|
- Blocks to revoke. |
|
|
|
After r\_count is a linear array of block numbers that are effectively |
|
revoked by this transaction. The size of each block number is 8 bytes if |
|
the superblock advertises 64-bit block number support, or 4 bytes |
|
otherwise. |
|
|
|
If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or |
|
JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation |
|
block is a ``struct jbd2_journal_revoke_tail``, which has this format: |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Description |
|
* - 0x0 |
|
- \_\_be32 |
|
- r\_checksum |
|
- Checksum of the journal UUID + revocation block |
|
|
|
Commit Block |
|
~~~~~~~~~~~~ |
|
|
|
The commit block is a sentry that indicates that a transaction has been |
|
completely written to the journal. Once this commit block reaches the |
|
journal, the data stored with this transaction can be written to their |
|
final locations on disk. |
|
|
|
The commit block is described by ``struct commit_header``, which is 32 |
|
bytes long (but uses a full block): |
|
|
|
.. list-table:: |
|
:widths: 8 8 24 40 |
|
:header-rows: 1 |
|
|
|
* - Offset |
|
- Type |
|
- Name |
|
- Descriptor |
|
* - 0x0 |
|
- journal\_header\_s |
|
- (open coded) |
|
- Common block header. |
|
* - 0xC |
|
- unsigned char |
|
- h\_chksum\_type |
|
- The type of checksum to use to verify the integrity of the data blocks |
|
in the transaction. See jbd2_checksum_type_ for more info. |
|
* - 0xD |
|
- unsigned char |
|
- h\_chksum\_size |
|
- The number of bytes used by the checksum. Most likely 4. |
|
* - 0xE |
|
- unsigned char |
|
- h\_padding[2] |
|
- |
|
* - 0x10 |
|
- \_\_be32 |
|
- h\_chksum[JBD2\_CHECKSUM\_BYTES] |
|
- 32 bytes of space to store checksums. If |
|
JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 |
|
are set, the first ``__be32`` is the checksum of the journal UUID and |
|
the entire commit block, with this field zeroed. If |
|
JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the |
|
crc32 of all the blocks already written to the transaction. |
|
* - 0x30 |
|
- \_\_be64 |
|
- h\_commit\_sec |
|
- The time that the transaction was committed, in seconds since the epoch. |
|
* - 0x38 |
|
- \_\_be32 |
|
- h\_commit\_nsec |
|
- Nanoseconds component of the above timestamp. |
|
|
|
Fast commits |
|
~~~~~~~~~~~~ |
|
|
|
Fast commit area is organized as a log of tag length values. Each TLV has |
|
a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length |
|
of the entire field. It is followed by variable length tag specific value. |
|
Here is the list of supported tags and their meanings: |
|
|
|
.. list-table:: |
|
:widths: 8 20 20 32 |
|
:header-rows: 1 |
|
|
|
* - Tag |
|
- Meaning |
|
- Value struct |
|
- Description |
|
* - EXT4_FC_TAG_HEAD |
|
- Fast commit area header |
|
- ``struct ext4_fc_head`` |
|
- Stores the TID of the transaction after which these fast commits should |
|
be applied. |
|
* - EXT4_FC_TAG_ADD_RANGE |
|
- Add extent to inode |
|
- ``struct ext4_fc_add_range`` |
|
- Stores the inode number and extent to be added in this inode |
|
* - EXT4_FC_TAG_DEL_RANGE |
|
- Remove logical offsets to inode |
|
- ``struct ext4_fc_del_range`` |
|
- Stores the inode number and the logical offset range that needs to be |
|
removed |
|
* - EXT4_FC_TAG_CREAT |
|
- Create directory entry for a newly created file |
|
- ``struct ext4_fc_dentry_info`` |
|
- Stores the parent inode number, inode number and directory entry of the |
|
newly created file |
|
* - EXT4_FC_TAG_LINK |
|
- Link a directory entry to an inode |
|
- ``struct ext4_fc_dentry_info`` |
|
- Stores the parent inode number, inode number and directory entry |
|
* - EXT4_FC_TAG_UNLINK |
|
- Unlink a directory entry of an inode |
|
- ``struct ext4_fc_dentry_info`` |
|
- Stores the parent inode number, inode number and directory entry |
|
|
|
* - EXT4_FC_TAG_PAD |
|
- Padding (unused area) |
|
- None |
|
- Unused bytes in the fast commit area. |
|
|
|
* - EXT4_FC_TAG_TAIL |
|
- Mark the end of a fast commit |
|
- ``struct ext4_fc_tail`` |
|
- Stores the TID of the commit, CRC of the fast commit of which this tag |
|
represents the end of |
|
|
|
Fast Commit Replay Idempotence |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Fast commits tags are idempotent in nature provided the recovery code follows |
|
certain rules. The guiding principle that the commit path follows while |
|
committing is that it stores the result of a particular operation instead of |
|
storing the procedure. |
|
|
|
Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' |
|
was associated with inode 10. During fast commit, instead of storing this |
|
operation as a procedure "rename a to b", we store the resulting file system |
|
state as a "series" of outcomes: |
|
|
|
- Link dirent b to inode 10 |
|
- Unlink dirent a |
|
- Inode 10 with valid refcount |
|
|
|
Now when recovery code runs, it needs "enforce" this state on the file |
|
system. This is what guarantees idempotence of fast commit replay. |
|
|
|
Let's take an example of a procedure that is not idempotent and see how fast |
|
commits make it idempotent. Consider following sequence of operations: |
|
|
|
1) rm A |
|
2) mv B A |
|
3) read A |
|
|
|
If we store this sequence of operations as is then the replay is not idempotent. |
|
Let's say while in replay, we crash after (2). During the second replay, |
|
file A (which was actually created as a result of "mv B A" operation) would get |
|
deleted. Thus, file named A would be absent when we try to read A. So, this |
|
sequence of operations is not idempotent. However, as mentioned above, instead |
|
of storing the procedure fast commits store the outcome of each procedure. Thus |
|
the fast commit log for above procedure would be as follows: |
|
|
|
(Let's assume dirent A was linked to inode 10 and dirent B was linked to |
|
inode 11 before the replay) |
|
|
|
1) Unlink A |
|
2) Link A to inode 11 |
|
3) Unlink B |
|
4) Inode 11 |
|
|
|
If we crash after (3) we will have file A linked to inode 11. During the second |
|
replay, we will remove file A (inode 11). But we will create it back and make |
|
it point to inode 11. We won't find B, so we'll just skip that step. At this |
|
point, the refcount for inode 11 is not reliable, but that gets fixed by the |
|
replay of last inode 11 tag. Thus, by converting a non-idempotent procedure |
|
into a series of idempotent outcomes, fast commits ensured idempotence during |
|
the replay. |
|
|
|
Journal Checkpoint |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Checkpointing the journal ensures all transactions and their associated buffers |
|
are submitted to the disk. In-progress transactions are waited upon and included |
|
in the checkpoint. Checkpointing is used internally during critical updates to |
|
the filesystem including journal recovery, filesystem resizing, and freeing of |
|
the journal_t structure. |
|
|
|
A journal checkpoint can be triggered from userspace via the ioctl |
|
EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. |
|
Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN |
|
can be used to verify input to the ioctl. It returns error if there is any |
|
invalid input, otherwise it returns success without performing |
|
any checkpointing. This can be used to check whether the ioctl exists on a |
|
system and to verify there are no issues with arguments or flags. The |
|
other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and |
|
EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be |
|
discarded or zero-filled, respectively, after the journal checkpoint is |
|
complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT |
|
cannot both be set. The ioctl may be useful when snapshotting a system or for |
|
complying with content deletion SLOs.
|
|
|