NAME
WAPBL,
wapbl_start,
wapbl_stop,
wapbl_begin,
wapbl_end,
wapbl_flush,
wapbl_discard,
wapbl_add_buf,
wapbl_remove_buf,
wapbl_resize_buf,
wapbl_register_inode,
wapbl_unregister_inode,
wapbl_register_deallocation,
wapbl_jlock_assert,
wapbl_junlock_assert
—
write-ahead physical block logging for file
systems
SYNOPSIS
#include <sys/wapbl.h>
typedef void (*wapbl_flush_fn_t)(struct mount *, daddr_t *, int
*, int);
int
wapbl_start(
struct
wapbl **wlp,
struct mount
*mp,
struct vnode
*devvp,
daddr_t off,
size_t count,
size_t blksize,
struct wapbl_replay *wr,
wapbl_flush_fn_t flushfn,
wapbl_flush_fn_t
flushabortfn);
int
wapbl_stop(
struct
wapbl *wl,
int
force);
int
wapbl_begin(
struct
wapbl *wl,
const char
*file,
int line);
void
wapbl_end(
struct
wapbl *wl);
int
wapbl_flush(
struct
wapbl *wl,
int wait);
void
wapbl_discard(
struct
wapbl *wl);
void
wapbl_add_buf(
struct
wapbl *wl,
struct buf
*bp);
void
wapbl_remove_buf(
struct
wapbl *wl,
struct buf
*bp);
void
wapbl_resize_buf(
struct
wapbl *wl,
struct buf
*bp,
long oldsz,
long oldcnt);
void
wapbl_register_inode(
struct
wapbl *wl,
ino_t ino,
mode_t mode);
void
wapbl_unregister_inode(
struct
wapbl *wl,
ino_t ino,
mode_t mode);
void
wapbl_register_deallocation(
struct
wapbl *wl,
daddr_t
blk,
int len);
void
wapbl_jlock_assert(
struct
wapbl *wl);
void
wapbl_junlock_assert(
struct
wapbl *wl);
DESCRIPTION
WAPBL, or
write-ahead physical block
logging, is an abstraction for file systems to write physical blocks in
the
buffercache(9) to a
bounded-size log first before their real destinations on disk. The name means:
-
-
- logging
- batches of writes are issued atomically via a log
-
-
- physical block
- only physical blocks, not logical file system operations,
are stored in the log
-
-
- write-ahead
- before writing a block to disk, its new content, rather
than its old content for roll-back, is recorded in the log
When a file system using
WAPBL issues writes (as in
bwrite(9) or
bdwrite(9)), they are grouped
in batches called
transactions in memory, which are
serialized to be consistent with program order before
WAPBL
submits them to disk atomically.
Thus, within a transaction, after one write, another write need not wait for
disk I/O, and if the system is interrupted, e.g. by a crash or by power
failure, either both writes will appear on disk, or neither will.
When a transaction is full, it is written to a circular buffer on disk called
the
log. When the transaction has been written to disk,
every write in the transaction is submitted to disk asynchronously. Finally,
the file system may issue new writes via
WAPBL once enough
writes submitted to disk have completed.
After interruption, such as a crash or power failure, some writes issued by the
file system may not have completed. However, the log is written consistently
with program order and before file system writes are submitted to disk. Hence
a consistent program-order view of the file system can be attained by
resubmitting the writes that were successfully stored in the log using
wapbl_replay(9). This may
not be the same state just before interruption — writes in transactions
that did not reach the disk will be excluded.
For a file system to use
WAPBL, its
VFS_MOUNT(9) method should
first replay any journal on disk using
wapbl_replay(9), and then,
if the mount is read/write, initialize
WAPBL for the mount
by calling
wapbl_start(). The
VFS_UNMOUNT(9) method
should call
wapbl_stop().
Before issuing any
buffercache(9) writes, the
file system must acquire a shared lock on the current
WAPBL
transaction with
wapbl_begin(), which may sleep until there
is room in the transaction for new writes. After issuing the writes, the file
system must release its shared lock on the transaction with
wapbl_end(). Either all writes issued between
wapbl_begin() and
wapbl_end() will
complete, or none of them will.
File systems may also witness an
exclusive lock on the current
transaction when
WAPBL is flushing the transaction to disk,
or aborting a flush, and invokes a file system's callback. File systems can
assert that the transaction is locked with
wapbl_jlock_assert(), or not
exclusively
locked, with
wapbl_junlock_assert().
If a file system requires multiple transactions to initialize an inode, and
needs to destroy partially initialized inodes during replay, it can register
them by
ino_t inode number before initialization with
wapbl_register_inode() and unregister them with
wapbl_unregister_inode() once initialization is complete.
WAPBL does not actually concern itself whether the objects
identified by
ino_t values are ‘inodes’ or
‘quaggas’ or anything else — file systems may use this to
list any objects keyed by
ino_t value in the log.
When a file system frees resources on disk and issues writes to reflect the
fact, it cannot then reuse the resources until the writes have reached the
disk. However, as far as the
buffercache(9) is
concerned, as soon as the file system issues the writes, they will appear to
have been written. So the file system must not attempt to reuse the resource
until the current
WAPBL transaction has been flushed to
disk.
The file system can defer freeing a resource by calling
wapbl_register_deallocation() to record the disk address of
the resource and length in bytes of the resource. Then, when
WAPBL next flushes the transaction to disk, it will pass an
array of the disk addresses and lengths in bytes to a file-system-supplied
callback. (Again,
WAPBL does not care whether the
‘disk address’ or ‘length in bytes’ is actually that;
it will pass along
daddr_t and
int
values.)
FUNCTIONS
-
-
- wapbl_start(wlp,
mp, devvp,
off, count,
blksize, wr,
flushfn, flushabortfn)
- Start using WAPBL for the file system
mounted at mp, storing a log of
count disk sectors at disk address
off on the block device devvp
writing blocks in units of blksize bytes. On
success, stores an opaque struct wapbl * cookie in
*
wlp for use with the other
WAPBL routines and returns zero. On failure, returns an
error number.
If the file system had replayed the log with
wapbl_replay(9), then
wr must be the struct wapbl_replay
* cookie used to replay it, and wapbl_start() will
register any inodes that were in the log as if with
wapbl_register_inode(); otherwise
wr must be NULL
.
flushfn is a callback that WAPBL
will invoke as flushfn (mp,
deallocblks, dealloclens,
dealloccnt) just before it flushes a transaction to
disk, with the an exclusive lock held on the transaction, where
mp is the mount point passed to
wapbl_start(), deallocblks is an
array of dealloccnt disk addresses, and
dealloclens is an array of
dealloccnt lengths, corresponding to the addresses
and lengths the file system passed to
wapbl_register_deallocation(). If flushing the
transaction to disk fails, WAPBL will call
flushabortfn with the same arguments to undo any
effects that flushfn had.
-
-
- wapbl_stop(wl,
force)
- Flush the current transaction to disk and stop using
WAPBL. If flushing the transaction fails and
force is zero, return error. If flushing the
transaction fails and force is nonzero, discard the
transaction, permanently losing any writes in it. If flushing the
transaction is successful or if force is nonzero,
free memory associated with wl and return zero.
-
-
- wapbl_begin(wl,
file, line)
- Wait for space in the current transaction for new writes,
flushing it if necessary, and acquire a shared lock on it.
The lock is not exclusive: other threads may acquire shared locks on the
transaction too. The lock is not recursive: a thread may not acquire it
again without calling wapbl_end first.
May sleep.
file and line are the file name
and line number of the caller for debugging purposes.
-
-
- wapbl_end(wl)
- Release a shared lock on the transaction acquired with
wapbl_begin().
-
-
- wapbl_flush(wl,
wait)
- Flush the current transaction to disk. If
wait is nonzero, wait for all writes in the current
transaction to complete.
The current transaction must not be locked.
-
-
- wapbl_discard(wl)
- Discard the current transaction, permanently losing any
writes in it.
The current transaction must not be locked.
-
-
- wapbl_add_buf(wl,
bp)
- Add the buffer bp to the current
transaction, which must be locked, because someone has asked to write it.
This is meant to be called from within
buffercache(9), not by
file systems directly.
-
-
- wapbl_remove_buf(wl,
bp)
- Remove the buffer bp, which must have
been added using wapbl_add_buf, from the current
transaction, which must be locked, because it has been invalidated (or XXX
???).
This is meant to be called from within
buffercache(9), not by
file systems directly.
-
-
- wapbl_resize_buf(wl,
bp, oldsz,
oldcnt)
- Note that the buffer bp, which must
have been added using wapbl_add_buf, has changed
size, where oldsz is the previous allocated size in
bytes and oldcnt is the previous number of valid
bytes in bp.
This is meant to be called from within
buffercache(9), not by
file systems directly.
-
-
- wapbl_register_inode(wl,
ino, mode)
- Register ino with the mode
mode as commencing initialization.
-
-
- wapbl_unregister_inode(wl,
ino, mode)
- Unregister ino, which must have
previously been registered with wapbl_register_inode
using the same mode, now that its initialization has
completed.
-
-
- wapbl_register_deallocation(wl,
blk, len)
- Register len bytes at the disk
address blk as ready for deallocation, so that they
will be passed to the flushfn that was given to
wapbl_start().
-
-
- wapbl_jlock_assert(wl)
- Assert that the current transaction is locked.
Note that it might not be locked by the current thread: this assertion
passes if any thread has it locked.
-
-
- wapbl_junlock_assert(wl)
- Assert that the current transaction is not exclusively
locked by the current thread.
Users of WAPBL observe exclusive locks only in the
flushfn and flushabortfn
callbacks to wapbl_start(). Outside of such contexts,
the transaction is never exclusively locked, even between
wapbl_begin() and wapbl_end().
There is no way to assert that the current transaction is not locked at all
— i.e., that the caller may acquire a shared lock on the transaction
with wapbl_begin() without danger of deadlock.
CODE REFERENCES
The
WAPBL subsystem is implemented in
sys/kern/vfs_wapbl.c, with hooks in
sys/kern/vfs_bio.c.
SEE ALSO
buffercache(9),
vfsops(9),
wapbl_replay(9)
BUGS
WAPBL works only for file system metadata managed via the
buffercache(9), and
provides no way to log writes via the page cache, as in
VOP_GETPAGES(9),
VOP_PUTPAGES(9), and
ubc_uiomove(9), which is
normally used for file data.
Not only is
WAPBL unable to log writes via the page cache, it
is also unable to defer
buffercache(9) writes until
cached pages have been written. This manifests as the well-known
garbage-data-appended-after-crash bug in FFS: when appending to a file, the
pages containing new data may not reach the disk before the inode update
reporting its new size. After a crash, the inode update will be on disk, but
the new data will not be — instead, whatever garbage data in the free
space will appear to have been appended to the file.
WAPBL
exacerbates the problem by increasing the throughput of metadata writes,
because it can issue many metadata writes asynchronously that FFS without
WAPBL would need to issue synchronously in order for
fsck(8) to work.
The criteria for when the transaction must be flushed to disk before
wapbl_begin() returns are heuristic, i.e. wrong. There is no
way for a file system to communicate to
wapbl_begin() how
many buffers, inodes, and deallocations it will issue via
WAPBL in the transaction.
WAPBL mainly supports write-ahead, and has only limited
support for rolling back operations, in the form of
wapbl_register_inode() and
wapbl_unregister_inode(). Consequently, for example, large
writes appending to a file, which requires multiple disk block allocations and
an inode update, must occur in a single transaction — there is no way to
roll back the disk block allocations if the write fails in the middle, e.g.
because of a fault in the middle of the user buffer.
wapbl_jlock_assert() does not guarantee that the current
thread has the current transaction locked.
wapbl_junlock_assert() does not guarantee that the current
thread does not have the current transaction locked at all.
There is only one
WAPBL transaction for each file system at
any given time, and only one
WAPBL log on disk.
Consequently, all writes are serialized. Extending
WAPBL to
support multiple logs per file system, partitioned according to an appropriate
scheme, is left as an exercise for the reader.
There is no reason for
WAPBL to require its own hooks in
buffercache(9).
The on-disk format used by
WAPBL is undocumented.