Flash disks and FFS layout heuristics
Matthew Dillon
dillon at apollo.backplane.com
Tue Apr 1 11:07:58 PDT 2008
:My instinct is to not combine transactions. That is, every
:data write results in a sequence: {data, [indirect blocks],
:inode, ..., root block}. Until the root block is written to
:the disk this is not a "commited" transaction and can be
:thrown away. In a Log-FS we always append on write; we never
:overwrite any data/metadata so this is easy and the FS state
:remains consistent. FFS overwrites blocks so all this gets
:far more complicated. Sort of like the difference between
:reasoning about functional programs & imperative programs!
:
:Now, it may be possible to define certain rules that allows
:one to combine transactions. For instance,
:
: write1(block n), write2(block n) == write2(block n)
: write(block n of file f1), delete file f1 == delete file f1
:
:etc. That is, as long as write1 & associated metadata writes
:are not flushed to the disk, and a later write (write2) comes
:along, the earlier write (write1) can be thrown away. [But I
:have no idea if this is worth doing or even doable!]
This is a somewhat different problem, one that is actually fairly
easy to solve in larger systems because operating systems tend to
want to cache everything. So really what is going on is that
your operations (until you fsync()) are being cached in system memory
and are not immediately committed to the underlying storage. Because
of that, overwrites and deletions can simply destroy the related
cache entities in system memory and never touch the disk.
Ultimately you have to flush something to disk, and that is where
the transactional atomicy and ordering issues start popping up.
:This is reminiscent of the bottom up rewrite system (BURS)
:used in some code generators (such as lcc's). The idea is the
:same here: replace a sequence of operations with an
:equivalent but lower cost sequence.
What it comes down to is how expensive do you want your fsync()
to be? You can always commit everything down to the root block
and your recovery code can always throw everything away until it
finds a good root block, and avoid the whole issue, but if you do
things that way then fsync() becomes an extremely expensive call to
make. Certain applications, primarily database applications, really
depend on having an efficient fsync().
Brute force is always simpler, but not necessarily always desireable.
:...
:> is in the algorith, but not the (fairly small) amount of time it takes
:> to actually perform the recovery operation.
:
:I don't understand the complexity. Basically your log should
:allow you to define a functional programming abstraction --
:where you never overwrite any data/metadata for any active
:transactions and so reasoning becomes easier. [But may be we
:should take any hammer discussion offline]
The complexity is there because a filesystem is actually a multi-layer
entity. One has a storage topology which must be indexed in some manner,
but one also has the implementation on top of that storage topology
which has its own consistency requirements.
For example, UFS stores inodes in specific places and has bitmaps for
allocating data blocks and blockmaps to access data from its inodes.
But UFS also has to maintain the link count for a file, the st_size
field in the inode, the directory entry in the directory, and so forth.
Certain operations require multiple filesystem entities to be adjusted
as one atomic operation. For example removing a file requires the
link count in the inode to be decremented and the entry in the directory
to be removed.
Undo logs are very good at describing the low level entity, allowing you
to undo changes in time order, but undo logs need additional logic to
recognize groups of transactions which must be recovered or thrown away
as a single atomic entity, or which depend on each other.
One reason why it's a big issue is that portions of those transactions
can be committed to disk out of order. The recovery code has to
recognize that dependant pieces are not present even if other
micro-transactions have been fully committed.
Taking UFS as an example: UFS's fsck can clean up link counts and
directory entries, but has no concept of lost file data so you can
wind up with an inode specifying a 100K file which after recovery
is actually full of zero's (or garbage) instead of the 100K of data
that was written to it. That is an example of a low level recovery
operation that is unable to take into account high level dependancies.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list