Updated gjournal patches [20061024].

Fri Oct 27 14:46:15 UTC 2006

Fluffles wrote:

> Please look at the screenshot i made of the panic message:
> http://dev.fluffles.net/images/gjournal-panic1.png

Hmm, quick grepping of the kernel sources for "Enough\." and sources for 
gjournal and graid5 for other strings in the panic screenshot doesn't 
locate a possible point of failure. You'll probably need to at least 
compile DDB & KDB into the kernel so when the panic happens you can 
create a backtrace (with "bt" command) and post that information.

> Also i have a question about it's performance. You mentioned earlier
> that writing big files takes about twice as long with gjournal, i wonder
> if this is inherit to journaling itself or due to the current
> implementation. Windows' journaling NTFS, for example, isn't slower than
> FAT32 with big files if i remember correctly. What major differences in
> the journaling process causes this?

Maybe MS is doing its tricks again? The way any journaling works is 
this: data is not written where it's supposed to be (which is all over 
the disk because files and metadata are scattered across the disk), but 
its written in a special on-disk area which is designated "the journal" 
and which is sequential. After some time (e.g. when the I/O load 
decreases) the data is read from the journal and written to where it 
belongs. Thus burst writes to file system are very fast, and the slow 
operations (data "relocation" to where it belongs) is performed when 
system is under less load.

This journal area is finite in size, so when it gets full, no more 
writes can happen until at least part of it is "freed" by relocating the 
data where it belongs, which is an operation that requires sequential 
reading from journal area and scattered writing to the on-disk data area.

When large files are written to the file system "in bulk", the FS is 
already smart enough to store them as sequentially as possible, but 
there are two problems with it:

- the FS can't reliably detect if the file that's going to be written is 
sequential or not, and neither can the journal driver (so the FS 
actually does best effort for any file in the hope that if the file 
grows large enough it won't get fragmented)
- all FS operations go through gjournal, so the sequence of operations 
becomes: 1. data is written to the journal, 2. the journal gets full and 
data is read again from the journal and written to where it belongs, and 
during this new writes to the journal are at best very slow. Thus the 
slowdown.

> Also, in your earlier post you explained the advantages of a journal
> with regard to significantly reduces fsck times at boot. But my major
> concern is dataloss: on my testserver i've had many kernel
> panics/freezes due to the experimental graid5 module being tested by
> Arne. This has resulted in the system not being able to boot because the
> ad0s2a (read: a!) partition has lost files. And it won't be the first
> time a lockup or power failure caused dataloss on my systems. That's why
> i want to use gjournal: to protect from dataloss. Am i correct in my
> assumption that gjournal addresses my needs in this regard?

To guarantee (meta)data safety on the file system, the FS code must be 
sure the data it has placed on the disk will stay there. Soft updates 
work so data is written in order and in batches, and the code assumes 
each such "batch" arrives safely on the hardware. SU's performance comes 
from delaying writes with the intention of not rewriting the same data 
multiple times (consider deleting a huge number of files from a 
directory: the same directory entry SHOULD be updated after each delete, 
but SU delays it so only the final version of the directory, with 
removed files, is written). Journaling on the other hand works by 
allowing each such write to proceed to the disk, BUT instead of seeking 
every time to accommodate where the data is placed, it writes all data 
sequentially on the journal, which is much faster (60+ MB/s with today's 
hardware). Smart journal engines (don't know if gjournal has this 
feature) will only relocate the last modified data entry (e.g. the last 
"state" of the directory entry from the example with SU) to its place.

Because all the intermediate data from the file system is placed in the 
journal, a power drop in the middle of updating a directory entry will 
result in either the directory entry being safely written to the journal 
(from where it can be recovered), or the changes being completely lost 
(in which case the old, un-updated directory entry is still valid). In 
all this, gjournal should do what you need.

The biggest problem today is not the software, but the hardware. Most 
disk drives (especially desktop-class ones) lie about safely writing the 
data when it's still in their buffers. This is why the modern approach 
to building critical data storage is to force the drives not to cache 
anything, and employ a hardware RAID controller with huge buffers and a 
battery that keeps the buffers "alive" when the power goes down.