Re: git: 2f036705f337 - main - Document the two recent newsyslog(8) change (-c option and <compress> configuration option).

From: Olivier Certner <olce_at_freebsd.org>
Date: Wed, 10 Jan 2024 14:33:45 UTC
Hi Xin,

Thanks for responding.

There were several ideas in my mail, some of them contradictory, and at times not grouped properly.  I hope it still was intelligible enough.

I was mostly concerned about the (future) change of default value, and still am.  But I'm also surprised by the premises some of your choices (including the default value) are based on.  To me, they look generally weak, and even for some do not seem to make sense.  This is (also) what I would like to discuss.  I'm probably far from having the most stringent or intensive use of log files in this community, and I'm not an expert of SSD wear-leveling either.  So maybe it's just me, but then I'd ask for the minimum education to understand your reasoning and learn from it.

> I am open to removing '-c'.

An alternative I developed later in my initial mail (it was not apparent at the point you responsed to) is to have '-c' (on the command-line) override <compress> (in some configuration file), and I think this is what you've done (and responded to Mike).  I'm fine with it since I have the feeling it's the general rule for most utilities where it's possible to request the same behavior on the command-line and from configuration files (in other words, it respects POLA).  My main concern here is that, if you keep '-c', you document it, as well as its relation to <compress>.  I'm saying this because you evoked the possibility of not documenting it on purpose in some other message, which I think can't be justified here.
 
> Could you please clarify what you mean by "make it enable compression" --
> did you mean that we mark all log files to be compressible?  (It's probably
> not a good idea as some "log" files may be binary and not really
> compressible).

Yes, I meant exactly that.  In this alternative, you simply ignore compression letters but also their absence, and compress everything the same.  I understand your point about binary files, but I would be surprised if that logs even formatted as binary files aren't significantly compressible (albeit less than text) in most cases, and even if they aren't, it would only be a very minor annoyance (files are not going to get longer; for other (non-)annoyances, see below).  Moreover, all log files in base are text files, and that is also the case for all ports/applications I use, so I find it strange not to cater to what is probably the vast majority of use cases (or do you disagree with that?).

Doing so would have also the benefit that application writers just don't have to bother wondering whether their logs should be compressed or not.  What would that decision based on?  Basing it on format (text or binary) is most probably flawed, as I've just said above.  I don't think it can be based on content either, which I suspect will always be compressible for log files (there will be redundancy, like timestamps, identifiers, etc.).  And I see this more as an administrative decision (e.g., do I have plenty of disk space or not?), which is independent.  So shifting that decision to the administrator once and for all makes sense.  If you don't like this way to make it happen, I'm suggesting another one next.

> Changing the meaning of all four legacy compression type letters to "file
> is compressible" is part of the intention.  The goal is to discourage using
> them as a way to specify a compression type, in favor of using the
> administrator configured value.

As I've just explained, I see a lot of value in having an administrator deciding on a global behavior.  I will use this functionality most likely.

I had been hesitating between preserving the current meaning of the compression letters, for POLA in general, and having the configuration directive override them.  That's why I mentioned an alternative where the override would have to be explicit, through an additional, different directive.  This idea could be reused like this: Have '<compress>' affect only files without compression letters, and have '<compress_override>' affect only those with them, and perhaps also have the specified value of one of them used as the default for the other (e.g., if '<compress_override>' is set, it also affects by default files without compression letters).  I'm mentioning this for completeness in case it fulfills the needs of others.  I probably won't use this refinement personally.  And, concerning POLA, there are different levels of it.  Forgetting a moment about the change in default value, being able to override compression letters with a directive in the configuration file is a bit surprising, but after more pondering I now do not consider it to be terribly annoying if sufficiently publicized.

> That's said, 'none' is a reasonable default in many ways as explained
> before (it makes grep'ing easier, compression is not really that helpful in
> the modern world because hard drives are larger than the 90's and it
> reduces the times data gets rewritten to SSDs and avoids hourly CPU load
> bursts for busy systems).

This is where my main disagreement is currently.  Most arguments have been addressed in my previous mails, so for each I'll do a small wrap-up and add a few new thoughts.

"it makes grep'ing easier": Our zgrep(1) works on any compressed file, and even on uncompressed ones, so is a drop-in replacement for grep(1).  I fail to see anything hard about using it.  Scripts already using grep(1) don't even need to be modified, via a combination of PATH or symlink tweaking.  We could even go so far as having grep(1) itself behave like zgrep(1), which could be a great usability win for newcomers as well.

"compression is not really that helpful in the modern world because hard drives are larger than the 90's": I certainly don't think so.  I manipulate GBs of (text) log files.  On build logs, I typically see ratios of 1/10, which is huge.  The space I'm saving is not only used to save more logs, but also for unrelated purposes, and prevents me from having to buy or dedicate more hard disks to this use.  And I'm not even talking about embedded systems, which are much more constrained, or virtual machines.

"it reduces the times data gets rewritten to SSDs": Surely, but does it matter? I don't think so.  A single rewrite of log data in most use cases shouldn't have any visible effect on wear-leveling, except for SSDs where this is the only and continuous job, but then you can have your equivalent to 'syslog' compressing on the fly, or can use ZFS with compression.  If really, you're reaching the disk I/O limits on your machine and can't afford the extra bandwidth for reading and compressing, shouldn't you be sending the logs via network to another machine doing exactly that processing?  And is this a use case common enough to warrant making non-compression the new newsyslog(8)'s default?  I don't think so.

"avoids hourly CPU load bursts for busy systems": That can, and should, be solved by configuration.  You're free to choose a higher frequency, to avoid busy hours if there are less loaded ones, and to rotate logs on a smaller size limit, all of which will mitigate the problem to the point of almost non-existence.  And if the "almost" is still significant to your workload, then see the previous point.  Again, is this common or important enough?  For now, I doubt it.  And there is an advantage of having application-controlled compression: At least you can control exactly when the bursts occur, which you can't with ZFS (which has to compress blocks also).

> 'bzip2' could be a good second best default (because for most
> configurations it's how the log files are compressed with today's
> defaults), but if the administrator has already configured their systems to
> use a different method, this would break their configuration anyways.

Yes for 'bzip2' as a good default, for POLA.  If the administrator configured its system, then the best default would be 'legacy'.  That's why I was hesitating with always keeping the original meaning to the compression letters.

> There are other benefits of not compressing rotated logs.  For busy
> systems, the hourly newsyslog run would process larger logs and cause CPU
> workload bursts.
> 
> And when logs are compressed, the data is read back and compressed data is
> rewritten to disk / SSDs, causing additional wear of the flash storage, and
> all that comes with no significant benefit for modern hardware.
> 
> (I don't think it's common to have log files indexed after rotation; a more
> common use case would be to use [u]grep to look up for a certain pattern).

I think I've already addressed most of these points in the previous mail and above.

I've read and, I think, understood your points.  So please save us time and refrain from repeating them.  This is not going to make me change my current mind that they all are weak at best.

On the other hand, please, after a careful reading of my objections, respond with comments, critiques or rebuttals as you see fit.  I may learn things in the process, and you might as well too.
 
> Yes, and that's not a big concern.  Achieving the maximum compression ratio
> is probably never the goal for most scenarios (not limited to logs, but
> also other places) where compression is used, and one always has to balance
> between the cost and benefit.

We are talking about logs, or at least use cases for newsyslog(8).  A frequent use case for it (it's certainly the primary for me) is long-term storage of old logs that are unfrequently read/processed.  Achieving a high compression ratio is important here, to save the space used in absolute terms *and* with respect to the expected (in a statistical sense) utility of these (i.e., low).

> If the person is distributing a release image to many thousands of users
> over the Internet, it would make a lot of sense to try the best compression
> for an 5% reduction of size because that adds up to the bandwidth cost and
> optimizes the experience for users, but it doesn't make as much sense to
> save, let's say a few MBs of disk space at the expense of spending a few
> more minutes every hour, the added "bursts" of slower response time for a
> server, and that's usually undesirable for production.

Really, I don't see where these figures can come from.  Here is a very quick example on a typical (for me) build log file of about ~70MB:

* Method            * Compression ratio * Elapsed time (s) *
************************************************************
  gzip (default)    | 95.3%, or / 21.2  | 0.426
  xz (default)      | 96.9%, or / 32.6  | 5.619
  zstd (default)    | 95.6%, or / 22.5  | 0.088

I could multiply them to convince you in a more serious manner statistically.  But already, I think you can agree that "a few MBs of disk space at the expense of spending a few
more minutes" is way, way off, even if you're still using xz(1).

Thanks and regards.

-- 
Olivier Certner