lixo.org

Logging: A UI Problem

Your logs are part of the UI. They are streams of interesting and actionable events that will be consumed by both machines and humans.

The most useful practice I’ve followed so far is to keep that in mind and act accordingly: understand the computer systems parsing, filtering and analyzing logs and talk to all the people who will be notified when something of interest happens. Watch what they do, and ask yourself “how could the output of my application be more helpful in this scenario?”

The parties interested in your application’s logs are usually at a conflict: what’s interesting and actionable to developers and testers isn’t so important to production support engineers, and your SQL timing statements are probably seen as junk to the analytics tool looking for security issues.

In order to minimize that, whatever logging framework you’re using should be able to direct those streams of events with pre-defined (and hopefully, easily configurable) filters, and each type of environment or user should be able to have its own configuration.

Here’s a few examples to illustrate the point:

During development, it makes sense to have every debug statement relevant to the module being worked on going to the same stream, while telling the framework to take it easy with all other modules. Events from other modules may be interesting, but they should be filtered out if they’re not actionable, as you’re not going to do anything with them. Changing the filter so you can look at different modules should take no more than a few seconds of work (but may require bouncing a server or two).

While running unit tests on a continuous integration set-up, it may make sense to disable verbose logging altogether: if your automated testing environment is sufficiently mature, at least one of the tests will break and you’ll be able to replay the failure on a development workstation to get at the details. In that kind of environment, not only you want to be mindful of disk usage, the events themselves are usually not very actionable anyways.

In production, leave that configuration to people experienced with support: talk to engineers who will get paged at 3am and rushed into a cab if a particular type of error happens, and get their input. They will tell you exactly what kinds of errors they’re interested in on your application in particular. Remember this is probably specific to the domain you’re working on, and that support engineers usually take care of more than one application, and more than one server.

A very common mistake I see (looking at you, JBoss!) is to treat errors that developers should see as important (a NullPointerException, for example) and that production support people can’t do a thing about. Don’t wake them up unless there is something they can do to fix the problem, or risk crying wolf too many times and having them filtering out important, actionable notifications, like OutOfMemory errors, low disk space, etc.