Logging in production environments

Hi everyone,

recently I've read an article in a german IT paper (iX 02/2010, p. 116, "Fit für den Betrieb" by Michael K. Klemke) about making applications ready for production (the german title would translate to "Ready for production"). In the last part of his article Michael is writing about logging in an application. I would like to state some of his points here and make some additions.

Here are short summaries of his main points:
  • no debugging in production, so learn your logs to find errors
  • changeable loglevel
  • unique IDs which are valid in every layer of your application, to track transactions across different systems
  • dev has to write monitoring plugins
  • log entry and exit of each method
And now some additions to his statements:
First the easy ones. Changing the loglevel should be possible without restart. It would also be nice to have a frontend which can be used be helpdesk. This way a supporter can ask the user to redo the steps which lead to the error after the supporter has changed the loglevel. The supporter can now attach the log output to a ticket and redirect the ticket to the developers.

The entry and exit of each function/method/... should be logged with a debug or trace level, so they are only traced when the loglevel is changed in case of an error. So the logs don't fill up with unnecessary garbage all the time. Which leads me to the first addition to the points from Michael. The logs should be rotated and deleted after e.g. 30 days (as long as you don't have to fulfil any law requirements). Rotate the logs daily and when they get larger than x Mb. This way you will prevent searching multiple Gb of logs when you look for an error. Even with tools like grep, awk, sed, ... it is very hard to find an error when you do not know what you are searching for. Another argument for logrotation is the price of the storage. Today you can buy 1 Tb of USB or Desktop storage for under 100€, but when you save your logs on SAN storage we quickly talk about thousand euro or more, depending on your setup and hardware vendor.

Another method which might ease support is to log to different logfiles. You can log system related messages (like "No access to DB") to another log than application related messages (like "You are not allowed to do this"). This way you can handle them differently for alerting, reaction time, ... . Which leads us to the next point. Michael talks about implementation of monitoring checks by the development department. I do not fully agree with this. The developers should help ops to set up a monitoring for the application, but should only concentrate on implementing the desired business processes to monitor. Ops should then integrate these checks into the monitoring system. This way dev doesn't have to know which monitoring system is in use and when the monitoring system will change no application has to be adopted to the new system.

Dev and ops also have to work together when monitoring logfiles. Ops has to know what to look for and whom to notice. Another requirement is, that the messages are parseable. Most of the time this means, that the message has to be  printed in one line. These oneliners should also be as precise as possible. It does not make sense to print the first line of a java stack trace. It is much more informative to print something like "Error while connecting to server xyz.example.com at port 1234!" or "User xyz can't login!".

I hope these ideas will help some people to get a better logging in their production environment. When you have other ideas or think Michael or I are wrong with a statement, please leave a comment.

Technorati Tags: , ,


Post a Comment