Logging in production environments

Hi everyone,

recently I've read an article in a german IT paper (iX 02/2010, p. 116, "Fit für den Betrieb" by Michael K. Klemke) about making applications ready for production (the german title would translate to "Ready for production"). In the last part of his article Michael is writing about logging in an application. I would like to state some of his points here and make some additions.

Here are short summaries of his main points:
  • no debugging in production, so learn your logs to find errors
  • changeable loglevel
  • unique IDs which are valid in every layer of your application, to track transactions across different systems
  • dev has to write monitoring plugins
  • log entry and exit of each method
And now some additions to his statements:
First the easy ones. Changing the loglevel should be possible without restart. It would also be nice to have a frontend which can be used be helpdesk. This way a supporter can ask the user to redo the steps which lead to the error after the supporter has changed the loglevel. The supporter can now attach the log output to a ticket and redirect the ticket to the developers.

The entry and exit of each function/method/... should be logged with a debug or trace level, so they are only traced when the loglevel is changed in case of an error. So the logs don't fill up with unnecessary garbage all the time. Which leads me to the first addition to the points from Michael. The logs should be rotated and deleted after e.g. 30 days (as long as you don't have to fulfil any law requirements). Rotate the logs daily and when they get larger than x Mb. This way you will prevent searching multiple Gb of logs when you look for an error. Even with tools like grep, awk, sed, ... it is very hard to find an error when you do not know what you are searching for. Another argument for logrotation is the price of the storage. Today you can buy 1 Tb of USB or Desktop storage for under 100€, but when you save your logs on SAN storage we quickly talk about thousand euro or more, depending on your setup and hardware vendor.

Another method which might ease support is to log to different logfiles. You can log system related messages (like "No access to DB") to another log than application related messages (like "You are not allowed to do this"). This way you can handle them differently for alerting, reaction time, ... . Which leads us to the next point. Michael talks about implementation of monitoring checks by the development department. I do not fully agree with this. The developers should help ops to set up a monitoring for the application, but should only concentrate on implementing the desired business processes to monitor. Ops should then integrate these checks into the monitoring system. This way dev doesn't have to know which monitoring system is in use and when the monitoring system will change no application has to be adopted to the new system.

Dev and ops also have to work together when monitoring logfiles. Ops has to know what to look for and whom to notice. Another requirement is, that the messages are parseable. Most of the time this means, that the message has to be  printed in one line. These oneliners should also be as precise as possible. It does not make sense to print the first line of a java stack trace. It is much more informative to print something like "Error while connecting to server xyz.example.com at port 1234!" or "User xyz can't login!".

I hope these ideas will help some people to get a better logging in their production environment. When you have other ideas or think Michael or I are wrong with a statement, please leave a comment.

Technorati Tags: , ,

Let's get started

Hi everyone,
let's get this blog started. I have a lot in mind about what I would like to write. So to get you a first impression what awaits you in this blog and to give my thoughts a bit of order, here is a list of topics I'm going to write about (not in order of appearance):
  • non-functional requirements
  • project timeline
  • organisational structures
  • "the app is running on my PC"
  • logging
  • automatisation
  • ...
This is surely not a full list, but a point to start. When some of these articles are finished, I will post a new list with upcoming articles.

Hi

Hi,

finally I got myself to start this blog. I've mixed feelings about it. But I will never know what would have happened if I don't start writing it. The main topic for this blog will be the work of developers and sysadmins and what is going wrong at the contact points in my opinion. I'm working as a sysadmin, so my opinions about this topic might be a little biased, but there for are the comments. I hope a lot people (both devs and ops) will comment on my postings. As I'm a sysadmin there will also be some post about my daily work. They will be for my personal documentation and to share my knowledge or get some ideas on doing things in an other way. The topics will most like be Linux, Apache, Tomcat, Java, monitoring (esp. nagios) and some other minor topics.

I'm not the first person writing about this topic, but perhaps I have some new points of view. You'll find some additional links to other blogs or ressources in the link section.

I hope you enjoy reading this blog.