It's all about respect

"Don't waste my time, buddy. You're just a dogsbody sysadmin; I write the software and you merely service it. So just shut up with your petty concerns and do as I say, OK?"
Comment from serverfault.com 

Today I'm going to write about one of the main reasons, in my opinion, why cooperation between dev and ops is not working most of the time: respect and understanding of the other persons job.

dev and ops have different types of work to do, but all have the same goal: bring an application to production and make it as good as possible (depending on time, money, technology, ...)

So in my experience some people mix up their jobs:
  • Some sysadmin believe every developer has to know everything about infrastructure
  • Some sysadmin even think they know how to program and that they could do the job of dev people much better
  • Some developers think they know how to set up infrastructure and how to configure all systems
  • Some developers think ops people should know all about the app without telling them anything
Don't get me wrong there are sysadmins who know how to program and there are developers who know how infrastructure works, but most of the time this is not the case. So when you work together, accept that everyone has his profession and most of the people know what they are doing. Let's straighten another point: It would be nice when every programmer would know the infrastructure and every sysadmin would know enough about programming, but this will not be the case. So when you argue about applications and/or infrastructure just explain the other one why you think it should be this way and not the other, don't just say "No!" (we were there [insert link here] already).

Respect each other and be polite. In my opinion dev and ops have different tasks during a project, which they should accomplish to make cooperation between both possible:
  • ops: Set up a development environment which matches the production environment in the main points: OS, network setup, ...
  • dev: Explain what the app should be doing and what type of systems it is using
  • ops: Make clear what other system exist which might influence the new environment
  • dev + ops: talk about non-functional requirements as early as possible and not two weeks before launch, otherwise it might get really messy
  • ...

Selenium and Nagios

Hi,

I've implemented a Nagios check for Selenium test cases. With this check it is possible to put your recorded test cases from your Selenium IDE into Nagios to use them for monitoring.

In my opinion this has the following advantages:
  • You can transfer your test cases to monitoring without making any changes.
  • You can run the test cases with multiple browsers. This means, you use the JavaScript and rendering engine of the browser not any other HTML/HTTP library.
  • You are more flexible, when monitoring more complex scenarios.
To get this working, you need the following components:
There are two possible scenarios how you could put this to work:
  • Install the Selinium RC server on your Nagios host.
  • Install the Selinum RC server on a different host.
I will only explain how you can set up the Selenium RC server on your Nagios host. When you want to install it on a different server you currently have to use nrpe/nsca to get it going. This is the only difference, the rest stays the same. You can contact the Selenium RC server directly on another server, but this is not yet implemented.

First record your test case with the Selenium IDE. When you've done this before, take a look at the Selenium IDE documentation. After recording it, export it to a Java file (File -> Export Test Case As ... -> Java (JUnit) - Selenium RC). Compile this Java file with javac or your favorite IDE. For compilation you already need the JUnit library and the file selenium-server.jar:
javac -cp ./junit.jar:./selenium-server.jar [your test case file]
 On your Nagios host, put check_selenium into your libexec path, add the files CallSeleniumTest.class, junit.jar and selenium-server.jar somewhere. Adjust the classpath in the check_selenium file. As the Selenium IDE puts the Java classes automatically in the package com.tests.examples, create a directory com/tests/examples on your filesystem and add it to the classpath. Put the compiled Java file into this directory.

Add the definition to your nagios configuration and test it. When your nagios server is a headless linux system, you can also run Selenium RC headless with Xvfb. Take a look at this post, how to set it up.

Hope this all gives you new ways how to monitor your web applications or make the already established way a bit easier.

The plugin also returns performance data for the test cases, but be aware the returned time also contains the startup time for the browser.

This plugin is written with Java and the Selenium test case is integrated with reflection. I'm sure you could also write this in Perl, Ruby or something different. I did it in Java because I had a bit of a mess with the needed Perl modules. Perhaps I will do a Perl version later.


Update 2011/07:
Some people asked for a more detailed explanation on integrating check_selenium into Nagios. Here are two possible solutions how to integrate it. The first one is with NRPE. The check runs on a different host. With the second possibility, the check runs on your Nagios host.
define command {
  command_name  nrpe_check_selenium
  command_line  $USER1$/check_nrpe -t 60 -H $HOSTADDRESS$ -p 5666 -c check_selenium -a "$ARG1$" "$ARG2$" "$ARG3$"
}

define command {
  command_name  local_check_selenium
  command_line  $USER1$/check_selenium "$ARG1$" "$ARG2$" "$ARG3$"
}
After that, you can define your service and asign it to a host:
define service {
  service_description   nrpe_selenium_Google
  use   service-check-05min
  host_name  your.winserver.here
  check_command  nrpe_check_selenium!com.example.tests.GoogleTestCase!http://www.google.com!*iexplore
}

define service {
  service_description  selenium_Google
  use  service-check-05min
  host_name  your.server.here
  check_command    local_check_selenium!com.example.tests.GoogleTestCase!http://www.google.com!*firefox
}

When executing the check via NRPE, add a line like this to your nrpe.cfg file on your remote host:

command[check_selenium]=/usr/local/nagios/libexec/check_selenium.sh --class "$ARG1$" --baseUrl "$ARG2$" --browsertype "$ARG3$"
Hope this will help some people to get faster results.

Documentation and automation

In addition to my article about documentation and automation, I found an interesting statement in the book "The Practice of System and Network Administration". The author of chapter 9 ("Documentation") stats:

Documentation is the first step to automation: If you don't know exactly how you do something, you can't automate it.

I haven't thought of it this way before. In my opinion he is right, but I think you don't have to write the documentation on paper, but you must have some sort of documentation in your head to automate a task. It can be easier to write a script and test it with a written documentation. You can use your documentation as some sort of pseudo code.
Another sort of documentation, your command history, is also a good starting point for automation (also mentioned in this chapter).

About automation and system adminstration


Hi,


there are a lot of tools for automating multiple tasks in system administration, like puppet, cfengine, spacewalk, kickstart, fai, autoyast, nagios, to only name some examples. But why are they not used by everyone?


I can think of the following arguments to use automation tools:
  • Less work: You will do some tasks only once and after that you gain a lot of time by repeating it automatically.
  • Reproducible results: Automation means scripts, and this way you can reproduce results by just executing a script or a tool for a second, third, ... time.
  • Homogeneous environment: When your work is reproducible you can set up different environments in the same way.
Some people are against automation. The reasons I've heard so far:
  • Why am I still needed when I automated my task?
  • It is not reliable enough.
  • We do not know what is happening.
As I see it with automation you do not get unnecessary, but you are needed for other work. You will have to maintain and improve the automation. And your automation is not unreliable, because you can test it. And with the test you know what is happening, and also because you wrote the scripts and the documentation for the scripts. On the other hand you will get more time for the interesting things at work. I know a lot of people who have a list in their head or written down of tasks they would like to do when they had more time. With automation you get this time.






With automation it is the same as with documentation, when it is done well, you can pass the work to your co-workers. Take a look at this post.


To get another view of this topic, take a look at this post John Willis.

The application is only running on the developer system

Hi,

today I'm going to write about a situation every sysadmin has already encountered. The sysadmin gets a new version of some type of software and should install it on a server. After some hours of trying he calls the developer and tells him he's not getting the application to start. The first answer all of us get: "But it is running on my PC." Let the discussion start. ;)

In my opinion it mostly a problem of proper communication. I have also seen (not only once) different types of development environments (mostly Windows) and production/test/... environments (be it Linux, AIX, HP-UX, ...). This could also be a reason for problems, but this is enough for an other topic. So let's come back to the communication.

As communication always two-sided, so is this problem. There can be different sources:
  • The admin has changed a default value of a configuration. 
  • The developer has changed some classes and now needs other permissions or files. 
  • The admin has installed an update and the application is installed on this test system. 
  • The developer uses a new library which is not installed on the server. 
  • ...

There can be a lot more other reasons why software fails, but I think you get the idea. Most of these supposed problems can easily be solved by proper communication. When the admin updates a system (be it security patches, os service packs, ...), just write a short e-mail to the users of the system and explain them in short words what you have done and what might be affected by the update. When a developer changes something in the code, keep a changelog. But as a developer do me a favour and do not mail the sysadmin the complete changelog. When it is to long or there are to many terms related to the business logic or how you changed some algorithm to get some more performance, he won't read it. The changes might also be interesting for the sysadmin, but mostly he will not have the time to read it all and get the parts interesting for him. I know developers do not have unlimited time, but for them it is much easier to find the parts affecting the sysadmin, because they (hopefully ;))understand the complete changelog.

In a perfect world we would have a change management which includes development and system administration, but as this will not always be present, just take the short track and write an e-mail, use the phone or do a short(!!!) meeting when anyone knows about changes, which could affect the release. Some people will now starting rolling their eyes and ask themselves who is not doing so already. It's sad but true there are a lot of people out there.

This way your releases will run a lot smoother and every side gets more understanding for the other side which will positively affect other parts of your daily work.

Create Java heapdumps with the help of core dumps

Hi,

for some time I had the problem, that taking Java heap dumps with jmap took too long. When one of my tomcats crashed by an OutOfMemoryException, I had no time to do a heap dump because it took some hours and the server had to be back online.

Now I found a sollution to my problem. The initial idea came from this post. It had a solution for Solaris, but with some googling and try and error I found a solution for linux too.

  1. create a core dump of your java process with gdb
    gdb --pid=[java pid]
    gcore [file name]
    detach
    quit
  2. restart the tomcat or do whatever you like with the java process
  3. attach jmap to the core dump and create a Java heap dump
    jmap -heap:format=b [java binary] [core dump file]
  4. analyze your Java heap dump with your prefered tool
 When you get the following error in step three:
Error attaching to core file: Can't attach to the core file
This might help:
In my case the error apeared because I used the wrong java binary in the jmap call. When you are not sure about your java binary, open the core dump with gdb:
gdb --core=[core dump file]
You will get an output similar to this one:
GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB.  Type "show warranty" for details. This GDB was configured as "i586-suse-linux"... (no debugging symbols found) Using host libthread_db library "/lib/libthread_db.so.1".
warning: core file may not match specified executable file. (no debugging symbols found) Failed to read a valid object file image from memory. Core was generated by `/opt/tomcat/bin/jsvc'. #0  0xffffe410 in _start ()
 What you are looking for is in this line:
Core was generated by `/opt/tomcat/bin/jsvc'.
 Call jmap with this binary and you will get a heapdump.

JMX and SSL

When you would like to use JMX with SSL you have to configure some points on both sides. First, create yourself a self-signed certificate (details here) and insert it into a keystore (details here).

Let’s assume you want to use JMX over SSL with your Tomcat and JConsole on the client. Add these parameters to your tomcat script:
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=[your jmx port] -Dcom.sun.management.jmxremote.ssl=true -Dcom.sun.management.jmxremote.authenticate=false
-Djavax.net.ssl.keyStorePassword=[your password]
-Djavax.net.ssl.keyStore=[full path to keystore file]
To configure JConsole to use SSL add these parameters to the call:
jconsole -J-Djavax.net.ssl.trustStore=[full path to keystore file] -J-Djavax.net.ssl.trustStorePassword=[your password]
Make sure that the trustStore file is the same as the keyStore file for Tomcat, or trustStore and keyStore contain the same certificates with the same alias.

Should you experience any problems using SSL, this parameter might help you:
-J-Djavax.net.debug=all (for jconsole)

-Djavax.net.debug=all (for tomcat)
This will also work with the check_jmx Nagios plugin. Just add the keystore file as trustStore to your call:
java -cp jmxquery.jar -Djavax.net.ssl.trustStore=[full path to keystore file] -Djavax.net.ssl.trustStorePassword=[your password] org.nagios.JMXQuery -U service:jmx:rmi:///jndi/rmi://:/jmxrmi -O "java.lang:type=MemoryPool,name=Perm Gen" -A Usage -K used -I Usage

Technorati Tags: , , ,

check_jmx

There are different check_jmx versions (ME, NE1, NE2 and CG) on NagiosExchange, MonitorExchange and Google Code but it seems none of them is still maintained. I tried to reach one author but got no reply. So I decided to put my modifications on the net. I also merged some other changes in this new release of check_jmx.

To be sure other people can continue development, should I'm not be reachable, I uploaded the source to gitorious. There is a new repository for Nagios plugins which is maintained by some community members.

check_jmx is a Nagios plugin to monitor your JVM, e.g. your Tomcat or JBoss Installation. It is possible to get data about your heap, gc, .... It is also possible to query MBeans which are part of your application. check_jmx also returns performance data.

For this release I merge the original check_jmx release with additions to support Longs instead of integers for the warning and critical value. I added authentication for connections to the JMX server.


Technorati Tags: , ,

devops cooperation at flickr

Hi,

and now the matching presentation to the podcast at redmonk I mentioned earlier. Their are some very good statements in this presentation. The tools section is not so important for the mentioned tools, but for the statements that are combined with the tools:
  • single click build
  • single click deployment
  • monitoring for app and systems
  • understanding for all metrics independent of app or system
  • dark launches
But as the last slide says, it is not easy. Try to start with one point or in one project and try to establish it. When it is working take the next point or the next project.

Very good are the slides about culture. One I specially want to mention is slide 57 "Don't just say 'No'". I don't know what was said during the presentation, but my understanding of this sentence is as follows:
You can say 'No', but when explain why and give alternatives. When you don't have alternatives, just say it and try to find alternatives together.
The slides about fingerpointing don't need any comment. Just take a look at them and you know everything.

But there are also slides I do not fully agree with. I don't think dev should have full access to all systems. They definitly must have access to a test environment which is almost the same as production, but they do not necessaries need access to production. They should have access to logs, but not the wrights to restart services or change anything. In my opinion this would be the same as ops changing some code. This can work in small organisations where most people do have more than one role, but not in bigger organisations.

Here is the presentation:
Also take a look at Johns blog

Technorati Tags: , ,

What is devops?

Hello everyone,

here you'll find an article what devops is about by Stephen Nelson-Smith. It is a nice round up of what devops is and what it tries to achieve.


dev and ops cooperation at flickr

Hi,

this is a podcast with John Allspow (now at etsy). For me as a sysadmin, the talk at about 22:30min and 27:00min was very interesting. Have fun listening.


FOSDEM 2010: cucumber-nagios

I listened to the talk about cucumber-nagios by Lindsay Holmwood. Besides the slides the presentation also had a live demonstration. As I understood the presentation cucumber-nagios is not completely new. You could have achieved the same with some other Nagios plugins or their combination, but I have to say: It is cool! It seems to be much more flexible and extendible.

I see cucumber-nagios as a plugin to Nagios which itself has plugins, called features. This way you extend Nagios only once and then do everything in cucumber-nagios.

As it does not use any browser, you can run it on every headless server, without the need for a graphical environment. I know you could also use a headless server for graphical checks (see this blog post), but again it makes things easier.

What I'm not sure about (but I'll find out) is authentication. How do you monitor applications which require authentication? Another problem I see are internal webapplications which only run in IE and need AD authentication. Until know I thought about Selenium to do the job by running it on a Windows Client machine and let it connect to the webapplication, but perhaps cucumber-nagios could also do this job. Another point I have to find out about, is performance data. Until know I've only seen performance data for the number of tests. What would be more interesting are the duration of the single tests. This way you would see problems or the effect of changes to your application or infrastructure in the monitoring system.

As Lindsay himself stats, one purpose of cucumber-nagios is to bridge the gap between developers and sysadmin (see slide 68 of this presentation, it's not from fosdem, but the same slide) So one more idea: It would be nice if one could translate testcases of the development/testing team to cucumber-nagios monitoring checks, even when the development project is not using Ruby or cucumber.



On my way to FOSDEM 2010

Currently I'm on my way to FOSDEM 2010 in Brussels, Belgium. There will be some promising talks about Open Source Projects and the use of Open Source Software. I hope to see the following talks:
  • Evil on the Internet (Link to fosdem site for presentation)
  • Large scale analysis made easy - Apache Hadoop (Link to fosdem site for presentation)
  • Scaling Facebook with Open Source tools (Link to fosdem site for presentation)
  • What is my system doing - Full System Observability with SystemTap (Link to fosdem site for presentation)
  • Starting the sysadmin tools renaissance: Flapjack + cucumber-nagios (Link to fosdem site for presentation)
  • Tor: Building, Growing and Extending Online Anonymity (Link to fosdem site for presentation)
  • MINIX 3: a modular, self-healing POSIX-compatible Operating System (Link to fosdem site for presentation)
I hope to get some new interesting ideas to make my daily work more comfortable. My highest hopes are on the presentation about cucumber-nagios. This should give me some completely new methods for monitoring the application I have to administrate.

Technorati Tags: , ,

Logging in production environments

Hi everyone,

recently I've read an article in a german IT paper (iX 02/2010, p. 116, "Fit für den Betrieb" by Michael K. Klemke) about making applications ready for production (the german title would translate to "Ready for production"). In the last part of his article Michael is writing about logging in an application. I would like to state some of his points here and make some additions.

Here are short summaries of his main points:
  • no debugging in production, so learn your logs to find errors
  • changeable loglevel
  • unique IDs which are valid in every layer of your application, to track transactions across different systems
  • dev has to write monitoring plugins
  • log entry and exit of each method
And now some additions to his statements:
First the easy ones. Changing the loglevel should be possible without restart. It would also be nice to have a frontend which can be used be helpdesk. This way a supporter can ask the user to redo the steps which lead to the error after the supporter has changed the loglevel. The supporter can now attach the log output to a ticket and redirect the ticket to the developers.

The entry and exit of each function/method/... should be logged with a debug or trace level, so they are only traced when the loglevel is changed in case of an error. So the logs don't fill up with unnecessary garbage all the time. Which leads me to the first addition to the points from Michael. The logs should be rotated and deleted after e.g. 30 days (as long as you don't have to fulfil any law requirements). Rotate the logs daily and when they get larger than x Mb. This way you will prevent searching multiple Gb of logs when you look for an error. Even with tools like grep, awk, sed, ... it is very hard to find an error when you do not know what you are searching for. Another argument for logrotation is the price of the storage. Today you can buy 1 Tb of USB or Desktop storage for under 100€, but when you save your logs on SAN storage we quickly talk about thousand euro or more, depending on your setup and hardware vendor.

Another method which might ease support is to log to different logfiles. You can log system related messages (like "No access to DB") to another log than application related messages (like "You are not allowed to do this"). This way you can handle them differently for alerting, reaction time, ... . Which leads us to the next point. Michael talks about implementation of monitoring checks by the development department. I do not fully agree with this. The developers should help ops to set up a monitoring for the application, but should only concentrate on implementing the desired business processes to monitor. Ops should then integrate these checks into the monitoring system. This way dev doesn't have to know which monitoring system is in use and when the monitoring system will change no application has to be adopted to the new system.

Dev and ops also have to work together when monitoring logfiles. Ops has to know what to look for and whom to notice. Another requirement is, that the messages are parseable. Most of the time this means, that the message has to be  printed in one line. These oneliners should also be as precise as possible. It does not make sense to print the first line of a java stack trace. It is much more informative to print something like "Error while connecting to server xyz.example.com at port 1234!" or "User xyz can't login!".

I hope these ideas will help some people to get a better logging in their production environment. When you have other ideas or think Michael or I are wrong with a statement, please leave a comment.

Technorati Tags: , ,

Let's get started

Hi everyone,
let's get this blog started. I have a lot in mind about what I would like to write. So to get you a first impression what awaits you in this blog and to give my thoughts a bit of order, here is a list of topics I'm going to write about (not in order of appearance):
  • non-functional requirements
  • project timeline
  • organisational structures
  • "the app is running on my PC"
  • logging
  • automatisation
  • ...
This is surely not a full list, but a point to start. When some of these articles are finished, I will post a new list with upcoming articles.

Hi

Hi,

finally I got myself to start this blog. I've mixed feelings about it. But I will never know what would have happened if I don't start writing it. The main topic for this blog will be the work of developers and sysadmins and what is going wrong at the contact points in my opinion. I'm working as a sysadmin, so my opinions about this topic might be a little biased, but there for are the comments. I hope a lot people (both devs and ops) will comment on my postings. As I'm a sysadmin there will also be some post about my daily work. They will be for my personal documentation and to share my knowledge or get some ideas on doing things in an other way. The topics will most like be Linux, Apache, Tomcat, Java, monitoring (esp. nagios) and some other minor topics.

I'm not the first person writing about this topic, but perhaps I have some new points of view. You'll find some additional links to other blogs or ressources in the link section.

I hope you enjoy reading this blog.