21 October 2013

Detect server side problems using Nagios plugins and the Load Impact Server Metrics Agent

Load Impact

Just recently we launched our cloud-based Server Metrics Agent - a function that allows you to collect information about what's happening internally on your servers while your website or - application is being load tested. Installing the Server Metrics agent on one of your machines will immediately let you see how much CPU, memory and network bandwidth the server is using throughout the load test.

SMA

This can, of course, be very useful when looking for bottlenecks, but sometimes you want to know more. For example, you might be using a database software such as PostgreSQL and suspect that it is running out of some PostgreSQL- internal resource, such as client connections, causing a bottleneck for your web server in handling client requests. In this case, you will not notice any problems just by looking at, for example, the CPU or memory usage on the physical server where PostgreSQL is running. Instead, you must communicate directly with PostgreSQL and ask it how it’s doing. You want it to tell you how many connections its database clients are using and what the maximum limit is.

When we created our Server Metrics agent, we realized people would want to collect more specialized metrics like this. Not just the standard physical server metrics (e.g. CPU usage, memory usage, disk usage, etc) but we were confronted with a big problem; there are thousands of different systems, platforms, applications from which you might want to collect performance metrics in order to detect bottlenecks, and each of them communicates in different ways. We couldn't possibly write monitoring code to support every one of them.

Luckily, we have a bit of experience with uptime monitoring, and we knew that the very popular open-source monitoring solution Nagios has a simple and flexible plugin system that is easy to interface with. We came up with the idea of designing our Server Metrics agent so that it was compatible with the Nagios plugin system, allowing users to use any Nagios plugins to collect performance data during their load tests.

As a result, Server Metrics allows you to collect performance metrics from almost anything! Measurements from the Server Metrics Agent can be correlated with other measurements collected during load tests, and results are made available as a time series that can also be viewed in graph format on the test results page, or exported to CSV (comma-separated values) format for use in a spreadsheet.

The Nagios community has created over 3,000 different plugins that measure the health of all kinds of software applications, hardware products, networks and services. And the plugins are available for all kinds of platforms (e.g. Linux, Windows, etc).

  1. Follow the instructions at https://loadimpact.com/server-metrics-agent-download to download, install and enable your server metrics agent
  2. Go to http://exchange.nagios.org/directory/Plugins and find the plugin(s) you want to use. In our case we wanted to monitor PostgreSQL so we go to http://exchange.nagios.org/directory/Plugins/Databases/PostgresQL which lists 18 (!) different plugins that can extract information about the health of a PostgreSQL server. We chose the "check_postgres" plugin - http://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_postgres/details
  3. Download and install the check_postgres plugin (in our case we did it locally on our PostgreSQL server)
  4. Edit the configuration file for the server metrics agent - it is called "li_metrics_agent.conf" and look at the section in it that says "# An external script" for information about how to make the Server Metrics agent start using your new Nagios PostgreSQL plugin. In our case we added two lines that looked like this:

    [db_connections] command = /usr/bin/perl /path/to/check_postgres-2.11.1/check_postgres.pl --host=localhost --port=5432 --dbname=loadimpact --dbuser=postgres --dbpass=verysecret --action backends -w 5 -c 10

Tip: if you have installed a Nagios plugin but don’t know what parameters it needs, try executing it with the --help parameter

  1. Restart your Server Metrics agent
  2. As usual, you then enable Server Metrics data collection from this particular agent when you configure a load test

Tip: the agent name should be shown as a selectable Server Metrics agent in the test configuration interface. If it you do not see it listed, this means your agent hasn't started or that it can't reach loadimpact.com. The latter is often a firewall issue.

When the test starts, you will see the Server Metrics agent coming online in the console.

Then when the load test is running you will be able to plot the usual CPU, memory, disk, etc. statistics that the Server Metrics agent collects by default, but you will also have a new metric called whichever name the active database has that you are measuring client connections for (in this case, the database is called "loadimpact"):

So in this example, we choose to plot this metric, which will show us the current number of clients connected to the database "loadimpact” on the PostgreSQL database on the physical server "dbserver1”.

The orange line shows the current number of connections to the database "loadimpact”, which in this example is around 80 and fairly stable.

This is, of course, just a simple example. The check_postgres plugin can measure a vast number of things related to your PostgreSQL database server. And anything it can measure you can have the Load Impact Server Metrics agent collect and relay to loadimpact.com to be stored as test result data associated with your load test. Many of the 3,000+ Nagios plugins are very powerful data collection programs, and by utilizing the Nagios plugin compatibility of Load Impact’s Server Metrics agent you suddenly have access to an incredibly wide range of measurement and monitoring options for load testing.

< Back to all posts