Updates & News 03 February 2017

Open Source Load Testing Tool Review

Ragnar Lönn

Editor's Note:

We've updated this blog post on Dec 4, 2018 to include the k6 open source load testing tool. k6 was not originally included in this tool review as it was launched after the blog was first posted.

There are tons of load testing tools, both open- and closed-source. Open-source tools are growing in popularity, and we use mainly open-source software (OSS) at Load Impact, so we thought it might be useful to take a deep look at the available options in a detailed open source load testing tool review. (We'll call this our version 1.1 review.)

OSS load testing options represent kind of a jungle, and the difference between one tool and the next in terms of usability, performance, feature set or reliability can be enormous. Thus, we’d like to help people find the right tool for their use case.

We have chosen to look at what we consider to be the most popular, open-source load testing tools out there today. This list includes:

  • Jmeter
  • Gatling
  • Locust
  • The Grinder
  • Apachebench
  • Artillery
  • Tsung
  • Vegeta
  • Siege
  • Boom
  • Wrk
  • k6 (Newly added to this review)

This review contains both my own personal views on the good and bad sides of these load testing tools, and my thoughts based on a round of benchmark testing that gives a sense about relative tool performance.

UPDATE: If you are mainly interested in the results from the benchmarking, they were posted in a follow up article and we also did a second round of benchmarks that included our newly developed tool k6. You can also read about why we created k6).

So — what's the setup?

Setting up the Open Source Load Testing Review

We installed, configured and ran these tools from the command line, plus spent a lot of time trying to extract results from them using shell script and standard Unix tools. Then we have tried to figure out the relative performance of the tools. First, by manually trying to squeeze the best performance out of each one, optimizing configuration parameters for the tool in question, and then by running a benchmark test on all the tools, trying to run them with as similar configuration as possible.

The benchmark numbers are one thing, but the opinions on usability that I give will be colored by my use case. For example, a tool that is hard to run from the command line, or that is hard to get useful results from when you run it that way, will make me frustrated. And when I’m frustrated, I’ll have to whine and complain about it. So that is what this article is all about.

One big caveat: I have not been looking much at the data visualization options included with each tool, so you may want to see this comparison as most useful in an automation setting, or if you are planning to push results into some external data storage and/or visualization system anyway.

open source load testing review - docker

Try it yourself: run the Docker image

To make things easy for people, I’ve created a public Docker image that can be used to easily repeat all the tests we have made (or to just run one of the load testing tools without having to install it yourself). Try this:

docker run -it loadimpact/loadgentest

Or, if you want to be able to simulate extra network delay you should do:

docker run -it --cap-add=NET_ADMIN loadimpact/loadgentest

You can also build the Docker image yourself, if you clone our Github repo:

git clone https://github.com/loadimpact/loadgentest

At https://github.com/loadimpact/loadgentest you will also find some documentation on how to use the setup, by the way!

So ... Which tool is best?

Actually, I think rather than coming out with a single recommendation, let's look at the top 3. I will say which tools I think have something going for them, and why, and then I’ll show you some benchmarking figures and let you decide for yourself, because one size really doesn’t fit all here.

The top 3 open source load testing tools

Gatling - open source load testing review

Gatling

Website: https://gatling.io

Apart from maybe its text output to stdout (which is as messy as Locust’s), Gatling is a very nice load testing tool. Its performance is not fantastic, but good enough. It has a consistent design where things actually make a little bit of sense, and the documentation is very good.

It has a DSL (domain specific language) more or less equal to what Jmeter and Tsung offer. However, while JMeter and Tsung use XML with their specific tags to implement stuff like loops, Gatling lets you define Scala classes that offer similar functionality but that are a lot more readable.

Yes, Scala. Apparently everyone reacts the same way upon hearing that. Gatling’s documentation specifically tells you not to panic, which of course made me panic, but once I realized I couldn’t get out of this and bothered to read the docs and look at the examples I saw that it’s not tricky at all.

My initial assumption was that Gatling would be as clunky to use as JMeter since it’s a Java app and with DSL functionality similar to JMeter. But I have now seen the error of my ways and while I still very much prefer to use a "real,” dynamic scripting language to define what happens in a load test, Gatling’s DSL is perhaps the second best (to a real, dynamic language) thing around. Here is how it can look:

import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class GatlingSimulation extends Simulation {
val httpConf = http
.baseURL("http://myhost.mydomain.com")
.disableCaching
.acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*;q=0.8")
.userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20100101 Firefox/16.0")
val scn = scenario("My Scenario") // A scenario is a chain of requests and pauses
.exec(http("request_1”).get("/index.html”))
.pause(2)
.exec(http("request_2”).get("/style.css”))
.pause(5)
setUp(scn.inject(atOnceUsers(100)).protocols(httpConf))
}

I wasn’t going to get into visualizations, but Gatling generates web-based visualizations of test results automatically, which is nice. It uses the excellent charting library Highcharts (which I can really recommend to anyone who wants to create any kind of charts - Highcharts rules!) to create nice-looking charts of response times.

It generates useful metrics by default, such as several different response time percentiles. The Gatling docs are really good and cover most things you will ever need to know.

Gatling allows you to log individual transaction results to a log file in tab-separated format, making it simple to process after-test results and generate whatever metrics you need. One issue I experienced with both Gatling and Jmeter was that I failed to get better resolution than 1ms for my metrics. The Gatling logs look like this:

REQUEST MyScenario 5 request_1 1474293386788 1474293386793 OK

The timestamps are milliseconds since Unix EPOCH and I don’t know how to get higher resolution than that. (When running Jmeter I tried using the -D sampleresult.useNanoTime=true parameter on the command line, but with no change in behaviour from Jmeter).

Anyway, summarizing, I can definitely see that a lot of Java people who would previously have defaulted to use Jmeter for their load testing would want to take a serious look at Gatling before they go the Jmeter route.

For me, who suffer from an intense and completely justified fear of the JVM, the fact that Gatling needs the JVM to run is still a big drawback, of course, but otherwise it is a very nice tool. Feels a lot more modern and user-friendly than Jmeter.

k6-logo-cyan-1

k6

Website: https://k6.io

We have written articles on why k6 was created, but to sum it up quickly, we believe k6 is the best tool around for developers and DevOps teams that want to run automated load tests as part of their Continuous Integration pipeline. k6 is written in Go and you write your test cases in pure JavaScript.

k6 has seen strong interest from developers, testers and DevOps teams and has more than 4400 GitHub stars. It is fast (see the latest benchmark data here), has a nice API and a nice, command-line-based UX. It is simple to get started with and to use, and it has good documentation. Test results output to _stdout _or results analysis tools such as Load Impact Insights.

Here's a sample JavaScript test script for k6 that shows the use of Thresholds to generate a pass/fail test result, needed for automation:

import { check, sleep } from 'k6';
import http from 'k6/http';
//Options
export const options = {
stages: [{ duration: '60s', target: 10 }, { duration: '60s' }, { duration: '60s', target: 0 }],
thrseholds: {
http_req_duration: ['p(95)<500'],
},
};
export default function () {
const res = http.get('http://test.loadimpact.com');
check(res, {
'is status 200': (r) => r.status === 200,
});
sleep(1);
}

Since k6 executes load tests written in ES6 JavaScript, your tests can be modular. Easily import standard and custom libraries. k6 can load ES6 modules and ES5 libraries.

Built-in Modules:

// Example import of bulit in Modules
import http from 'k6/http';
import { Counter, Gauge, Rate, Trend } from 'k6/metrics';
import { check } from 'k6';
import { sha256 } from 'k6/crypto';

k6 can import modules that are hosted remotely. (This functionality is only available when using k6 locally to trigger tests from the command line).

Import Remote Modules:

// Example import of bulit in Modules
import http from 'k6/http';
import moment from 's3.amazonws.com/k6samples/moment.js';
export default function () {
http.get('http://test.loadimpact.com');
console.log(moment().format());
}

QA and Performance Engineering teams can create realistic user test scenarios with tools like the Load Impact k6 Test Script Recorder. k6 runs in the Load Impact Cloud Execution mode that allows you to run large load tests on a distributed cloud infrastructure without having to manage that infrastructure yourself. Generate loads from up to 10 different global locations (load zones).

As mentioned, k6 is built for automation and integrates into your Continuous Integration (CI) pipeline. Integrate with CI tools such as Jenkins, Circle CI, Team City and GitLab.

More information on k6:

open source load testing review - docker

Tsung

Website: https://tsung.erlang-projects.org/

Tsung is written in Erlang, and while my experience with Erlang is limited to having accidentally seen 5-10 lines of code when looking over someone’s shoulder (the code looked weird), I have heard many people talk about how good it is at concurrency. Seems this might be true, because Tsung utilizes multiple CPUs better than perhaps any other tool I have seen (i.e. it fully uses all four CPU cores in my test setup). In terms of total traffic generated, it also performs very well, on par with Jmeter.

Tsung supports many different protocols, not just HTTP. If you want to test an application that speaks WebDAV, XMPP, LDAP, SQL etc there is support for that. There is also a plugin that speaks raw TCP or UDP. Seems extensible if you have some exotic application protocol you want to test.

The system includes some useful functionality for monitoring the target server(s) with an Erlang-based agent, vanilla SNMP, or the monitoring tool Munin. Distributed load generation is a standard feature. There is a proxy recorder for recording HTTP traffic, etc, etc.

Overall, Tsung seems aimed at scalability and performance, while also offering a very competent set of basic functionality that includes just about anything you’ll need.

The whole feeling you get from the tool, when looking at the documentation and how things are designed, is that this is a competent piece of software that is probably going to be quite future-proof for most users (you will not grow out of it) while at the same time offering a low-friction onboarding process - it is not hard to get started with Tsung.

Its main drawback: its virtual user scenarios are written in XML, just like Jmeter. This is what a very simple Tsung config & scenario can look like:

<?xml version="1.0"?>
<!DOCTYPE tsung SYSTEM "/usr/share/tsung/tsung-1.0.dtd">
<tsung loglevel="notice" version="1.0" dumptraffic="protocol">
<clients>
<client host="localhost" use_controller_vm="true"/>
</clients>
<servers>
<server host="myserver" port="80" type="tcp"/>
</servers>
<monitoring>
<monitor host="myserver" type="snmp"/>
</monitoring>
<load duration="300" unit="second">
<arrivalphase phase="1" duration="300" unit="second">
<users maxnumber="20" arrivalrate="20" unit="second"/>
</arrivalphase>
</load>
<sessions>
<session name="http-example" probability="100" type="ts_http">
<for from="1" to="10000" var="i">
<request>
<http url="http://myhost.mydomain.com/index.html" method="GET" version="1.1"></http>
</request>
</for>
</session>
</sessions>
</tsung>

The lack of a dynamic scripting language is annoying, though Tsung XML scenarios (again, just like JMeter) can include things like loops and if-statements, so it is actually possible to write all sorts of complicated user scenario "code.”

The functionality is there, but the usability is not: few developers like "programming” in XML. So it all depends on your use case. If you don’t mind specifying your user scenarios in XML, take a look at Tsung because it is a very competent piece of software without being the difficult-to-learn-and-use Godzilla application that JMeter is.

The Honorable Mention

The Grinder

Website: https://grinder.sourceforge.net/

I’m of two minds about The Grinder. On one hand it looks like an ancient and almost-dead project. It was first released in 2000, and the latest official release was in 2012. It is hosted on Sourceforge.

Who hosts their project on Sourceforge these days? (I feel bad for saying it, having been a happy Sourceforge user once upon a time, but they really blew it, didn’t they? Completely failed to stay relevant, completely missed the big Git migration).

Anyway, Grinder seems to have one guy working on it, and commits are a year and half between them. Is it dead? Maybe twitching a little still, but no more than that.

So on one hand it’s an almost-dead project and on the other hand it is a very competent load testing tool that seems to see a fair bit of use. It is one of few open-source tools with real scripting - i.e. where you define virtual user behaviour in your load test using a real, dynamic programming language.

In the case of Grinder the default scripting language is Jython, which is a Python dialect run from Java. Reportedly, you can also use Clojure; I have no idea why. If you’re familiar with Python, programming Grinder is a piece of cake. The only issue is that you can’t use all the gazillion Python modules you may be used to having. Jython has an implementation of the Python standard library but that’s about it.

Still, it’s a nice and powerful language that you can do a lot with. Here is an example of a tiny Grinder Jython file that describes what the virtual users should be up to during the load test:

from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
from net.grinder.plugin.http import HTTPRequest
test1 = Test(1, "Request resource")
request1 = HTTPRequest()
test1.record(request1)
class TestRunner:
def __call__(self):
result = request1.GET("http://myhost.mydomain.com/index.html")

As mentioned above, the Python/Jython code describes only what virtual users should do during the test. Test-wide parameters such as test duration, load levels etc. are defined in a separate file called grinder.properties, which is a pretty standard config file that can look like this, though there are many more parameters you can set:

grinder.processes = 1
grinder.threads = 20
grinder.runs = 0
grinder.useConsole = false
grinder.script = /path/to/yourscript.py
grinder.logDirectory = /var/log/grinder
grinder.duration = 300000

(Note: grinder.duration is expressed in milliseconds)

Grinder has support for load distribution. Start a console process and then connect one or more agent processes that can be remote. Those processes, in turn, create worker processes to drive the virtual users. Grinder puts results in nice little files, CSV-style, one per worker process, which makes post-processing fairly simple.

Grinder, just like Gatling and Jmeter, seems incapable of producing results with higher resolution than 1ms. They are all Java apps (hmm). Some silly Java person must have decided that 1ms is "enough resolution for anybody,” to paraphrase someone else.

In many complex, real-world setups, 1ms probably is enough because response times will be many milliseconds, but I think sub-millisecond resolution adds value in an increasing number of use cases (such as measuring response times for e.g. a micro service where the clients are on the same local network), so I would definitely not settle for 1ms if I were writing a new load testing tool today.

One area where I assumed Grinder would fail was performance. I thought that given that it was an old and somewhat outdated application, plus that it was running a real, dynamic scripting language, its throughput (traffic-generating capabilities) would be limited.

I was wrong.

It performed really well and almost on par with tools like Jmeter and Tsung that are not executing "real” code. Impressive!

Grinder also has an extensive script API that allows you, among other things, to log custom metrics. To sum things up, this is a very flexible and competent tool that also performs well. Not being able to use all the Python infrastructure out there (modules) may be a deal-breaker to some, of course.

The "Pretty Good" Tools

Some tools are just "pretty good", period, and I'd like to mention these here.

Locust

Website: https://locust.io

Locust is cool because it was clearly made by developers, for developers. It allows you to write powerful, expressive and easily understandable code (real code, not some clunky, limited, invented-here DSL) that defines the behaviour of your simulated users in the load test. You write plain Python code, and who doesn’t love Python? No-one, of course!

Here is a very short example locustfile.py that defines what the virtual users in the test should do:

from locust import HttpLocust, TaskSet
def stylesheet(l):
l.client.get("/style.css")
class UserBehavior(TaskSet):
tasks = {stylesheet:1}
class WebsiteUser(HttpLocust):
task_set = UserBehavior

In this case, the target host is specified on the command line, using the --host command line option, while the path ("/style.css”) is specified in Python code. This seems to be the common way to do things with Locust, but you’re free to specify everything in code if you want.

And Locust comes with a slick little web-based command-and-control UI plus support for distributed load generation, making it feel quite upgradeable and future-proof. Plus the locust.io website was obviously created by some talented ldesigner - just look at the cool little insect logo!

But once you’ve stopped staring at their shiny website and started really using the app you realize that it does have its weak spots too, just like the rest of them. If there is one thing Python is not known for, it is high performance, and Locust suffers from being a Python app. Besides Python code not executing terribly fast, the Python GIL makes it hard for Locust to utilize more than one single CPU, which means you have to run multiple processes to scale up the load generation.

Turns out that the built-in support for distributed load generation is probably pretty vital to Locust, because without it, the application feels a lot less future proof from a user perspective. Many people don’t need to run very large-scale tests, but it is nice to at least have the ability to do it without needing to switch tools.

But even with a single CPU, Locust should be able to generate quite a bit of traffic, right? Wrong.

In the benchmark it is roughly 25 times slower than the fastest application, adjusted for the fact that it only uses a single CPU (it is about 100 times slower in the benchmark, because most other tools are able to utilize all 4 CPUs in my lab setup).

Still, given a reasonable machine Locust will be able to generate hundreds or up into the low thousands of requests per second per CPU/running program instance. This is enough for many use cases so it is definitely worth considering what your particular requirements are before ruling out Locust as an option.

Results output options are limited with Locust. If you’re using the web UI you can download results in CSV format, but if you’re using the command line you get results in some not-so-pretty pretty-printed format that is hard to parse for humans and machines alike.

I guess the intention is for you to handle results output in your Python code, which is certainly doable but unless thoughtfully implemented it can probably further lower performance and in worst case introduce measurement errors. It would have been nice with a command-line option to store detailed results in CSV or JSON format.

Also worth noting, and the thing that tipped over Locust from the top category into "honorable mentions”, is the fact that Locust response time measurements are notoriously unreliable as Locust adds lots of internal delay to every transaction. You probably only want to look at statistical differences between different test runs, or you may want to use Locust for load generation but measure response times using some other means (e.g. New Relic).

So, what is the summary then?

Well if you’re Python-centric, Locust may be the tool for you. You’ll be running pure Python and can use all your standard Python modules etc. However, the horrible performance in terms of load generation capability and measurement accuracy is a big drawback.

If you just like the Python syntax, the Grinder offers Jython, which is Python-in-Java. As much as it galls me to suggest another Java app, the Grinder offers a nice scripting language and pretty damn decent performance, despite being one of the older tools and one that is not updated as frequently as Locust (for example).

Artillery

Website: https://artillery.io

Artillery is interesting because it seems very targeted towards continuous integration (CI) and automation, something few load testing tools do well (or at all). Configure Artillery using YAML or JSON and receive pass/fail results for load tests executed.

It looks like it would sit very well in a CircleCI or Travis test suite. It also allows you to execute custom JS code, although not without also supplying a YAML/JSON config. Here is a sample Artillery JSON config (the one we use to run the benchmark; it loads a single URL in a loop):

{
"config": {
"target": "http://myhost.mydomain.com",
"phases": [{ "duration": 1, "arrivalRate": 20, "name": "startphase" }]
},
"scenarios": [
{
"flow": [
{
"loop": [{ "get": { "url": "/style.css" } }],
"count": 10000
}
]
}
]
}

The results are stored as JSON data, and there are a few plugins that allow you to send results to statsd, cloudwatch or InfluxDB so results output options are pretty good.

Cool as it is, there are a couple of reasons Artillery doesn’t end up among my favourites, however.

The way load tests are defined is very targeted towards extremely simple tests with few transactions in a user scenario. This may be what 99% of API testers want, I don’t know, but it means Artillery is not as suitable for more complex use cases.

One obvious omission is that there is no way to control test execution time. The authors may think there is, because they are only running short user scenarios, but you cannot interrupt a user scenario iteration.

There is no global test duration parameter that you can define that lets you do that. If you have a long-running user scenario (and in the benchmark mine is an infinite loop, so it can probably be classified as "long-running”), this lack of a global test duration parameter becomes a problem.

To configure Artillery to do something similar to what the other tools in the benchmark do, I had to use a per-VU request count limit that caused the VUs to exit once they reached the target number of requests.

This is not great: I have no way of knowing how long the test will take to run. So Artillery’s way of defining "arrival phases” with durations (per phase) and number of arrivals may be perfect for some, but in my mind it is lacks sufficient control.

You don’t know how many threads you have executing at any single moment and you cannot make your test run for a certain amount of time if you have more complex user scenarios.

Another negative thing with Artillery is its performance, which is only exceeded in horribleness by Locust.

Artillery is more efficient than Locust in that it generates more requests per CPU core, but just like Locust it cannot make use of more than one core per process. Unlike Locust, there is no built-in support for distributed load generation (but this seems to be on its way) which means that you’re limited to one CPU core unless you want to spend effort scripting some kind of home-built distributed execution.

So, on our benchmarking setup Locust generates only around 600-700 RPS when running on one CPU core, but can easily be configured to run multiple processes which increases performance to at least a little over 2,000 RPS.

Artillery produces about twice as much traffic as Locust per CPU core (~1,400 RPS) but given no distributed mode of operation, that’s it. The rest of the tools come in at 10,000+ RPS, so there is a big gap between these two and the better-performing tools.

So, Artillery can’t generate a lot of traffic, and and measurement accuracy is not very good at all, just like Locust. At any meaningful load levels it will add substantial delay to all your measurements. If measurement accuracy or the ability to run large-scale tests is important to you you’ll probably want to stay well away from Artillery.

Wrk

Website: https://github.com/wg/wrk

Will Glozer created the very cool little Wrk application. I mention it here because it is plain and simply the fastest tool around. On our lab setup it saturated the target machine at ~110,000 requests per second (RPS) when we ran it natively on the source machine (i.e. not Dockerized, which reduces performance for all the tools by ~40%).

The source system was at 50-60% CPU at this point, while all four CPU cores on the target machine were completely busy serving the requests. No other tool managed to saturate the target machine, even when running natively on the source machine.

Wrk can run in both a "static” mode where it just hits a single URL, or it can execute Lua code that allows for dynamic scripting. Wrk blows away the competition with a rate of about 60,000 RPS in our Dockerized benchmark setup while hitting a single, static URL.

Apachebench, which is the next closest in terms of performance, did 30-35,000 RPS under the same circumstances and the other tools I consider to perform "well” (Jmeter, Tsung, Grinder, Boom and Siege) all did between 15,000 and 30,000 RPS. Interestingly, running Lua code does not seem to impact Wrk performance a lot. At least with a very simple Lua scenario, like the one you can see further down, Wrk performs about the same as with a static URL.

But extreme proficiency in one area usually means sacrifices in other areas and Wrk is no exception. Results are limited - you only get aggregated results for a whole test. There is no information on what response codes you got back, for instance.

The scripting API is callback-based, which means any complex user scenarios are not easy to write and the code you produce will not be easy to read either. Here is a very simple script example:

request = function()
return wrk.format("GET", "http://myhost.mydomain.com/index.html")
end

In other words, the API allows you to define certain Lua functions that Wrk then calls at various points during execution. The function "request” is called once for every request Wrk makes, and should return the URL that Wrk will then request. This API is not very expressive, and writing complex user scenarios is going to be a hassle and not recommended.

Basically, Wrk has one use case where it rocks: when you want to generate max amount of traffic from a single machine, but don’t care too much about detailed result metrics and the traffic you want to simulate is not very complex.

Jmeter

Website: https://jmeter.apache.org/

While JMeter is definitely not my favourite, not giving it some kind of credit would be unfair, because it is a very competent tool. It has more features and integrations, and also a larger user base, than any other open-source load testing tool.

If you’re a Java-centric organization, and are already proficient with JMeter, there is often not much you can gain (at least short-term) by switching tools. JMeter can test several different protocols, it offers good performance and scalability, and it has a rich feature set.

The main negative thing about JMeter is that it is "a big, old, clunky beast”. A bit like the old generation Lincoln Continental if you’ve ever had the great misfortune of driving one of those. Or, to be more concrete: configurations, including user scenario logic, are written in XML, which means not great DX (Developer eXperience) because no one enjoys "programming” in XML.

Besides configurations being unpleasant to write, they will also be unpleasant to read. If you’re already using JMeter, I would say that going from XML-based configurations to a more expressive DSL and perhaps a real, dynamic scripting language would probably be the biggest reason to switch to another tool. Here is a sample XML config for JMeter:

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="2.6" jmeter="2.11 r1554548">
<hashTree>
<TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Test Plan" enabled="true">
<stringProp name="TestPlan.comments"></stringProp>
<boolProp name="TestPlan.functional_mode">false</boolProp>
<boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
<elementProp name="TestPlan.user_defined_variables" elementType="Arguments" guiclass="ArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
<collectionProp name="Arguments.arguments"/>
</elementProp>
<stringProp name="TestPlan.user_define_classpath"></stringProp>
</TestPlan>
<hashTree>
<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Thread Group" enabled="true">
<stringProp name="ThreadGroup.on_sample_error">continue</stringProp>
<elementProp name="ThreadGroup.main_controller" elementType="LoopController" guiclass="LoopControlPanel" testclass="LoopController" testname="Loop Controller" enabled="true">
<boolProp name="LoopController.continue_forever">false</boolProp>
<stringProp name="LoopController.loops">10000</stringProp>
</elementProp>
<stringProp name="ThreadGroup.num_threads">20</stringProp>
<stringProp name="ThreadGroup.ramp_time">1</stringProp>
<longProp name="ThreadGroup.start_time">1406901208000</longProp>
<longProp name="ThreadGroup.end_time">1406901208000</longProp>
<boolProp name="ThreadGroup.scheduler">false</boolProp>
<stringProp name="ThreadGroup.duration"></stringProp>
<stringProp name="ThreadGroup.delay"></stringProp>
</ThreadGroup>
<hashTree>
<HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
<elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
<collectionProp name="Arguments.arguments"/>
</elementProp>
<stringProp name="HTTPSampler.domain">myhost.mydomain.com</stringProp>
<stringProp name="HTTPSampler.port"></stringProp>
<stringProp name="HTTPSampler.connect_timeout"></stringProp>
<stringProp name="HTTPSampler.response_timeout"></stringProp>
<stringProp name="HTTPSampler.protocol"></stringProp>
<stringProp name="HTTPSampler.contentEncoding"></stringProp>
<stringProp name="HTTPSampler.path">/style.css</stringProp>
<stringProp name="HTTPSampler.method">GET</stringProp>
<boolProp name="HTTPSampler.follow_redirects">true</boolProp>
<boolProp name="HTTPSampler.auto_redirects">false</boolProp>
<boolProp name="HTTPSampler.use_keepalive">true</boolProp>
<boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
<boolProp name="HTTPSampler.monitor">false</boolProp>
<stringProp name="HTTPSampler.embedded_url_re"></stringProp>
</HTTPSamplerProxy>
</hashTree>
</hashTree>
</hashTree>
</hashTree>
</jmeterTestPlan>

The above can probably be shortened quite a bit, who knows. Another thing is the fact that running JMeter requires a reasonable level of familiarity with operations of Java apps, and the JVM.

There always seem to be some headaches involved at some point, when you’re trying to setup a JMeter load generation system.

For instance, I ran into some weird but known JVM issue where changing a system limit (ulimit -v, see https://bugs.openjdk.java.net/browse/JDK-8043516) caused Jmeter to be unable to start (I couldn’t even get it to print its version number) unless I also specified -XX:MaxHeapSize=... and -XX:CompressedClassSpaceSize=...

It was a very weird and unintuitive bug, in my opinion. I worked around it by not using ulimit -v to set max virtual memory size, but instead set the system defaults to what I wanted, even if those defaults were the same as the values I had tried to set with ulimit...

This type of issue may of course be more JVM/Java-related than JMeter-related, so perhaps it could happen to Grinder and Gatling also, but I feel that these types of things happen more often with JMeter. (I didn’t have the same issue with Gatling or Grinder this time, despite running them using the same JVM.)

JMeter feels like it is usually one level up in trickiness to get started with than most other tools. Comparing the "Getting started” web pages for JMeter, Gatling and Grinder, the JMeter page is clearly the longest and most complicated also.

So, JMeter can do a lot, performs well, is very configurable and has a large community. It is also not user-friendly, has a scary XML-based configuration/DSL and comes with a somewhat high level of swearword-induction for us people who don’t run a lot of enterprise Java applications.

Apachebench

Website: https://https.apache.org/docs/current/programs/ab.html

Apachebench isn’t bad at all. It is very fast, despite being able to utilize just 1 CPU core. It is simple to use, and it provides a bunch of useful options. Reporting is limited, but not as limited as with Wrk.

In general, if you just want to hit a single URL and want a tool that has at least the most essential feature set, then Apachebench may be the tool for you. At first I put Apachebench in the section called "The rest of the tools”, but I changed my mind and decided to put it under "honorable mentions” because it really is the by far best tool overall when you only want to hit one single URL.

I don’t really have anything negative to say about Apachebench. It is a tool focused on one job, and it does it very well.

The rest of the tools

As we now have the bottom half of the tools left, I might as well go through those also. Some are not bad, others have a bit too many quirks and probably justify a bit of ranting for having wasted part of my life, so here goes.

Vegeta

Website: https://github.com/tsenart/vegeta

Vegeta is somewhat similar to Artillery in that it seems geared towards an automated testing use-case. It does not have a concept of VUs/threads but instead you set a target request rate (requests per second) and Vegeta will try to achieve that. There is also no ramp functionality to ramp up/down the request rate.

This makes benchmarking pretty tough as we have to experiment with different, static request rates and see where we get the most throughput without errors. Vegeta is designed so it will generate requests at the rate specified, then it just drops them on the ground after it runs out of CPU. As it spends more CPU cycles generating a larger number of requests, this means that specifying too high RPS numbers will cause a substantial drop in total # of successful requests. For example, unless you specify the exact right request rate, you will not know what the max # of requests per second Vegeta can actually push through.

For a use case where you want to find out maximum throughput of an application, Vegeta is therefore not a good choice of tool, but if you want to do simple, static URL testing with constant load it can be a good choice.

Another annoying thing with Vegeta is the fact that while it has options to specify concurrency (-connections and -workers), these options configure only initial values - Vegeta will change them at runtime as it feels appropriate.

This means that the only thing you can really order Vegeta to do is to send requests at a certain rate - the other parameters are up to the app. Comparing it to other load testing tools is very hard under these circumstances. Just like with Artillery I feel there is a lack of control that would have been nice to have.

Vegeta uses multiple CPUs well, but is fairly average in terms of efficiency (how much traffic it can pump out per CPU core). It has a fair number of command-line options to do various things with the URLs it fetches (which are read from stdin) it supports HTTP/2 and it stores reports in a binary format which you can then feed into vegeta again to create reports or output results in JSON or CSV format.

The README describes how to do distributed testing with a combo of Vegeta, shell remote execution and some chewing gum. I’m sure it works, but it’s something almost any command-line tool can do so I can’t really say Vegeta has "distributed load generation capability” (or, technically, that all tools have it).

One thing that is a bit unique with Vegeta is that you can also use it as a Go library - i.e. you can write your own load testing tool using the Vegeta library.

Siege

Website: https://www.joedog.org/siege-home/

Sorry to say, Siege makes me a bit frustrated when I use it. Apologies to Jeff Fulmer for some of what I write below. He has, after all given the world a useful, free application, but I just have to let off some steam here.

Siege’s main advantages: it works, is simple to get started with, and has a nice online manual at https://www.joedog.org/siege-manual/. But it has several annoying things about it that would make me use it only as a last resort. So, what is so bad then?

Overall, the design and operation is slightly odd, doesn’t feel 100% thought through and the docs and help tell you things that turn out to be false. It has the feeling of a quick hack that would have needed refactoring, design-wise, at some point, but which never got it. An example: When you execute Siege, it will tell you how many transactions it has done:

Transactions: 458 hits

Then it will tell you how many successful transactions it has seen:

Successful transactions: 471

More successful transactions than transactions? Doesn’t make sense. I realized that I was telling Siege to load test a URL where the server responded with a 3xx redirect, and thought that might have something to do with it. Yes. The docs on the Siege web page say that "redirects are considered successful transactions.”

Fine, but then why don’t I have twice the number of successful transactions here? (as every request in this test resulted in a 301 redirect response and then a 200 when Siege went to the redirected URL). After a little thinking, and with the word "hits” giving a hint, I think I have figured out how Siege works.

A "transaction” is not an HTTP request-response pair but instead a so-called "hit” on the web server. (I associate an HTTP request-response pair with the word "transaction,” but maybe I’m the one with the odd idea here. I tend to think in terms of network requests and responses - an old war injury.)

A transaction is considered to be one or more HTTP request-response pairs that either return the file/resource the client wants, or an error code >=400. When a server returns a 3xx redirect it is just considered an extra, intermediate step of the transaction.

  • The "Transactions” metric counts the number of response codes seen that are "not redirects”, i.e. either < 300 or >= 400
  • The "Successful transactions” metric, on the other hand, must be calculated by adding 1) the number of response codes seen that are < 300 (success) AND 2) the number of 3xx redirect responses that have not yet been redirected (when they are redirected they turn into a "Transaction”, but will not increment "Successful transactions” again, I assume...)

Huh? If you don’t consider a redirect to be a complete transaction, why include them in the "successful transactions” metric? There is probably some thought behind this second metric, whatever it may be, but I think I’d rather stick pins in my eyes than try to figure it out.

Then we have the units Siege uses. It reports response time in seconds, with two decimals of precision. Hello? Is this the 1800’s? Computers and networks are fast today, and in our benchmark setup very few requests take more than 1ms to complete.

What does Siege report in such a situation?

It reports a response time of "0.00 secs”. Very useful. I have been complaining about the Java load testing tools being unable to provide better resolution than 1ms, but here we have a tool that, in 2017, thinks 10ms is a perfectly good highest resolution.

It may be that Siege is actually unable to do better than that, as the benchmark tests have indicated that Siege adds a substantial amount of delay to to every single request. This extra delay seems to start at just below 10 ms, and for some incredibly strange reason be dependent mainly on network delay. I.e. it seems Siege will add more extra delay to every request, on average, if the network delay between source and target machines is high, than if that delay is low.

Another minor units-related gripe is the fact that you specify a 30-second test duration as "30S”. Note the capital S. I have never in modern times seen the unit "seconds” expressed using a capital ‘S’.

I think I may have seen it in some COBOL or AS/400 RPG programs (or perhaps a VAX/VMS system - they like everything to be uppercase) in one of my earlier lives, but there is something to be said for writing things the way everybody else does. Likewise in Siege, minutes are capital "M” and hours capital "H.”

I asked our developers if they had seen this way of expressing time, and one responded that she would have interpreted "30M” as "30 months”.

Some other annoyances I’ve realized come from different system-wide siegerc files that were installed (e.g. as /usr/local/etc/siegerc) by a package installer and which have different default settings configured for Siege.

The main pages and docs tell you what defaults to expect when running Siege without specifying things explicitly on the command line, but when you try it in practise you find that the defaults weren’t what the docs said.

Not sure if this is the author’s fault, or some stupid package (apt) maintainer that decided to change defaults between Siege versions to make life hard for the users, but it is still annoying.

Finally, Siege crashes if you try to set the concurrency too high (at around 500 VU it started to get unstable when I used it).

The only reason to use Siege, in my opinion, is if you need a simple, Apachebench-style tool that you can feed a list of URLs, rather than just a single URL. Which may be a pretty common use-case of course, so I’m not saying Siege is for noone.

What I am saying is that it's not something that will be future proof for anyone who wants to progress and create more advanced test cases beyond that static list of URLs. Also, it may not be wise to trust the Siege response time measurements.

Boom

Website: https://github.com/rakyll/boom

Last in this review comes Boom. I’m talking about the new Go version, to avoid any confusion. If you go to the Github repo you’ll see that it has changed its name to "Hey”, to avoid getting mixed up with the old Python Boom, but as it was renamed while I was doing this review, and because I think "Hey” is a silly name for a load testing tool, I’ll continue calling it Boom for now.

If you get confused, just go to the URL above. (Of course, that URL will direct you to https://github.com/rakyll/hey so I guess I should probably direct you there instead.)

Boom is boring to review because it seems pretty solid, while at the same time it doesn’t shine in any particular area.

It is a simple tool. The README states that the aim was to replace Apachebench. Unfortunately, the author forgot to implement anything that would have made it interesting to Apachebench users to switch.

Boom performs decently, but not as well as Apachebench and its general feature set is a lot more basic than that of Apachebench. What it does have is HTTP/2 support and the fact that it is written in Go, which may attract the Go community I guess.

To sum up, I would say it makes sense to use Boom/Hey if you’re into Go. If not, use Apachebench.

Or, possibly, if you have a machine with 8 or more CPU cores you may want to consider Boom as it can utilize multiple cores, but Apachebench can’t. (With up to 4 cores though, Apachebench is still faster.)

See this follow-up post for technical details of the open source load testing tool benchmark!

What are your thoughts about your favorite OSS testing tools?

Ragnar Lönn is Founder and former Head of Product Development at Load Impact, the world’s most popular online load-testing service for testing the performance of websites, apps and APIs. Load Impact users have executed over 2.1 million load tests since 2009. Previously, Ragnar worked in the ISP industry and founded one of Sweden’s first Internet service providers, Algonet, in 1994. His more unexpected interests include running, macroeconomics, and energy-storage technologies.

< Back to all posts