Testing the Limits: Performance Benchmarks for Conduit

Introduction

Conduit is meant as a Kafka Connect replacement with better developer experience but it’s just as easy to use it to build real-time data pipelines. For that reason, we didn’t want to wait for too long to know how much Conduit can handle, or what it takes to break Conduit. To answer the two questions, we developed a benchmarking tool. In this blog post, we’ll share our experience building it and using it.

Types of performance testing

There are different types of performance testing, and in Conduit we started with the following three:

Load testing (i.e. testing Conduit with expected load)
Stress testing (i.e. testing Conduit with unusually high load)
Spike testing (i.e. testing Conduit with suddenly increasing or decreasing loads)

We do not plan to stop here, and we plan to expand our tests to include other types of performance testing (especially soak and capacity testing).

Principles

Firstly, let’s mention the principles upon which we built this version of benchmarks:

It should be possible to track performance of Conduit itself (i.e. without connectors included)

One thing we’re especially interested in is the performance of Conduit itself. Let’s remember what a pipeline looks like:

Diagram

Connectors are pluggable components which can greatly affect the performance of a pipeline. For that reason, we decided to have a number of tests which will cancel out the effects of connectors. We achieved this using two special types of connectors:

A generator source, for which generating a record comes at virtually no cost, but can be configured to send data at a specified rate (or rates, to simulate spikes).
A NoOp destination which simply drops all records without doing anything.

It should be possible to track performance of Conduit with the connectors included

While zooming in on Conduit’s performance is definitely helpful, we do not want the performance testing framework to restrict us into that, and not make it possible to test Conduit with connectors. This would be helpful for a number of reasons:

To know what, and what not to expect from a production environment
To try reproducing behavior from a production environment
To conduct a performance test on a connector you developed (e.g. you may have developed a source connector, so you can test it using the NoOp destination connector)

Benchmarks are run on-demand (automated benchmarks are planned for later)

As a first step, it’s acceptable if the performance tests are run manually. Automated tests are a great tool to compare the performance of two releases, or making sure that code changes didn’t introduce degradations. However, before answering the question “was this a good change from a previous state?”, we need to establish a baseline*.* Automated benchmarks are on our roadmap, and with that we hope to be able to answer both questions.

It's easy to manage workloads

Workloads are one of the most important parts in a performance test, and so we’d like to be able to easily add them. In Conduit’s case, there are two significant parts of a workload:

Conduit’s own configuration
Pipeline setup

Ideally, both configurations can exist in files. At the time of developing the benchmarking framework, the pipeline file configurations were in progress, so the way workloads are specified is via Bash scripts, which create pipelines using the HTTP API. Here you can find an example of a workload, which simulates bursts, i.e. conducts spike testing.

The connector configuration (which is what is used to generate load) can be clearly seen in the scripts. Still, the scripts are relatively verbose and we plan to replace them with pipeline configuration files.

Metrics of interest

When we set out to write the benchmarking framework, one of the first questions we answered was “what are we actually interested in?”. Generally speaking, in performance tests we want to know how fast the work was performed, but also what resources have been used.

As for the “work performed” part, we chose to monitor the number of records per second and the number of bytes per second, as they are the most important indicators of a pipeline’s performance. If you have metrics related to individual objects/events (for example, we track the time Conduit spends on a record), it’s also useful to show percentiles.

With regards to resource usage, we’re generally interested in CPU and memory usage. Conduit itself doesn’t use disk or network heavily, so we’re not keeping a close eye on those.

Data collection

Regardless of what metrics you define, all the data collected needs to be linked to the actual test it belongs to. This can be the test name, a timestamp, version of the system you’re testing, or version of the test framework, etc.

Conduit comes with a number of already defined metrics. The metrics available are exposed through the HTTP API and ready to be scraped by Prometheus. You can find more information about the metrics here.

With that, using a tool like Grafana to monitor Conduit makes a lot of sense. While we do monitor Conduit through Grafana too, it’s not how we primarily do it. Eventually, we’d like to be able to compare metrics from different test runs (e.g. to check if there were performance degradations between two releases). Comparing the results using Prometheus or Grafana cannot be done easily, so we wrote a simple tool which will collect Conduit-specific metrics and save them to a CSV file.

When it comes to collecting data about resource usage, we are doing it in two ways. The first is instrumenting Conduit by using the Prometheus client library, which gives us a lot of information about the internals (e.g. memory allocation, heap statistics, number of goroutines, etc.). The second is by using DataDog, which we use for the general VM stats (mostly for CPU and memory related metrics).

Here’s a tip if you’re visualizing your data: implement a break between test runs. Otherwise, once test N is done, and test N+1 starts immediately after it, you might only see a fall or an increase on your graph. That can make it more difficult to correlate the test results and your graphs.

Target instance

We recommend running Conduit on an instance with 2 CPUs and 4 GB of RAM, so we’re running the tests against VMs with the same specifications.

The test framework we developed can run the tests either against Conduit in Docker containers or against Conduit installed on an AWS EC2 instance (sidenote: we have a great guide for launching an AWS EC2 instance and installing Conduit from scratch!).

When it comes to testing on EC2 instances, here’s a couple of things we’d like to share with you:

Don’t forget about them! Especially if you’re not using them very often. Otherwise, your next AWS bill may be a big surprise.
Be well informed about throttling on the instance you’re using. Certain types of instances will be throttled once you run out of credits, which may affect the test results.

Data evaluation

The first step here is to actually question the data. This is especially important in cases where you’ve written some code yourself to expose certain metrics or to collect them. For example:

Have you calculated a metric correctly?
Are the units correct and expected (nanoseconds vs milliseconds, megabytes vs mebibytes, etc.)?
Are you able to cross check the metrics? (e.g. if a pipeline rate is shown as 100 records per second, do you actually see 6000 records in a destination after 60 seconds?)
Are time zones matching? (e.g. when checking resource usage, make sure you see the same time zones in your resource graphs and your test results)

Once you are confident in your test results, you can actually start evaluating the data. Here are a few questions which may help:

Is the data in a test result consistent? If not, why not? For example, in some test results we saw that Conduit spent 100ms on a record (figures are for illustrative purposes), so you may expect a throughput of 10 records per second. However, the throughput was actually much higher. We then recalled there was some concurrent processing involved, which explained the numbers.
What’s the relationship between a workload and the resource usage? Are you seeing the expected increase in resource usage when you increase the workload in a specific way? For example, in our tests with large records, we do expect the memory to go up. Or, if you have spike tests, does the resource usage go back to normal once a burst is done?
Is there a relationship between different workloads?

Results and observations

Large messages (4MB payloads)

By default, gRPC messages are limited to 4 MB in size. We also think that messages in data streams are much smaller than that in the majority of cases, so this feels like a good test. We have two variations of this test: one with a rate of 100 msg/s and one with a rate of 1000 msg/s.

At a rate of 1000 msg/s, the throughput is around 200/s. We did expect a smaller rate, but this is something we’re going to look into and try improving.

Graph: CPU usage for large message payloads

Graph: Pipeline throughput for large message payloads

Graph: Memory usage for large message payloads

Small messages, high rates

We ran a few tests with message payloads which are 1 KB in size. The rates were: 10k msg/s, 15k msg/s, 20k msg/s.


Generator rate	Pipeline rate	CPU (%)	Memory usage (GB)
10 000	6 650	46	1.4
15 000	10 550	55	1.55-1.7
20 000	13 270	62	1.55-1.7
“Insane”(the generator sends records as quickly as possible)	29 000	77	1-1.4 GB

As we see, the actual throughput is roughly 70% of the configured generator rate. We have an issue open to investigate this difference. We hypothesize that, at higher rates, the ratio between the time the generator sleeps and the time it takes to return and acknowledge a record becomes more significant. In other words, it’s possible that, at a higher rate, the generator produces less records than specified.

Bonus workload: We have a workload, where the messages are generated as quickly as possible by the generator.

Small message bursts

In this workload, we have a generator producing 10 msg/s, and then we have 30-second bursts, which happen every 30 seconds, and where we have 1000 msg/s.

The CPU usage was oscillating between 0 and 10%, where the time between peaks was exactly 60 seconds, which corresponds to the configured burst time.

Improvement loops

Last but not least, let the tests “soak” a little bit. Running them periodically or even frequently will let you know how to make them more efficient, easier to run, and what additional metrics you may need or not. Another way to improve your benchmark is by open-sourcing it, letting others use it and suggest improvements. Here’s us doing that here. Looking forward to your questions, comments and suggestions!

Testing the Limits: Performance Benchmarks for Conduit

Haris Osmanagić