Cutelyst on TechEmpower benchmarks round 16

Yesterday TechEmpower released the results for round 16 of their benchmarking tests, you can see their blog about it here. And like for round 15 I'd like add my commentary about it here.

Before you look into the results web site it's important to be aware of a few things, first round 16 runs on a new hardware newer and more powerful than the previous rounds, they also did a Dockerization of the tests which allowed us to pull different distro images, cache package install and isolate from other frameworks. So don't try to compare to round 15.

Having put Cutelyst under testing there has brought many benefits to it, in previous rounds we noticed that due testing on a server with many CPU cores it letting the operating system do the scheduling wouldn't be a good idea, so we added CPU affinity feature to cutelyst-wsgi, while uWSGI also has it our logic does this core pinning for threads and not only process.

While testing at home on an AMD Phenon II x4 using pre-fork mode was much faster than threaded mode, but on my 5th Gen Intel laptop the results were closer but threading was much better, and thanks to TFB I found out that each process were being assigned to CPU core 0, which made 28 processes be bound to a single core. In this new round you can now see the difference between running threaded or in pre-fork mode is negligible in some tests pre-fork is even faster. Pre-fork does consume more RAM, running 100 threads of Virtlyst uses around 35MiB, while 100 process around 130 MiB but multiple process are better if your code happens to crash.

An important feature of Tech Empower Benchmarks are it's filters, they allow you to filter stuff that doesn't matter to you, due the above fix you call look at the results and see the close threads vs pre-fork match here.

Using an HTML templating engine is very important for real world apps, in round 16 I've added a test that uses Grantlee for rendering, they are around 46% slower than creating the HTML directly on C++, there might be room for improvements in Grantlee but the fortunes results aren't bad.

Some people asked why there were some many occurrences of Cutelyst in the tests, the reason is that these benchmarks allow you to tests some features and see what performs better, in round 15 it became clear that using EPoll vs Qt's glib based event loop was a clear win, so in 16 we don't have Qt's glib tested anymore and EPoll is now the default event dispatcher in Cutelyst 2 on Linux.

This round however I tried to make a reasoning about TCP_NODELAY and while results are closer it's a bit clear that due the blocking nature of the SQL tests TCP_NODELAY will decrease latency and increase performance a bit due it not waiting to send a bigger TCP packet.

As I also mentioned on the release of Cutelyst 2.4.0 a fix was made to avoid stack overflow and you can see the results on the "Plain Text" test, before that fix cutelyst was crashing and respawning and that limited the results to 1,000,000 requests/second, after the fix it is 2,800,000 requests/second.

This round used Ubuntu 18.04 as base, Cutelyst 2.4.0 and Qt 5.9.5.

Last but not least if you still try to compare to round 15 :) you might notice Cutelyst went lower on the ranking, that's in part due frameworks getting optimizations but mostly due new micro/platform frameworks, but you can enable the Fullstack classification filter if you want to compare frameworks that would provide similar features that what Cutelyst offers.