Cutelyst 1.6.0 released, to infinity and beyond!

Once 1.5.0 was release I thought the next release would be a small one, it started with a bunch of bug fixes, Simon Wilper made a contribution to Utils::Sql, basically when things get out to production you find bugs, so there were tons of fixes to WSGI module.

Then TechEmpower benchmarks first preview for round 14 came out, Cutelyst performance was great, so I was planning to release 1.6.0 as it was but second preview fixed a bug that Cutelyst results were scaled up, so our performance was worse than on round 13, and that didn't make sense since now it had jemalloc and a few other improvements.

Actually the results on the 40+HT core server were close to the one I did locally with a single thread.

Looking at the machine state it was clear that only a few (9) workers were running at the same time, I then decided to create an experimental connection balancer for threads. Basically the main thread accepts incoming connections and evenly pass them to each thread, this of course puts a new bottleneck on the main thread. Once the code was ready which end up improving other parts of WSGI I became aware of SO_REUSEPORT.

The socket option reuse port is available on Linux >3.9, and different from BSD it implements a simple load balancer. This obsoleted my thread balancer but it still useful on !Linux. This option is also nicer since it works for process as well.

With 80 cores there's still the change that the OS scheduler put most of your threads on the same cores, and maybe even move them when under load. So an option for setting a CPU affinity was also added, this allows for each work be pinned to one or more cores evenly. It uses the same logic as uwsgi.

Now that WSGI module supported all these features preview 3 of benchmarks came out and the results where still terrible... further investigation revealed that a variable supposed to be set with CPU core count was set to 8 instead of 80. I'm sure all this work did improve performance for servers with a lots of cores so in the end the wrong interpretation was good after all :)

Preview 4 came out and we are back to the top, I'll do another post once it's final.

Code name "to infinity and beyond" came to head due scalability options it got :D

Last but not least I did my best to get rid of doxygen missing documentation warnings.

Have fun

Cutelyst 1.5.0 released, I18N and HTTPS built-in

Cutelyst the C++/Qt web framework just got a new stable release.

Right after last release Matthias Fehring made another important contribution adding support for internationalization in Cutelyst, you can now have your Grantlee templates properly translated depending on user setting.

Then on IRC an user asked if Cutelyst-WSGI had HTTPS support, which it didn't, you could enable HTTPS (as I do) using NGINX in front of your application or using uwsgi, but of course having that build-in Cutelyst-WSGI is a lot more convenient especially since his use would be for embedded devices.

Cutelyst-WSGI also got support for --umask, --pidfile, --pidfile2 and --stop that will send a signal to stop the instance based on the pidfile provided, better documentation. Fixes for respawning and cheaping workers, and since I have it now on production of all my web applications FastCGI got some critical fixes.

The cutelyst command was updated to use WSGI library to work on Windows and OSX without requiring uwsgi, making development easier. got Cutelyst logo, and an updated CMlyst version, though the site still looks ugly... .

Download here: .

Have fun! .

Cutelyst 1.4.0 released, C100K ready.

Yes, it's not a typo.

Thanks to the last batch of improvements and with the great help of jemalloc, cutelyst-wsgi can do 100k request per second using a single thread/process on my i5 CPU. Without the use of jemalloc the rate was around 85k req/s.

This together with the EPoll event loop can really scale your web application, initially I thought that the option to replace the default glib (on Unix) event loop of Qt had no gain, but after increasing the connection number it handle them a lot better. With 256 connections the request per second using glib event loop get's to 65k req/s while the EPoll one stays at 90k req/s a lot closer to the number when only 32 connections is tested.

Beside these lovely numbers Matthias Fehring added a new Session backend memcached and a change to finally get translations to work on Grantlee templates. The cutelyst-wsgi got --socket-timeout, --lazy, many fixes, removal of usage of deprecated Qt API, and Unix signal handling seems to be working properly now.

Get it!

Hang on FreeNode #cutelyst IRC channel or Google groups:!forum/cutelyst

Have fun!

Cutelyst 1.3.0 released

Only 21 days after the last stable release and some huge progress was made.

The first big addition is a contribution by Matthias Fehring, which adds a validator module, allowing you to validate user input fast and easy. A multitude of user input types is available, such as email, IP address, JSON, date and many more. With a syntax that can be used in multiple threads and avoid recreating the parsing rules:

static Validator v({ new ValidatorRequired(QStringLiteral("username") });
if (v.validate(c,Validator::FillStashOnError)) { ... }

Then I wanted to replace uWSGI on my server and use cutelyst-wsgi, but although performance benchmark shows that NGINX still talks faster to cutelyst-wsgi using proxy_pass (HTTP), I wanted to have FastCGI or uwsgi protocol support.

Evaluating FastCGI vs uwsgi was somehow easy, FastCGI is widely supported and due a bad design decision uwsgi protocol has no concept of keep alive. So the client talks to NGINX with keep alive but NGINX when talking to your app keeps closing the connection, and this makes a huge difference, even if you are using UNIX domain sockets.

uWSGI has served us well, but performance and flexible wise it's not the right choice anymore, uWSGI when in async mode has a fixed number of workers, which makes forking take longer and user a lot of more RAM memory, it also doesn't support keep alive on any protocol, it will in 2.1 release (that nobody knows when will be release) support keep alive in HTTP but I still fail to see how that would scale with fixed resources.

Here are some numbers when benchmarking with a single worker on my laptop:

uWSGI 30k req/s (FastCGI protocol doesn't support keep conn)
uWSGI 32k req/s (uwsgi protocol that also doesn't support keeping connections)
cutelyst-wsgi 24k req/s (FastCGI keep_conn off)
cutelyst-wsgi 40k req/s (FastCGI keep_conn on)
cutelyst-wsgi 42k req/s (HTTP proxy_pass with keep conn on)

As you can see the uwsgi protocol is faster than FastCGI so if you still need uWSGI, use uwsgi protocol, but there's a clear win in using cutelyst-wsgi.

UNIX sockets weren't supported in cutelyst-wsgi and are now supported with a HACK, yeah sadly QLocalServer doesn't expose the socket description, plus another few stuff which are waiting for response on their bug reports (maybe I find time to write and ask for review), so I inspect the children() until a QSocketNotifier is found and there I get it. Works great but might break in future Qt releases I know, at least it won't crash.

With UNIX sockets command line options like --uid, --gid, --chown-socket, --socket-access, as well as systemd notify integration.

All of this made me review some code and realize a bad decision I've made which was to store headers in lower case, since uWSGI and FastCGI protocol bring them in upper case form I was wasting time converting them, if the request comes by HTTP protocol it's case insensitive so we have to normalize anyway. This behavior is also used by frameworks like Django and the change brought a good performance boost, this will only break your code if you use request headers in your Grantlee templates (which is uncommon and we still have few users). When normalizing headers in Headers class it was causing QString to detach also giving us a performance penalty, it will still detach if you don't try to access/set the headers in the stored form (ie CONTENT_TYPE).

These changes made for a boost from 60k req/s to 80k req/s on my machine.

But we are not done, Matthias Fehring also found a security issue, I dunno when but some change I did break the code that returned an invalid user which was used to check if the authentication was successful, leading a valid username to authenticate even when the logs showed that password didn't match, with his patch I added unit tests to make sure this never breaks again.

And to finish today I wrote unit test to test PBKDF2 according to RFC 6070, and while there I noticed that the code could be faster, before my changes all tests were taking 44s, and now take 22s twice as fast is important since it's a CPU bound code that needs to be fast to authenticate users without wasting your CPU cycles.

Get it!

Oh and while the FreeNode #cutelyst IRC channel is still empty I created a Cutelyst on Google groups:!forum/cutelyst

Have fun!

Cutelyst 1.2.0 released

Cutelyst the C++/Qt web framework has a new release.

  • Test coverage got some additions to avoid breaks in future.
  • Responses without content-length (chunked or close) are now handled properly.
  • StatusMessage plugin got new methods easier to use and has the first deprecated API too.
  • Engine class now has a struct with the request subclass should create, benchmarks showed this as a micro optimization but I've heard function with many arguments (as it was before) are bad on ARM so I guess this is the first optimization for ARM :)
  • Chained dispatcher finally got a performance improvement, I didn't benchmark it but it should be much faster now.
  • Increased the usage of lambdas when the called function was small/simple, for some reason they reduce the library size so I guess it's a good thing...
  • Sql helper classes can now use QThread::objectName() which has the worker thread id as it's name so writing thread safe code is easier.
  • WSGI got static-map and static-map2 options both work the same way as in uWSGI allowing you to serve static files without the need of uWSGI or a webserver.
  • WSGI got both auto-reload and touch-reload implemented, which help a lot on the development process.
  • Request::addressString() was added to make it easier to get the string representation of the client IP also removing the IPv6 prefix if IPv4 conversion succeeds.
  • Grantlee::View now exposes the Grantlee::Engine pointer so one can add filters to it (usefull for i18n)
  • Context got a locale() method to help dealing with translations
  • Cutelyst::Core got ~25k smaller
  • Some other small bug fixes and optimizations....

For 1.3.0 I hope WSGI module can deal with FastCGI and/or uwsgi protocols, as well as helper methods to deal with i18n. But a CMlyst release might come first :)


Cutelyst 1.1.2 released

Cutelyst the C++/Qt Web Framework just got a new release.

Yesterday I was going to do the 1.1.0 release, after running tests, and being happy with current state I wrote this blog post, but before I publish I went to CMS (CMlyst) to paste the same post, and I got hit by a bug I had on 1.0.0, which at time I thought it was a bug in CMlyst, so decided to investigate and found a few other bugs with our WSGI not properly changing the current directory and not replacing the + sign with an ' ' space when parsing formdata, which was the main bug. So did the fixes tagged as 1.1.1 and today found that automatically setting Content-Length wasn't working when the View rendered nothing.

Cutelyst Core and Plugins API/ABI are stable but Cutelyst-WSGI had to have changes, I forgot to expose the needed API for it to actually be useful outside Cutelyst so this isn't a problem (nobody would be able to use it anyway). So API stability got extended to all components now.

Thanks to a KDAB video about perf I finally managed to use it and was able to detect two slow code paths in Cutelyst-WSGI.

The first and that was causing 7% overhead was due QTcpSocket emitting readyRead() signal and in the HttpParser code I was calling sender() to get the socket, turns out the sender() method implementation is not as simple as I thought it was and there is a QMutexLocker which maybe caused some thread to wait on it (not sure really), so now for each QTcpSocket there is an HttpParser class, this uses a little more memory but the code got faster. Hopefully in Qt6 the signal can have a readyRead(QIODevice *) signature.

The second was that I end up using QByteArrayMatcher to match \r\n to get request headers, perl was showing it was causing 1.2% overhead, doing some benchmarks I replaced it by a code that was faster. Using a single thread on my Intel I5 I can process 85k req/s, so things got indeed a little faster.

It got a contribution from a new developer on View::Email, which made me do a new simplemail-qt release (1.3.0), the code there still needs love regarding performance but I didn't manage to find time to improve it yet.

This new release has some new features:

  • EPoll event loop was added to Linux builds, this is supposedly to be faster than default GLib but I couldn't measure the difference properly, this is optional and requires CUTELYST_EVENT_LOOP_EPOLL=1 environment to be set.
  • ViewEmail got methods to set/get authentication method and connection type
  • WSGI: can now set TCP_NODELAY, TCP KEEPALIVE, SO_SNDBUF and SO_RCVBUF on command line, these use the same name as in uwsgi
  • Documentation was improved a bit and can be generated now with make docs, which doesn't require me to changing Cutelyst version manually anymore

And also some important bug fixes:

  • Lazy evaluation of Request methods was broken on 1.0.0
  • WSGI: Fixed loading configs and parsing command line option

Download and have fun!

Cutelyst benchmarks on TechEmpower round 13

Right on my first blog post about Cutelyst users asked me about more "realistic" benchmarks and mentioned TechEmpower benchmarks. Though it was a sort of easy task to build the tests at that time it didn't seem to be useful to waste time as Cutelyst was moving target.

Around 0.12 it became clear that the API was nearly freezing and everyone loves benchmarks so it would serve as a "marketing" thing. Since the beginning of the project I've learned lots of optimizations which made the code get faster and faster, there is still room from improvement, but changes now need to be carefully measured.

Cutelyst is the second Web Framework on TechEmpower benchmarks that uses Qt, the other one being treefrog, but Qt's appearance was noticed on their blog:

The cutelyst-thread tests suffered from a segfault (that was restarted by master process), the fix is in 1.0.0 release but it was too late for round 13, so round 14 it will get even better results.

If you want to help MongoDB/Grantlee/Clearsilver tests are still missing :D

Cutelyst 1.0.0 with stable API/ABI is out!

Cutelyst the Qt web framework just reached it's first stable release, it's been 3 years since the first commit and I can say it finally got to a shape where I think I'm able to keep it's API/ABI stable. The idea is to have any break into a 2.0 release by the end of next year although I don't expect many changes as the I'm quite happy with it's current state.

The biggest change from the last release is the SessionStoreFile class, it initially used QSettings for it's simplicity but it was obvious that it's performance would be far from ideal, so I replaced QSettings with a plain QFile and QDataStream, this produced smaller session files and made the code twice as fast, but profiling was showing it was still slow because it was writing to disk multiple times on the same request. So the code was changed to merge the changes and only save to disk when the Context get's destroyed, on my machine the performance went from 3,5k to 8.5k on read and writes and 4k to 12k on reads, this is on the same session id, should probably be faster with different ids. This is also very limited by the disk IO as we use a QLockFile to avoid concurrency, so if you need something faster you can subclass SessionStore and use Sql or Redis...

Besides that the TechEmpower 13th preview rounds showed a segfault in the new Cutelyst WSGI in threaded mode, a very hard to reproduce but with an easy fix. The initial fix made the code a bit ugly so I searched to see if XCode clang got an update on thread_local feature and finally, XCode 8 has support for it, if you are using XCode clang now you need version 8.

Many other small bugs also got in and API from 0.13.0 is probably unchanged.

Now if you are a distro packager please :D package it, or if you are a dev and was afraid of API breaks keep calm and have fun!


Cutelyst 0.13.0 released!

cutelyst-logoCutelyst the Qt web framework just got a new release, 0.13.0.

A new release was needed now that we have this nice new logo. 

Special thanks to Alessandro Longo (Alex L.) for crafting this cute logo, and a cool favicon for Cutelyst web site.

But this release ain't only about the logo, it's full of cool things: 

When I started Cutelyst a simple developer Engine (read HTTP engine) was created, it was very slow and mostly an ugly hackery but helped work on the APIs that matter, I then took a look at uWSGI due some friend saying it was awesome and it was great to be able to deal with many protocols without the hassled of writing parsers for them.

Fast forwarding to 0.12.0 release and I started to feel that I was reaching a limit on Cutelyst optimizations and uWSGI was holding us back, and it wasn't only about performance,  memory usage (scalability) was too high for something that should be rather small, it's written in C after all. 

It also has a fixed number of requests it can take, if you start it with 5 threads or process it's 5 blocking clients that can be processed at the same time, if you use the async option you then have a fixed number of clients per process, 5 process * 5 async clients = 25 clients at the same time, but this 5 async clients are always pre-allocated which means that each new process will also be bigger right from launch.

Think now about websockets, how can one deal with 5000 simultaneous clients? 50 process with async = 100? Performance on async mode was also slower due complexity to deal with them.

So before getting into writing an alternative to uWSGI in Cutelyst I did a simple experiment, asked uWSGI to load a Cutelyst app and fork 1000 times and wrote a simple QCoreApplication that would do the same, uWSGI used > 1GB of RAM and took around 10s to start, while the Qt app used < 300MB of RAM and around 3s. So ~700MB of RAM is a lot of RAM and that was enough to get me started.

Cutelyst-wsgi, is born, and granted the command line arguments are very similar to uWSGI and I also followed the same separation between socket and protocol handling, of course in C++ things are more reusable, so our Protocol class has a HTTP subclass and in future will have FastCGI and uWSGI ones too.

Did I say uWSGI before 2.1 doesn't support keep-alive? And that 2.1 is not released nor someone knows when it will? Cutelyst-wsig supports keep-alive, http pipelining, is complete async and yes, performs a little better. If you put NGINX in front of uWSGI you can get keep alive support, but guess what? the uwsgi protocol closes the connection between the front server so it's quite hard to get very high speeds. Preliminary results of TechEmpower Benchmarks #13 showed Cutelyst hitting these limits as others frameworks were using keep-alive properly.

Thanks to this new Engine the Engine API got several improvements and is quite stable now. Besides it a few other important changes were made as well:

  • Change internals to take advantage of NRVO (named return value optimization)
  • Improved speed of Context::uriFor() making Cutelyst now require Qt 5.6 due a behavior change in QUrl
  • Improved speed and memory usage of Url query parser 1s faster in 1m iterations, using QByteArray::split() is very convenient but it allocates more memory and a QList for the results, using ::indexOf() and manually getting the parts is both faster and more memory efficient but yes, this is the optimization we do in Cutelyst::Core and that makes a difference, in application code the extra complexity might not worth it.
  • C++ for ranged loops, all our Q_FOREACH & friends where replaced with for ranged loops
  • Use of new reverse and equal_range iterators
  • Use QHash for storing headers, this was done after several benchmarks that showed QHash was faster for all common cases, namely if it keept the values() in order like QMap it would be used in other places as well
  • Replaced most QList with QVector, and internally std::vector
  • Multipart/form-data got faster, it doesn't seek() anymore but requires a not sequential QIODevice as each Upload object point to parts of the body device.
  • Add a few more unit tests.

Thanks to the above the core library size is also a bit smaller, ~640KB on x64.

I was planning to do a 1.0 after 0.13 but with this new engine I think it's better to have a 0.14 version, and make sure no more changes in Core will be needed for additional protocols. 

Download here enjoy!

Cutelyst 0.12.0 is out!

Cutelyst a web framework built with Qt is closer to have it's first stable release, with it becoming 3 years old at the end of the year I'm doing my best to finally iron it to get an API/ABI compromise, this release is full of cool stuff and a bunch of breaks which most of the time just require recompiling.

For the last 2-3 weeks I've been working hard to get most of it unit tested, the Core behavior is now extensively tested with more than 200 tests. This has already proven it's benefits resulting in improved and fixed code.

Continuous integration got broader runs with gcc and clang on both OSX and Linux (Travis) and with MSVC 12 and 14 on Windows (Appveyor), luckily most of the features I wanted where implemented but the compiler is pretty upsetting. Running Cutelyst on Windows would require uwsgi which can be built with MinGW but it's in experimental state, the developer HTTP engine, is not production ready so Windows usefulness is limited at the moment.

One of the 'hypes' of the moment is non-blocking web servers, and this release also fixes this so that uwsgi --async <number_of_requests> is properly handled, of course there is no magic, if you enable this on blocking code the requests will still have to wait your blocking task to finish, but there are many benefits of using this if you have non-blocking code. At the moment once a slot is called to process the request and say you want to do a GET on some webservice you can use the QNetworkAccessManager do your call and create a local QEventLoop so once the QNetworkReply finish() is emitted you continue processing. Hopefully some day QtSql module will have an async API but you can of course create a Thread Queue.

A new plugin called StatusMessage was also introduced which generates an ID which you will use when redirecting to some other page and the message is only displayed once, and doesn't suffer from those flash race conditions.

The upload parser for Content-Type  multipart/form-data got a huge boost in performance as it now uses QByteArrayMatcher to find boundaries, the bigger the upload the more apparent is the change.

Chunked responses also got several fixes and one great improvement which will allow to use it with classes like QXmlStreamWriter by just passing the Response class (which is now a QIODevice) to it's constructor or setDevice(), on the first write the HTTP headers are sent and it will start creating chunks, for some reason this doesn't work when using uwsgi protocol behind Nginx, I still need to dig and maybe disable the chunks markup depending on the protocol used by uwsgi.

A Pagination class is also available to easy the work needed to write pagination, with methods to set the proper LIMIT and OFFSET on Sql queries.

Benchmarks for the TechEmpower framework were written and will be available on Round 14.

Last but not least there is now a QtCreator integration, which allows for creating a new project and Controller classes, but you need to manually copy (or link) the qtcreator directory to ~/.config/QtProject/qtcreator/templates/wizard.


As usual many bug fixes are in.

Help is welcome, you can mail me or hang on #cutelyst at freenode.

Download here.