Writing a reverse proxy/loadbalancer from the ground up in C, part 4: Dealing with slow writes to the network

Posted on 10 October 2013 in Linux, Programming

This is the fourth step along my road to building a simple C-based reverse proxy/loadbalancer, rsp, so that I can understand how nginx/OpenResty works -- more background here. Here are links to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend, to the second part, where I added the code to handle multiple connections by using epoll, and to the third part, where I started using Lua to configure the proxy.

This post was was unplanned; it shows how I fixed a bug that I discovered when I first tried to use rsp to act as a reverse proxy in front of this blog. The bug is fixed, and you're now reading this via rsp. The problem was that when the connection from a browser to the proxy was slower than the connection from the proxy to the backend (that is, most of the time), then when new data was received from the backend and we tried to send it to the client, we sometimes got an error to tell us that the client was not ready. This error was being ignored, so a block of data would be skipped, so the pages you got back would be missing chunks. There's more about the bug here.

[ Read more ]

...and another sidetrack -- a new theme!

Posted on 3 October 2013 in Blogging, Meta, Website design

While I was at it, I figured that this blog was looking ridiculously dated. So I've fixed that with the Iconic One Wordpress theme, with a few tweaks that I think make it look a bit cleaner.

A brief sidetrack: Varnish

Posted on 2 October 2013 in Linux, Programming

In order to use this blog as a decent real-world test of rsp, I figured that I should make it as fast as possible. The quickest way to do that was to install Varnish, which is essentially a reverse proxy that caches stuff. You configure it to say what is cachable, and then it runs in place of the web server and proxies anything it can't cache back to it.

I basically used the instructions from Ewan Leith's excellent "10 Million hits a day with Wordpress using a $15 server" post.

So now, this server has:

  • rsp running on port 80, proxying everything to port 83.
  • varnish running on port 83, caching what it can and proxying the rest to port 81.
  • nginx running on port 81, serving static pages and sending PHP stuff to php5-fpm on port 9000.

I've also got haproxy running on port 82, doing the same as rsp -- proxying everything to varnish -- so that I can do some comparative speed tests once rsp does enough for such tests to give interesting results. Right now, all of the speed differences seem to be in the noise, with a run of ab pointed at varnish actually coming out slower than the two proxies.

Writing a reverse proxy/loadbalancer from the ground up in C, pause to regroup: fixed it!

Posted on 29 September 2013 in Linux, Programming

It took a bit of work, but the bug is fixed: rsp now handles correctly the case when it can't write as much as it wants to the client side. I think this is enough for it to properly work as a front-end for this website, so it's installed and running here. If you're reading this (and I've not had to switch it off in the meantime) then the pages you're reading were served over rsp. Which is very pleasing :-)

The code needs a bit of refactoring before I can present it, and the same bug still exists on the communicating-to-backends side (which is one of the reasons it needs refactoring -- this is something I should have been able to fix in one place only) so I'll do that over the coming days, and then do another post.

Writing a reverse proxy/loadbalancer from the ground up in C, pause to regroup: non-blocking output

Posted on 28 September 2013 in Linux, Programming

Before moving on to the next step in my from-scratch reverse proxy, I thought it would be nice to install it on the machine where this blog runs, and proxy all access to the blog through it. It would be useful dogfooding and might show any non-obvious errors in the code. And it did.

I found that while short pages were served up perfectly well, longer pages were corrupted and interrupted halfway through. Using curl gave various weird errors, eg. curl: (56) Problem (3) in the Chunked-Encoded data, which is a general error saying that it's receiving chunked data and the chunking is invalid.

Doubly strangely, these problems didn't happen when I ran the proxy on the machine where I'm developing it and got it to proxy the blog; only when I ran it on the same machine as the blog. They're different versions of Ubuntu, the blog server being slightly older, but not drastically so -- and none of the stuff I'm using is that new, so it seemed unlikely to be a bug in the blog server's OS. And anyway, select isn't broken.

After a ton of debugging with printfs here there and everywhere, I tracked it down. You'll remember that our code to transfer data from the backend to the client looks like this:

void handle_backend_socket_event(struct epoll_event_handler* self, uint32_t events)
{
    struct backend_socket_event_data* closure = (struct backend_socket_event_data*) self->closure;

char buffer[BUFFER_SIZE]; int bytes_read;

if (events & EPOLLIN) { bytes_read = read(self->fd, buffer, BUFFER_SIZE); if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) { return; }

if (bytes_read == 0 || bytes_read == -1) { close_client_socket(closure->client_handler); close_backend_socket(self); return; }

write(closure->client_handler->fd, buffer, bytes_read); }

if ((events & EPOLLERR) | (events & EPOLLHUP) | (events & EPOLLRDHUP)) { close_client_socket(closure->client_handler); close_backend_socket(self); return; }

}

If you look closely, there's a system call there where I'm not checking the return value -- always risky. It's this:

        write(closure->client_handler->fd, buffer, bytes_read);

The write function returns the number of bytes it managed to write, or an error code. The debugging code revealed that sometimes it was returning -1, and errno was set to EAGAIN, meaning that the operation would have blocked on a non-blocking socket.

This makes a lot of sense. Sending stuff out over the network is a fairly complex process. There are kernel buffers of stuff to send, and as we're using TCP, which is connection-based, I imagine there's a possibility that the client being slow or transmission of data over the Internet might be causing things to back up. Possibly sometimes it was returning a non-error code, too, but was still not able to write all of the bytes I asked it to write, so stuff was getting skipped.

So that means that even for this simple example of an epoll-based proxy to work properly, we need to do some kind of buffering in the server to handle cases where we're getting stuff from the backend faster than we can send it to the client. And possibly vice versa. It's possible to get epoll events on an FD when it's ready to accept output, so that's probably the way to go -- but it will need a bit of restructuring. So the next step will be to implement that, rather than the multiple-backend handling stuff I was planning.

This is excellent. Now I know a little more about why writing something like nginx is hard, and have a vague idea of why I sometimes see stuff in its logs along the lines of an upstream response is buffered to a temporary file. Which is entirely why I started writing this stuff in the first place :-)

Here's a run-through of the code I had to write to fix the bug.

Writing a reverse proxy/loadbalancer from the ground up in C, part 3: Lua-based configuration

Posted on 11 September 2013 in Linux, Programming

This is the third step along my road to building a simple C-based reverse proxy/loadbalancer so that I can understand how nginx/OpenResty works -- more background here. Here's a link to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend, and to the second part, where I added the code to handle multiple connections by using epoll.

This post is much shorter than the last one. I wanted to make the minimum changes to introduce some Lua-based scripting -- specifically, I wanted to keep the same proxy with the same behaviour, and just move the stuff that was being configured via command-line parameters into a Lua script, so that just the name of that script would be specified on the command line. It was really easy :-) -- but obviously I may have got it wrong, so as ever, any comments and corrections would be much appreciated.

[ Read more ]

Writing a reverse proxy/loadbalancer from the ground up in C, part 2: handling multiple connections with epoll

Posted on 7 September 2013 in Linux, Programming

This is the second step along my road to building a simple C-based reverse proxy/loadbalancer so that I can understand how nginx/OpenResty works -- more background here. Here's a link to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend.

This (rather long) post describes a version that uses Linux's epoll API to handle multiple simultaneous connections -- but it still just sends all of them down to the same backend server. I've tested it using the Apache ab server benchmarking tool, and over a million requests, 100 running concurrently, it adds about 0.1ms to the average request time as compared to a direct connection to the web server, which is pretty good going at this early stage. It also doesn't appear to leak memory, which is doubly good going for someone who's not coded in C since the late 90s. I'm pretty sure it's not totally stupid code, though obviously comments and corrections would be much appreciated!

[UPDATE: there's definitely one bug in this version -- it doesn't gracefully handle cases when the we can't send data to the client as fast as we're receiving it from the backend. More info here.]

[ Read more ]

Writing a reverse proxy/loadbalancer from the ground up in C, part 1: a trivial single-threaded proxy

Posted on 12 August 2013 in Programming

This is the first step along my road to building a simple C-based reverse proxy/loadbalancer so that I can understand how nginx/OpenResty works -- more explanation here. It's called rsp, for Really Simple Proxy. This version listens for connections on a particular port, specified on the command line; when one is made it sends the request down to a backend -- another server with an associated port, also specified on the command line -- and sends whatever comes back from the backend back to the person who made the original connection. It can only handle one connection at a time -- while it's handling one, it just queues up others, and it handles them in turn. This will, of course, change later.

I'm posting this in the hope that it might help people who know Python, and some basic C, but want to learn more about how the OS-level networking stuff works. I'm also vaguely hoping that any readers who code in C day to day might take a look and tell me what I'm doing wrong :-)

[ Read more ]

Writing a reverse proxy/loadbalancer from the ground up in C, part 0: introduction

Posted on 8 August 2013 in Programming

We're spending a lot of time on nginx configuration at PythonAnywhere. We're a platform-as-a-service, and a lot of people host their websites with us, so it's important that we have a reliable load-balancer to receive all of the incoming web traffic and appropriately distribute it around backend web-server nodes.

nginx is a fantastic, possibly unbeatable tool for this. It's fast, reliable, and lightweight in terms of CPU resources. We're using the OpenResty variant of it, which adds a number of useful modules -- most importantly for us, one for Lua scripting, which means that we can dynamically work out where to send traffic as the hits come in.

It's also quite simple to configure at a basic level. You want all incoming requests for site X to go to backend Y? Just write something like this:

    server {
        server_name X
        listen 80;

location / { proxy_set_header Host $host; proxy_pass Y; } }

Simple enough. Lua scripting is pretty easy to add -- you just put an extra directive before the proxy_pass that provides some Lua code to run, and then variables you set in the code can be accessed from the proxy_pass.

But there are many more complicated options. worker_connections, tcp_nopush, sendfile, types_hash_max_size... Some are reasonably easy to understand with a certain amount of reading, some are harder.

I'm a big believer that the best way to understand something complex is to try to build your own simple version of it. So, in my copious free time, I'm going to start putting together a simple loadbalancer in C. The aim isn't to rewrite nginx or OpenResty; it's to write enough equivalent functionality that I can better understand what they are really doing under the hood, in the same way as writing a compiler for a toy language gives you a better understanding of how proper compilers work. I'll get a good grasp on some underlying OS concepts that I have only a vague appreciation of now. It's also going to be quite fun coding in C again. I've not really written any since 1997.

Anyway, I'll document the steps I take here on this blog; partly because there's a faint chance that it might be interesting to other experienced Python programmers whose C is rusty or nonexistent and want to get a view under the hood, but mostly because the best way to be sure you really understand it is to try to explain it to other people.

I hope it'll be interesting!

Here's a link to the first post in the series: Writing a reverse proxy/loadbalancer from the ground up in C, part 1: a trivial one-shot proxy

SNI-based reverse proxying with Go(lang)

Posted on 18 July 2013 in Programming, PythonAnywhere

Short version for readers who know all about this kind of stuff: we build a simple reverse-proxy server in Go that load-balances HTTP requests using the Hosts header and HTTPS using the SNIs from the client handshake. Backends are selected per-host from sets stored in a redis database. It works pretty well but we won't be using it because it can't send the originating client IP to the backends when it's handling HTTPS. Code here.

We've been looking at options to load-balance our user's web applications at PythonAnywhere; this post is about something we considered but eventually abandoned; I'm posting it because the code might turn out to be useful to other people.

A bit of background first; if you already know what a reverse proxy is and how load-balancing and virtual hosting work, you can skip forward a bit.

Imagine an old-fashioned shared hosting environment. You're able to run a web application on a machine that's being used by lots of other people, and you're given that machine's IP address. You set up your DNS configuration so that your domain points to that IP address, and it all works. When a connection comes in from a browser to access your site, the web server on the machine needs to work out which person's web app it should route it to. It does this by looking at the HTTP request and finding a Host header in it. So, by using the Host header, the shared hosting provider can keep costs down by sharing an IP address and a machine between multiple clients. This is called virtual hosting.

Now consider the opposite case -- a high-traffic website, where one machine isn't enough to handle all of the traffic. Processing a request for a page on a website can take a certain amount of machine resources -- database lookups, generating dynamic pages from templates, and so on. So a single web server might not be enough to cope with lots of traffic. In this case, people use what's called a reverse proxy, or load-balancer. In the simplest case, this is just a machine running on a single IP. When a request comes in, it selects a backend -- that is, one of a number of web servers, each of which is running the full website's code. It then just sends the request down to one of them, and copies all data that comes back from that backend up to the browser that made the request. Because just copying data around from backend to browser and vice versa is much easier work than processing the actual request, a single load-balancer can handle many more requests than any of the backend web servers could, and if it's configured to select backends appropriately it can spread the load smoothly across them. Additionally, this kind of setup can handle outages gracefully -- if one backend stops responding, it can stop routing to it and use the others as backups.

Now let's combine those two ideas. Imagine a platform-as-a-service, where each outward-facing IP might be responsible for handling large numbers of websites. But for reliability and performance, it might make sense to have each website backed by multiple backends. So, for example, a PaaS might have a thousand websites backed by one hundred different webservers, where website one is handled by backends one, two and three, website two by backends two, three and four, and so on. This means that the PaaS can keep costs down (running ten web apps per backend server) and reliability and performance up (each website having three independent backends).

So, that's the basics. There are a number of great tools which can be used to operate as super-efficient proxies that can handle this kind of many-hostnames-to-many-backends mapping. nginx is the most popular, but there are also haproxy and hipache. We are planning to choose one of these for PythonAnywhere (more about that later), but we did identify one slight problem with all of them. The code I'm shortly going to show was our attempt at working around that problem.

The description above of how virtual hosting works is fine when we're talking about HTTP. But increasingly, people want to use HTTPS for secure connections.

When an HTTPS connection comes in, the server has a problem. Before it can decode what's in the request and get the Host header, it needs to establish a secure link. Its first step to establish that link is to send a certificate to the client to prove it is who it says it is. But each of the different virtual hosts on the machine will need a different certificate, because they're all on different domains. So there's a chicken-and-egg problem; it needs to know which host it is meant to be in order to send the right certificate, but it needs to have sent the certificate in order to establish a secure connection to find out which host it is meant to be. This was a serious problem until relatively recently; basically, it meant that every HTTPS-secured site had to have its own dedicated IP address, so that the server could tell which certificate to serve when a client connected by looking at the IP address the connection came in on.

This problem was solved by an extension to the TLS protocol (TLS being the latest protocol to underly HTTPS) called "Server Name Indication". Basically, it takes the idea of the HTTP Host header and moves it down the stack a bit. The initial handshake message that a client connecting to a server used to just say "here I am and here's the kind of SSL protocol I can handle -- now what's your certificate?" With SNI the handshake also says "here's the hostname I expect you to have"

So with SNI, a browser connects to a server, and the server looks at the handshake to find out which certificate to use. The browser and server establish a secure link and then the browser sends the normal HTTP request, which has a Host header, which it then uses to send the request to the appropriate web app.

Let's get back to the proxy server that's handling incoming requests for lots of different websites and routing them to lots of different backends. With all of the proxies mentioned above -- nginx, hipache and haproxy -- a browser makes a connection, the proxy does all of the SNI stuff to pick the right certificate, it decodes the data from the client, works out which backend to send it to using the Host header in the decoded data, and then forwards everything on.

There's an obvious inefficiency here. The proxy shouldn't have to decode the secure connection to get the Host header -- after all, it already knows that from the information in the SNI. And it gets worse. Decoding the secure connection uses up CPU cycles on the proxy. And either the connection between the proxy and the backends is non-secure, which could be an issue if a hacker got onto the network, or it's secure, in which case the proxy is decoding and then encoding everything that goes through it -- even more CPU load. Finally, all of the certificates for every site that the proxy's handling -- and their associated private keys -- have to be available to the proxy. Which is another security risk if it gets hacked.

So, probably like many people before us, we thought "why not just route HTTPS based on the SNI? It can't be that hard!" And actually, it isn't. Here's a GitHub project with a simple Go application that routes HTTP requests using the hosts header, and HTTPS using the SNI. It never needs to know anything about the certificates for the sites it's proxying for, and all data is passed through without any decryption.

So why didn't we decide to use it? Access logs and spam filters. The thing is, people who are running websites like to know who's been looking at their stuff -- for their website metrics, for filtering out spammy people using tools like Akismet, and so on. If you're using a proxy, then the backend sees every request as coming from the proxy's IP, which isn't all that useful. So normally a proxy will add an extra header to HTTP requests it passes through -- X-Forwarded-For is the usual one.

And the problem with an SNI proxy is the same as its biggest advantage. Because it's not decoding the secure stream from the browser, it can't change it, so it can't insert any extra headers. So all HTTPS requests going over any kind of SNI-based reverse proxy will appear to come from the proxy itself. Which breaks things.

So we're not going to use this. And TBH it's not really production-level code -- it was a spike and is also the first Go code I've ever written, so it's probably full of warts (comments very much welcomed!). Luckily we realised the problem with the backends not knowing about the client's IP before we started work on rewriting it test-first.

On the other hand, it might be interesting for anyone who wants to do stuff like this. The interesting stuff is mostly in handleHTTPSConnection, which decodes the TLS handshake sent by the client to extract the SNI.

I did a bit of very non-scientific testing just to make sure it all works. I started three backends servers with simple Flask apps that did a sleep on every request to simulate processing:

from flask import Flask
import time
from socket import gethostname

app = Flask(__name__)

@app.route("/") def index(): time.sleep(0.05) return "Hello from " + gethostname()

if __name__ == "__main__": app.run("0.0.0.0", 80, processes=4)

Then ran the Apache ab tool to see what the performance characteristics were for one of them:

root@abclient:~# ab -n1000 -c100 http://198.199.83.71/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 198.199.83.71 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests

Server Software: Werkzeug/0.9.2 Server Hostname: 198.199.83.71 Server Port: 80

Document Path: / Document Length: 19 bytes

Concurrency Level: 100 Time taken for tests: 21.229 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 172000 bytes HTML transferred: 19000 bytes Requests per second: 47.10 [#/sec] (mean) Time per request: 2122.938 [ms] (mean) Time per request: 21.229 [ms] (mean, across all concurrent requests) Transfer rate: 7.91 [Kbytes/sec] received

Connection Times (ms) min mean[+/-sd] median max Connect: 0 3 7.4 0 37 Processing: 73 2025 368.7 2129 2387 Waiting: 73 2023 368.4 2128 2386 Total: 103 2028 363.7 2133 2387

Percentage of the requests served within a certain time (ms) 50% 2133 66% 2202 75% 2232 80% 2244 90% 2286 95% 2317 98% 2344 99% 2361 100% 2387 (longest request) root@abclient:~#

Then, after adding records to the proxy's redis instance to tell it to route requests with the hostname proxy to any of the backends, and hacking the hosts file on the ab client machine to make the hostname proxy point to it:

root@abclient:~# ab -n1000 -c100 http://proxy/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking proxy (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests

Server Software: Werkzeug/0.9.2 Server Hostname: proxy Server Port: 80

Document Path: / Document Length: 19 bytes

Concurrency Level: 100 Time taken for tests: 7.668 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 172000 bytes HTML transferred: 19000 bytes Requests per second: 130.41 [#/sec] (mean) Time per request: 766.803 [ms] (mean) Time per request: 7.668 [ms] (mean, across all concurrent requests) Transfer rate: 21.91 [Kbytes/sec] received

Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 1.7 0 9 Processing: 93 695 275.4 617 1228 Waiting: 93 693 275.4 614 1227 Total: 99 696 274.9 618 1228

Percentage of the requests served within a certain time (ms) 50% 618 66% 799 75% 948 80% 995 90% 1116 95% 1162 98% 1185 99% 1204 100% 1228 (longest request) root@abclient:~#

So, it works. I've not done ab testing with the HTTPS side of things, but I have hacked my own hosts file and spent a day accessing Google and PythonAnywhere itself via the proxy. It works :-)

As to what we're actually going to use for load-balancing PythonAnywhere:

  • nginx is great but stores its routing config in files, which doesn't easily scale to large numbers of hosts/backends. It's doable, but it's just a nightmare to manage, especially if things go wrong.
  • haproxy is the same -- worse, it needs to be fully restarted (interrupting ongoing connections) if you change the config.
  • hipache stores data in redis (which is what inspired me to do something similar for this proxy) so it can gracefully handle rapidly-changing rounting setups. But it's written in Node.js, so while it's pretty damn fast, it's not as fast as nginx.

But... as the dotcloud people who wrote hipache recently pointed out (bottom of the post), nginx's built-in lua scripting support is now at a level where you can store your routing config in redis -- so with a bit of work, you can get the speed of nginx with the ease of configuration of hipache. So that's where we're heading. We'll just have to make sure the proxy and its certificates are super-secure, and live with the extra CPU load.