Writing a reverse proxy/loadbalancer from the ground up in C, pause to regroup: non-blocking output

About

Contact

Writing a reverse proxy/loadbalancer from the ground up in C, pause to regroup: non-blocking output

Posted on 28 September 2013 in C, Linux, TIL deep dives

Before moving on to the next step in my from-scratch reverse proxy, I thought it would be nice to install it on the machine where this blog runs, and proxy all access to the blog through it. It would be useful dogfooding and might show any non-obvious errors in the code. And it did.

I found that while short pages were served up perfectly well, longer pages were corrupted and interrupted halfway through. Using curl gave various weird errors, eg.

curl: (56) Problem (3) in the Chunked-Encoded data

...which is a general error saying that it's receiving chunked data and the chunking is invalid.

Doubly strangely, these problems didn't happen when I ran the proxy on the machine where I'm developing it and got it to proxy the blog; only when I ran it on the same machine as the blog. They're different versions of Ubuntu, the blog server being slightly older, but not drastically so -- and none of the stuff I'm using is that new, so it seemed unlikely to be a bug in the blog server's OS. And anyway, select isn't broken.

After a ton of debugging with printfs here there and everywhere, I tracked it down.

You'll remember that our code to transfer data from the backend to the client looks like this:

void handle_backend_socket_event(struct epoll_event_handler* self, uint32_t events)
{
    struct backend_socket_event_data* closure = (struct backend_socket_event_data*) self->closure;

    char buffer[BUFFER_SIZE];
    int bytes_read;

    if (events & EPOLLIN) {
        bytes_read = read(self->fd, buffer, BUFFER_SIZE);
        if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
            return;
        }

        if (bytes_read == 0 || bytes_read == -1) {
            close_client_socket(closure->client_handler);
            close_backend_socket(self);
            return;
        }

        write(closure->client_handler->fd, buffer, bytes_read);
    }

    if ((events & EPOLLERR) | (events & EPOLLHUP) | (events & EPOLLRDHUP)) {
        close_client_socket(closure->client_handler);
        close_backend_socket(self);
        return;
    }

}

If you look closely, there's a system call there where I'm not checking the return value -- always risky. It's this:

        write(closure->client_handler->fd, buffer, bytes_read);

The write function returns the number of bytes it managed to write, or an error code. The debugging code revealed that sometimes it was returning -1, and errno was set to EAGAIN, meaning that the operation would have blocked on a non-blocking socket.

This makes a lot of sense. Sending stuff out over the network is a fairly complex process. There are kernel buffers of stuff to send, and as we're using TCP, which is connection-based, I imagine there's a possibility that the client is being slow, or the transmission of data over the Internet might be causing things to back up. Possibly sometimes it was returning a non-error code, too, but was still not able to write all of the bytes I asked it to write, so stuff was getting skipped.

[Update, later: and of course the different behaviour on the two machines would make sense. When it was running on the same machine as the blog's HTTP server, it would get data from it really quickly -- fast enough to overwhelm the "upstream" connection to the browser. By contrast, when it was running on a separate machine, the downstream and upstream bandwidth would be more similar -- data would come in from the blog at a speed much closer to the speed at whcih it could be sent to the client.]

So that means that even for this simple example of an epoll-based proxy to work properly, we need to do some kind of buffering in the server to handle cases where we're getting stuff from the backend faster than we can send it to the client. And possibly vice versa. It's possible to get epoll events on an FD when it's ready to accept output, so that's probably the way to go -- but it will need a bit of restructuring. So the next step will be to implement that, rather than the multiple-backend handling stuff I was planning.

This is excellent. Now I know a little more about why writing something like nginx is hard, and have a vague idea of why I sometimes see stuff in its logs along the lines of an upstream response is buffered to a temporary file. Which is entirely why I started writing this stuff in the first place :-)

Here's a run-through of the code I had to write to fix the bug.