Writing a reverse proxy/loadbalancer from the ground up in C, part 2: handling multiple connections with epoll
This is the second step along my road to building a simple C-based reverse proxy/loadbalancer so that I can understand how nginx/OpenResty works -- more background here. Here's a link to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend.
This (rather long) post describes a version that uses Linux's
epoll API to handle multiple simultaneous
connections -- but it still just sends all of them down to the same backend
server. I've tested it using the Apache ab
server benchmarking tool,
and over a million requests, 100 running concurrently, it adds about 0.1ms to
the average request time as compared to a direct connection to the web server,
which is pretty good going at this early stage. It also doesn't appear to leak
memory, which is doubly good going for someone who's not coded in C since the
late 90s. I'm pretty sure it's not totally stupid code, though obviously
comments and corrections would be much appreciated!
[UPDATE: there's definitely one bug in this version -- it doesn't gracefully handle cases when the we can't send data to the client as fast as we're receiving it from the backend. More info here.]
Just like before, the code that I'll be describing is hosted on GitHub as a project called rsp, for "Really Simple Proxy". It's MIT licensed, and the version of it I'll be walking through in this blog post is as of commit f51950b213. I'll copy and paste the code that I'm describing into this post anyway, so if you're following along there's no need to do any kind of complicated checkout.
Before we dive into the code, though, it's worth talking about epoll a bit.
You'll remember that the code for the server in my last post went something like this pseudocode:
while True:
wait for a new incoming connection from a client
handle the client connection
...where the code to handle the client connection was basically:
connect to the backend
read a block's worth of stuff from the client
send it to the backend
while there's still stuff to be read from the backend:
send it to the client
Now, all of those steps to read stuff, or to wait for incoming connections, were blocking calls -- we made the call, and when there was data for us to process, the call returned. So handling multiple connections would have been impossible, as (say) while we were waiting for data from one backend we would also have to be waiting for new incoming connections, and perhaps reading from other incoming connections or backends. We'd be trying to do several things at once.
That sounds like the kind of problem threads, or even cooperating processes, were made for. That's a valid solution, and was the normal way of doing it for a long time. But that's changed (at least in part). To see why, think about what would happen on a very busy server, handling hundreds or thousands of concurrent connections. You'd have hundreds or thousands of threads or processes -- which isn't a huge deal in and of itself, but they'd all be spending most of their time just sitting there using up memory while they were waiting for data to come in. Processes, or even threads, consume a non-trivial amount of machine resources at that kind of scale, and while there's still a place for them, they're an inefficient way to do this kind of work.
A very popular option for network servers like this these days is to use non-blocking IO. It's a bit of a misnomer, but there's logic behind it. The theory is that instead of having your "read from the backend server" or "wait for an incoming connection" call just sit there and not return until something's available, it doesn't block at all -- if there's nothing there to return, it just returns a code saying "nothing for you right now".
Now obviously, you shouldn't write your code so that it's constantly running through a list of the things you're waiting for saying "anything there for me yet?" because that would suck up CPU cycles to no real benefit. So what the non-blocking model does in practice is provide you with a way to register a whole bunch of things you're interested in, and then gives you a blocking (told you it was a misnomer) function that basically says "let me know as soon as there's anything interesting happening on any of these". The "things" that you're waiting for stuff on are file descriptors. So the previous loop could be rewritten using this model to look something like this:
add the "incoming client connection" file descriptor to the list of things I'm interested in
while True:
wait for an event on the list of things I'm interested in
if it's an incoming client connection:
get the file descriptor for the client connection, add it to the list
connect to the backend
add the backend's file descriptor to the list
else if it's data from an client connection
send it to the associated backend
else if it's data from a backend
send it to the associated client connection
...with a bit of extra logic to handle closing connections.
It's worth noting that this updated version can not only process multiple connections with just a single thread -- it can also handle bidirectional communication between the client and the backend. The previous version read once from the client, sent the result of that read down to the backend, and from then on only sent data from the backend to the client. The pseudocode above keeps sending stuff in both directions, so if the client sends something after the initial block of data, while the backend is already replying, then it gets sent to the backend. This isn't super-useful for simple HTTP requests, but for persistent connections (like WebSockets) it's essential.
There have been many system calls that been the "wait for an event on the list
of things I'm interested in" call in Unix/POSIX-like environments over the years
-- select
and poll
, for example -- but they had poor
performance as the number of file descriptors got large.
The current popular solution, in Linux at least, is
epoll
. It can handle huge numbers of file
descriptors with minimal reduction in performance. (The equivalent in FreeBSB
and Mac OS X is kqueue
, and according
to Wikipedia there's something similar in Windows and Solaris called
"I/O Completion Ports".)
The rest of this post shows the code I wrote to use epoll
in a way that
(a) makes sense to me and feels like it will keep making sense as I add more
stuff to rsp, and (b) works pretty efficiently.
Just as before, I'll explain it by working through the code. There are a bunch
of different files now, but the main one is still rsp.c
, which now has a main
routine that starts like this:
int main(int argc, char* argv[])
{
if (argc != 4) {
fprintf(stderr,
"Usage: %s <server_port> <backend_addr> <backend_port>\n",
argv[0]);
exit(1);
}
char* server_port_str = argv[1];
char* backend_addr = argv[2];
char* backend_port_str = argv[3];
So, some pretty normal initialisation stuff to check the command-line parameters
and put them into meaningfully-named variables. (Sharp-eyed readers will have
noticed that I've updated my code formatting -- I'm now putting the *
to
represent a pointer next to the type to which it points, which makes more sense
to me than splitting the type definition with a space, and I've also discovered
that the C99 standard allows you to declare
variables anywhere inside a function, which I think makes the code much more
readable.)
Now, the first epoll
-specific code:
int epoll_fd = epoll_create1(0);
if (epoll_fd == -1) {
perror("Couldn't create epoll FD");
exit(1);
}
epoll
not only allows you to wait for stuff to happen on multiple file
descriptors at a time -- it's also controlled by its own special type of file
descriptor. You can have multiple epoll FDs in a program, each of which gives
you the ability to wait for changes on a different set of normal FDs. A
specific normal FD could be in several different epoll FDs' sets of things to
listen to. You can even add one epoll FD to the list of FDs another epoll FD
is watching if you're so inclined. But we're not doing anything quite that
complicated here.
You create a special epoll FD using either
epoll_create
or epoll_create1
.
epoll_create
is pretty much deprecated now (see the link for details) so we
just use epoll_create1
in its simplest form, and bomb out if it returns an
error value.
So now we have an epoll instance ready to go, and we need to register some file
descriptors with it to listen to. The first one we need is the one that will
wait for incoming connections from clients. Here's the code that does that in
rsp.c
struct epoll_event_handler* server_socket_event_handler;
server_socket_event_handler = create_server_socket_handler(epoll_fd,
server_port_str,
backend_addr,
backend_port_str);
add_epoll_handler(epoll_fd, server_socket_event_handler, EPOLLIN);
This is all code that's using an abstraction I've built on top of epoll
that
makes it easy to provide callback functions that are called when an event
happens on a file descriptor, so it's worth explaining that now. Let's switch
to the file epollinterface.c
. This defines a function add_epoll_handler
that looks like this:
void add_epoll_handler(int epoll_fd, struct epoll_event_handler* handler, uint32_t event_mask)
{
struct epoll_event event;
event.data.ptr = handler;
event.events = event_mask;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, handler->fd, &event) == -1) {
perror("Couldn't register server socket with epoll");
exit(-1);
}
}
The important system call in there is epoll_ctl
.
This is the function that allows you to add, modify and delete file descriptors
from the list that a particular epoll
file descriptor is watching. You give it
the epoll file descriptor, an operation (EPOLL_CTL_ADD
, EPOLL_CTL_MOD
or
EPOLL_CTL_DEL
), the normal file descriptor you're interested in events for,
and a pointer to a struct epoll_event
. The event you pass in has two fields:
an event mask saying which events on the file descriptor you're interested in,
and some data.
The data is interesting. When you do the "block until something interesting has
happened on one or more of this epoll
FD's file descriptors" call, it returns a
list of results. Obviously, you want to be able to work out for each event
where it came from so that you can work out what to do with it. Now, this could
have been done by simply returning the file descriptor for each. But epoll
's
designers were a bit cleverer than that.
The thing is, if all epoll
gave you was a set of file descriptors that have had
something happen to them, then you would need to maintain some kind of control
logic saying "this file descriptor needs to be handled by that bit of code over
there, and this other file description by that other code", and so on. That could get complicated
quickly. You only need to look at the code of some of the
epoll
examples on the net`
to see that while it might make sample code easier to understand at a glance, it
won't scale. (I should make it clear that this isn't a criticism of the
examples, especially the one I linked to, which is excellent -- just my opinion
that non-trivial non-sample code needs a different pattern.)
So, when epoll
tells you that something's happened on one of the file
descriptors you're interested in, it gives you an epoll_event
just like the
one you registered the FD with, with the events
field set to the bitmask of
the events you've received (rather than the set of the events you're interested
in) and the data
field set to whatever it was you gave it at registration
time.
The type of the data
field is a union, and it looks like this:
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
Aside for people newish to C -- this was something I had to refresh myself on -- a C union is a type that allows you to put any value from a set of types into a variable. So in a variable with the type specification above, you can store either a pointer to something (
void*
), or an integer, or one of two different types of specifically-sized integers. When you retrieve the value, you have to use the field name appropriate to the type of thing you put in there -- for example, if you were to store a 32-bit integer in the data using theu32
name and then retrieve it using theptr
variable, the result would be undefined. (Well, on a 32-bit machine it would probably be a pointer to whatever memory address was represented by that 32-bit integer, but that's unlikely to be what you wanted.)
In this case, we're using the data
pointer inside the union, and we're setting
it to a pointer to a struct epoll_event_handler
. This is a structure I've
created to provide callback-like functionality from epoll. Let's take a look
-- it's in epollinterface.h<
:
struct epoll_event_handler {
int fd;
void (*handle)(struct epoll_event_handler*, uint32_t);
void* closure;
};
So, an epoll_event_handler
stores:
- The file descriptor it's associated with
- A callback function to handle an
epoll
event which takes a pointer to aepoll_event_handler
structure, and auint32_t
which will hold the bitmask representing the events that need to be handled - And a pointer to something called
closure
; basically, a place to store any data the callback function needs to do its job.
Right. Now we have a function called add_epoll_handler
that knows how to add
a file descriptor and an associated structure to an epoll
FD's list of things
it's interested in, so that it's possible to do a callback with data when an
event happens on the epoll FD. Let's go back to the code in rsp.c
that was
calling this. Here it is again:
struct epoll_event_handler* server_socket_event_handler;
server_socket_event_handler = create_server_socket_handler(epoll_fd,
server_port_str,
backend_addr,
backend_port_str);
add_epoll_handler(epoll_fd, server_socket_event_handler, EPOLLIN);
This presumably now makes sense -- we're creating a special handler to handle
events on the server socket (that is, the thing that listens for incoming client
connections) and we're then adding that to our epoll
FD, with an event mask
that says that we're interested in hearing from it when there's something to
read on it (EPOLLIN
) -- that is, a client connection has come in.
Let's put aside how that server socket handler works for a moment, and finish
with rsp.c
. The next lines look like this:
printf("Started. Listening on port %s.\n", server_port_str);
do_reactor_loop(epoll_fd);
return 0;
}
Pretty simple. We print out our status, then call this do_reactor_loop
function, then return. do_reactor_loop
is obviously the interesting bit; it's
another part of my epoll
abstraction layer, and it basically does the "while True"
loop in the pseudocode above -- it waits for incoming events on the epoll FD,
and when they arrive it extracts the appropriate handler, and calls its callback
with its closure data. Let's take a look, back in epollinterface.c
:
void do_reactor_loop(int epoll_fd)
{
struct epoll_event current_epoll_event;
while (1) {
struct epoll_event_handler* handler;
epoll_wait(epoll_fd, ¤t_epoll_event, 1, -1);
handler = (struct epoll_event_handler*) current_epoll_event.data.ptr;
handler->handle(handler, current_epoll_event.events);
}
}
It's simple enough. We go into a never-ending loop, and each time around we
call epoll_wait
, which, as you'd
expect, is the magic function that blocks until events are available on any one
of the file descriptors that our epoll
FD is interested in. It takes an
epoll
FD to wait on, a place to store incoming events, a maximum number of
events to receive right now, and a timeout. As we're saying "no timeout" with
that -1
as the last parameter, then when it returns, we know we have an event
-- so we extract its handler, and call it with the appropriate data. And back
around the loop again.
One interesting thing here: as you'd expect from the parameters,
epoll_wait
can actually get multiple events at once; the1
we're passing in as the penultimate parameter is to say "just give us one", and we're passing in a pointer to a singlestruct epoll_event
. If we wanted more than one then we'd pass in an array ofstruct epoll_event
s, with a penultimate parameter saying how long it is so thatepoll_wait
knew the maximum number to get in this batch. When you callepoll_wait
with a smaller "maximum events to get" parameter than the number that are actually pending, it will return the maximum number you asked for, and then the next time you call it will give you the next ones in its queue immediately, so the only reason to get lots of them in one go is efficiency. But I've noticed no performance improvements from getting multiple epoll events in one go, and only accepting one event at a time has one advantage. Imagine that you're processing an event on a backend socket's FD, which tells you that the backend has closed the connection. You then close the associated client socket, and you free up the memory for both the backend and the client socket's handlers and closures. Closing the client socket means that you'll never get any more events on that client socket (it automatically removes if from any epoll FDs' lists that it's on). But what if there was already an event for the client socket in the event array that was returned from your last call toepoll_wait
, and you'd just not got to it yet? If that happened, then when you did try to process it, you'd try to get its handler and closure data, which had already been freed. This would almost certainly cause the server to crash. Handling this kind of situation would make the code significantly more complicated, so I've dodged it for now, especially given that it doesn't seem to harm the proxy's speed.
So that's our reactor loop (the name "reactor" comes from
Twisted and I've doubtless completely misused the word).
The code that remains unexplained is in the event-handlers. Let's start off by
looking at the one we skipped over earlier -- the server_socket_event_handler
that we created back in rsp.c
to listen for incoming connections. It's in
server_socket.c
and the create_server_socket_handler
function called from
rsp.c
looks like this:
struct epoll_event_handler* create_server_socket_handler(int epoll_fd,
char* server_port_str,
char* backend_addr,
char* backend_port_str)
{
int server_socket_fd;
server_socket_fd = create_and_bind(server_port_str);
So, we create and bind to a server socket, to get a file descriptor for it.
You'll remember that terminology from
the last post,
and in fact the create_and_bind
function (also defined in server_socket.c
) is
exactly the same code as we had to do the same job in the original
single-connection server.
Now, we do our first new thing -- we tell the system to make our server socket non-blocking, which is obviously important if we don't want calls to get data from it to block:
make_socket_non_blocking(server_socket_fd);
This isn't a system call, unfortunately -- it's a utility function, defined in
netutils.c
, and let's jump over there and take a look:
void make_socket_non_blocking(int socket_fd)
{
int flags;
flags = fcntl(socket_fd, F_GETFL, 0);
if (flags == -1) {
perror("Couldn't get socket flags");
exit(1);
}
flags |= O_NONBLOCK;
if (fcntl(socket_fd, F_SETFL, flags) == -1) {
perror("Couldn't set socket flags");
exit(-1);
}
}
Each socket has a number of flags associated with it that control various aspects of it. One of these is whether or not it's non-blocking. So this code simply gets the bitmask that represents the current set of flags associated with a socket, ORs in the "non-blocking" bit, and then applies the new bitmask to the socket. Easy enough, and thanks to Banu Systems for a neatly-encapsulated function for that in their excellent epoll example.
Let's get back to the create_server_socket_handler
function in server_socket.c
.
listen(server_socket_fd, MAX_LISTEN_BACKLOG);
You'll remember this line from the last post, too. One slight difference -- in
the first example, we had MAX_LISTEN_BACKLOG
set to 1. Now it's much higher,
at 4096. This came out of the Apache Benchmarker tests I was doing with large
numbers of simultaneous connections. If you're running a server, and it gets
lots of incoming connection and goes significantly past its backlog, then the OS
might assume someone's running a SYN flood denial of service attack against you.
You'll see stuff like this in syslog
:
Sep 4 23:09:27 localhost kernel: [3036520.232354] TCP: TCP: Possible SYN flooding on port 8000. Sending cookies. Check SNMP counters.
Thanks to Erik Dubbelboer for
an excellent writeup
on how this happens and why. A value of 4096
for the maximum backlog seems to
be fine in terms of memory usage and allows this proxy to work well enough for
the amount of connections I've tested it with so far.
Moving on in the code:
struct server_socket_event_data* closure = malloc(sizeof(struct server_socket_event_data));
closure->epoll_fd = epoll_fd;
closure->backend_addr = backend_addr;
closure->backend_port_str = backend_port_str;
We create a closure of a special structure type that will contain all of the information that our "there's an incoming client connection" callback will need to do its job, and fill it in appropriately.
struct epoll_event_handler* result = malloc(sizeof(struct epoll_event_handler));
result->fd = server_socket_fd;
result->handle = handle_server_socket_event;
result->closure = closure;
return result;
}
...then we create a struct epoll_event_handler
with the FD, the handler
function, and the closure, and return it.
That's how we create a server socket that can listen for incoming client
connections, which when added to the event loop that the code in epollinterface.c
defined, will call an appropriate function with the appropriate data.
Next, let's look at that callback. It's called handle_server_socket_event
,
and it's also in server_socket.c
.
void handle_server_socket_event(struct epoll_event_handler* self, uint32_t events)
{
struct server_socket_event_data* closure = (struct server_socket_event_data*) self->closure;
We need to be able to extract information from the closure we set up originally for this handler, so we start off by casting it to the appropriate type. Next, we need to accept any incoming connections. We loop through all of them, accepting them one at a time; we don't know up front how many there will be to accept so we just do an infinite loop that we can break out of:
int client_socket_fd;
while (1) {
client_socket_fd = accept(self->fd, NULL, NULL);
There are two conditions under which an accept will fail (under which
circumstances the call to accept
will return -1
):
if (client_socket_fd == -1) {
Firstly if there's nothing left to accept. If that's the case, we break out of our loop:
if ((errno == EAGAIN) || (errno == EWOULDBLOCK)) {
break;
Secondly, if there's some kind of weird internal error. For now, this means that we just exit the program with an appropriate error message.
} else {
perror("Could not accept");
exit(1);
}
}
If we were able to accept an incoming client connection, we need to create a
handler to look after it, which we'll have to add to our central epoll handler.
This is done by a new function, handle_client_connection
:
handle_client_connection(closure->epoll_fd,
client_socket_fd,
closure->backend_addr,
closure->backend_port_str);
Note that we're passing in the backend address and port that we originally put into the closure when we created the server socket, that came from the command line.
Once that's done, we go back around our accept loop:
}
And once the loop is done, we return from this function:
}
So, in summary -- when we get a message from the server socket file descriptor
saying that there are one or more incoming connections, we call
handle_server_socket_event
, which accepts all of them, calling
handle_client_connection
for each one. We have to make sure that we accept
them all, as we won't be told about them again. (This is actually slightly
surprising, for reasons I'll go into later.)
All this means that our remaining unexplained code is what happens from
handle_client_connection
onwards. This is also in server_socket.c
, and is
really simple:
void handle_client_connection(int epoll_fd,
int client_socket_fd,
char* backend_host,
char* backend_port_str)
{
struct epoll_event_handler* client_socket_event_handler;
client_socket_event_handler = create_client_socket_handler(client_socket_fd,
epoll_fd,
backend_host,
backend_port_str);
add_epoll_handler(epoll_fd, client_socket_event_handler, EPOLLIN | EPOLLRDHUP);
}
We just create a new kind of handler, one for handling events on client sockets,
and register it with our epoll
loop saying that we're interested in events
when data comes in, and when the remote end of the connection is closed.
Onward to the client connection handler code, then! create_client_socket_handler
is defined in client_socket.c
, and looks like this:
struct epoll_event_handler* create_client_socket_handler(int client_socket_fd,
int epoll_fd,
char* backend_host,
char* backend_port_str)
{
make_socket_non_blocking(client_socket_fd);
struct client_socket_event_data* closure = malloc(sizeof(struct client_socket_event_data));
struct epoll_event_handler* result = malloc(sizeof(struct epoll_event_handler));
result->fd = client_socket_fd;
result->handle = handle_client_socket_event;
result->closure = closure;
closure->backend_handler = connect_to_backend(result, epoll_fd, backend_host, backend_port_str);
return result;
}
This code should be pretty clear by now. We make the client socket
non-blocking, create a closure to store data for callbacks relating to it (in
this case, the client handler needs to know about the backend so that it can
send data to it), then we set up the event handler object, create the backend
connection, and return the handler. There are two new functions being used
here -- handle_client_socket_event
and connect_to_backend
, both of which do
exactly what they say they do.
Let's consider connect_to_backend
first. It's also in client_socket.c
, and
I won't copy it all in here, because it's essentially exactly the same code as
was used in the last post
to connect to a backend. Once it's done all of the messing around to get the
addrinfo
, connect to the backend, and get an FD that refers to that backend
connection, it does a few things that should be pretty clear:
struct epoll_event_handler* backend_socket_event_handler;
backend_socket_event_handler = create_backend_socket_handler(backend_socket_fd, client_handler);
add_epoll_handler(epoll_fd, backend_socket_event_handler, EPOLLIN | EPOLLRDHUP);
The same pattern as before -- create a handler to look after that FD, passing in
information for the closure (in this case, just as the client handler needed to
know about the backend, the backend needs to know about this client), then we
add the handler to the epoll
event loop, once again saying that we're
interested in knowing about incoming data and when the remote end closes the
connection.
The only remaining client connection code that we've not gone over is
handle_client_socket_event
. Here it is:
void handle_client_socket_event(struct epoll_event_handler* self, uint32_t events)
{
struct client_socket_event_data* closure = (struct client_socket_event_data* ) self->closure;
char buffer[BUFFER_SIZE];
int bytes_read;
After our initial setup, we work out what to do based on the event bitmask that we were provided. Firstly, if there's some data coming in, we read it:
if (events & EPOLLIN) {
bytes_read = read(self->fd, buffer, BUFFER_SIZE);
There are two possible error cases here. Firstly, perhaps we were misinformed and there's no data (this happens!). If that's the case then we do nothing.
if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
return;
}
Secondly, perhaps the remote end has closed the connection. We don't always get an official "remote hung up" if this happens, so we explicitly close the connection if that happens, also closing our connection to the backend.
if (bytes_read == 0 || bytes_read == -1) {
close_backend_socket(closure->backend_handler);
close_client_socket(self);
return;
}
Finally, if we did successfully read some data, we send it down to the backend:
write(closure->backend_handler->fd, buffer, bytes_read);
}
We also need to handle explicit "the client connection has been closed" events. (note that the event bitmask can contain multiple events. So the following code might also be executed for an event that triggered the above code):
if ((events & EPOLLERR) | (events & EPOLLHUP) | (events & EPOLLRDHUP)) {
close_backend_socket(closure->backend_handler);
close_client_socket(self);
return;
}
...or in other words, if there's been some kind of error or the remote end hung up, we unceremoniously close the connection to the backend and the client connection itself.
}
And that's the sum of our event handling from client connections.
There's one interesting but perhaps non-obvious thing happening in that code.
You'll remember that when we were handling the "incoming client connection"
event, we had to carefully accept every incoming connection because we weren't
going to be informed about it again. In this handler, however, we read a
maximum of BUFFER_SIZE
bytes (currently 4096). What if there were more than
4096 bytes to read?
Explaining this requires a little more background on epoll
. It can operate in
two different modes -- edge-triggered and level-triggered. Level-triggered
is the default, so it's what we're using here. In level-triggered mode, if you
receive an epoll
notification that there's data waiting for you, and only read
some of it, then epoll
notes that there's still unhandled data waiting, and
schedules another event to be delivered later. By contrast, edge-triggered mode
only informs you once about incoming data. If you don't process it all, it
won't tell you again.
So because we're using level-triggered epoll
, we don't need to make sure we
read everything -- we know that epoll
will tell us later if there's some stuff
we didn't read. And doing things this way gives us a nice way to make sure that
when we are handling lots of connections, we time-slice between them reasonably
well. After all, if every time we got data from a client, we processed it all
in the handler, then if a client sent us lots of data in one go, we'd sit there
processing it and ignoring other clients in the meantime. Remember, we're not
multi-threading.
That's all very well, but if it's the case and we can use it when processing
data from a client, why did we have to be careful to accept all incoming client
connections? Surely we could only accept the first one, then rely on epoll
to
tell us again later that there were still more to handle?
To be honest, I don't know. It seems really odd to me. But I tried changing the accept code to only accept one client connection, and it didn't work -- we never got informed about the ones we didn't accept. Someone else got the same behaviour and reported it as a bug in the kernel back in 2006. But it's super-unlikely that something like this is a kernel bug, especially after seven years, so it must be something odd about my code, or deliberate defined behaviour that I've just not found the documentation for. Either way, the thread continuing from that bug report has comments from people saying that regardless of whether you're running edge- or level-triggered, if you want to handle lots of connections then accepting them all in one go is a good idea. So I'll stick with that for the time being, and if anyone more knowledgable than me wants to clarify things in the comments then I'd love to hear more!
So, what's left? Well, there's the code to close the client socket handler:
void close_client_socket(struct epoll_event_handler* self)
{
close(self->fd);
free(self->closure);
free(self);
}
Simple enough -- we close the socket, then free the memory associated with the closure and the handler.
There's a little bit of extra complexity here, in how we call this close
function from the handle_client_socket_event
function. It's all to do with
memory management, like most nasty things in C programs. But it's worth having
a quick look at the backend-handling code first. As you'd expect, it's in
backend_socket.c
, and it probably looks rather familiar. We have a function to create a backend handler:
struct epoll_event_handler* create_backend_socket_handler(int backend_socket_fd,
struct epoll_event_handler* client_handler)
{
make_socket_non_blocking(backend_socket_fd);
struct backend_socket_event_data* closure = malloc(sizeof(struct backend_socket_event_data));
closure->client_handler = client_handler;
struct epoll_event_handler* result = malloc(sizeof(struct epoll_event_handler));
result->fd = backend_socket_fd;
result->handle = handle_backend_socket_event;
result->closure = closure;
return result;
}
Essentially the same as its client socket equivalent, but it doesn't need to create a client handler for its closure because one was passed in.
There's a function to handle backend events, which also won't have many surprises:
void handle_backend_socket_event(struct epoll_event_handler* self, uint32_t events)
{
struct backend_socket_event_data* closure = (struct backend_socket_event_data*) self->closure;
char buffer[BUFFER_SIZE];
int bytes_read;
if (events & EPOLLIN) {
bytes_read = read(self->fd, buffer, BUFFER_SIZE);
if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
return;
}
if (bytes_read == 0 || bytes_read == -1) {
close_client_socket(closure->client_handler);
close_backend_socket(self);
return;
}
write(closure->client_handler->fd, buffer, bytes_read);
}
if ((events & EPOLLERR) | (events & EPOLLHUP) | (events & EPOLLRDHUP)) {
close_client_socket(closure->client_handler);
close_backend_socket(self);
return;
}
}
And finally, code to close the backend socket:
void close_backend_socket(struct epoll_event_handler* self)
{
close(self->fd);
free(self->closure);
free(self);
}
There's a lot of duplication there. Normally I'd refactor to make as much common code as possible between client and backend connections. But the next steps into making this a useful proxy are likely to change the structure enough that it's not worth doing that right now, only to undo it a few commits later. So there it remains, for now.
That's all of the code! The only thing remaining to explain is the memory management weirdness I mentioned in the close handling.
Here's the problem: when a connection is closed, we need to free the memory
allocated to the epoll_event_handler
structure and its associated closure. So
our handle_client_socket_event
function, which is the one notified when the
remote end is closed, needs to have access to the handler structure in order to
close it. If you were wondering why the epoll
interface abstraction passes
the handler structure into the callback function (instead of just the closure,
which would be more traditional for a callback interface like this) then there's
the explanation -- so that it can be freed when the connection closes.
But, you might ask, why don't we just put the memory management for the handler
structure in the epoll
event loop, do_reactor_loop
? When an event comes in,
we could handle it as normal and then if the event said that the connection had
closed, we could free the handler's memory. Indeed, we could even handle more
obscure cases -- perhaps the handler could returns a value saying "I'm done, you
can free my handler".
But it doesn't work, because we're not only closing the connection for the FD the handler is handling. When a client connection closes, we need to close the backend, and vice versa. Now, when the remote end closes a connection, we get an event from epoll. But if we close it ourselves, then we don't. For most normal use, that doesn't matter --- after all, we just closed it, so we should know that we've done so and tidy up appropriately.
But when in a client connection handler we're told that the remote end has disconnected, we need to not only free the client connection (and thus free its handler and its closure), we also need to close the backend and free its stuff up. Which means that the client connection needs to have a reference not just to the backend connection's FD to send events -- it also needs to know about the backend connection's handler and closure structures because it needs to free them up too.
So there's the explanation. There are other ways we could do this kind of thing -- I've tried a bunch -- but they all require non-trivial amounts of accounting code to keep track of things. As the system itself is pretty simple right now (notwithstanding the length of this blog post) then I think it would be an unnecessary complication. But it is something that will almost certainly require revisiting later.
So, on that note -- that's it! That's the code for a trivial epoll
-based
proxy that connects all incoming connections to a backend. It can handle
hundreds of simultaneous connections -- indeed, with appropriate ulimit
changes to increase the maximum number of open file descriptors, it can handle
thousands -- and it adds very little overhead.
In the next step, I'm going to integrate Lua scripting. This is how the proxy will ultimately handle backend selection (so that client connections can be delegated to appropriate backends based on the hostname they're for) but initially I just want to get it integrated for something much simpler. Here's a link to the post.
Thanks again to Banu Systems for their awesome epoll example, which was invaluable.