Giving up on the AI chatbot tutorial (for now)
I'm a big fan of learning in public, and early last year I started trying to do that by writing an AI chatbot tutorial as I learned the technology myself. But somehow it just wasn't working -- perhaps because my understanding was evolving so quickly that each time I sat down to write, I spotted dozens of errors in the previous posts, and felt I should fix those first. So I've decided to give up on that one, at least for now.
So, back to something a bit more achievable! Some lab notes will be coming on things I've been working on, including -- later on this evening -- a post about an oddity I found the other day.
In the meantime, here's a blog post I did for PythonAnywhere late last year: Five steps to create your own PythonAnywhere AI guru, on PythonAnywhere.
As those of you who know me (and probably a fair few that don't) will already know, PythonAnywhere was acquired by Anaconda, Inc back in June of this year. We're still the same team, and I'm still leading it, but now we're part of a larger company.
It's been quite a ride. Due diligence and negotiation in the months up to the close was just as tough as I'd always been told it would be (and that's despite the fact that according to our lawyers it was a pretty smooth one as these things go). And now I have to get used to having a boss again, which is weird... but is helped by the fact that said boss is a great guy, and is aligned with us (you can tell from the lingo that I work for a larger company now, right?) on keeping the platform up and running as it was, while investing into it so that it can get better and grow faster.
So, all good news :-)
I've been vaguely considering putting together a few blog posts outlining what happens during an acquisition -- just a general discussion of the steps and what they involve. I wouldn't be putting anything in about this particular deal, of course -- there are strict non-disclosures about the terms and so on -- but just a description of what happens might be useful for other people in the position I was in earlier on this year. I had to learn a lot of stuff very quickly, and while our lawyers were awesome and explained things brilliantly, it would have been useful to have some kind of layman's background information.
What do you think -- worth posting?
A somewhat indirect way of reporting stolen cards to the bank
One of the interesting things about having a business that accepts cards on the Internet is seeing what odd things people do when trying to use your site. A case in point is someone we've noticed over the last few months, who appears to be using our site as a rather indirect way to report stolen cards.
The behaviour that we see is that they run some kind of script that signs up for a bunch of accounts, with randomly-generated usernames, and then try to upgrade them all using stolen card numbers.
Naturally, our fraud-prevention systems pick that up pretty much immediately, and we run our own script that identifies every account that they've created, finds the card details used for them, and reports every transaction and attempted transaction as fraudulent. This means that our payment processor, Stripe, can flag the card numbers as stolen, so that they can't be used elsewhere without triggering fraud alerts to the other merchants. And, if a charge actually goes through (most of the cards tend to be pre-paid with no money on them, so most charges fail), then we refund it as fraudulent, which not only notifies Stripe, but I believe notifies the bank that the card number is circulating amongst card fraudsters.
Now, the fact that we do this should be obvious to them. Every time they run their scripts, it causes a minor inconvenience to us (the scripts that we have to handle the problem are getting ever-simpler to use), and it means that every card that they tried on our site is now significantly less valuable as an asset to them. They're essentially paying money for lists of stolen card numbers, and then burning it up.
Given that we're doing this, and they must know that we're doing it, the only explanation I can think of is that they're actually running some kind of strange public service where they buy lists of stolen card details and then get them blocked. It does seem a very roundabout way to do it, though. Surely it would be easier to just tell the banks directly?
But perhaps there's something I'm missing.
Or perhaps they really are dim enough to be using us to check stolen cards for validity, and haven't yet noticed that doing so against a site that reports every fraudulent transaction to the card processor is not a terribly good idea...
Parsing website SSL certificates in Python
A kindly PythonAnywhere user dropped us a line today to point out that StartCom and WoSign's SSL certificates are no longer going to be supported in Chrome, Firefox and Safari. I wanted to email all of our customers who were using certificates provided by those organisations.
We have all of the domains we host stored in a database, and it was surprisingly hard to find out how I could take a PEM-formatted certificate (the normal base-64 encoded stuff surrounded by "BEGIN CERTIFICATE" and "END CERTIFICATE") in a string and find out who issued it.
After much googling, I finally found the right search terms to get to this Stack Overflow post by mhawke, so here's my adaptation of the code:
from OpenSSL import crypto
for domain in domains:
cert = crypto.load_certificate(crypto.FILETYPE_PEM, domain.cert)
issuer = cert.get_issuer().CN
if issuer is None:
# This happened with a Cloudflare-issued cert
if "startcom" in issuer.lower() or "wosign" in issuer.lower():
# send the user an email
pam-unshare: a PAM module that switches into a PID namespace
Today in my 10% time at PythonAnywhere (we're
a bit less lax than Google) I wrote
a PAM module that lets you configure a
Linux system so that when someone su
s, sudo
s, or ssh
es in, they are put
into a private PID namespace. This means that they can't see anyone else's
processes, either via ps
or via /proc
. It's definitely not production-ready,
but any feedback on it would be very welcome.
In this blog post I explain why I wrote it, and how it all works, including some of the pitfalls of using PID namespaces like this and how I worked around them.
An HTTP request's journey through a platform-as-a-service
I'm definitely getting better as a public speaker :-) At EuroPython in Berlin last month, I gave a high-level introduction to PythonAnywhere's load-balancing system. There's a video up on PyVideo: An HTTP request's journey through a platform-as-a-service. And here are the slides [PDF].
A fun bug
While I'm plugging the memory leaks in my epoll-based C reverse proxy, I thought I might share an interesting bug we found today on PythonAnywhere. The following is the bug report I posted to our forums.
So, here's what was happening.
Each web app someone has on PythonAnywhere runs on a backend server. We have a cluster of these backends, and the cluster is behind a loadbalancer. Every backend server in the cluster is capable of running any web app; the loadbalancer's job is to spread things out between them so that each one at any given time is only running an appropriately-sized subset of them. It has a list of backends, which we can update in realtime as we add or remove backends to scale up or down, and it looks at incoming requests and uses the domain name to work out which backend to route a request to.
That's all pretty simple. The twist comes when we add the code that reload web apps to the mix.
Reloading a PythonAnywhere web app is simply a case of making an authenticated request to a specific URL. For example, right now (and this might change, it's not an official API, so don't do anything that relies on it) to reload owned by user fred, you'd hit the URL
Now, the PythonAnywhere website itself is just another web app running on one of the backends (a bit recursive, I know). So most requests to it are routed based on the normal loadbalancing algorithm. But calls specifically to that "reload" URL need to be routed differently -- they need to go to the specific backend that is running the site that needs to be reloaded. So, for that URL, and that URL only, the loadbalancer uses the domain name that's specified second-to-the-end in the path bit of the URL to choose which backend to route the request to, instead of using the hostname at the start of the URL.
So, what happened here? Well, the clue was in the usernames of the people who were affected by the problem -- IronHand and JoeButy. Both of you have mixed-case usernames. And your web apps are and
But the code on the "Web" tab that specifies the URL for reloading the selected domain specifies it using your mixed-case usernames -- that is, it specifies that the reload calls should go to the URL for or
And you can probably guess what the problem was -- the backend selection code was case-sensitive. So requests to your web apps were going to one backend, but reload messages were going to another different backend. The fix I just pushed made the backend selection code case-insensitive, as it should have been.
The remaining question -- why did this suddenly crop up today? My best guess is that it's been there for a while, but it was significantly less likely to happen, and so it was written off as a glitch when it happened in the past.
The reason it's become more common is that we actually more than doubled the number of backends yesterday. Because of the way the backend selection code works, when there's a relatively small number of backends it's actually quite likely that the lower-case version of your domain will, by chance, route to the same backend as the mixed-case one. But the doubling of the number of servers changed that, and suddenly the probability that they'd route differently went up drastically.
Why did we double the number of servers? Previously, backends were m1.xlarge AWS instances. We decided that it would be better to have a larger number of smaller backends, so that problems on one server impacted a smaller number of people. So we changed our system to use m1.large instances instead, span up slightly more than twice as many backend servers, and switched the loadbalancer across.
So, there you have it. I hope it was as interesting to read about as it was to figure out :-)
SNI-based reverse proxying with Go(lang)
Short version for readers who know all about this kind of stuff: we built a simple reverse-proxy server in Go that load-balances HTTP requests using the
header and HTTPS using the SNIs from the client handshake. Backends are selected per-host from sets stored in a redis database. It works pretty well, but we won't be using it because it can't send the originating client IP to the backends when it's handling HTTPS. Code here.
We've been looking at options to load-balance our user's web applications at PythonAnywhere; this post is about something we considered but eventually abandoned; I'm posting it because the code might turn out to be useful to other people.
How many Python programmers are there in the world?
We've been talking to some people recently who really wanted to know what the potential market size was for PythonAnywhere, our Python Platform-as-a-Service and cloud-based IDE.
There are a bunch of different ways to look at that, but the most obvious starting point is, "how many people are coding Python?" This blog post is an attempt to get some kind of order-of-magnitude number for that.
First things first: Wikipedia has an estimate of 10 million Java developers (though I couldn't find the numbers to back that up on the cited pages) but nothing for Python -- or, indeed, any of the other languages I checked. So nothing there.
A bit of Googling around gets one interesting hit; in this Stack Overflow answer, "Tall Jeff" says that the 2007 version of Learning Python estimated that there were 1 million Python programmers in the world. Using Amazon's "Look inside" feature on the current edition, they still have the same number but for the present day, but let's assume that they were right originally and the number has grown since then. Now, according to the Python wiki, there were 586 people at the 2007 PyCon. According to the front page at, there were 2,500 people at PyCon 2013. So if we take that as a proxy for the growth of the language, we get one guess of the number of Python developers: 4.3 million.
Let's try another metric.'s web statistics are public. Looking at the first five months of this year, and adding up the total downloads, we get:
Month | Downloads |
Jan | 2,584,754 |
Feb | 2,539,177 |
Mar | 3,182,946 |
Apr | 3,199,012 |
May | 2,855,033 |
Averaging that over a year gives us 34,466,213 downloads per year. It's worth noting that these are overwhelmingly Windows downloads -- most Linux users are going to be using the versions packaged as part of their distro, and (I think, but correct me if I'm wrong) the same is largely going to be the case on the Mac.
So, 34.5 million downloads. There were ten versions of Python released over the last year, so for let's assume that each developer downloaded each version once and once only; that gives us 3.5 million Python programmers on Windows.
What other data points are there? This job site aggregator's blog post suggests using searches for resumes/CVs as a way of getting numbers. Their suggested search for Python would be
(intitle:resume OR inurl:resume) Python -intitle:jobs -resumes -apply
Being in the UK, where we use "CV" more than we use "resume", I tried this:
(intitle:resume OR inurl:resume OR intitle:cv OR inurl:cv) Python -intitle:jobs -resumes -apply
The results were unfortunately completely useless. 338,000 hits but the only actual CV/resume on the first page was Guido van Rossum's -- everything else was about the OpenCV computer vision library, or about resuming things.
So let's scrap that. What else can we do? Well, taking inspiration (and some raw data) from this excellent blog post about estimating the number of Java programmers in the world, we can do this calculation:
- Programmers in the world: 43,000,000 (see the link above for the calculation)
- Python developers as per the latest TIOBE ranking: 4.183%, which gives 1,798,690
- Python developers as per the latest ranking: 7% (taken by an approximate ratio of the Python score to the sum of the scores of all languages), which gives 2,841,410
OK, so there I'm multiplying one very approximate number of programmers by a "percentage" rating that doesn't claim to be a percentage of programmers using a given language. But this ain't rocket science, I can mix and match units if I want.
The good news is, we're in the same order of magnitude; we've got numbers of 1.8 million, 2.8 million, 3.5 million, and 4.3 million. So, based on some super-unscientific guesswork, I think I can happily say that the number of Python programmers in the world is in the low millions.
What do you think? Are there other ways of working this out that I've missed? Does anyone have (gasp!) hard numbers?
A super-simple chat app with AngularJS, SockJS and node.js
We're planning to move to a more advanced JavaScript library at PythonAnywhere. jQuery has been good for us, but we're rapidly reaching a stage where it's just not enough.
There are a whole bunch of JavaScript MVC frameworks out there that look tempting -- see TodoMVC for an implementation of a simple app in a bunch of them. We're asking the people we know and trust which ones are best, but in the meantime I had a look at AngularJS and knocked up a quick chat app to see how easy it would be. The answer was "very".
Here's the client-side code:
<html ng-app>
<script src=""></script>
<script src=""></script>
var sock = new SockJS('');
function ChatCtrl($scope) {
$scope.messages = [];
$scope.sendMessage = function() {
$scope.messageText = "";
sock.onmessage = function(e) {
<div ng-controller="ChatCtrl">
<li ng-repeat="message in messages">{{message}}</li>
<form ng-submit="sendMessage()">
<input type="text" ng-model="messageText" placeholder="Type your message here" />
<input type="submit" value="Send" />
Then on the server side I wrote this server (in node.js because
I've moved to Shoreditch and have ironic facial hair it was easy to copy,
paste and hack from the SockJS docs -- I'd use Tornado if this was on
var http = require('http');
var sockjs = require('sockjs');
var connections = [];
var chat = sockjs.createServer();
chat.on('connection', function(conn) {
var number = connections.length;
conn.write("Welcome, User " + number);
conn.on('data', function(message) {
for (var ii=0; ii < connections.length; ii++) {
connections[ii].write("User " + number + " says: " + message);
conn.on('close', function() {
for (var ii=0; ii < connections.length; ii++) {
connections[ii].write("User " + number + " has disconnected");
var server = http.createServer();
chat.installHandlers(server, {prefix:'/chat'});
server.listen(9999, '');
And that's it! It basically does everything you need from a simple chat app. Definitely quite impressed with AngularJS. I'll try it in some of the other frameworks we evaluate and post more here.