London Financial User Group Meeting: September 15

Posted on 24 August 2010 in Python

The next meeting of the LFPUG will be on September 15, from 19:00 - 21:00 -- location TBD. Didrik Pinte will be talking about Enthought's port of NumPy to .NET, which I'm very interested in hearing about.

More information on the LFPUG wiki page.

Bare Git repositories

Posted on 1 July 2010 in Programming

We started a new project at Resolver today -- I'm pretty excited about it, and will be blogging about it soon. However, in the meantime, here's something that's half a note-to-self and half something to help people googling for help with Git problems.

We've previously been using Subversion as our main source code control system, but for more recent projects we've moved to Mercurial. When we started the new one today, we decided to try out Git for a change; I use GitHub for my personal stuff, but hadn't used it for anything involving multiple developers -- and various people had been telling us that it wasn't subject to some of the problems we'd had with Mercurial.

So we created a new Git repo on a shared directory, by creating a directory and then running git init in it. We then cloned it into a working directory on my machine, and started work. After a while, we had our first checkin ready, so we a dded the files, committed them, and then decided to push to the central repo to make sure everything worked OK. We got this error message:

remote: error: refusing to update checked out branch: refs/heads/master
remote: error: By default, updating the current branch in a non-bare repository
remote: error: is denied, because it will make the index and work tree inconsistent
remote: error: with what you pushed, and will require 'git reset --hard' to match
remote: error: the work tree to HEAD.
remote: error:
remote: error: You can set 'receive.denyCurrentBranch' configuration variable to
remote: error: 'ignore' or 'warn' in the remote repository to allow pushing into
remote: error: its current branch; however, this is not recommended unless you
remote: error: arranged to update its work tree to match what you pushed in some
remote: error: other way.
remote: error:
remote: error: To squelch this message and still keep the default behaviour, set
remote: error: 'receive.denyCurrentBranch' configuration variable to 'refuse'.

It took us a while to work out precisely what this meant, because we'd never heard of "bare" repositories before. It turns out that there are two kinds of repository in Git: bare and non-bare. A non-bare repository is the same as the ones we were used to in Mercurial; it has a bunch of working files, and a directory containing the version control information. A bare repository, by contrast, just contains the version control information -- no working files.

Now, you can (in theory) push and pull between repositories regardless of whether they are bare or not. But if you were to push to a non-bare repository, it would cause problems. Part of the SCC data that Git keeps is an index, which basically tells it what the head of the current branch looks like. Now, if you push to a non-bare repository, Git will look at the working files, compare them to the index, and see that they differ -- so it will think that the working files have changed! For example, if your push added a new file, it would notice that the working directory didn't include that file, and would conclude that it had been deleted. There's a step-by-step example here.

You can see how that could be confusing. So bare repositories exist as a way of having central repositories that a number of people can push to. If you want to transfer changes from a non-bare repository to another, the correct way is to pull from the destination rather than push from the target -- which makes some kind of sense when you think about it. In general, any repository that someone is working on is not something that should be receiving changes without their approval... on the other hand, we've not encountered problems with pushing to regular repositories with Mercurial.

Anyway, this was our first checkin, so we had no history to lose, we fixed the problem by creating a new central repository using git --bare init in a new directory on the shared drive, cloning it to a new working repo, copying our files over from the old working repo to the new one, committing, and pushing back to the bare repository. It worked just fine. If we'd done multiple checkins before we tried our first push, we could have saved things by hand-editing the central repository; it had no working files (because we'd only just created it) so we could have moved the contents of the .git directory up to the repository's root, and deleted .git -- this would have "bared" it so that we could have pushed from our working repo. That would have been a bit scary, though.

Running Resolver One on Mono for Windows

Posted on 28 May 2010 in Programming, Resolver One

Mono is an open source version of the .NET framework; it allows you to run .NET applications not just on Windows but on Linux and the Mac. I've spent quite some time over the last week getting our Python spreadsheet, Resolver One, to run on the Windows version, and thought it would be worth sharing some experiences.

[ Read more ]

An odd crontab problem

Posted on 18 May 2010 in Oddities

This took a little while to work out, so it's worth sharing here just in case anyone else has the same problems and is googling for solutions. We had a problem on one of our web servers at Resolver which manifested itself in some (but not all) cron jobs being run twice, which was causing all kinds of problems. Here's how we tracked it down and solved it.

[ Read more ]

Generating political news using NLTK

Posted on 4 May 2010 in Funny, Politics, Programming, Python, Resolver Systems

It's election week here in the UK; on Thursday, we'll be going to the polls to choose our next government. At Resolver Systems, thanks to energy and inventiveness of our PR guys over at Chameleon, we've been doing a bunch of things related to this, including some analysis for the New Statesman that required us to index vast quantities of tweets and newspaper articles.

Last week I was looking at the results of this indexing, and was reminded of the fun I had playing with NLTK back in February. NLTK is the Python Natural Language Toolkit; as you'd expect, it has a lot of clever stuff for parsing and interpreting text. More unexpectedly (at least for me), it has the ability to take some input text, analyse it, and then generate more text in the same style. Here's something based on the Book of Genesis:

In the selfsame day entered Noah , and asses , flocks , and Maachah . And Joseph said unto him , Abrah and he asses , and told all these things are against me . And Jacob told Rachel that he hearkened not unto you . And Sarah said , I had seen the face of the air ; for he hath broken my covenant between God and every thing that creepeth upon the man : And Eber lived after he begat Salah four hundred and thirty years , and took of every sort shalt thou be come thither .

It was the work of a moment to knock together some code that would read in all of the newspaper articles that we'd tagged as being about a particular subject, run them through a Beautiful Soup-based parser to pull out the article text, and feed that into NLTK, then to dump the results into a Wordpress blog (after a little manual polishing for readability).

[ Read more ]

Regular expressions and Resolver One column-level formulae

Posted on 26 April 2010 in Programming, Python, Resolver One

Recently at Resolver we've been doing a bit of analysis of the way people, parties and topics are mentioned on Twitter and in the traditional media in the run-up to the UK's next national election, on behalf of the New Statesman.

We've been collecting data, including millions of tweets and indexes to newspaper articles, in a MySQL database, using Django as an ORM-mapping tool -- sometime in the future I'll describe the system in a little more depth. However, from our perspective the most interesting thing about it is how we're doing the analysis -- in, of course, Resolver One.

Here's one little trick I've picked up; using regular expressions in column-level formulae as a way of parsing the output of MySQL queries.

Let's take a simple example. Imagine you have queried the database for the number of tweets per day about the Digital Economy Bill (or Act). It might look like this:

+------------+----------+
| Date       | count(*) |
+------------+----------+
| 2010-03-30 |       99 |
| 2010-03-31 |       30 |
| 2010-04-01 |       19 |
| 2010-04-02 |       12 |
| 2010-04-03 |        2 |
| 2010-04-04 |       13 |
| 2010-04-05 |       30 |
| 2010-04-06 |      958 |
| 2010-04-07 |     1629 |
| 2010-04-08 |     1961 |
| 2010-04-09 |     4038 |
| 2010-04-10 |     2584 |
| 2010-04-11 |     1940 |
| 2010-04-12 |     3333 |
| 2010-04-13 |     2421 |
| 2010-04-14 |     1319 |
| 2010-04-15 |     1387 |
| 2010-04-16 |     3194 |
| 2010-04-17 |      860 |
| 2010-04-18 |      551 |
| 2010-04-19 |      859 |
| 2010-04-20 |      685 |
| 2010-04-21 |      528 |
| 2010-04-22 |      631 |
| 2010-04-23 |      591 |
| 2010-04-24 |      320 |
| 2010-04-25 |      363 |
| 2010-04-26 |      232 |
+------------+----------+

Now, imagine you want to get these numbers into Resolver One, and because it's a one-off job, you don't want to go to all the hassle of getting an ODBC connection working all the way to the DB server. So, first step: copy from your PuTTY window, and second step, paste it into Resolver One:

Shot 1

Right. Now, the top three rows are obviously useless, so let's get rid of them:

Shot 2

Now we need to pick apart things like | 2010-03-30 | 99 | and turn them into separate columns. The first step is to import the Python regular expression library:

Shot 3

...and the next, to use it in a column-level formula in column B:

Shot 4

Now that we've parsed the data, we can use it in further column-level formulae to get the dates:

Shot 5

...and the numbers:

Shot 6

Finally, let's pick out the top 5 dates for tweets on this subject; we create a list

Shot 7

...sort it by the number of tweets in each day...

Shot 8

...reverse it to get the ones with the largest numbers of tweets...

Shot 9

...and then use the "Unpack" command (control-shift-enter) to put the first five elements into separate cells.

Shot 10

Now, once we've done this once, it's easy to use for other data; for example, we might want to find the fives days when Nick Clegg was mentioned most on Twitter. We just copy the same kind of numbers from MySQL, paste them into column A, and the list will automatically update:

Shot 11

So, a nice simple technique to create a reusable spreadsheet that parses tabular data.

An aside: SEO for restaurants

Posted on 19 March 2010 in Personal

The other day, we got an ad through our letterbox for a new Thai restaurant. We'd become fed up with the other neighbourhood Thais, so decided to try this one this evening. We could remember the name, "Cafe de Thai", and the street, All Saints Road, but no more, but hey, no problem: let's Google it!

The results were odd; I won't link to them because they'll change rapidly enough, but what we found was that the front page results had two links to aggregators of celebrity Twitter accounts (because someone who is apparently semi-famous tweeted about the place), but everything else was about other places on the same street, or with vaguely similar names. By contrast, a search for their competitors came up with a bunch of random London restaurant listing sites, many of which I'd never heard of -- but all of which had the information I was looking for, to wit the telephone number and the precise address.

What's interesting to me is that (a) neither restaurant's own web page was on the first page of the listings, and (b) this didn't matter. All that mattered was that the contact details were at the front of the list; the more established place had loads of listings sites giving contact details for them, but the newer place was nowhere to be found. So perhaps, while software companies spend money to make as sure as possible that their own website is at the top of the search results for their name and industry segment, SEO for restaurants is much more nuanced: you don't need your own website to come first, just that of a decent listings site. Ideally, one would assume, a listings site where you get a good rating...

Anyway, just in case anyone has wound up on this page looking for details of the restaurant:

Cafe de Thai
29 All Saints Road
London
020 7243 3001

I recommend the scallops and the weeping tiger; Lola liked her dim sum and red curry with prawns. Alan Carr recommends the green curry, apparently...

OpenCL: .NET, C# and Resolver One integration -- the very beginnings

Posted on 18 March 2010 in GPU Computing, Programming, Python, Resolver One

Today I wrote the code required to call part of the OpenCL API from Resolver One; just one function so far, and all it does is get some information about your hardware setup, but it was great to get it working. There are already .NET bindings for OpenCL, but I felt that it was worthwhile reinventing the wheel -- largely as a way of making sure I understood every spoke, but also because I wanted the simplest possible API, with no extra code to make it more .NETty. It should also work as an example of how you can integrate a C library into a .NET/IronPython application like Resolver One.

I'll be documenting the whole thing when it's a bit more finished, but if you want to try out the work in progress, and are willing to build the interop code, here's how:

  • Make sure you have OpenCL installed -- here's the NVIDA OpenCL download page, and here's the OpenCL page for ATI. I've only tested this with NVIDIA so far, so I'm keen to hear of any incompatibilities.
  • Clone the dot-net-opencl project from Resolver Systems' GitHub account.
  • Load up the DotNetOpenCL.sln project file in the root of the project using Visual C# 2008 (here's the free "Express" version if you don't have it already).
  • Build the project
  • To try it out from IronPython, run ipy test_clGetPlatformIDs.py
  • To try it in Resolver One, load test_clGetPlatformIDs.rsl

That should be it! If you want to look at the code, the only important bit is in DotNetOpenCL.cs -- and it's simply an external method definition... the tricky bit was in working out which OpenCL function to write an external definition for, and what that definition should look like.

I've put a slightly tidied version of the notes I kept as I implemented this below, for posterity's sake; if you're interested in finding out how the implementation went, read on...

[ Read more ]

OpenCL: first investigations with an NVIDIA card

Posted on 24 February 2010 in GPU Computing, Programming

I'm taking a look at OpenCL at the moment, with the vague intention of hooking it up to Resolver One. In case you've not heard about it, OpenCL is a language that allows you to do non-graphical computing on your graphics card (GPU). Because GPUs have more raw computational power than even modern CPUs, in the form of a large number of relatively slow stream processors, this can speed up certain kinds of calculations -- in particular, those that are amenable to massive parallelisation.

Until recently, the two main graphics card manufacturers had their own languages for this kind of general-purpose GPU computing; NVIDIA had CUDA, and ATI/AMD had their Stream technology. OpenCL was created as a way of having one language that would work on all graphics cards, so although the tools for developing using it are not currently as good as those for CUDA (which has been around for a long time and has great support), as a longer-term investment OpenCL looks to me like the best one to be looking at.

It took a little bit of work to get something up and running on my machine here at work, so it's probably worth documenting to help others who are trying the same.

[ Read more ]

Playing with NLTK

Posted on 18 February 2010 in Programming, Python

I spent some time this afternoon playing with NLTK, the Python Natural Language Toolkit; the book is highly recommended. I couldn't resist sharing the results of the first time I tried its generate function, which takes a particular text and tries to automatically generate more text in the same style.

First, from the Book of Genesis:

In the selfsame day entered Noah , and asses , flocks , and Maachah . And Joseph said unto him , Abrah and he asses , and told all these things are against me . And Jacob told Rachel that he hearkened not unto you . And Sarah said , I had seen the face of the air ; for he hath broken my covenant between God and every thing that creepeth upon the man : And Eber lived after he begat Salah four hundred and thirty years , and took of every sort shalt thou be come thither .

Next, Jane Austen:

[ Sense and Sensibility by Jane Austen 1811 ] CHAPTER 1 The family of his relations , I know Marianne ' s indisposition , beyond giving time for the world to seeing YOU .-- Very far from being weary of her feeling how more than commonly joyful . He was four years . Every thing was disregarded . I had called in Berkeley Street , January . " Mrs . Jennings , sagaciously smiling , " For my part , not being Willoughby , where much of his fate with Miss Morton , with two pointers playing round him ,

And finally, from a corpus of IRC conversations:

now im lost i agree i saw the whole thing JOIN PART google 's start page .. lol yeah , he said this morning he is hoping they win tonight im gay ....... im happy as hell ...... jus aint #### lol U42 , how are you NICK : U23 what a combo hehehe JOIN . ACTION pictures the blues brothers behind that chicken wire screen . (((((((((( U34 ))))))))))))) Hi U7 ......... how are ya ll gotta watch my manners or she wo n't you play another somebody done somebody wrong song ? JOIN . ACTION wonders if U16

Scarily accurate :-)