- September 2024 (1)
- August 2024 (2)
- July 2024 (2)
- May 2024 (2)
- April 2024 (2)
- February 2024 (2)
- April 2023 (1)
- March 2023 (2)
- September 2022 (1)
- February 2022 (1)
- November 2021 (1)
- March 2021 (1)
- February 2021 (2)
- August 2019 (1)
- November 2018 (1)
- May 2017 (1)
- December 2016 (1)
- April 2016 (1)
- August 2015 (1)
- December 2014 (1)
- August 2014 (1)
- March 2014 (1)
- December 2013 (1)
- October 2013 (3)
- September 2013 (4)
- August 2013 (2)
- July 2013 (1)
- June 2013 (1)
- February 2013 (1)
- October 2012 (1)
- June 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (1)
- October 2011 (1)
- June 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- February 2011 (1)
- January 2011 (1)
- December 2010 (3)
- November 2010 (1)
- October 2010 (1)
- September 2010 (1)
- August 2010 (1)
- July 2010 (1)
- May 2010 (3)
- April 2010 (1)
- March 2010 (2)
- February 2010 (3)
- January 2010 (4)
- December 2009 (2)
- November 2009 (5)
- October 2009 (2)
- September 2009 (2)
- August 2009 (3)
- July 2009 (1)
- May 2009 (1)
- April 2009 (1)
- March 2009 (5)
- February 2009 (5)
- January 2009 (5)
- December 2008 (3)
- November 2008 (7)
- October 2008 (4)
- September 2008 (2)
- August 2008 (1)
- July 2008 (1)
- June 2008 (1)
- May 2008 (1)
- April 2008 (1)
- January 2008 (5)
- December 2007 (3)
- March 2007 (3)
- February 2007 (1)
- January 2007 (2)
- December 2006 (4)
- November 2006 (18)
- 3D (5)
- AI (14)
- Admin (3)
- Blogging (5)
- Business of Software (9)
- Copyright (1)
- Dirigible (3)
- Django (1)
- Eee (3)
- Finance (6)
- Funny (11)
- GPU Computing (2)
- Gadgets (8)
- JavaScript (1)
- Linux (13)
- Memes (2)
- Meta (7)
- Music (4)
- NSLU2 offsite backup project (13)
- OLPC XO (2)
- Oddities (4)
- Personal (3)
- Politics (3)
- Programming (61)
- Python (36)
- PythonAnywhere (12)
- Quick links (2)
- Rants (4)
- Raspberry Pi (1)
- Resolver One (22)
- Resolver Systems (18)
- Robotics (8)
- Space (2)
- Talks (3)
- Uncategorized (5)
- VoIP (2)
- Website design (4)
Regular expressions and Resolver One column-level formulae
Recently at Resolver we've been doing a bit of analysis of the way people, parties and topics are mentioned on Twitter and in the traditional media in the run-up to the UK's next national election, on behalf of the New Statesman.
We've been collecting data, including millions of tweets and indexes to newspaper articles, in a MySQL database, using Django as an ORM-mapping tool -- sometime in the future I'll describe the system in a little more depth. However, from our perspective the most interesting thing about it is how we're doing the analysis -- in, of course, Resolver One.
Here's one little trick I've picked up; using regular expressions in column-level formulae as a way of parsing the output of MySQL queries.
Let's take a simple example. Imagine you have queried the database for the number of tweets per day about the Digital Economy Bill (or Act). It might look like this:
+------------+----------+
| Date | count(*) |
+------------+----------+
| 2010-03-30 | 99 |
| 2010-03-31 | 30 |
| 2010-04-01 | 19 |
| 2010-04-02 | 12 |
| 2010-04-03 | 2 |
| 2010-04-04 | 13 |
| 2010-04-05 | 30 |
| 2010-04-06 | 958 |
| 2010-04-07 | 1629 |
| 2010-04-08 | 1961 |
| 2010-04-09 | 4038 |
| 2010-04-10 | 2584 |
| 2010-04-11 | 1940 |
| 2010-04-12 | 3333 |
| 2010-04-13 | 2421 |
| 2010-04-14 | 1319 |
| 2010-04-15 | 1387 |
| 2010-04-16 | 3194 |
| 2010-04-17 | 860 |
| 2010-04-18 | 551 |
| 2010-04-19 | 859 |
| 2010-04-20 | 685 |
| 2010-04-21 | 528 |
| 2010-04-22 | 631 |
| 2010-04-23 | 591 |
| 2010-04-24 | 320 |
| 2010-04-25 | 363 |
| 2010-04-26 | 232 |
+------------+----------+
Now, imagine you want to get these numbers into Resolver One, and because it's a one-off job, you don't want to go to all the hassle of getting an ODBC connection working all the way to the DB server. So, first step: copy from your PuTTY window, and second step, paste it into Resolver One:
Right. Now, the top three rows are obviously useless, so let's get rid of them:
Now we need to pick apart things like | 2010-03-30 | 99 |
and turn them into separate columns. The first step is to import the Python regular expression library:
...and the next, to use it in a column-level formula in column B:
Now that we've parsed the data, we can use it in further column-level formulae to get the dates:
...and the numbers:
Finally, let's pick out the top 5 dates for tweets on this subject; we create a list
...sort it by the number of tweets in each day...
...reverse it to get the ones with the largest numbers of tweets...
...and then use the "Unpack" command (control-shift-enter) to put the first five elements into separate cells.
Now, once we've done this once, it's easy to use for other data; for example, we might want to find the fives days when Nick Clegg was mentioned most on Twitter. We just copy the same kind of numbers from MySQL, paste them into column A, and the list will automatically update:
So, a nice simple technique to create a reusable spreadsheet that parses tabular data.