Matplotlib colours

14 02 2013
Matplotlib supports HTML colour names like ‘AquaMarine’ or ‘BlueViolet’. A complete list of available names can be obtained from this page. But it’s sometimes useful to have a list of all the names inside your program, e.g. if you have plenty (say 10 or 20) different groups of data to be plotted in different colours on a scatter plot. 

import urllib2
from BeautifulSoup import BeautifulSoup

def get_page(url):
    user_agent = 'Mozilla/5 (Solaris 10) Gecko'
    headers = {'User-Agent' : user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    the_page =
    pool = BeautifulSoup(the_page)
    return pool

pool = get_page('')
res = pool.find('table', attrs={'class' : 'reference'})

# this is the list that will contain the colour names
c = []

for i in range(1, len(res.findAll('tr'))):
    rows = res.findAll('tr')[i]
    cols = rows.findAll('td')

# to save in a file
with open('colorlist', 'w+') as f:
    for item in c:
        f.write(item + '\n')

It’s easy…. when you know it

27 12 2012

The saying “it’s easy when you know it” is rather underrated, I think. When someone says it, your immediate response is probably “of course it is..”

The only way to comprehend how much truth there is in it, you should start doing things by yourself, and see how much you can screw things up when you don’t know how the pieces work together. And, sometimes, how easily it could happen.

And now you should tell me to look how screwed up my life is.

Coursera’s introduction to databases

23 12 2012

For no apparent reason (read: boredom, most likely) I decided to opt in to this free, self-paced online course on databases from Coursera. After spending a horrible half year with a jerky XML subdomain called SOAP web service, the first chapters on relational database, XML, and JSON look pretty much like a child’s play :p

Analysing and predicting Schlag den Raab

16 12 2012

I’m a big fan of Schlag den Raab TV show. Last night we’ve got a new record of 3.5 million euro prize for the challenger. There’s been 38 episodes of it so far, I think it’s ripe for some statistics.. Some data is available from Wikipedia as a wiki table, from which I did some munging and cleaning with Google Refine, and export it to a CSV file. It’s interesting to observe the winners statistics, e.g. from which professional groups, gender, age, just from this raw data. For example, there has been no female winner since the beginning of the show, or 6 out of the 13 winners are 30 years or older, etc.

I will continue collecting more data as the show continues, perhaps devising some prediction methods/models along the way, probably the-good-old-but-reliable support vector machines or some sort. Just not to remove the fun from the show, it’s not going to be a full prediction with “cutoff” before an episode. I’m more thinking of a prediction that is continuously updated along an episode as more and more information are unveiled, and thus gaining more and more certainty.

For example, the candidate’s occupation could have some influence, and it can only be known about 30 minutes into the episode. Some challenges are always played in one form or another in every episode (e.g. “Blamieren oder Kassieren”, challenges involving car driving skills, certain types of sports, etc.). It would be interesting to get also some statistics on this, e.g. Raab almost always wins Blamieren oder Kassieren. Some data is available from this site, but it’s not complete. Results from earlier episodes are missing.

These challenges can also be grouped into a set of played challenges in an episode, which can be used as one “feature” of the prediction model. E.g. if an episode contains challenges in which Raab is extremely good at, then it is very likely that he will win the episode. Again, a full knowledge of this set is not available before the episode, so the prediction would have to be updated as the episode progresses. It might be possible to set a cutoff at sometime in the episode, e.g. once the certainty level exceeds some percentage that either Raab or the candidate will win.

Let’s see.. 🙂

GE flight quest — flight routes

6 12 2012

I thought a prediction model would need to be bound to a certain route, so route grouping it is… and mapping, why not.. This is the reference data on Nov 20, 2012. Not all flights are included, otherwise it’ll become too cluttered. 

Python datetime

4 12 2012

Can’t we just get over with datetime and move on to dateutil?

GE flight quest — airports

2 12 2012

First exploration of the data, I save all aiport locations (combined departure and arrival, at least for those that have ICAO code, with coordinates from aiport data) in a kml file (get the link URL and open it in Google Maps). 511 aiports are contained over the whole US mainland. 

GE Flight Quest challenge page here.