Matplotlib colours

14 02 2013
Matplotlib supports HTML colour names like ‘AquaMarine’ or ‘BlueViolet’. A complete list of available names can be obtained from this page. But it’s sometimes useful to have a list of all the names inside your program, e.g. if you have plenty (say 10 or 20) different groups of data to be plotted in different colours on a scatter plot. 

import urllib2
from BeautifulSoup import BeautifulSoup

def get_page(url):
    user_agent = 'Mozilla/5 (Solaris 10) Gecko'
    headers = {'User-Agent' : user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    the_page =
    pool = BeautifulSoup(the_page)
    return pool

pool = get_page('')
res = pool.find('table', attrs={'class' : 'reference'})

# this is the list that will contain the colour names
c = []

for i in range(1, len(res.findAll('tr'))):
    rows = res.findAll('tr')[i]
    cols = rows.findAll('td')

# to save in a file
with open('colorlist', 'w+') as f:
    for item in c:
        f.write(item + '\n')

Analysing and predicting Schlag den Raab

16 12 2012

I’m a big fan of Schlag den Raab TV show. Last night we’ve got a new record of 3.5 million euro prize for the challenger. There’s been 38 episodes of it so far, I think it’s ripe for some statistics.. Some data is available from Wikipedia as a wiki table, from which I did some munging and cleaning with Google Refine, and export it to a CSV file. It’s interesting to observe the winners statistics, e.g. from which professional groups, gender, age, just from this raw data. For example, there has been no female winner since the beginning of the show, or 6 out of the 13 winners are 30 years or older, etc.

I will continue collecting more data as the show continues, perhaps devising some prediction methods/models along the way, probably the-good-old-but-reliable support vector machines or some sort. Just not to remove the fun from the show, it’s not going to be a full prediction with “cutoff” before an episode. I’m more thinking of a prediction that is continuously updated along an episode as more and more information are unveiled, and thus gaining more and more certainty.

For example, the candidate’s occupation could have some influence, and it can only be known about 30 minutes into the episode. Some challenges are always played in one form or another in every episode (e.g. “Blamieren oder Kassieren”, challenges involving car driving skills, certain types of sports, etc.). It would be interesting to get also some statistics on this, e.g. Raab almost always wins Blamieren oder Kassieren. Some data is available from this site, but it’s not complete. Results from earlier episodes are missing.

These challenges can also be grouped into a set of played challenges in an episode, which can be used as one “feature” of the prediction model. E.g. if an episode contains challenges in which Raab is extremely good at, then it is very likely that he will win the episode. Again, a full knowledge of this set is not available before the episode, so the prediction would have to be updated as the episode progresses. It might be possible to set a cutoff at sometime in the episode, e.g. once the certainty level exceeds some percentage that either Raab or the candidate will win.

Let’s see.. 🙂

GE flight quest — flight routes

6 12 2012

I thought a prediction model would need to be bound to a certain route, so route grouping it is… and mapping, why not.. This is the reference data on Nov 20, 2012. Not all flights are included, otherwise it’ll become too cluttered. 

GE flight quest — airports

2 12 2012

First exploration of the data, I save all aiport locations (combined departure and arrival, at least for those that have ICAO code, with coordinates from aiport data) in a kml file (get the link URL and open it in Google Maps). 511 aiports are contained over the whole US mainland. 

GE Flight Quest challenge page here.

Sarah Jessica Parker

14 11 2012

OK, this post has actually very little to do with the actress. It’s just there’s this guy in YouTube, commenting my comment on a Doctor Who clip, which was meant to refer to the Sarah Jessica Parker’s horse joke. He/she said that “Sarah” or “Jessica” is not a common name in the western time. Well, I don’t want to start a fight with him/her, it’s just interestingly I’d been playing around with American baby name historical data, and so I’m really tempted to figure out how popular those names are.

First as a background, the referred Doctor Who episode is called “A Town Called Mercy“, aired in September this year. The episode features the Doctor going back to the wild wild west time, with cowboys and stuff. In one scene, the Doctor, claiming that he speaks horse, contradicts a preacher, saying that the horse he’s about to ride is called Susan, not Joshua as the preacher had claimed.

There was no mention of the year where the whole story takes place, so I can only infer it from a dialogue between the Doctor and Rory:

The Doctor: That’s not right

Rory: It’s a street lamp.

The Doctor: An electric about ten years too early.

Rory: That’s only a few years out.

The Doctor: That’s what you said when you left your phone charger in Henry VIII’s own suite.

Given that Thomas Edison invented the electric light bulb in 1879, the event must have taken place around 1860-1870s. My dataset started in 1880, so it’s actually not so far off.

So I just went on doing some processing of the data using pandas, and get the percentage over time of the name “Sarah” over the whole American population:

Hmm, in 1880 the name “Sarah” constituted about 1.3 percent of the overall population. As we will see later, this is actually quite high. Maybe not so surprising because Sarah is sort of a biblical name (well, I guess; at least it has some religious flavour in Islam, I suppose it’s pretty much the same in the Bible..). Extrapolate back in time, looking at the graph, it could as well be higher before 1880. So that kid’s comment is invalid! Sarah is a popular name…

By the way I don’t differentiate if the name is a boy’s or a girl’s name (the dataset actually does). I just sum up the statistics of both as that is the only interesting number for my analysis.

Then another plot for “Jessica”:

All right, “Jessica” seems to be a modern-world phenomenon. It gained some popularity in the 1960s, peaked up in late 1980s, and has lost its popularity since then. Now, “Parker”:

Hmmm…. the name “Parker” is even a more recent phenomenon. What I find really interesting is the glitch slightly after the year 2000 and the continuing popularity of this name. Just go ahead let your imagination free and relate this phenomenon with the release year of Spiderman the movie… (Peter Parker, that is. You’re welcome.)

Now, about the relative proportion of the name “Sarah”; the following plot is a segment of the first plot, between 1880 and 1890, overlaid on the average proportion of all names in each year:

Here’s what it means: any (boy or girl) name between 1880 and 1890 constitutes in average only about 0.09 percent of the population. With 1.3 percent, “Sarah” is actually quite popular…

What can we learn from this? If a horse in 1870 claims that her name is Sarah, we really should believe it. If a horse today claims that her name is Sarah Jessica Parker, I think that’s quite possible as well.

UPDATE. The name “Susan” is actually less popular than “Sarah” in 1880

Weekend bookmark

12 11 2012

…full of data analysis and machine learning stuffs….

  • Google refinea power tool for working with messy data. Nice, but kinda slow when I loaded the US presidential candidate donation data from Wes McKinney’s Pandas tutorial (ca. 500000 lines; I use an MBA with 4GB of RAM, ca. 1 gig was available when loading the data to Refine….)
  • scikit-learn, machine learning in Python. A MUST TRY.
  • mrjob, Yelp‘s open sourced mapreduce package for Python.
  • dumbo, another Python mapreduce package. Not sure why Yelp created a new library (mrjob, that is) for the same purpose…
  • Nominatim, kinda nice tool to get (latitude, longitude) coordinate from addresses or vice versa.
  • Seven Python libraries you should now about….
  • and, what seems to be the most exciting so far: Ramp, rapid machine learning prototyping, essentially a pandas wrapper around Python’s various machine learning and statistics libraries (scikit-learn, rpy2, etc.). 

A time full of excitement awaits us, folks…