Scraping Wikipedia data (And becoming a Wiki-contributor)

David Raxén
May 30, 2020
4 min read

Like everyone (Or at least Almost everyone(!) I set out to make some graphs over the Covid-19 data in mid- to late March this year. Quite early on there were several sites that tracked the number of people testing positive - with the "worldometers"-one being one of the more popular.

I did this by using the python package "BeautifulSoup" - which in many ways is both easy to work with and able to do amazing things. But as I got stuck I asked a question on reddit on how to handle sites that had data already in neat tables and got an answer saying that pandas actually had a great function aimed at doing just this simply called "pandas.read_html()" (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html) and it worked absolutely flawless for what I was trying to do.

But as I found out that the Covid-19 data was hard to compare "country by country" given that all countries had different approaches - both to the testing itself and to the.. truth! I lost interest and revisited the "Titanic dataset" - but this time using R instead of Python. It was fun - and I'll probably finish up a post about it in the coming days. Anyways - just as I was finishing up my ML-models predicting the survival rates I read another reddit post from DS-reqruiters complaining about reading the same old boring "Titanic dataset"-projects in CV after CV I figured.. Ok, I'll do something from scratch.

Going back to the drawing board I thought I'd scrape some data from "transfermark" och "whoscored" but pretty soon found out that both sites had put in "protection" from data scraping - which is understandable given that their data is their product. From what I gathered this could could be circumvented by using the "Selenium" package - but I felt that I was big enough to respect that they didn't want their data stolen so I set my eyes on Wikipedia instead! ;-)

My favourite team is Chelsea F.C and they have a wikipedia page for each season with some tables with info that can be used. And by using the request module and the function pandas.html_read() those can be downloaded by using just a few lines:

import requests
import pandas as pd
season = "https://en.wikipedia.org/wiki/2018-19_Chelsea_F.C._season"
res = requests.get(season)
res.raise_for_status() # Not needed, but good practise
tables = pd.read_html(res.text) # Store _All_ dataframes from wiki-page.

This saves all the tables (almost - some tables might have a tricky format) as dataframes in a list stored in a variable called "tables".

Very neat indeed!

One season is good - but I wanted to get a couple of seasons in there to make comparisons between different managers and such. Luckily - It's no big deal to create a loop that loops through a couple of seasons. For this it might've been as easy to just manually create a list including all seasons -> But I created a for loop where I quickly could change which season intervals I was interested in.

seasons = []
i = 0
j = 1
while j < 21:
   if len(str(i)) < 2:
           s = str(0)+str(i)
   else:
           s = str(i)
   if len(str(j)) < 2:
           t = str(0)+str(j)
   else:
           t = str(j)
u = "20"+s+"-"+t
i += 1
j += 1
seasons.append(u)

In the list called "tables" there will be a couple of DataFrames containing information from each game played during a season about which player scored - and often which ones got yellow or red cards. In the for loop for each season i add them all together (by using an if statement that looks for dataframes with the names [0,1,2,3,4]. In the print screen here the for loop I used can be used. There where som format issues that I needed to take care of - mostly using "regex"-substitutions that I'll make a separate post for. (I'll also create a github account and upload my code in the coming days)

But..! When everything was set up and all the games had been saved in a new dataframe I saw that it skipped the season 2002-03. To find out why I visited said page... and to my Horror I saw that the particular page was using a different (and from what I can gather) older type of football tables. (The same format that most seasons Before 2000 uses where information about who scored the oppositions goals were missing as well as the minutes when Chelsea's scoring occured)

I put me into an awkward position where I either had to make my analysis from 2003-04 and forward or figure out something else. I went with something else and started editing the wikipedia page itself..! And a couple of hours - and alot of googling old FA Cup games from 2003 the wikipedia page: https://en.wikipedia.org/wiki/2002-03_Chelsea_F.C._season had as fancy tables as it's cousins and my DataFrame contained information from all competative games that Chelsea has played since autumn 2000. Woop woop!

More to come! In the form of cool ggplots in R.

Scraping Wikipedia data (And becoming a Wiki-contributor)

Recent Posts

Commenti