Data-Cleaning in pandas using regex's and more regex's.

David Raxén
Jun 4, 2020
5 min read

Alrite! First is first, my code and the three datasets it creates can be found here -> https://github.com/davidraxen/wikiscraping There is still some cleaning that can be done on the data. One of the dutch players has got his name written in two different ways for instance and some of the nations on teams from pre-season games are a bit off - but if you (for some reason) want to do some analysis on Chelsea data that hasn't been stolen from transfermarkt och whoscored (Against their will mind you!) then feel free.

Anyhow! Even tough the pd.read_html() works like a charm it does have some issues that became apparent for me when I got my data. I wanted to add information about which manager was managing Chelsea for each game. This can usually be found in the infobox for each season - but since there is a wikipage for this with exact dates I felt was a more convinient way of getting exact dates for the manager switches.

#Create df mostly including the date intervals for each manager.
res = requests.get("https://en.wikipedia.org/wiki/List_of_Chelsea_F.C._managers")
res.raise_for_status()
managers = pd.read_html(res)
managers = managers[0]#
managers = managers.iloc[: , list(range(1,11))]
managers["Name"] = managers["Name"].apply(lambda x: x.split("[")[0])
managers["From"] = pd.to_datetime(managers['From'])
for i in managers.index:
    if managers["To"][i] == "present":
            managers["To"][i] = datetime.datetime.now()
managers["To"] = pd.to_datetime(managers['To'])
#--------------------------------------------------

First I use the requests module to get the html from the page that I'm interested in. There actually is a module called "Wikipedia" that can do this as well - and if you're making a program with some sort of user input then that is probably preferable since it's really easy to get the top wikipedia choice for any search word by using it. But for this good old requests does the trick nicely.

The table that I'm interested in is the 0'th table that pd.read_html() stores in my variable "manager", which is the first one since Python counts from 0 and not from 1. (Some languages start counting from 1 though, like R for instance.)

I'm not using the dataframe Manager for more than to extract the dates and names of the guys. But in case I change my mind I'm still only going to use the columns 1->11 so I filter out the rest.

I also use a lambda statement, where I perform an operation on each value in the column named "Name", to remove everything from "[" and after it because some of the managers have got a citation put after their name that messes things up. (Like "Rafael Benítez [nb 2]" for instance).

Lastly i use pandas built in function pd.to_datetime() on the "From" and "To" - with a small edit where I set today's day as Frank Lampard's last day. Because with Roman you never know when it's time to move on!

After that I create a list containing all the seasons that I'm interested in collecting data from and a dictionary that contains information about which international competition that Chelsea played for respective season. - The latter in a similar fashion as the managers.

After that it's time to extract all the good stuff! First I create some variables that I'm going to store my data in. And this is where the data cleaning kicks in!

for i in seasons:
    season = "https://en.wikipedia.org/wiki/" + i + "_Chelsea_F.C._season"
        try:        
            res = requests.get(season)
            res.raise_for_status()
            res, images2 = drx.replaceImageWithTitle(res.text) #Replace images with their resp. "alt"-tag

So here the first thing comes in. I've created a function called "replaceImageWithTitle()" which removes the html-tag for images and replaces them with their "alt"-tag (if they have an alt-tag!) This is because the pd.html_read() function skips images (Which is natural since it's built to extract text in tables.) And since the information regarding countries on wikipedia often is displayed as an image of the nation's flag.

The function works like this:

def replaceImageWithTitle(res):    ''' Replace all images in html-text for     their "alt"-tag to get all information in tables. '''
    pattern = "(\<img\salt\=)(.*?)(\s\/\>)"
    pattern2 = "(?<=\<img\salt\=\")(.*?)(?=\")"
    flupp = re.findall(pattern, res)
    pics = {}
    for tup in flupp:
        tup = ''.join(tup)
        pics[tup] = re.search(pattern2, tup).group(0).replace(" ", "")
    for k, v in pics.items():
          res = res.replace(k, v)
    return res, list(pics.values())

What it does is that it first stores two regular expressions "pattern" and "pattern2". There is a great help site here: https://regex101.com/ that can be used to test regular expressions since they can be a bit tricky to get right at the first try.

What they do basically is that they create a pattern that a can use to look for matches in a long text.

The first pattern matches everything that starts with '<img alt=' and ends with ' />' (Which is how images are represented on these wikipedia pages and the second pattern matches everything between '<img alt="' and '"' which for the swedish flag would be "Sweden". Then it stores all the matches for the first pattern in the brilliantly named variable "flupp" and then creates a dictionary called "pics" where it pares together each match with the alt-tag (which is found by doing a search for pattern2 in each match of pattern). When done it goes through the whole text again and replaces all the keys (matches for pattern) with their values (internal match for pattern2) and returns the edited text and a list containing all alt-tags that has been found. The latter is useful for creating useful columns later on.

This is followed by a line correcting a mistake in the html-code on one of the wikipages where there is a typo where "8;" should've been "8" and a small change on table form which messed up the format which made pd.html_read() not able to recognise the html-tag as a table.

        res = res.replace("8;", "8") #if 2018-19
        res = res.replace("wikitable sortable", "wikitable") #Needed to grab a table from 2013-14

But there is more issues in the tables.. Some the tables have "Notes" like comments on some special achievements - which messes up the tables and those notes therefore needs to be removed from the html-text and the same goes for some seasons which separated the players by their positions. This is fixed by matching the notes and their positions with a regular expression and substituting the matches with an "empty" "" string.

        res = re.sub(Note, "", res) # Removes notes from tables to ensure that all games gets captured
        res = re.sub(Positions, "", res) # Cleans up a table with a little too much unformation.

The regexes here are very long:

Note = """\<tr\sstyle\=\"font\-size\:85\%\"\>\<td\scolspan\=\"5\"\>\<i\>Note.*\<\/td\>\s\<\/tr\>"""
Positions = '''\<tr\>\s\<th\scolspan=\"16\"\sstyle\=\"background\:\#dcdcdc\;\"\>.*\s\<\/th\>\<\/tr\>'''

But the basic idea is the same where the one looking for notes looks for a specific font size and the word Note (And some other fun stuff)

And the one for the position looks for a html-tag containing information about the background color in the table - which is different from the rows containing the information I'm interested in.

More to come! :)

Data-Cleaning in pandas using regex's and more regex's.

Recent Posts

Commentaires