Python: Scraping elements relative to each other with BeautifulSoup
Last week we hosted a Game of Thrones based intro to Cypher at the Women Who Code London meetup and in preparation had to scrape the wiki to build a dataset.
I’ve built lots of datasets this way and it’s a painless experience as long as the pages make liberal use of CSS classes and/or IDs.
Unfortunately the Game of Thrones wiki doesn’t really do that so I had to find another way to extract the data I wanted - extracting elements based on their position to more prominent elements on the page.
For example, I wanted to extract Arya Stark's allegiances which look like this on the page:
We don’t have a direct route to her allegiances but we do have an indirect path via the h3 element with the text 'Allegiance'.
The following code gets us the 'Allegiance' element:
from bs4 import BeautifulSoup
file_name = "Arya_Stark"
wikia = BeautifulSoup(open("data/wikia/characters/{0}".format(file_name), "r"), "html.parser")
allegiance_element = [tag for tag in wikia.find_all('h3') if tag.text == "Allegiance"]
> print allegiance_element
[<h3 class="pi-data-label pi-secondary-font">Allegiance</h3>]
Now we need to work out the relative position of the div containing the houses. It’s inside the same parent div so I thought it’d probably be the next sibling:
next_element = allegiance_element[0].next_sibling
> print next_element
Nope. Nothing! Hmmm, wonder why:
> print next_element.name, type(next_element)
None <class 'bs4.element.NavigableString'>
Ah, empty string. Maybe it’s the one after that?
next_element = allegiance_element[0].next_sibling.next_sibling
> print next_element.name, type(next_element)
[<a href="/wiki/House_Stark" title="House Stark">House Stark</a>, <br/>, <a href="/wiki/Faceless_Men" title="Faceless Men">Faceless Men</a>, u' (Formerly)']
Hoorah! Afer this it became a case of working out how the text was structure and pulling out what I wanted.
The code I ended up with is on github if you want to recreate it yourself.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.