What are you talking about?

One recent Friday afternoon I decided to write a python script that would scrape News24 RSS feeds and fetch all the news stories. I wanted to use this to find the themes that certain columnists covered in their posts. I used beautifulsoup to collect the text and then Tagxedo to do the visualization.

So for example: I used Simon Williamson's News24 RSS Feed (http://feeds.news24.com/articles/News24/Columnists/Simon-Williamson/rss), grabbed all the linked articles from there and then used the resulting text to create the following Tag cloud.

Simon Williamson News 24 Recent Columns Keyword Cloud

Simon Williamson News 24 Recent Columns Keyword Cloud

As you can see, I ran the scraper the weeks before the US election and Simon had commented on that on his recent columns. Here are two other columnists. The usual suspect, Khaya Dlanga:

Khaya Dlanga Column Tag Cloud

Khaya Dlanga Column Tag Cloud

Finally, Sibongile Mafu

Sibongile Mafu Column Tag Cloud

Sibongile Mafu Column Tag Cloud

This was not exhaustive or clean but does give a glimpse into each columnists recent themes. There are lots of improvements that can be made to the scraper, I did not exhaustively build features into it. It would be awesome if you could just give it a news website and a name of a journalist/columnist and it would figure out the structure automatically and return the text you want. Anywa.

You can get the Scraper here: https://github.com/bionicv/XMLScraper

Posted in Technical Hacks Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*