#DataHack1: 2014 South African Election Social Media Hack #HackRU
So attending a bit of HackRU Spring 2014 and thought I would take the opportunity to sharpen up my python and data science skills. The dataset I chose to do some deep shallow dives on was a Twitter archive of Tweets about the upcoming 2014 South African elections. I setup a small python script using Twitter Python. The script grabs all tweets (hopefully) that have to do with the following search keywords
The list is not exhaustive but should still result in some interesting insights. This specific blog posts is about the ~ 7800 tweets collected on 11 April 2014.
The first bit of analysis was to get all the tweets and extract the most used Hashtags. I created a histogram of the number of occurrences of a hashtag in the figure below.
What is interesting about the most frequent hashtag is that it also gives us a glimpse of what was happening on April 11th. The top hashtag #ayisafani also coincided with the rigorous campaign by the Democratic Alliance to spread their new "Banned" TV ad [Youtube Ad]. Obviously the president will also feature heavily as the second most used hashtag #zuma. It would be interesting to do some sentiment analysis on these tweets but getting a good sentiment model would be hard given the mixture of languages used in South Africa. Anyway, if anyone is interested in implementing this, the data is available and can be accessed from the link provided later.
Klout and Retweets
Now I know this might be a bit circular, but I wanted to measure the correlation between the user Klout scores and the number of retweets they received in a given day. For this I used the Klout API via their Python package to retrieve the Klout scores of each retweeted user. I present the scatter plot of the # of Retweets vs. the User Klout score below.
I annotated the plot with the labels of the top 5 retweeted accounts. The correlation for the top 30 most retweeted users was 0.51. So the correlation is positive but not strong. This was still interesting as we can see that @EconFreedomZA are punching above their weight on this day with more retweets than the @DA_News account, which has the highest Klout. score. Hmm. Well something interesting happens when we look at the number mentions vs. Klout score instead. Below I plot only the top 20 mentioned users and their Klout scores. I did not annotate it but to help in reading it I will say that the users with the second highest Klout score is @hellenzille (leader of @DA_News). The correlation here is 0.71. Way more strong.
Now it might be interesting if we can reverse engineer the complex Klout calculation via linear regression.
The Mention Network
The first image in this post is a snapshot of the social network constructed by tracking the mentions of specific accounts in the Twitter dataset. The relative size of each node(account) is the number of mentions each of those accounts has gotten in the network. As you would expect with this metric, @DA-News has the largest size. You can view the full network with accounts that have at least 6 mentions in the network here: 11th April SA Election Mention Network
As always, I made available the Twitter JSON dumps at my GitHub. Grab the continuously updated data here -> github:za-2014-election-tweets
Yes, I will upload them when they get interesting.