Having given an overview of why we would want to find influential people on Twitter or a social network in Part I, now we move on to more data on the Intersexions dataset. Mainly: What the data is and how it was collected. What led me to choose Intersexions as a case study is that they had an active social media presence that formed what is now termed a "Second Screen" experience for those watching the show. It was the first time I had experienced people engaging with a show this way.
The data consists of tweets that had word "intersexions" in them during the final 8 weeks of the first season of the show. The data were collected between 15 February 2011 and 5 April 2011. The show was broadcast on Tuesday evenings between 8:30 pm and 9 pm South African Standard Time. The number breakdown of the users and tweets is shown next.
|Number of messages||30435|
|Total number of users||6210|
Below is a plot of the number of unique users collected during the whole 8 weeks as a function of time in 5 minute intervals.
As you can see the last episode (season finale) had the largest spike in new users joining the conversation about intersections. The tweets sent by the users could be classified into 3 bins:
- A statement due to an episode broadcast (Twitter Mentions)
- Tweets by the TV show's producers (Twitter Replies or Retweets)
- Tweets by other users in connection with the show (Twitter Replies or Retweets)
We can look at this data in another way. We can look at the number of tweets sent out during each episode and how many viewers are estimated to have watched the show.
Even though the plot is exaggerated, we can see that there is some correlation between the two measures. There are better signals to look at, I just used number of tweets, to provide better analytics to TV networks. You can look at BlueFin Labs an example of a company taking this to the max. One thing we see is that in the last week we had a lot of new users join in on the conversation as well as a lot of tweets. Two factors can account for the increase in users: It was the shows finale and the broadcast was an hour instead of the normal 30 minutes.
Moving on, I further used the list of users to construct a network by finding which who each user followed in the network and visa versa. Here are some stats on the "Friend/Follower" network.
|Follower edges (directed connections between users)||146627|
|Maximum Diameter (maximum number of hops from one user to another)||6|
Interesting takeaway from this is that the graph diameter is 6 (the maximum hops, follows, it takes to go from one user to the other in the network) , which fits in with the 6 degrees of seperation theory. Yay science! Another important observations is that the network has 1 connected component. Simply, there is always a connected path from one user to another through other users. A "biased" sliver of the Intersexions follower network is below.
I went further and also constructed a "Mention" network. In this second network, a user has a connection to another user only if they mention them in a tweet during the time data was being collected. This was done to filter out users who just follow other users but do not actually engage with them. This second "Mention" network has these properties:
The "Mention" network had a larger diameter, which as expected is sparser. There are way less connections between users, actually less than 10% of follower/mention connections. Another insight is that we have 72 groups of users who converse with and around each other. We can deduce this from the number of connected components.
Collecting the data
Anyone who has worked with Twitter data will appreciate the work that goes into getting it all done. I initially used NodeXL from Microsoft to collect the tweets, but it was limited by the number of queries that could be done via the Twitter API. What I ended up using was TwapperKeeper to archive all the tweets. TwapperKeeper has since been shutdown as a Twitter archive and just after I finished my data collection Twitter also instituted more stringent policies on 3rd party archiving. I was one of the lucky ones who finished and downloaded my collections before the features were removed. Now I had all the tweets but I did not have the follower network. To construct the follower network I wrote a Python script that used the Twitter API and Python Twitter. The script took all the users I had in the message dump from TwapperKeeper and then found who followed who. Obviously due to the restriction on how many API calls developers can do in an hour, I had to come with a clever way to schedule and pause my API calls. It took a while to build the network with this setup but at least I didn't have a million users to analyze.
If you want to do this I think the best way is to look at services that allow you to collect and download massive twitter archives. For example look at the TwitterArchivist service. There is so much more I can talk about here but it would go on forever. Contact me if you have specific queries.
So that's Part II. We now know a little more about our network, now we want to start finding those influential people. So Part III is next.
Trivia: The Twitter Beast refers to the archiving policy changes that affected a lot of researchers during early 2011. There are a number of services now available that you can use, for a fee, to get archives that Twitter approves of. I also believe that if you ask nicely as an academic the services do waive the fees sometimes. Anyway always ask, you never know.