Wow, already at Part III. I already covered why one would want to find influential people in a social network in Part I. I discussed what the data I am analyzing is and how it was collected from twitter in Part II. Now we move to the really fun science part. Quantifying influence! More concretely I used a measure of social correlation as a starting point to explore influence. The analysis measures I used are based on the work:
Aris Anagnostopoulos, Ravi Kumar, and Mohammad Mahdian. 2008. Influence and correlation in social networks. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '08). http://doi.acm.org/10.1145/1401890.1401897
Joining the conversation
In the Intersexions dataset, what is of interest is: "Why do people start tweeting with the keyword "intersexions"?". Most importantly, are they tweeting because someone the follow has also tweeted with the word? We want to have a measure of this correlation. There could be multiple reasons to why when one person starts tweeting then another does the same thing. Namely:
- Homophily: This is the tendency of individuals to choose friends who share their interests and thus they tend to converse about similar things.
- Confounding Variables: These are external factors that are not part of the network that get people to tweet. In this case a user watching the show can spontaneously tweet after watching a scene.
- Influence: The user starts tweeting after seeing a tweet about a topic from someone they follow. This is exactly what we want to detect.
To start with we have a model that quantifies the probability of a user joining the conversation given that a of his/her friends have already started talking about intersexions given the correlation in the network. So
where a is the number of friends who are already active, α is the correlation coefficient and β is a constant. This can be re-arranged to:
So this model gives us a way to quantify what the probability of "activation" (tweeting with the word "intersexions") given that a of your friends have already tweeted with the word as a function of the correlation in the network. Obviously we don't have the correlation coefficient as that is what we want to find out. The correlation coefficient can be calculated by using a maximum likelihood logistic regression. See the paper for further details. What is important in this model is that it weights correlation low when few of your friends have activated and then increases it as more and more become activated. So when few users's friends have used the word intersexions then the reason for the user tweeting with the word is less likely to be due to their friends or homophily. But when lets say 99% of their friends (people they are following) have used this word, there is a high likelihood that the user joined because of their influence on them or homophily. For this analysis I used maximum likelihood logistic regression at 5 minute intervals as per the paper.
So what does this actually give us. At each time point during and shortly after the airing of each episode we can keep track of the correlation. So what we would expect to see that at the start of the episode the correlation would be low. As the conversations on Twitter intensified, we would then expect the correlation to increase and then decrease after the show has ended. So if we look at the next plot, that is exactly what we see in week one (Blue line is important).
So in the plot for now let us focus on the blue line. We see that as more and more users join the conversation (Y axis indicates this), alpha changes. Initially alpha is a bit erratic but we do see that it increases after a few people have joined the conversation and then starts decreasing towards the end when everyone has joined in the conversation. Now an interesting thing to do is do what is termed a shuffle test. This test shuffles the time of entry of each user. If the resulting plot for the correlation after the shuffle test is different, we can then look to see where influence could have played a part. In this case for week one, we see that the shuffle test results in a different plot especially when there are few users wo are using the word. This is now giving us a glimpse that people might have joined the conversation because of some specific users in the network spearheading the discussion. Now we have to find these people.
That's where part IV comes in.
Trivia: Wired recently had a great article about how TV ratings are changing because of Twitter. Check it out:
THE NEW RULES OF THE HYPER-SOCIAL, DATA-DRIVEN, ACTOR-FRIENDLY, SUPER-SEDUCTIVE PLATINUM AGE OF TELEVISION http://bit.ly/149aZLl