вторник, 23 июля 2013 г.

Can Twitter predict royal baby's name? (Updated)


       One of the main news today is the birth of royal baby, the crown prince. We congratulate Kate and William on this event and wish much health and happiness to them and their son!
Is it possible to predict the crown prince's name on the basis of the analysis of tweets? Using NLP methods, the theory of frequent itemsets and association rules, I have analysed the tweets. For my analysis, I used the R environment and the algorithms I used in my previous studies. I've obtained the following distribution of names:





So, we'll see if there is really the crown prince's name among all these male names.



After the Royal baby's name was announced


At last the Royal baby's name has been announced: Prince George of Cambridge!
As a result, we can see tweets mining could predict the Royal baby's name! What does this mean? Somebody writes me that this study is nuts. It is really not serious problem and nuts, if to take it literally. But the main goal of this study is to test whether there is a correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. As the obtained results show such correlation does exist. The Crown Prince's full name is George Alexander Louis. Unfortunately I don't know the history of England very well and I didn't take into account that the full name of the Royal baby may consist of three names. I studied the tweets array once again which had been downloaded before the Crown Prince's name was announced. Using the theory of frequent itemsets and association rules, we studied which names occur in tweets together. As the analysis showed the three names George, Alexander and Louis are the part of the top 5 of frequent itemsets with the biggest level of support.
Top of frequent itemsets:
 items                                         support
1  {alexander,george,james}     0.135593220
2  {george,henry,james}      0.121725732
3  {george,james,louis}      0.104776579
4  {alexander,james,louis}      0.098613251
5  {alexander,george,louis}     0.098613251
6  {george,henry,louis}      0.095531587
7  {alexander,henry,james}      0.093990755
8  {alexander,george,henry}     0.093990755
9  {henry,james,louis}      0.092449923
10 {alexander,henry,louis}      0.090909091


The formation of frequent itemsets can be represented as the following graph:



On the basis of frequent itemsets with three elements, we analysed the association rules with high level of support and confidence. The names George, Alexander and Louis also form the top 5 of association rules, grouped by the value of confidence:

Top of  association rules:

1  {james,louis}      => {george}    0.10477658  0.9855072 1.714730
2  {henry,louis}      => {george}    0.09553159  0.9841270 1.712328
3  {alexander,louis}    => {james}     0.09861325  0.9696970 2.192799
4  {alexander,louis}    => {george}    0.09861325  0.9696970 1.687221
5  {james,louis}      => {alexander} 0.09861325  0.9275362 4.459045
6  {george,louis}      => {james}     0.10477658  0.9189189 2.077973
7  {alexander,james}    => {george}    0.13559322  0.8888889 1.546619
8  {george,louis}      => {alexander} 0.09861325  0.8648649 4.157758
9  {alexander,george}   => {james}     0.13559322  0.8543689 1.932005
10 {george,louis}      => {henry}     0.09553159  0.8378378 3.649374 


The top 5 of obtained association rules can be represented as the following:


 

Consider the set structure of the users who participated in the discussion of the prince's name. To identify the communities that were formed dynamically in the discussion under analysis, we used a fast greedy modularity optimization algorithm. To build a graph, we used a Fruchterman-Reingold algorithm. This algorithm belongs to force algorithms, or spring algorithms. The character of the graph is due to the model which is used in force algorithms. The distinctive feature of the model is that its vertices are considered as the balls, affected by repulsive forces; and the edges are considered as  spring models that attract the vertices which are connected by these edges . In the tweet arrays, we have found 6919 users that sent 37191 tweets. These tweets mentioned 2645 users. An essential part of these mentions is relates to retweets. For further analysis, we take active users who sent more than on tweet in the process of discussion or who were mentioned in tweets more than once. We have found 2,300 active users who sent more than one tweet, and 923 users who were mentioned in tweets more than once. Figure 6 shows the graph of users' interrelations, the shades of colors on it mark the users' communities. On this graph, we can see that there are several numerous users' communities.


Revealed users' communities.
 

Our next step is to conduct the analysis after removing the most popular users that were mentioned in tweets 100 times or more. We have found only 6 such users. Having removed these users from the analysis, we received the community graph.  Removed users constitute nearly 0.2% of all the users mentioned in tweets. As follows from the obtained data, that if to remove only the most popular users from the analysis, the community structure will be changed significantly, and only numerous small communities will be left.
 
Users' communities without six most popular users.
 
The results of the study demonstrate that tweets mining could predict the Royal baby's name. We showed that the major name of newborn Prince George was dominant in the spectrum of names before the official announcement. It follows from the obtained data that the theory of frequent sets allows to get a more precise prediction for the full name if to compare with the analysis of the name frequency range which allows to predict a major name only. The three prince's component names George, Alexander, Louis form a frequent itemset of words and this itemset was the part of the top 5 largest frequent itemsets by the support value. We also showed that the structure of dynamically formed users' communities that participated in the discussion is defined by only several leaders who have a significant influence on the position of other users. What do these results mean? It is really not a serious problem, if to take it literally. But the main goal of this study is to test whether there is any correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. In our studies, we revealed that such a correlation does exist. This means that there is a certain correlation between the bloggers' viewpoints and the decision-making of the Royal family as to the prince's name.  

Populare retweets about Royal Baby name:

"RT @Lord_Voldemort7: They should name the #RoyalBaby 'Weasley' so that in future people can go around singing "Weasley is our King." "
"RT @Lord_Voldemort7: #RoyalBabyName It seems only fitting that the son of Prince William and commoner Kate Middleton be named Severus Snape…"
"RT @PrincessKateNOT: We have decided to name our #RoyalBaby with a popular British boys name. Mohammed."
"RT @eonline: We still don't know the #RoyalBaby's name...but we may have an idea of what his surname could be!"
"RT @AmazingPhil: I think they should name him after his great grandfather! Prince Philip. #RoyalBaby"
"RT @AdamCatterall: I woke to see #thunderstorm was trending. For a moment I thought they'd let Kanye West name the #RoyalBaby"
"RT @gracehelbig: Should've named it "Norther Wester." #RoyalBaby"
"RT @wescraven: Suggested name for the new little prince... Freddy. #RoyalBaby"
"RT @Lord_Voldemort7: The #RoyalBaby has not yet been named. They should just call him 'You Know Who.'"
"RT @MelissaJoanHart: I seriously don't wanna go to bed without knowing the #royalbaby name. Isn't that ridiculous?! "
"RT @Telegraph: #RoyalBaby: George is the bookies' favourite for the new prince's name, followed by James, Alexander, Louis and Henry"