Since this Pew report came out, researchers and journalists in my circles have been trying to untangle what it actually means. One of its’ interpretations is that Twitter is full of haters. Another reads that Twitter is a mainstream liberal but a conservative wonk (Srsly?). The notion that opinions raised on Twitter are biased since the population of active users on the network is not representative of the general public makes a lot of sense. In the report, Pew researchers monitored opinions on Twitter across a number of political events during 2012 using Crimson Hexagon’s sentiment analysis service. At the same time, they ran national polls for sentiment around the chosen events. While they conclude that ‘Twitter reaction to events (is) often at odds with overall public opinion’, it seems like what they actually prove is that Crimson Hexagon’s (CH) sentiment analysis method for Twitter doesn’t reflect public opinion… and therefore is meaningless for assessing public sentiment on Twitter.
These two conclusions are very different. The former suggests that Twitter cannot be used at all to assess opinions, while what I’m suggesting is that language-based statistical models from Tweets in aggregate will not provide meaningful results when evaluating general public sentiment around an event. If context around users is not taken into account, specifically their historical propensity to respond to a topic, as well as their positioning within the network, we lose the ability to gain interesting insight from data coming from social networked spaces. Let me explain.
I spent some time looking into CH’s documentation online, and while I feel like I have much better handle on what they do, I’m still partly guessing, as much of the meat counts as “proprietary algorithms”. From my understanding, CH gets access to the Twitter firehose, then for every project, a series of keywords and phrases contained within a time period are chosen, and all tweets where the keywords appear are extracted as the observed dataset. All non-English tweets are then filtered out (not sure exactly how this is done, and what happens with mixed language Tweets and/or slang). The chosen period seems to vary, from several hours to multiple days. For the Pew studies over the past year seems like the chosen time periods vary for every every observed event.
In the next step, sentiment-related “assertions” are identified across all tweets. This is most likely done using a pre-existing dictionary of words and phrases that are based on a manual classification (labeled dataset) of tweets from the past. Then a random sample of assertions from the observed event is manually labeled. This is used to train a classifier which then runs across the whole dataset of assertions, breaking them down into 4 bins: positive, negative, neutral (informational) and irrelevant.
I’ve seen this type of methodology work well for long-form text, but have yet to see interesting results come out of Twitter data. Some of my main concerns are outlined below:
1. We cannot expect to assign sentiment correctly 100% of the time. Even humans often disagree about the sentiment of text:
- Twitter is spoken, non-homogeneous language that is constantly evolving. It is very hard to train a classifier to accurately represent a single model for “language” on Twitter.
- Cultural differences (“sick” – positive for some, negative for others).
- Sarcasm and innuendo (“old men and women” – Chomsky’s constructional homonymity) – this is a crucial problem, especially around political content.
- How are hashtags dealt with? Many of them, especially around political events, are words never previously seen by a model.
- How does CH know that a person is situated in the US? They filter out English content, but this doesn’t necessarily mean the person is located in the US.
2. There’s no user context around the tweets:
- What if a user tweets multiple times during an event? This should be taken into account, as it highly skews the results.
- What about retweets? They are an interesting signal reflecting user opinion, but there are many reasons why someone might retweet a message, and repetitive retweets from a single person reflecting effectively a single opinion should not be counted multiple times.
- Why do we think that the general sentiment of an event is simply the sum of sentiment in individual tweets? At the end of the day, the data is coming from people. If we don’t understand who these users are, especially how they’re interconnected and what a meaningful sample would be, the sum of all tweets is meaningless.
3. If a service claims a 97% accuracy rate, especially on a problem that’s not considered solvable, you should be highly suspicious. How do they define accuracy here? Based on how their models classify their trained dataset? We have to be very careful here.
Given that this is all proprietary technology makes it incredibly difficult to critique. I’ve yet to see Twitter sentiment analysis results that are actually meaningful – highlighting interesting, important and timely insight. Topsy claims that with #Twindex they accurately predicted election results for almost all US states. I haven’t played around with CH or Topsy as they are quite pricey. But I have spent the past couple of years working on insight from Twitter data, and while it is straightforward to identify extremely positive and extremely negative posts, the majority of content tends to land somewhere in the middle. From my experience, using natural language-based models on Twitter data without any context around the users or the observed event will not bring sufficiently valuable insight.
Is it totally useless? I don’t think so. There’s value in understanding amount of buzz around an event, especially if there’s enough data about what normal behavior looks like. Yet IMHO, some of the most interesting insight can come from taking a networked approach to analyzing user response to an event.
For example, on election day 2012, we saw over 100,000 users who self reported their vote, literally tweeting out “I voted for …”. Obviously we know that the network is biased and that there are many more young liberals using Twitter (as noted by Pew’s latest study on the demographics of social media users). We can account for this bias, but additionally, we can start to identify communities of users: teenagers in Florida, moms in Ohio, media professionals across the US. Communities are inherent to the organizational scheme of interest-based social networks such as Twitter. By getting more context on how heavily each community is involved within an observed event, and sampling accordingly, we may be able to significantly improve the way we gage the general public opinion through the lens of Twitter.
Does this approach better align with the formal polls? I have no idea.
But a network-based approach may help give us important context around event polling – a smarter way to sample user data coming from social networks. In any case, we need to continue to experiment in this space and be much more critical of what we’re told by companies who promise algorithmic accuracy.
Some related links:
- A recent example of community cluster analysis looking at a specific event – the Harlem Shake.
- Watch out for Marshall Kirkpatrick’s new startup – Little Bird - an interesting actor in this space.
- Alex Johnson, Journalist at NBC, blogs about his usage of Crimson Hexagon’s sentiment analysis technology.